Вы находитесь на странице: 1из 10

textToPrint <- "this is some text to print"

# our old friend print()

print(textToPrint)

# the nchar() function tells you the number of characters in a variable

nchar(textToPrint)

# the c() function concatenates (strings together) all its arguments

c(textToPrint, textToPrint, textToPrint)

# we can check the data type of a variable using the function str() (like "structure")

str(anExampleOfCharacters)

# we can tell this is a character because it's structure is "chr"

# let's create some numeric variables

hoursPerDay <- 24

daysPerWeek <- 7

# we can check to make sure that these actually are numeric

class(hoursPerDay)

class(daysPerWeek)

# since this is numeric data, we can do math with it!

# "*" is the symbol for multiplication

hoursPerWeek <- hoursPerDay * daysPerWeek

hoursPerWeek

# Important! Just becuase something is a *number* doesn't mean R thinks it's numeric!
a <- 5

b <- "6"

# this will get you the error "non-numeric argument to binary operator", becuase b isn't

# numeric, even though it's a number!

a*b

# You can change character data to numeric data using the as.numeric() function.

# This will let you do math with it again. :)

a * as.numeric(b)

# check out the stucture: note that b changes from "chr" to "num

str(b)

str(as.numeric(b))

# to fix b to be a number permentantly

# b <- as.numeric(b)

# let's make a vector!

listOfNumbers <- c(1,5,91,42.8,100008.41)

listOfNumbers

# becuase this is a numeric vector, we can do math on it! When you do math to a vector,

# it happens to every number in the vector. (If you're familiar with matrix

# mutiplication, it's the same thing as multiplying a 1x1 matrix by a 1xN matrix.)

# multiply every number in the vector by 5

5 * listOfNumbers
# add one to every number in the vector

listOfNumbers + 1

# get the first item from "listOfNumbers"

listOfNumbers[1]

chocolateData <- read_csv("../input/chocolate-bar-ratings/flavors_of_cacao.csv")

\ DIGANTI JADI /

# some of our column names have spaces in them. This line changes the column names to

# versions without spaces, which let's us talk about the columns by their names.

names(chocolateData) <- make.names(names(chocolateData), unique=TRUE)

# the head() function reads just the first few lines of a file.

head(chocolateData)

# the tail() function reads in the just the last few lines of a file.

# we can also give both functions a specific number of lines to read.

# This line will read in the last three lines of "chocolateData".

tail(chocolateData, 3)

get the contents in the cell in the sixth row and the forth column

chocolateData[6,4]

dataframe[row,column]

# Before we get going, let's get rid of the white spaces in the column names of this

# dataset. This will make it possible for us to refer to columns by thier names, since

# any white space in a name will mess R up.

names(chocolateData) <- gsub("[[:space:]+]", "_", names(chocolateData))

str(chocolateData)
sapply(my.data, typeof)
y x1 x2 X3
"double" "integer" "logical" "integer"

#print the first few values from the column named "Rating" in the dataframe "chocolateData"

head(chocolateData$Rating)

One of them is type_convert, which will look at the first 1000 rows of each column, guess what the data
type of that column should be and then convert that column into that data type.

# remove all the percent signs in the fifth column. You don't really need to worry about

# all the different things that are happening in this line right now.

chocolateData$Cocoa_Percent <- sapply(chocolateData$Cocoa_Percent, function(x) gsub("%", "", x))


chocolateData <- chocolateData[-1,]

# make sure we removed the row we didn't want


head(chocolateData)

# try the type_convert() function agian

chocolateData <- type_convert(chocolateData)

# check the structure to make sure it actually is a percent

str(chocolateData)

https://www.kaggle.com/ericson61/getting-started-in-r-summarize-data/edit

misahin variable
strsplit(LOAN$RATE,"%")

> coklat = type_convert(coklat)


Parsed with column specification:
cols(
Cocoa.Percent = col_character()
)
> coklat$Cocoa.Percent = sapply(coklat$Cocoa.Percent, function(x) gsub("%", "", x))
> coklat = type_convert(coklat)
Parsed with column specification:
cols(
Cocoa.Percent = col_double()
summarise_all(chocolateData, funs(mean))

## Summarizing a specific variable

# return a data_frame with the mean and sd of the Rating column, from the chocolate

# dataset in it

chocolateData %>%

summarise(averageRating = mean(Rating),

sdRating = sd(Rating))

The functions we used above give you an overview of the entire dataset, but often you're only
interested in one or two variables. We can look at specific variables really easily using the summarise()
function and pipes. Pipes are part of the Tidyverse package we loaded in the beginning: if you try to use
them without load in the package, you'll get an error.

GROUP

chocolateData %>%

group_by(Review_Date) %>%

summarise(averageRating = mean(Rating),

sdRating = sd(Rating))
# tell R that we're going to use the tidyverse library library(tidyverse)

# read in our dataset as a data_frame coklat = read.csv(“…”)

# remove the first line of our dataset using a negative index

# remove the white spaces in the column names

names(chocolateData) <- gsub("[[:space:]+]", "_", names(chocolateData))

# remove percentage signs in the Cocoa_Percent

chocolateData <- type_convert(chocolateData)

chocolateData$Cocoa_Percent <- sapply(chocolateData$Cocoa_Percent, function(x) gsub("%", "", x))

chocolateData <- type_convert(chocolateData)

# Poker and roulette winnings from Monday to Friday:

poker_vector <- c(140, -50, 20, -120, 240)

roulette_vector <- c(-24, -50, 100, -350, 10)

days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

names(poker_vector) <- days_vector

names(roulette_vector) <- days_vector

# Which days did you make money on roulette?

selection_vector <- roulette_vector > 0

# Select from roulette_vector these days

roulette_winning_days <- roulette_vector[selection_vector]

print(roulette_winning_days)

# Box office Star Wars (in millions!)

new_hope <- c(460.998, 314.4)

empire_strikes <- c(290.475, 247.900)

return_jedi <- c(309.306, 165.8)


# Construct matrix

star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)

# Vectors region and titles, used for naming

region <- c("US", "non-US")

titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

# Name the columns with region

colnames(star_wars_matrix) = region

# Name the rows with titles

rownames(star_wars_matrix) = titles

# Print out star_wars_matrix

star_wars_matrix

# The worldwide box office figures

worldwide_vector <- rowSums(star_wars_matrix)

worldwide_vector

# Bind the new variable worldwide_vector as a column to star_wars_matrix

all_wars_matrix <- cbind(star_wars_matrix,worldwide_vector)

all_wars_matrix

# Create speed_vector

speed_vector <- c("medium", "slow", "slow", "medium", "fast")

# Add your code below

factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "medium", "fast"))

# Print factor_speed_vector
factor_speed_vector

summary(factor_speed_vector)

# Create a data frame from the vectors

planets_df <- data.frame(name,type,diameter,rotation,rings)

# planets_df is pre-loaded in your workspace

# Use order() to create positions

positions <- order(planets_df$diameter)

# Use positions to sort planets_df

planets_df[positions,]
# draw a blank plot with "Review_Date" as the x axis and "Rating" as the y axis.

ggplot(chocolateData, aes(x= Review_Date, y = Rating))

http://ggplot2.tidyverse.org/reference/#section-layer-geoms

# draw a plot with "Review_Date" as the x axis and "Rating" as the y axis, and add a point for each data
point

ggplot(chocolateData, aes(x= Review_Date, y = Rating)) + geom_point()

# draw a plot with "Review_Date" as the x axis and "Rating" as the y axis, add a point for each data
point, move each point slightly so they don't overlap and add a smoothed line (lm = linear model)

ggplot(chocolateData, aes(x= Review_Date, y = Rating)) +

geom_point() +

geom_jitter() +

geom_smooth(method = 'lm')

save our plot to a variable with an informative name

chocolateRatingByReviewDate <- ggplot(chocolateData, aes(x= Review_Date, y = Rating, color =


Cocoa_Percent)) +

geom_point() +

geom_jitter() +

geom_smooth(method = 'lm')

# save our plot

ggsave("chocolateRatingByReviewDate.png", # the name of the file where it will be save

plot = chocolateRatingByReviewDate, # what plot to save

height=6, width=10, units="in") # the size of the plot & units of the size

# notice that this cell doesn't have any output in place! That's because in the first section we're

# giving the plot a name rather than printing it, and in the second we're saving our plot rather

# than printing it. We've never actually said to print our plot at any point.
# Return the average and sd of ratings by the year a rating was given

averageRatingByYear <- chocolateData %>%

group_by(Review_Date) %>%

summarise(averageRating = mean(Rating))

# plot only the average rating by year

ggplot(averageRatingByYear, aes(y= averageRating, x = Review_Date )) +

geom_point() + # plot individual points

geom_line() # plot line

Вам также может понравиться