Академический Документы
Профессиональный Документы
Культура Документы
Data Manipulation
Agenda
03 dplyr Package
Data Manipulation
Summarize
Transform Normalize
Aggregate
Apply Family
The apply family consists of vectorized functions which minimize your need to explicitly create loops
tapply() For vectors. Useful when we need to break up a vector into groups
The apply() function is most often used to apply a function to rows or columns (margins) of matrices or
dataframes
apply(m,1,sum)
apply(m,2,sum)
lapply() Function
• It applies a function to each element of the list (a function that we specify) apply(x, FUN, ...)
lapply(my_list,
mean)
sapply() Function
sapply(my_list, mean)
tapply() Function
tapply() splits the array based on specified data, usually at factor levels, and then applies the function to
it
mapply() is a multivariate version of sapply(). It will apply the specified function to the first element of
each argument first, followed by the second element, and so on
mapply(FUN, ...)
mapply(sum, x, b)
The R package dplyr is an extremely useful resource for data cleaning, manipulation and analysis
select() filter()
mutate() Sample_n()
Sample_frac() summarise()
select()
select()
Col 1
Dataset
Col 2
select() Syntax
select(dataframe_name,col_num1,col_num2…..col_num_n)
select(customer_churn,1,3,6)
select() Syntax
select(dataframe_name,col_num1:col_num_n)
select(customer_churn,2:5)
select() Syntax
select(dataframe_name,column_name)
select(customer_churn,Partner)
starts_with() & ends_with()
select(dataframe_name,starts_with(“”)) select(dataframe_name,ends_with(“”))
select(customer_churn,starts_with(“P”)) select(customer_churn,ends_with(“e”))
Extract all columns which start with “P” Extract all columns which end with “e”
filter()
filter()
filter()
Dataset Set 1
filter() Syntax
filter(data-frame_name, condition)
filter(customer_churn,gender==“female”)
Customer_churn dataframe
filter() Syntax
Random Set
1
Dataset
Random Set
2
sample_n()
sample_n() function is used to randomly sample “n” rows from the entire
dataframe
sample_n(dataframe_name, sample_size )
sample_n(customer_churn, 2)
Customer_churn dataframe
sample_frac()
sample_frac(dataframe_name, fraction_value )
sample_frac(customer_churn,
.3)
Customer_churn dataframe
summarise()
summarise()
Max_salary
Dataset
Min_salary
summarise()
summarise(dataframe_name, function )
summarise(customer_churn,mean_tenure=mean(ten
ure))
Customer_churn dataframe
group_by()
group_by()
Male Mean_salary
Dataset
Female Mean_salary
group_by()
summarise(group_by(data.frame_name,column_name),function)
summarise(group_by(customer_churn,InternetService),mean(tenure))
Customer_churn dataframe
Pipe Operator (%>%)
Pipe Operator %>%
Using the pipe operator extract only the first 5 columns and only ‘Male’ customers from the customer_churn dataset
Using the pipe operator get summarized result of mean “MonthlyCharges” of only those records where
“InternetService” is DSL grouped w.r.t “gender”
customer_churn %>%
filter(InternetService=="DSL") %>%
group_by(gender) %>%
summarise(mean_mc=mean(MonthlyCharges))
Pipe Operator Examples
Using the pipe operator select only 1st,2nd, 18th and 19th columns. After that filter out those records
where ‘MonthlyCharges’ are greater than 100 and gender is ‘Male’
customer_churn %>%
select(1,2,18,19) %>%
filter(MonthlyCharges>100 & gender =="Male")
Quiz
Quiz
Which of the following functions is used to add a new column in the data?
a. summary()
b. head()
c. tail()
d. mutate()
Quiz
Which of the following functions is used to add a new column in the data?
Solution:
d. mutate()
Quiz
Solution:
Solution:
Solution:
Solution:
How would you select the 3rd, 4th & 5th columns from the ‘customer_churn’ dataframe?
a. select(customer_churn,(3,4,5))
b. select(customer_churn,3,4,5)
c. select(customer_churn,list(3,4,5))
d. select(3:5,customer_churn)
Quiz
How would you select the 3rd, 4th & 5th columns from the ‘customer_churn’ dataframe?
Solution:
b. select(customer_churn,3,4,5)
Quiz
How would you extract those customers who have subscribed to both ‘StreamingTV’ &
‘StreamingMovies’?
How would you extract those customers who have subscribed to both ‘StreamingTV’ &
‘StreamingMovies’?
Solution:
How would you extract a random sample of 33 records from the ‘customer_churn’ dataframe?
a. sample(customer_churn,33)
b. sample_frac(customer_churn,.33)
c. sample_frac(customer_churn,33)
d. sample_n(customer_churn,33)
Quiz
How would you extract a random sample of 33 records from the ‘customer_churn’ dataframe?
Solution:
d. sample_n(customer_churn,33)
Quiz
How would you get a summarized result for the mean of ‘MonthlyCharges’ grouped w.r.t
‘PaymentMethod’?
a. summarise(group_by(customer_churn,PaymentMethod),mean_MC=mean(
MonthlyCharges))
b. group_by(mean_MC=mean(MonthlyCharges),summarise(customer_churn,
PaymentMethod))
b. group_by(summarise(customer_churn,PaymentMethod),mean_MC=mean(
MonthlyCharges))
c. summarise(group_by(PaymentMethod),mean_MC=mean(customer_churn,
MonthlyCharges))
Quiz
How would you get a summarized result for the mean of ‘MonthlyCharges’ grouped w.r.t
‘PaymentMethod’?
Solution:
a. summarise(group_by(customer_churn,PaymentMethod),mean_MC=mean(Mo
nthlyCharges))
Thank You