Вы находитесь на странице: 1из 4

Automatic Activity Recognition of

Weight Lifting Exercises Using Sensor


Data
Summary
This analysis aims to build a machine learning model to automatically recognize the activity
type, given the data from various wearable sensors (such as accelerometers). The data used
for this analysis comes from a publication that performs a similar analysis [1]. Several type of
models classification models are built for the training dataset and cross-validation is used to
pick the best performing model. The random forest model appears to perform best for the
given data.
Data Processing
The training and testing datasets provided for this analysis come from the data set available
here [2].
Exploratory Data Analysis
trainDS <- read.csv("pml-training.csv", stringsAsFactors =
FALSE)
testDS <- read.csv("pml-testing.csv", stringsAsFactors =
FALSE)
The training and the testing datasets have 160 variables. The training dataset has 19622
observations. The testing dataset has 20 observations. In the training dataset, some columns
indicate raw measurements such as acceleration, pitch, roll, and yaw from various
sensor units like belt, forearm, dumbell etc; other variables indicate aggregates and
descriptives of the aforementioned raw measurements such as min, max, avg, stddev,
var, amplitude, skew, kurtosis etc. These columns take on non-NA values only for a very
small fraction of the total number of observations (2.07%); the non-NA values occur only when
the new_window takes on a value of yes. It appears these values are computed for a set of
measurements taken during a time window and are missing for all other observations. These
variables are excluded from training feature-set since a very large number of observations
have these values missing. The first 7 variables record identification values such as
timestamps, usernames, and other flag values. These variables are also excluded from the
training set. Subsets of the training and testing datasets are created using the 59 raw
measurements and: the target variable classe for training dataset and the variable
problem_id for test dataset. These datasets are stored to trainDS2 and testDS2.
Analysis
In order to evaluate the fitness of models, a cross-validation strategy is used. A 10-fold
cross-validation scheme is used. The following learning strategies are used: Random Forest,
Support Vector Machines, and Gradient Boosted Regression.
kFolds <- 10
rfAcc <- rep(NA, kFolds)
svmAcc <- rep(NA, kFolds)
gbmAcc <- rep(NA, kFolds)
lrAcc <- rep(NA, kFolds)
nC <- floor(nrow(trainDS2)/kFolds)
for (j in 1:kFolds) {
minI <- (j - 1) * nC + 1
maxI <- j * nC
cvChunkIndex <- minI:maxI
cvChunk <- trainDS2[cvChunkIndex, ]
trainChunk <- trainDS2[-cvChunkIndex, ]
rfM <- randomForest(classe ~ ., data = trainChunk)
svmM <- svm(classe ~ ., data = trainChunk)
gbmM <- gbm(classe ~ ., data = trainChunk, cv.folds =
3, distribution = "multinomial")
best.iter <- gbm.perf(gbmM, method = "cv", plot.it =
FALSE)
gbmP <- predict(gbmM, newdata = cvChunk, n.trees =
best.iter)
gbmP <- levels(cvChunk$classe)[sapply(1:nrow(cvChunk),
function(i) {
which.max(gbmP[i, , 1])
})]
rfAcc[j] <- sum(predict(rfM, newdata = cvChunk) ==
cvChunk$classe)/nrow(cvChunk)
svmAcc[j] <- sum(predict(svmM, newdata = cvChunk) ==
cvChunk$classe)/nrow(cvChunk)
gbmAcc[j] <- sum(gbmP == cvChunk$classe)/nrow(cvChunk)
}
The 10-fold cross-validated accuracies are plotted in the following figure.
The average cross-validated accuracies and average out-of-sample error estimates can be
see in the following table.
Method Accuracy OOSError
RF 58.80 0.41
SVM 50.91 0.49
GBM 21.60 0.78
Based on these results, it appears that the random forest method is best suited among the
attempted methods for the given data. So a random forest model built using the entire training
set and is used for predicting the activity type for the test set observations.
tControl <- trainControl(method = "cv", number = 3)
rfM1 <- train(classe ~ ., data = trainDS2, trControl =
tControl, method = "rf")
rfP1 <- predict(rfM1, newdata = testDS2)
Results
A 10-fold cross-validation approach is used to pick a model for training and testing the data.
The random forest model is selected based on its higher average accuracy and lower out-of-
sample error. The predictions made for the test data set using the model built using the entire
training set are follows.
ProblemId Prediction
1 B
2 A
3 B
4 A
5 A
6 E
7 D
8 B
9 A
10 A
11 B
12 C
13 B
14 A
15 E
16 E
17 A
18 B
19 B
20 B
References
[1]. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition
of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with
SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 201
[2]. Groupware.les.inf.puc-rio.br, (2014). [online] Available at: http://groupware.les.inf.puc-
rio.br/static/WLE/WearableComputing_weight_lifting_exercises_biceps_curl_variations.csv
[Accessed 22 Jun. 2014].

Вам также может понравиться