Classification modeling example
You have previously prepared a set of Russian tweets for classification. Of the 20,000 tweets, you have filtered to tweets with an account_type of Left or Right, and selected the first 2000 tweets of each. You have already tokenized the tweets into words, removed stop words, and performed stemming. Furthermore, you converted word counts into a document-term matrix with TFIDF values for weights and saved this matrix as: left_right_matrix_small.
You will use this matrix to predict whether a tweet was generated from a left-leaning tweet bot, or a right-leaning tweet bot. The labels can be found in the vector, left_right_labels.
Cet exercice fait partie du cours
Introduction to Natural Language Processing in R
Instructions
- Set the random seed to
1111for reproducibility. - Create training and test datasets. Use a 75% sample for the training data.
- Run a random forest model on the training data, use
left_right_labelsfor the response vectory. - Print the random forest results.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
library(randomForest)
# Create train/test split
set.___(___)
sample_size <- floor(___ * nrow(left_right_matrix_small))
train_ind <- ___(nrow(left_right_matrix_small), size = ___)
train <- left_right_matrix_small[___, ]
test <- left_right_matrix_small[-___, ]
# Create a random forest classifier
rfc <- randomForest(x = as.data.frame(as.matrix(___)),
y = ___[___],
nTree = 50)
# Print the results
___