WebR in Quarto HTML Documents
Get started with building a model in this R Markdown document that accompanies Preprocess your data with recipes tidymodels start article.
If you ever get lost, you can visit the links provided next to section headers to see the accompanying section in the online article.
Introduction
Load necessary packages:
Load and wrangle data:
Before moving forward, let’s reduce the size of our data so we can run these analyses with the default computational resources on RStudio Cloud. By doing so we will avoid aborting our session.
Let’s sample 20% of the rows and assign it as our data:
Note that since we are using a subset of the original data set, the results you generate here will be slightly different than the Preprocess your data with recipes article.
Check the number of delayed flights:
For example, the number of late
and on_time
flights you get here are less than the number of flights you see in the article. The proportions are very close, though, suggesting that our random sampling was indeed random and did not over- or under-sample one category vs. the other.
Take a look at data types and data points:
Summarise the dataset:
Data splitting
Create training and test sets:
Try typing ?initial_split
in the console to get more details about the splitting function from rsample
package.
Create recipe and roles
Let’s initiate a new recipe:
You can see more details about how to create recipes by typing ?recipe
in the console.
Update variable roles of a recipe with update_role
:
You can also read more about adding/updating/removing roles with ?roles
.
To get the current set of variables and roles, use the summary()
function:
Create features
What happens if we transform date
column to numeric
?
From date
we can derive more meaningful features such as:
- the day of the week,
- the month, and
- whether or not the date corresponds to a holiday.
Add steps to your recipe to generate these features:
Check out help documents for these step functions with ?step_date
, ?step_holiday
, ?step_rm
.
Create dummy variables using step_dummy()
:
Check if some destinations present in test set are not included in the training set:
Remove variables that contain only a single value with step_zv()
:
Fit a model with a recipe
Recall the Build a model article.
This time we build a model specification for logistic regression using the glm
engine:
For more details try typing ?set_engine
and ?glm
in the console.
Bundle the model specification (lr_mod
) with the recipe (flights_rec
) to create a model workflow:
Prepare the recipe and train the model:
Be patient; this step will take a little time to compute.
Pull the fitted model object then use the broom::tidy()
function to get a tidy tibble of model coefficients:
Use a trained workflow to predict
Simply apply fitted model to test_data
and predict outcomes.
Get predicted class probabilities and bind them with some variables from the test data:
Note that the result you get here will be different than the online article since we only fitted the model to the subset of the actual data set.
Let’s look at model performance with ROC curve (roc_curve()
) and plot by piping it to the autoplot()
.
Similarly, roc_auc()
estimates the area under the curve:
Good job!
Now it’s your turn to test out this workflow without this recipe!
In the Build a model article, we did not use a recipe but used a formula instead.
You can use workflows::add_formula(arr_delay ~ .)
instead of add_recipe()
(remember to remove the identification variables first!), and see whether our recipe improved our model’s ability to predict late arrivals.