WebR in Quarto HTML Documents

Get started with building a model in this R Markdown document that accompanies Preprocess your data with recipes tidymodels start article.

If you ever get lost, you can visit the links provided next to section headers to see the accompanying section in the online article.

Introduction

Load necessary packages:

Load and wrangle data:

Before moving forward, let’s reduce the size of our data so we can run these analyses with the default computational resources on RStudio Cloud. By doing so we will avoid aborting our session.

Let’s sample 20% of the rows and assign it as our data:

Note that since we are using a subset of the original data set, the results you generate here will be slightly different than the Preprocess your data with recipes article.

Check the number of delayed flights:

For example, the number of late and on_time flights you get here are less than the number of flights you see in the article. The proportions are very close, though, suggesting that our random sampling was indeed random and did not over- or under-sample one category vs. the other.

Take a look at data types and data points:

Summarise the dataset:

Data splitting

Create training and test sets:

Try typing ?initial_split in the console to get more details about the splitting function from rsample package.

Create recipe and roles

Let’s initiate a new recipe:

You can see more details about how to create recipes by typing ?recipe in the console.

Update variable roles of a recipe with update_role:

You can also read more about adding/updating/removing roles with ?roles.

To get the current set of variables and roles, use the summary() function:

Create features

What happens if we transform date column to numeric?

From date we can derive more meaningful features such as:

the day of the week,
the month, and
whether or not the date corresponds to a holiday.

Add steps to your recipe to generate these features:

Check out help documents for these step functions with ?step_date, ?step_holiday, ?step_rm.

Create dummy variables using step_dummy():

Check if some destinations present in test set are not included in the training set:

Remove variables that contain only a single value with step_zv():

Fit a model with a recipe

Recall the Build a model article.

This time we build a model specification for logistic regression using the glm engine:

For more details try typing ?set_engine and ?glm in the console.

Bundle the model specification (lr_mod) with the recipe (flights_rec) to create a model workflow:

Prepare the recipe and train the model:

Be patient; this step will take a little time to compute.

Pull the fitted model object then use the broom::tidy() function to get a tidy tibble of model coefficients:

Use a trained workflow to predict

Simply apply fitted model to test_data and predict outcomes.

Get predicted class probabilities and bind them with some variables from the test data:

Note that the result you get here will be different than the online article since we only fitted the model to the subset of the actual data set.

Let’s look at model performance with ROC curve (roc_curve()) and plot by piping it to the autoplot().

Similarly, roc_auc() estimates the area under the curve:

Good job!

Now it’s your turn to test out this workflow without this recipe!

In the Build a model article, we did not use a recipe but used a formula instead.

You can use workflows::add_formula(arr_delay ~ .) instead of add_recipe() (remember to remove the identification variables first!), and see whether our recipe improved our model’s ability to predict late arrivals.