How to run linear regression analysis with tidymodels in R for temporal prediction (Tidymodels Series PART 2)

Packages we will need:

library(tidyverse)
library(tidymodels)
library(magrittr)

We will look at Varieties of Democracy dataset

vdem %>% 
select(judic_corruption = v2jucorrdc,
corruption = v2xnp_regcorr,
freedom_kill= v2clkill,
freedom_torture = v2cltort,
freedom_religion = v2clrelig,
ps_theft = v2exthftps,
country_name,
year) -> vdem

We will create two datasets: one for all years EXCEPT 2020 and one for only 2020

vdem %>% 
filter(year > 1989 & year < 2020) -> vdem_1990_2019

vdem %>%
filter(year == 2020) -> vdem_2020

First we build the model. We will look at whether level of public sector theft can predict the judicial corruption levels.

The model will have three parts

linear_reg() : This is the foundational step indicating the type of regression we want to run

set_engine() : This is used to specify which package or system will be used to fit the model, along with any arguments specific to that software. With a linear regression, we don’t really need any special package.

set_mode("regression") : In our regression model, the model predicts continuous outcomes. If we wanted to use a categorical variable, we would choose “classification

linear_spec <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")

If we print out the linear_spec object, it gives us information about the model specifications we had just fed in!

> linear_spec
Linear Regression Model Specification (regression)

Computational engine: lm 

class(linear_spec)
[1] "linear_reg" "model_spec"

Now we create a “recipe” with our model formula and any steps we take to change the variables.

recipe_spec <- recipe(judic_corruption ~ ps_theft, data = vdem_1990_2019) %>%
step_normalize(all_predictors(), -all_outcomes())

Click here to look at all the possible steps we can add to the regression recipe

https://recipes.tidymodels.org/reference

When we print this out, it shows us the stages in our lm “recipe”

── Recipe 

── Inputs 
Number of variables by role
outcome:   1
predictor: 2

── Operations 
• Centering and scaling for: all_predictors(), 

Finally, we are going to feed the recipe and the model specification into the workflow “oven”, if we want to keep the cooking metaphor going.

linear_workflow <- workflow() %>%
add_recipe(recipe_spec) %>%
add_model(linear_spec) %>%
fit(data = vdem_1990_2019)

We can tidy up the results of the linear regression with the tidy() function

linear_workflow %>% tidy()

Printing out a tidy version of the data

linear_workflow %>% tidy()
# A tibble: 31 × 5
   term                   estimate std.error statistic p.value
   <chr>                     <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)            0.0822      0.0584   1.41      0.160
 2 ps_theft                    1.29        0.0105 123.        0    
 3 year-1.57179471038752 -0.0175      0.0825  -0.212     0.832
 4 year-1.45612480773099 -0.0119      0.0824  -0.144     0.885
 5 year-1.34045490507446 -0.0321      0.0823  -0.390     0.697
 6 year-1.22478500241793  0.00679     0.0823   0.0825    0.934
 7 year-1.1091150997614   0.000206    0.0823   0.00250   0.998
 8 year-0.99344519710487 -0.00175     0.0823  -0.0213    0.983
 9 year-0.87777529444834  0.00339     0.0823   0.0412    0.967
10 year-0.76210539179181 -0.00774     0.0822  -0.0942    0.925
# ℹ 21 more rows
# ℹ Use `print(n = ...)` to see more rows

Now we can see does the model from 1990 to 2019 predict the values in 2020?

Natasha Lyonne GIF by Golden Globes - Find & Share on GIPHY
predictions <- predict(linear_workflow, new_data = vdem_2020) %>%
bind_cols(vdem_2020)
> predictions
# A tibble: 179 × 6
    .pred judic_corruption ps_theft country_name   year  diff
    <dbl>            <dbl>    <dbl> <chr>         <int> <dbl>
 1 -0.447           -2.21    -0.693 Pakistan       2020  1.77
 2  1.03            -0.722    1.04  Tunisia        2020  1.76
 3  0.319           -1.38     0.205 Burkina Faso   2020  1.70
 4  0.228           -1.33     0.098 Bolivia        2020  1.56
 5 -0.569           -2.05    -0.837 Comoros        2020  1.48
 6 -1.73            -3.01    -2.19  Azerbaijan     2020  1.28
 7  0.469           -0.781    0.381 Peru           2020  1.25
 8  0.623           -0.578    0.562 Burma/Myanmar  2020  1.20
 9  1.64             0.478    1.76  Benin          2020  1.16
10  2.65             1.52     2.94  Luxembourg     2020  1.13
# ℹ 169 more rows
# ℹ Use `print(n = ...)` to see more rows

We can look at the RMSE, R square and MAE scores

In regression analysis, Root Mean Square Error (RMSE), R-squared (R²), and Mean Absolute Error (MAE) are metrics used to evaluate the performance and accuracy of regression models.

We use the metrics function from the yardstick package.

predictions represents the predicted values generated by your model. These are typically the values your model predicts for the outcome variable based on the input data.

truth is the actual or true values of the outcome variable. It is what you are trying to predict with your model. In the context of your code, judic_corruption is likely the true values of judicial corruption, against which the predictions are being compared.

The estimate argument is optional. It is used when the predictions are stored under a different name or within a different object.

metrics <- yardstick::metrics(predictions, truth = judic_corruption, estimate = .pred)

And here are the three metrics to judge how “good” our predictions are~

> metrics
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.774
2 rsq     standard       0.718
3 mae     standard       0.622

Determining whether RMSE, R-squared (R2), and MAE values are “good” depends on several factors, including the context of your specific problem, the scale of the outcome variable, and the performance of other models in the same domain.

RMSE (Root mean square deviation)

rmse <- sqrt(mean(predictions$diff^2))
  • RMSE measures the average magnitude of the errors between predicted and actual values.
  • Lower RMSE values indicate better model performance, with 0 representing a perfect fit.
  • Lower RMSE values indicate better model performance.
  • It’s common to compare the RMSE of your model to the RMSE of a baseline model or other competing models in the same domain.
  • The interpretation of “good” RMSE depends on the scale of your outcome variable. A small RMSE relative to the range of the outcome variable suggests better predictive accuracy.

R-squared

  • Higher R-squared values indicate a better fit of the model to the data.
  • However, R-squared alone does not indicate whether the model is “good” or “bad” – it should be interpreted in conjunction with other factors.
  • A high R-squared does not necessarily mean that the model makes accurate predictions.
  • This is especially if it is overfitting the data.
  • R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
  • Values closer to 1 indicate a better fit, with 1 representing a perfect fit.

MAE (Mean Absolute Error):

  • Similar to RMSE, lower MAE values indicate better model performance.
  • MAE is less sensitive to outliers compared to RMSE, which may be advantageous depending on the characteristics of your data.
  • MAE measures the average absolute difference between predicted and actual values.
  • Like RMSE, lower MAE values indicate better model performance, with 0 representing a perfect fit.

And now we can plot out the differences between predicted values and actual values for judicial corruption scores

We can add labels for the countries that have the biggest difference between the predicted values and the actual values – i.e. the countries that our model does not predict well. These countries can be examined in more detail.

First we add a variable that calculates the absolute difference between the actual judicial corruption variable and the value that our model predicted.

Then we use filter to choose the top ten countries

And finally in the geom_repel() layer, we use this data to add the labels to the plot

predictions <- predictions %>%
mutate(difference = abs(judic_corruption - .pred),
rank = rank(-difference))

top_countries <- predictions %>%
filter(rank <= 10)

predictions %>%
ggplot(aes(x = judic_corruption, y = .pred)) +
geom_point(color = "#1d3557", alpha = 0.6, size = 4) +
ggrepel::geom_label_repel(data = top_countries, aes(label = country_name),
nudge_y = 0.15, color = "#14213d", size = 3.5, alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed",
color = "#6a040f", size = 3, alpha = 0.75) +
ggtitle("Actual vs. Predicted Judicial Corruption Scores") +
xlab("Actual Scores") +
ylab("Predicted Scores") +
xlim(c(-3, 3)) +
ylim(c(-3, 3)) +
bbplot::bbc_style() +
theme(plot.title = element_text(hjust = 0.5))

7 thoughts on “How to run linear regression analysis with tidymodels in R for temporal prediction (Tidymodels Series PART 2)

  1. Ready to boost your income for free? Dive into our exclusive affiliate opportunities – no upfront costs!

    Unlock Mastery Now!
    Discover secrets from novice to pro with complimentary courses and tools.

    Limited-Time Offer: Earn Instant Commissions!
    Seize the chance for zero-cost affiliate success. Start earning now!

    Claim Your Free Access https://bit.ly/freeaffiliates87

    Like

  2. Hey there! We absolutely love reading people’s blogs and the inspiring content that creators like you share . Your unique voice enriches the engaging online community that we all value . Keep writing and empowering your audience, because your words can make a significant impact on the world. We can’t wait to read what you’ll create next!

    Thanks- Jason http://www.pomeranianpuppies.uk

    Like

  3. Hey there! We regretfully apologize for the comment on your website. We’re committed to improving and learning. Join PassiveIncomePro, our vibrant community dedicated to empowering individuals to unlock the secrets of passive income. Discover premium content, powerful resources, and connect with like-minded individuals on your journey to financial freedom. Join us on this transformative journey together and make a positive impact! Please note that PassiveIncomePro is open to USA residents only.

    Gain financial independence through passive income strategies at http://passiveincomepro.website.

    Like

  4. Hey there! Came across your post on the WordPress feed and couldn’t resist saying hello. I’m already hooked and eagerly looking forward to more captivating posts. Can’t seem to find the follow button, haha! Guess I’ll have to bookmark your blog instead. But rest assured, I’ll be keeping an eye out for your updates!

    may i leave a link to my blog here? feel free to post a comment on my site and leave your link 🙂 helps both our sites !!
    Hope to see your comment soon 🙂
    https://pomeranianpuppies.uk/2023/05/06/the-different-types-of-pomeranian-coats/

    Like

  5. Hey there! We deeply regret for the comment on your website. We’re committed to improving and learning. Join PassiveIncomePro, our vibrant community empowering individuals like you to unlock the secrets of passive income. Discover exclusive content, powerful resources, and connect with like-minded individuals on your journey to financial freedom. Let’s embark on this transformative journey together and make a positive impact! Kindly note that this website is exclusively available to residents of the USA.

    Join the passive income revolution with http://passiveincomepro.website as your ally.

    Like

Leave a comment