How to use the assign() function in R

We can use the assign function to create new variables.

Most often I want to assign variables that I create to the Global Environment.

assign particularly useful in loops, simulations, and scenarios involving conditional variable naming or creation.

The basic syntax of the assign function is

assign(x, value, pos = -1, envir = as.environment(pos), inherits = FALSE)

envir: The environment in which to place the new variable. If not specified, it defaults to the current environment. .GlobalEnv is often used to assign variables in the global environment.

Generate variables with dynamic names in a loop.

for (i in 1:3) {
assign(paste("var", i, sep = "_"), i^2)
}


var_3
9

Next, we will make a for loop that iterates over each element in the years vector.

The paste0() function concatenates its arguments into a single string without any separator.

Here, it is used to dynamically create variable names by combining the string "sales_" with the current year. For example, if year is 2020, the result would be "sales_2020".

data.frame(month = 1:12, sales = sample(100:200, 12, replace = TRUE)) creates a new data frame for each iteration of the loop. The data frame has two columns:

month 1 to 12 and a random sample of 12 numbers (with replacement) from the integers between 100 and 200. This simulates monthly sales data.

The assign() function assigns a value to a variable in the R environment. The first argument is the name of the variable (as a string), and the second argument is the value to assign. In this snippet, assign() is used to create a new variable with the name generated by paste0() and assign the newly created data frame to it. This means that after each iteration, a new variable (e.g., sales_2020) will be created in the global environment, containing the corresponding data frame.

years <- 2018:2022
for (year in years) {
assign(paste0("sales_", year), data.frame(month = 1:12, sales = sample(100:200, 12, replace = TRUE)))
}

sales_2022
   month sales
1      1   118
2      2   157
3      3   163
4      4   177
5      5   185
6      6   171
7      7   151
8      8   142
9      9   141
10    10   157
11    11   137
12    12   152
set.seed(1111)
years <- 2000:2005
countries <- c("Country A", "Country B", "Country C")
data <- expand.grid(year = years, country = countries)
data$value <- runif(n = nrow(data), min = 100, max = 200)
   year  country    value
1  2018  Austria 146.5503
2  2019  Austria 141.2925
3  2020  Austria 190.7003
4  2021  Austria 113.7105
5  2022  Austria 173.8817
6  2018 Bahamams 197.6327
7  2019 Bahamams 187.9960
8  2020 Bahamams 111.6784
9  2021 Bahamams 154.6289
10 2022 Bahamams 114.0116
11 2018   Canada 100.1690
12 2019   Canada 174.8958
13 2020   Canada 175.0958
14 2021   Canada 163.3406
15 2022   Canada 186.8168
16 2018  Denmark 115.9363
17 2019  Denmark 191.6828
18 2020  Denmark 155.7007
19 2021  Denmark 190.0419
20 2022  Denmark 176.5887
data_list <- split(data, data$year)
data_list
$`2018`
   year  country    value
1  2018  Austria 146.5503
6  2018 Bahamams 197.6327
11 2018   Canada 100.1690
16 2018  Denmark 115.9363

$`2019`
   year  country    value
2  2019  Austria 141.2925
7  2019 Bahamams 187.9960
12 2019   Canada 174.8958
17 2019  Denmark 191.6828

$`2020`
   year  country    value
3  2020  Austria 190.7003
8  2020 Bahamams 111.6784
13 2020   Canada 175.0958
18 2020  Denmark 155.7007

$`2021`
   year  country    value
4  2021  Austria 113.7105
9  2021 Bahamams 154.6289
14 2021   Canada 163.3406
19 2021  Denmark 190.0419

$`2022`
   year  country    value
5  2022  Austria 173.8817
10 2022 Bahamams 114.0116
15 2022   Canada 186.8168
20 2022  Denmark 176.5887

env <- .GlobalEnv

Now we can dynamically create variables within the environment

assign_year_country_dataframes <- function(data, year_col, country_col, env) {
# Get unique combinations of year and country
combinations <- unique(data[, c(year_col, country_col)])

# Iterate over each combination
for (i in 1:nrow(combinations)) {
combination <- combinations[i, ]
year <- combination[[year_col]]
country <- combination[[country_col]]

# Subset the data for the current combination
data_subset <- data[data[[year_col]] == year & data[[country_col]] == country, ]

# Create a dynamic variable name based on year and country
variable_name <- paste0(gsub(" ", "_", country), year)

# Assign the subset data to a dynamically named variable in the specified environment
assign(x = variable_name, value = data_subset, envir = env)
}
}

Now we can run the function and put all the country-year pairs into the global environment

assign_year_country_dataframes(data = data, year_col = "year", country_col = "country", env = env)
Cat Vibes GIF by Evergreen Cannabis - Find & Share on GIPHY

How to run regressions with the tidymodels package in R: PART 1


The tidymodels framework in R is a collection of packages for modeling.

Within tidymodels, the parsnip package is primarily responsible for specifying models in a way that is independent of the underlying modeling engines. The set_engine() function in parsnip allows users to specify which computational engine to use for modeling, enabling the same model specification to be used across different packages and implementations.

 - Find & Share on GIPHY

In this blog series, we will look at some commonly used models and engines within the tidymodels package

  1. Linear Regression (lm): The classic linear regression model, with the default engine being stats, referring to the base R stats package.
  2. Logistic Regression (logistic_reg): Used for binary classification problems, with engines like stats for the base R implementation and glmnet for regularized regression.
  3. Random Forest (rand_forest): A popular ensemble method for classification and regression tasks, with engines like ranger and randomForest.
  4. Boosted Trees (boost_tree): Used for boosting tasks, with engines such as xgboost, lightgbm, and catboost.
  5. Decision Trees (decision_tree): A base model for classification and regression, with engines like rpart and C5.0.
  6. K-Nearest Neighbors (nearest_neighbor): A simple yet effective non-parametric method, with engines like kknn and caret.
  7. Principal Component Analysis (pca): For dimensionality reduction, with the stats engine.
  8. Lasso and Ridge Regression (linear_reg): For regression with regularization, specifying the penalty parameter and using engines like glmnet.

Click here for some resources I found:

  1. https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels
  2. https://rpubs.com/chenx/tidymodels_tutorial
  3. https://bookdown.org/paul/ai_ml_for_social_scientists/06_01_ml_with_tidymodels.html