Removing variables from V-DEM according to string suffixes

In this blog, I just want to keep the code that removes the Varieties of Democracy variables that are not the continuous variables and the run exploratory correlation analysis.

Click here to read more about downloading the V-DEM dataset directly into R via the vdemdata package in R

Click here to download the V-DEM dataset from the website instead.

suffixes <- str_extract(names(vdem), pattern = "_\\w+$")

The str_extract() function from the stringr package is used to extract matches to a regex regular expression pattern from a string.

The pattern "_\\w+$" will match any substring in the column names that starts with an underscore and is followed by one or more word characters until the end of the string.

Essentially, this pattern is designed to extract suffixes from the column names that follow an underscore.

suffixes_df <- data.frame(suffixes = suffixes) %>%
filter(!is.na(suffixes))

Above we create a suffixes_df data.frame to store all the strings that follow the _*** pattern.

Next we will find the most common suffixes

suffixes_df %>%
group_by(suffixes) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
print(n = 30)

# A tibble: 631 × 2
   suffixes      count
   <chr>         <int>
 1 _nr             288
 2 _codehigh       260
 3 _codelow        260
 4 _mean           260
 5 _sd             260
 6 _ord            257
 7 _ord_codehigh   257
 8 _ord_codelow    257
 9 _osp            257
10 _osp_codehigh   257
11 _osp_codelow    257
12 _osp_sd         257
13 _1               30
14 _2               30
15 _3               30
16 _0               29
17 _4               27
18 _5               27
19 _6               26
20 _7               25
21 _8               23
22 _9               20
23 _10              17
24 _11              15
25 _12              15
26 _13              14
27 _14               4
28 _15               4
29 _16               4
30 _17               4
  1. _nr: Numeric Rating – Indicates the numeric value assigned to a particular measure or indicator within the V-Dem dataset. This is typically used for quantifying various aspects of democracy or governance.
  2. _codehigh: Code High – Represents the highest value that can be assigned to a specific indicator within the V-Dem dataset. This establishes the upper limit of the scale used for measuring a particular aspect of democracy.
  3. _codelow: Code Low – Indicates the lowest value that can be assigned to a specific indicator within the V-Dem dataset. This establishes the lower limit of the scale used for measuring a particular aspect of democracy.
  4. _mean: Mean – Represents the average value of a specific indicator across all observations within the V-Dem dataset. It provides a measure of central tendency for the distribution of values.
  5. _sd: Standard Deviation – Indicates the measure of dispersion or variability of values around the mean for a specific indicator within the V-Dem dataset.
  6. _ord: Ordinal – Denotes that the variable is measured on an ordinal scale, where responses are ranked or ordered based on a defined criteria.
  7. _ord_codehigh: Ordinal Code High – Represents the highest value on the ordinal scale for a particular indicator within the V-Dem dataset.
  8. _ord_codelow: Ordinal Code Low – Represents the lowest value on the ordinal scale for a particular indicator within the V-Dem dataset.
  9. _osp: Ordinal Scale Point – Indicates the midpoint of the ordinal scale used to measure a particular indicator within the V-Dem dataset.
  10. _osp_codehigh: Ordinal Scale Point Code High – Represents the highest value on the ordinal scale for a particular indicator within the V-Dem dataset.
  11. _osp_codelow: Ordinal Scale Point Code Low – Represents the lowest value on the ordinal scale for a particular indicator within the V-Dem dataset.
  12. _osp_sd: Ordinal Scale Point Standard Deviation – Indicates the standard deviation of values around the midpoint of the ordinal scale for a specific indicator within the V-Dem dataset.

We choose the top suffixes

string_vector <- c("_nr", "_codehigh", "_codelow", "_mean", "_sd", "_ord",
"_ord_codehigh", "_ord_codelow", "_osp", "_osp_codehigh",
"_osp_codelow", "_osp_sd")

pattern <- paste(string_vector, collapse = "|")
[1] "_nr|_codehigh|_codelow|_mean|_sd|_ord|_ord_codehigh|_ord_codelow|_osp|_osp_codehigh|_osp_codelow|_osp_sd"

And we create a new vdem to remove the variables that have any of the above suffixes.

vdem_sub <- vdem %>% 
select(-matches(pattern)) %>%
select(where(is.numeric))
Happy New Year GIF - Find & Share on GIPHY

I also want to keep here my code to create correlations:

We will look at the v2xed_ed_ptcon variable.

It is the variable that measures patriotic indoctrination content in education.

The V-DEM question asks to what extent is the indoctrination content in education patriotic?

They argue that patriotism is a key tool that regimes can use to build political support for the broader political community. This v2xed_ed_ptcon index measures the extent of patriotic content in education by focusing on patriotic content in the curriculum as well as the celebration of patriotic symbols in schools more generally.

target_var <- vdem_sub$v2xed_ed_ptcon 

Next we will run the correlations;

We run a safe_cor because there could be instances of two variables being NA.

That would throw a proverbial spanner in the correlations.

safe_cor <- possibly(function(x, y) cor(x, y, use = "complete.obs"), otherwise = NA_real_)

correlations <- map_dbl(vdem_sub, ~safe_cor(target_var, .x))

The possibly() function from the purrr package creates a new function that wraps around an existing one, but with an important difference: it allows the new function to return a default value when an error occurs, instead of stopping execution and throwing an error message.

It’s useful when we are mapping a function over a list or vector and some of the function might fail.

The map_dbl() function comes from the purrr package that we will use to iterate over the database.

The tilda ~ create an anonymous function within map_dbl().

And cor(target_var, .x, use = "complete.obs") is the function being applied.

Here, cor() is used to calculate the correlation between our target_var and each element of vdem_sub.

The .x is a placeholder that represents each variable of vdem_sub as map_dbl() iterates over it.

If we add use = "complete.obs" is an argument passed to cor(), we secify the correlation should be calculated using complete cases only, i.e., pairs of observations where neither is missing (NA).

names(correlations) <- names(vdem_sub)

correlations_df <- tibble(
  variable = names(correlations),
  correlation = correlations)

correlations_df <- correlations_df %>% filter(variable != "your_target_variable")

correlations_df %>% 
  arrange(desc(correlation)) %>% 
  print(n = 100)
# A tibble: 1,146 × 2
    variable               correlation
    <chr>                        <dbl>
  1 v2xed_ed_con                 1    
  2 v2xed_ed_dmcon               0.987
  3 v3lgbudgup                   0.985
  4 v3lgdomchm                   0.985
  5 v3lglegpup                   0.985
  6 v2edcritical                 0.857
  7 v2edplural                   0.829
  8 v2edideolch_rec              0.801
  9 v3ellocelc                   0.783
 10 v3elreappuc                  0.767
 11 v3lgbudglo                   0.764
 12 v2x_diagacc                  0.753
 13 v2edideolch_4                0.750
 14 v2xca_academ                 0.746
 15 v2x_accountability           0.742
 16 v2cafexch                    0.741
 17 v3lgcomslo                   0.738
 18 v2x_clpol                    0.736
 19 v2x_civlib                   0.736
 20 v2x_cspart                   0.732
 21 v2x_freexp                   0.732
 22 v2xcs_ccsi                   0.732
 23 v2cafres                     0.730
 24 v2x_liberal                  0.729
 25 v2clpolcl                    0.726
 26 v2x_freexp_altinf            0.724
 27 v2csreprss                   0.724
 28 v2cldiscw                    0.722
 29 v2x_libdem                   0.721
 30 v2x_delibdem                 0.721

Happy Maya Rudolph GIF - Find & Share on GIPHY

Without the inital suffix deletion, there are lots and lots of _od and _codelow et cetera variables.

# A tibble: 4,602 × 2
    variable                     correlation
    <chr>                              <dbl>
  1 v2xed_ed_con                       1    
  2 v2xed_ed_dmcon                     0.987
  3 v3lginsesup_sd                     0.985
  4 v3lgbudgup_codelow                 0.985
  5 v3lgdomchm_ord_codelow             0.985
  6 v3lgdomchm_ord_codehigh            0.985
  7 v3lgdomchm_mean                    0.985
  8 v3lglegpup_codelow                 0.985
  9 v3lglegpup_osp_codelow             0.985
 10 v3lgbudgup                         0.985
 11 v3lgbudgup_sd                      0.985
 12 v3lgbudgup_osp_sd                  0.985
 13 v3lginsesup_osp_sd                 0.985
 14 v3lgdomchm                         0.985
 15 v3lgdomchm_codelow                 0.985
 16 v3lgdomchm_osp_codelow             0.985
 17 v3lgdomchm_osp_codehigh            0.985
 18 v3lglegpup                         0.985
 19 v3lglegpup_osp                     0.985
 20 v3lgbudgup_codehigh                0.985
 21 v3lgdomchm_codehigh                0.985
 22 v3lgdomchm_osp                     0.985
 23 v3lglegpup_codehigh                0.985
 24 v2xed_ed_con_codelow               0.982
 25 v2xed_ed_con_codehigh              0.981
 26 v2xed_ed_dmcon_codelow             0.971
 27 v2xed_ed_dmcon_codehigh            0.968
 28 v2edcritical_osp                   0.866
 29 v2edcritical                       0.861
 30 v2edcritical_codelow               0.860

Leave a comment