In this blog, I just want to keep the code that removes the Varieties of Democracy variables that are not the continuous variables and the run exploratory correlation analysis.
Click here to read more about downloading the V-DEM dataset directly into R via the vdemdata package in R
Click here to download the V-DEM dataset from the website instead.
suffixes <- str_extract(names(vdem), pattern = "_\\w+$")
The str_extract() function from the stringr package is used to extract matches to a regex regular expression pattern from a string.
The pattern "_\\w+$" will match any substring in the column names that starts with an underscore and is followed by one or more word characters until the end of the string.
Essentially, this pattern is designed to extract suffixes from the column names that follow an underscore.
suffixes_df <- data.frame(suffixes = suffixes) %>%
filter(!is.na(suffixes))
Above we create a suffixes_df data.frame to store all the strings that follow the _*** pattern.
Next we will find the most common suffixes
suffixes_df %>%
group_by(suffixes) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
print(n = 30)
# A tibble: 631 × 2
suffixes count
<chr> <int>
1 _nr 288
2 _codehigh 260
3 _codelow 260
4 _mean 260
5 _sd 260
6 _ord 257
7 _ord_codehigh 257
8 _ord_codelow 257
9 _osp 257
10 _osp_codehigh 257
11 _osp_codelow 257
12 _osp_sd 257
13 _1 30
14 _2 30
15 _3 30
16 _0 29
17 _4 27
18 _5 27
19 _6 26
20 _7 25
21 _8 23
22 _9 20
23 _10 17
24 _11 15
25 _12 15
26 _13 14
27 _14 4
28 _15 4
29 _16 4
30 _17 4
- _nr: Numeric Rating – Indicates the numeric value assigned to a particular measure or indicator within the V-Dem dataset. This is typically used for quantifying various aspects of democracy or governance.
- _codehigh: Code High – Represents the highest value that can be assigned to a specific indicator within the V-Dem dataset. This establishes the upper limit of the scale used for measuring a particular aspect of democracy.
- _codelow: Code Low – Indicates the lowest value that can be assigned to a specific indicator within the V-Dem dataset. This establishes the lower limit of the scale used for measuring a particular aspect of democracy.
- _mean: Mean – Represents the average value of a specific indicator across all observations within the V-Dem dataset. It provides a measure of central tendency for the distribution of values.
- _sd: Standard Deviation – Indicates the measure of dispersion or variability of values around the mean for a specific indicator within the V-Dem dataset.
- _ord: Ordinal – Denotes that the variable is measured on an ordinal scale, where responses are ranked or ordered based on a defined criteria.
- _ord_codehigh: Ordinal Code High – Represents the highest value on the ordinal scale for a particular indicator within the V-Dem dataset.
- _ord_codelow: Ordinal Code Low – Represents the lowest value on the ordinal scale for a particular indicator within the V-Dem dataset.
- _osp: Ordinal Scale Point – Indicates the midpoint of the ordinal scale used to measure a particular indicator within the V-Dem dataset.
- _osp_codehigh: Ordinal Scale Point Code High – Represents the highest value on the ordinal scale for a particular indicator within the V-Dem dataset.
- _osp_codelow: Ordinal Scale Point Code Low – Represents the lowest value on the ordinal scale for a particular indicator within the V-Dem dataset.
- _osp_sd: Ordinal Scale Point Standard Deviation – Indicates the standard deviation of values around the midpoint of the ordinal scale for a specific indicator within the V-Dem dataset.
We choose the top suffixes
string_vector <- c("_nr", "_codehigh", "_codelow", "_mean", "_sd", "_ord",
"_ord_codehigh", "_ord_codelow", "_osp", "_osp_codehigh",
"_osp_codelow", "_osp_sd")
pattern <- paste(string_vector, collapse = "|")
[1] "_nr|_codehigh|_codelow|_mean|_sd|_ord|_ord_codehigh|_ord_codelow|_osp|_osp_codehigh|_osp_codelow|_osp_sd"
And we create a new vdem to remove the variables that have any of the above suffixes.
vdem_sub <- vdem %>%
select(-matches(pattern)) %>%
select(where(is.numeric))
I also want to keep here my code to create correlations:
We will look at the v2xed_ed_ptcon variable.
It is the variable that measures patriotic indoctrination content in education.
The V-DEM question asks to what extent is the indoctrination content in education patriotic?
They argue that patriotism is a key tool that regimes can use to build political support for the broader political community. This v2xed_ed_ptcon index measures the extent of patriotic content in education by focusing on patriotic content in the curriculum as well as the celebration of patriotic symbols in schools more generally.
target_var <- vdem_sub$v2xed_ed_ptcon
Next we will run the correlations;
We run a safe_cor because there could be instances of two variables being NA.
That would throw a proverbial spanner in the correlations.
safe_cor <- possibly(function(x, y) cor(x, y, use = "complete.obs"), otherwise = NA_real_)
correlations <- map_dbl(vdem_sub, ~safe_cor(target_var, .x))
The possibly() function from the purrr package creates a new function that wraps around an existing one, but with an important difference: it allows the new function to return a default value when an error occurs, instead of stopping execution and throwing an error message.
It’s useful when we are mapping a function over a list or vector and some of the function might fail.
The map_dbl() function comes from the purrr package that we will use to iterate over the database.
The tilda ~ create an anonymous function within map_dbl().
And cor(target_var, .x, use = "complete.obs") is the function being applied.
Here, cor() is used to calculate the correlation between our target_var and each element of vdem_sub.
The .x is a placeholder that represents each variable of vdem_sub as map_dbl() iterates over it.
If we add use = "complete.obs" is an argument passed to cor(), we secify the correlation should be calculated using complete cases only, i.e., pairs of observations where neither is missing (NA).
names(correlations) <- names(vdem_sub)
correlations_df <- tibble(
variable = names(correlations),
correlation = correlations)
correlations_df <- correlations_df %>% filter(variable != "your_target_variable")
correlations_df %>%
arrange(desc(correlation)) %>%
print(n = 100)
# A tibble: 1,146 × 2
variable correlation
<chr> <dbl>
1 v2xed_ed_con 1
2 v2xed_ed_dmcon 0.987
3 v3lgbudgup 0.985
4 v3lgdomchm 0.985
5 v3lglegpup 0.985
6 v2edcritical 0.857
7 v2edplural 0.829
8 v2edideolch_rec 0.801
9 v3ellocelc 0.783
10 v3elreappuc 0.767
11 v3lgbudglo 0.764
12 v2x_diagacc 0.753
13 v2edideolch_4 0.750
14 v2xca_academ 0.746
15 v2x_accountability 0.742
16 v2cafexch 0.741
17 v3lgcomslo 0.738
18 v2x_clpol 0.736
19 v2x_civlib 0.736
20 v2x_cspart 0.732
21 v2x_freexp 0.732
22 v2xcs_ccsi 0.732
23 v2cafres 0.730
24 v2x_liberal 0.729
25 v2clpolcl 0.726
26 v2x_freexp_altinf 0.724
27 v2csreprss 0.724
28 v2cldiscw 0.722
29 v2x_libdem 0.721
30 v2x_delibdem 0.721
Without the inital suffix deletion, there are lots and lots of _od and _codelow et cetera variables.
# A tibble: 4,602 × 2
variable correlation
<chr> <dbl>
1 v2xed_ed_con 1
2 v2xed_ed_dmcon 0.987
3 v3lginsesup_sd 0.985
4 v3lgbudgup_codelow 0.985
5 v3lgdomchm_ord_codelow 0.985
6 v3lgdomchm_ord_codehigh 0.985
7 v3lgdomchm_mean 0.985
8 v3lglegpup_codelow 0.985
9 v3lglegpup_osp_codelow 0.985
10 v3lgbudgup 0.985
11 v3lgbudgup_sd 0.985
12 v3lgbudgup_osp_sd 0.985
13 v3lginsesup_osp_sd 0.985
14 v3lgdomchm 0.985
15 v3lgdomchm_codelow 0.985
16 v3lgdomchm_osp_codelow 0.985
17 v3lgdomchm_osp_codehigh 0.985
18 v3lglegpup 0.985
19 v3lglegpup_osp 0.985
20 v3lgbudgup_codehigh 0.985
21 v3lgdomchm_codehigh 0.985
22 v3lgdomchm_osp 0.985
23 v3lglegpup_codehigh 0.985
24 v2xed_ed_con_codelow 0.982
25 v2xed_ed_con_codehigh 0.981
26 v2xed_ed_dmcon_codelow 0.971
27 v2xed_ed_dmcon_codehigh 0.968
28 v2edcritical_osp 0.866
29 v2edcritical 0.861
30 v2edcritical_codelow 0.860

