Compare clusters with dendextend package in R

Packages we need

install.packages("dendextend")
library(dendextend)

This blog will create dendogram to examine whether Asian countries cluster together when it comes to extent of judicial compliance. I’m examining Asian countries with populations over 1 million and data comes from the year 2019.

Judicial compliance measure how often a government complies with important decisions by courts with which it disagrees.

Higher scores indicate that the government often or always complies, even when they are unhappy with the decision. Lower scores indicate the government rarely or never complies with decisions that it doesn’t like.

It is important to make sure there are no NA values. So I will impute any missing variables.

Click here to read how to impute missing values in your dataset.

library(mice)
imputed_data <- mice(asia_df, method="cart")
asia_df <- complete(imputed_data)

Next we can scale the dataset. This step is for when you are clustering on more than one variable and the variable units are not necessarily equivalent. The distance value is related to the scale on which the different variables are made.

Therefore, it’s good to scale all to a common unit of analysis before measuring any inter-observation dissimilarities.

asia_scale <- scale(asia_df)

Next we calculate the distance between the countries (i.e. different rows) on the variables of interest and create a dist object.

There are many different methods you can use to calculate the distances. Click here for a description of the main formulae you can use to calculate distances. In the linked article, they provide a helpful table to summarise all the common methods such as “euclidean“, “manhattan” or “canberra” formulae.

I will go with the “euclidean” method. but make sure your method suits the data type (binary, continuous, categorical etc.)

asia_judicial_dist <- dist(asia_scale, method = "euclidean")
class(asia_judicial_dist)

We now have a dist object we can feed into the hclust() function.

With this function, we will need to make another decision regarding the method we will use.

The possible methods we can use are "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

Click here for a more indepth discussion of the different algorithms that you can use

Again I will choose a common "ward.D2" method, which chooses the best clusters based on calculating: at each stage, which two clusters merge that provide the smallest increase in the combined error sum of squares.

asia_judicial_hclust <- hclust(asia_judicial_dist, method = "ward.D2")
class(asia_judicial_hclust)

We next convert our hclust object into a dendrogram object so we can plot it and visualise the different clusters of judicial compliance.

asia_judicial_dend <- as.dendrogram(asia_judicial_hclust)
class(asia_judicial_dend)

When we plot the different clusters, there are many options to change the color, size and dimensions of the dendrogram. To do this we use the set() function.

Click here to see a very comprehensive list of all the set() attributes you can use to modify your dendrogram from the dendextend package.

asia_judicial_dend %>%
set("branches_k_color", k=5) %>%    # five clustered groups of different colors
set("branches_lwd", 2) %>%          # size of the lines (thick or thin)
set("labels_colors", k=5) %>%       # color the country labels, also five groups
plot(horiz = TRUE)                  # plot the dendrogram horizontally

I choose to divide the countries into five clusters by color:

And if I zoom in on the ends of the branches, we can examine the groups.

The top branches appear to be less democratic countries. We can see that North Korea is its own cluster with no other countries sharing similar judicial compliance scores.

The bottom branches appear to be more democratic with more judicial independence. However, when we have our final dendrogram, it is our job now to research and investigate the characteristics that each countries shares regarding the role of the judiciary and its relationship with executive compliance.

Singapore, even though it is not a democratic country in the way that Japan is, shows a highly similar level of respect by the executive for judicial decisions.

Also South Korean executive compliance with the judiciary appears to be more similar to India and Sri Lanka than it does to Japan and Singapore.

So we can see that dendrograms are helpful for exploratory research and show us a starting place to begin grouping different countries together regarding a concept.

A really quick way to complete all steps in one go, is the following code. However, you must use the default methods for the dist and hclust functions. So if you want to fine tune your methods to suit your data, this quicker option may be too brute.

asia_df %>%
scale %>%
dist %>%
hclust %>%
as.dendrogram %>%
set("branches_k_color", k=5) %>%
set("branches_lwd", 2) %>%
set("labels_colors", k=5) %>%
plot(horiz = TRUE)