Comparative Data Visualisation in R: an Example of Building a Data Frame from the Ground Up

When we do data analysis, we often deal with existed data-sets from an official database or survey. That is because R is a powerful tool allowing us to import a wide range of data formats, including excel, SPSS, Stata and other formats. In most occasions, we do not need to create a data frame ourselves. Even if we do, the simplest way probably will be referring to filling up an excel sheet and then import it to Rstudio.

However, sometimes we do want to stick in R to create our own small data frames to generate some simple statistical graphs. There are some very convenient R codes to accomplish this thought. Even further, the RStudio is not necessary anymore since we are able to run the R codes in a free web-based platform: the Nextjournal.

In this post, I am going to give a very simple and detailed example about how to generate a data frame as well as its visulaised results from the web-based platform. I am not using the RStudio, but these steps are also fully applicable in RStudio. In this way, you do not need to carry your own computer with the RStudio (as well as numerous necessary packages) everywhere. Just simply open the chrome on a random computer, creating your own account in the nextjournal platform, adding a new notebook in R template, and the platform will be ready for you.

The example here is to compare the Infant Mortality Rate (IMR) of different population groups. It is a very simple population health research, and I want to compare the IMR of migrants in Shanghai with the local residents of Shanghai. To make more senses, I also put on the average IMR level of China and UK as a reference. Since the data is sourced from different databases, and the size of the data-set is very small, I will create the data frame myself.

First of all, we can test if the web-based platform is ready for R code. So I put on this code:

print(“This is an example”)

The result is showing up like this:

It also shows that the platform takes 0.8s to run the code. More importantly, it is really worked out in your browser!

Let’s treat it as your RStudio since all codes applied here are also applied in the real RStudio. The first step is to build a small size data frame. For my research purpose, I collected the IMR data annually from 2009 to 2015 for four categories: Shanghai Migrants, Shanghai Local Residents, China and UK. The counting unit is the number of deaths of infants under one year old per 1,000 live births.

The first step is to create vectors, the package tidyverse is applied here:

library(tidyverse)
ri1<-c(2.89,3.12,2.92,2.72,2.81,2.90,2.46)
ri2<-c(7.92,7.47,7.39,6.79,7.70,6.58,6.58)
ci1<-c(14.6,13.6,12.6,11.6,10.8,10,9.2)
ui1<-c(4.6,4.4,4.2,4.1,3.9,3.8,3.8)

The second step is to build a data frame, including all vectors:

dfimr<- data.frame(“year” = 2009:2015, “Shanghai_Local_IMR” = ri1, “Shanghai_Migrants_IMR” = ri2, “China_IMR” = ci1, “UK_IMR” = ui1)

dfimr

yearShanghai_Local_IMRShanghai_Migrants_IMRChina_IMRUK_IMR
120092.897.9214.64.6
220103.127.4713.64.4
320112.927.3912.64.2
420122.726.7911.64.1
520132.817.710.83.9
620142.96.58103.8
720152.466.589.23.8

We can see the years are new created in the data frame because it is much more convenient to generate than other irregular numbers. The names of all vectors are modified since we need the understandable names for each column. The small data frame named dfimr is ready to use now.

It is a small size data frame so we can make a simple comparative graph to see how they look like together. Here the gather function from the package ggplot2 is applied:

library(ggplot2)

dfimr%>%
gather(key,value, Shanghai_Local_IMR, Shanghai_Migrants_IMR, China_IMR, UK_IMR) %>%
ggplot(aes(x=year, y=value, colour = key,linetype = key,shape = key)) +
geom_line()+
geom_point()+
geom_text(aes(label = value,vjust = 1.1))+
scale_x_continuous(breaks = dfimr$year)+
labs(x=”Year”, y=”Infant Mortality per 1000 live births”, title=”Infant Mortality Rate of Shanghai migrants, Shanghai locals, China and UK (2009- 2015)”)

Then the graph is generated here:

The graph shows all the details from the data frame, thank for the pipeline %>% function from the tidyverse package and the gather function from the ggplot2 package. The linetype and the shape makes the differences more visible, and the vjust gives a space for the numbers be displayed. The breaks function enables all the year-numbers being displayed in the graph, and the labs function allows you to edit the title whatever you want.

The gap of IMR of different population groups is very visible in the graph, migrants are unsurprisingly worse than the local groups in Shanghai. However, they are still much better than the national average level of China.

The nextjoural platform also allows us to share what we do on the platform. I have saved the draft in my account so the whole procedure of this simple example is available here:

https://nextjournal.com/a/LAC8qj1r97YjzToH3W7UN?token=P4QXZSM5hvivdYrR1jcfib

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s