Dummy variables in R – an example for logistic regression modeling

Doing social research in a quantitative way means we have to fix our data with our expected theories. This is a very different approach from qualitative research, as the grounded theory is not very likely to be purely constructed by numbers. Thus, we sometimes need to fix our data in order to meet our theoretical needs.

Here is an example from my experiences about how to set up dummy variables as a proper baseline in logistic regression models. Dummy variable trap is a common mistake that people would make in logistic regression models, the way to avoid it is to make sense of these dummy variables, theoretically.

The tool I used here is RStudio.

In this example, I want to explore the logistic regression between people taking self-medication and their hukou (household registration) status.

The self-medication is a binary variable (1,0) simply with 1 refers to yes, and 0 refers to no.

The hukou types are a categorical variable with four categories: rural_out refers to rural hukou from other provinces, rural-in refers to local rural hukou, urban_out refers to urban hukou from other provinces, urban_in refers to local urban hukou.

We can have a look at the basic descriptive outcomes (with code and outcomes) :

(1) the self_medication

plot(h02$self_medication)

untitled

(2) the self_medication in provinces
library(ggplot2)
ggplot(data = h02) +
geom_bar(mapping = aes(x = provcode, fill =self_medication))

untitled2

(3) the hukou types in provinces

ggplot(data = h02) +

geom_bar(mapping = aes(x = provcode, fill =hukou))

untitled1

However, we want to explore the logistic regression between self_medication and hukou types, especially considering the rural-to-urban migrants. We can firstly make a logistic regression here as m1:

library(lme4)
## Loading required package: Matrix
m1<-glm(self_medication ~ hukou,family=binomial(link='logit'),data=h02)
summary(m1)
## 
## Call:
## glm(formula = self_medication ~ hukou, family = binomial(link = "logit"), 
##     data = h02)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8672  -0.7898  -0.6969  -0.6969   1.7517  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.29146    0.03409 -37.884  < 2e-16 ***
## hukourural_out  0.50734    0.08999   5.638 1.72e-08 ***
## hukouurban_in   0.28646    0.05178   5.532 3.17e-08 ***
## hukouurban_out  0.01203    0.18897   0.064    0.949    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10266  on 9282  degrees of freedom
## Residual deviance: 10216  on 9279  degrees of freedom
## AIC: 10224
## 
## Number of Fisher Scoring iterations: 4

From the outcomes, we can only see three hukou types: rural_out, urban_in, urban_out. That is because the system automatically chose one category as the baseline to compare with other categories, the rural_in is automatically chosen (mostly chosen by the order of letters). The regression can be interpreted as: Comparing with rural_in hukou, rural_out hukou increases the log odds of taking self_medication by 0.507 (tutorial for understanding the log odds ratio is available on https://www.youtube.com/watch?v=ARfXDSkQf1Y).

Similarly, the next one can be interpreted as: Comparing with rural_in hukou, urban_in hukou increases the log odds of taking self_medication by 0.286, the third one is 0.012.

However, the rest interpretation is not something we want in our paper since the paper is specifically looking at rural-to-urban migrants from other provinces. We thus need to focus on the rural_out hukou type. The solution here is to adjust the baseline category to rural_out so that the rest types can be compared respectively.

Let’s look at the original order of the categories:

levels(h02$hukou)
## [1] "rural_in"  "rural_out" "urban_in"  "urban_out"

Then adjust them:

h02$hukou <- relevel(h02$hukou, "rural_out") 
levels(h02$hukou)
## [1] "rural_out" "rural_in"  "urban_in"  "urban_out"

The rural_out becomes the first category here, which will be automatically taken as the baseline category.

We can run the regression model again named as m2:

library(lme4)
m2<-glm(self_medication ~ hukou,family=binomial(link='logit'),data=h02)
summary(m2)
## 
## Call:
## glm(formula = self_medication ~ hukou, family = binomial(link = "logit"), 
##     data = h02)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8672  -0.7898  -0.6969  -0.6969   1.7517  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.78412    0.08328  -9.415  < 2e-16 ***
## hukourural_in  -0.50734    0.08999  -5.638 1.72e-08 ***
## hukouurban_in  -0.22089    0.09195  -2.402   0.0163 *  
## hukouurban_out -0.49531    0.20367  -2.432   0.0150 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10266  on 9282  degrees of freedom
## Residual deviance: 10216  on 9279  degrees of freedom
## AIC: 10224
## 
## Number of Fisher Scoring iterations: 4

Now the rural_out category is hidden from the model outcomes, which means it is automatically taken as the baseline value. We thus can interpret this model as: Comparing with rural_out hukou type, rural_in hukou decreases the log odds of taking self_medication by 0.507, urban_in hukou decreases the log odds of taking self_medication by 0.221, and the urban_out hukou decreases the log odds of taking self_medication by 0.495.

The P value refers to the significance of the outcomes. We, therefore, are able to conclude that in this dataset, compare with other people, non-local rural-to-urban migrant workers are more likely to take self-medication as their medical patterns.

  • Produced by Juntao Lyu, University of Leeds.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s