Doing social research in a quantitative way means we have to fix our data with our expected theories. This is a very different approach from qualitative research, as the grounded theory is not very likely to be purely constructed by numbers. Thus, we sometimes need to fix our data in order to meet our theoretical needs.

Here is an example from my experiences about how to set up dummy variables as a proper baseline in logistic regression models. Dummy variable trap is a common mistake that people would make in logistic regression models, the way to avoid it is to make sense of these dummy variables, theoretically.

The tool I used here is RStudio.

In this example, I want to explore the logistic regression between people taking self-medication and their hukou (household registration) status.

The self-medication is a binary variable (1,0) simply with 1 refers to yes, and 0 refers to no.

The hukou types are a categorical variable with four categories: rural_out refers to rural hukou from other provinces, rural-in refers to local rural hukou, urban_out refers to urban hukou from other provinces, urban_in refers to local urban hukou.

We can have a look at the basic descriptive outcomes (with code and outcomes) :

(1) the self_medication

plot(h02$self_medication)

(2) the self_medication in provinces

library(ggplot2)

ggplot(data = h02) +

geom_bar(mapping = aes(x = provcode, fill =self_medication))

(3) the hukou types in provinces

ggplot(data = h02) + geom_bar(mapping = aes(x = provcode, fill =hukou))

However, we want to explore the logistic regression between self_medication and hukou types, especially considering the rural-to-urban migrants. We can firstly make a logistic regression here as m1:

library(lme4)

## Loading required package: Matrix

m1<-glm(self_medication ~ hukou,family=binomial(link='logit'),data=h02) summary(m1)

## ## Call: ## glm(formula = self_medication ~ hukou, family = binomial(link = "logit"), ## data = h02) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.8672 -0.7898 -0.6969 -0.6969 1.7517 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.29146 0.03409 -37.884 < 2e-16 *** ## hukourural_out 0.50734 0.08999 5.638 1.72e-08 *** ## hukouurban_in 0.28646 0.05178 5.532 3.17e-08 *** ## hukouurban_out 0.01203 0.18897 0.064 0.949 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 10266 on 9282 degrees of freedom ## Residual deviance: 10216 on 9279 degrees of freedom ## AIC: 10224 ## ## Number of Fisher Scoring iterations: 4

From the outcomes, we can only see three hukou types: rural_out, urban_in, urban_out. That is because the system automatically chose one category as the baseline to compare with other categories, the rural_in is automatically chosen (mostly chosen by the order of letters). The regression can be interpreted as: Comparing with rural_in hukou, rural_out hukou increases the log odds of taking self_medication by 0.507 (tutorial for understanding the log odds ratio is available on https://www.youtube.com/watch?v=ARfXDSkQf1Y).

Similarly, the next one can be interpreted as: Comparing with rural_in hukou, urban_in hukou increases the log odds of taking self_medication by 0.286, the third one is 0.012.

However, the rest interpretation is not something we want in our paper since the paper is specifically looking at rural-to-urban migrants from other provinces. We thus need to focus on the rural_out hukou type. The solution here is to adjust the baseline category to rural_out so that the rest types can be compared respectively.

Let’s look at the original order of the categories:

levels(h02$hukou)

## [1] "rural_in" "rural_out" "urban_in" "urban_out"

Then adjust them:

`h02$hukou <- relevel(h02$hukou, "rural_out")`

`levels(h02$hukou)`

## [1] "rural_out" "rural_in" "urban_in" "urban_out"

The rural_out becomes the first category here, which will be automatically taken as the baseline category. However, the rest of them are still not ordered.

To reorder all of them if it is necessary:

`h02$hukou <- factor(h02$hukou, levels = c("rural_out", "urban_out", "rural_in", "urban_in" )`

By doing this, we will get the exact order we want for all of the categories.

We can run the regression model again named as m2:

library(lme4) m2<-glm(self_medication ~ hukou,family=binomial(link='logit'),data=h02) summary(m2)

## ## Call: ## glm(formula = self_medication ~ hukou, family = binomial(link = "logit"), ## data = h02) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.8672 -0.7898 -0.6969 -0.6969 1.7517 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.78412 0.08328 -9.415 < 2e-16 *** ## hukourural_in -0.50734 0.08999 -5.638 1.72e-08 *** ## hukouurban_in -0.22089 0.09195 -2.402 0.0163 * ## hukouurban_out -0.49531 0.20367 -2.432 0.0150 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 10266 on 9282 degrees of freedom ## Residual deviance: 10216 on 9279 degrees of freedom ## AIC: 10224 ## ## Number of Fisher Scoring iterations: 4

Now the rural_out category is hidden from the model outcomes, which means it is automatically taken as the baseline value. We thus can interpret this model as: Comparing with rural_out hukou type, rural_in hukou decreases the log odds of taking self_medication by 0.507, urban_in hukou decreases the log odds of taking self_medication by 0.221, and the urban_out hukou decreases the log odds of taking self_medication by 0.495.

The P value refers to the significance of the outcomes. We, therefore, are able to conclude that in this dataset, compare with other people, non-local rural-to-urban migrant workers are more likely to take self-medication as their medical patterns.

- Produced by Juntao Lyu, University of Leeds.