GENDER WAGE GAP¶
Here we ask and answer the following question:
What is the difference in predicted wages between men and women with the same job-relevant characteristics?¶
Our data comes from the March Supplement of the U.S. Current Population Survey, year 2012. We focus on the single (never married) workers with education levels equal to high-school, some college, or college graduates. The sample size is about 4,000.
Our outcome variable Y is hourly wage, and our X’s are various characteristics of workers such as gender, experience, education, and geographical indicators.
Dataset¶
The dataset contains the following variables:
- wage : weekly wage
- female : female dummy
- cg : college Graduate Dummy
- sc : some college dummy
- hsg : High School graduate dummy
- mw : mid-west dummy
- so : south dummy
- we : west dummy
- ne : northeast dummy
- exp1 : experience(year)
- exp2 : experience squared (taken as experience squared/100)
- exp3 : experience cubed (taken as experience cubed/1000)
Importing the necessary libraries and overview of the dataset¶
#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Loading the data¶
# Load data
df = pd.read_csv('gender_wage_gap.csv')
# See variables in the dataset
df.head()
| female | cg | sc | hsg | mw | so | we | ne | exp1 | exp2 | exp3 | wage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 33.0 | 10.89 | 35.937 | 11.659091 |
| 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 27.0 | 7.29 | 19.683 | 12.825000 |
| 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 13.0 | 1.69 | 2.197 | 5.777027 |
| 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2.0 | 0.04 | 0.008 | 12.468750 |
| 4 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 15.0 | 2.25 | 3.375 | 18.525000 |
Checking the info of the data¶
# Checking info of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3835 entries, 0 to 3834 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 female 3835 non-null int64 1 cg 3835 non-null int64 2 sc 3835 non-null int64 3 hsg 3835 non-null int64 4 mw 3835 non-null int64 5 so 3835 non-null int64 6 we 3835 non-null int64 7 ne 3835 non-null int64 8 exp1 3835 non-null float64 9 exp2 3835 non-null float64 10 exp3 3835 non-null float64 11 wage 3835 non-null float64 dtypes: float64(4), int64(8) memory usage: 359.7 KB
Observations:
- The dataset has 3835 observations and 12 different variables.
- There is no missing value in the dataset.
- All of the dummy variables (cg, sc, hsg, etc.) have the datatype as int. They are binary variables having values of 0 and 1.
Univariate Analysis¶
Checking the summary statistics of the dataset¶
# Printing the summary statistics for the dataset
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| female | 3835.0 | 0.417992 | 0.493293 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
| cg | 3835.0 | 0.376271 | 0.484513 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
| sc | 3835.0 | 0.323859 | 0.468008 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
| hsg | 3835.0 | 0.299870 | 0.458260 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
| mw | 3835.0 | 0.287614 | 0.452709 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
| so | 3835.0 | 0.243546 | 0.429278 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.000000 |
| we | 3835.0 | 0.211734 | 0.408591 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.000000 |
| ne | 3835.0 | 0.257106 | 0.437095 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
| exp1 | 3835.0 | 13.353194 | 8.639348 | 2.000000 | 6.00000 | 11.000000 | 19.500000 | 35.000000 |
| exp2 | 3835.0 | 2.529267 | 2.910554 | 0.040000 | 0.36000 | 1.210000 | 3.802500 | 12.250000 |
| exp3 | 3835.0 | 5.812103 | 9.033207 | 0.008000 | 0.21600 | 1.331000 | 7.414875 | 42.875000 |
| wage | 3835.0 | 15.533356 | 13.518165 | 0.004275 | 9.61875 | 13.028571 | 17.812500 | 348.333017 |
Observations:
- The average wage is about 15 dollars per hour while the maximum wage is 348, which is very high.
- 42% of workers are women.
- The average experience is 13 years, with the minimum and maximum being 2 and 35 years, respectively, indicating that the data is diversified and drawn from various experience groups.
- 38% of the people in the data are college graduates.
- 32% have gone to some college, and 30% hold only high school diploma.
- You can also see geographical distribution of workers across major geographical regions of the states, and seem to be nearly same between 22-28%, which again shows maybe data is collected from different regions in a uniform manner.
df[['exp1','exp2','exp3','wage']].boxplot(figsize=(20,10))
plt.show()
- For the wage variable, we can see that there are outliers, which makes sense, because some people have higher earnings than others.
As the goal of this case study is to analyse the gender wage gap, let's look at the statistics of all columns for both female and male.¶
# Mean value of all females
df[df['female'] == 1].mean()
female 1.000000 cg 0.406114 sc 0.354336 hsg 0.239551 mw 0.291329 so 0.255147 we 0.198378 ne 0.255147 exp1 13.037118 exp2 2.449453 exp3 5.599297 wage 14.720058 dtype: float64
# Mean value of all males
df[df['female'] == 0].mean()
female 0.000000 cg 0.354839 sc 0.301971 hsg 0.343190 mw 0.284946 so 0.235215 we 0.221326 ne 0.258513 exp1 13.580197 exp2 2.586588 exp3 5.964938 wage 16.117458 dtype: float64
Observations:
- We first take a look at the following descriptive statistics for the subsamples of single men and single women, with educational attainment equal to high-school, some college, or college.
- The mean hourly wage is 16 dollars for men and about 15 dollars for women, so the difference is 1, without controlling for job-relevant characteristics.
- If we take a look at some of these characteristics, we see that on average men have more experience, but women are more likely to have college degrees or some college education.
- Geographical distribution of both men and women is similar.
Basic Model¶
#################### Linear and Quadratic specifications ##############################
# Wage linear regression
import statsmodels.api as sm
Y = df['wage'] # target variable
X = df[['female' , 'sc', 'cg', 'mw' , 'so' , 'we' , 'exp1' , 'exp2' , 'exp3']] #regressors
X = sm.add_constant(X) # adding constant for intercept
model = sm.OLS(Y, X)
results = model.fit() # train the model
print(results.summary()) # summary of the model
OLS Regression Results
==============================================================================
Dep. Variable: wage R-squared: 0.095
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 44.87
Date: Thu, 08 Jun 2023 Prob (F-statistic): 3.17e-77
Time: 17:07:11 Log-Likelihood: -15235.
No. Observations: 3835 AIC: 3.049e+04
Df Residuals: 3825 BIC: 3.055e+04
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9154 1.299 3.784 0.000 2.368 7.462
female -1.8264 0.425 -4.302 0.000 -2.659 -0.994
sc 2.4865 0.534 4.654 0.000 1.439 3.534
cg 9.8708 0.562 17.567 0.000 8.769 10.972
mw -1.2142 0.566 -2.146 0.032 -2.323 -0.105
so 0.4046 0.588 0.688 0.491 -0.748 1.558
we -0.2508 0.611 -0.410 0.682 -1.449 0.947
exp1 1.0965 0.269 4.077 0.000 0.569 1.624
exp2 -4.0134 1.785 -2.248 0.025 -7.514 -0.513
exp3 0.4603 0.344 1.340 0.180 -0.213 1.134
==============================================================================
Omnibus: 6626.018 Durbin-Watson: 1.958
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8721375.159
Skew: 11.808 Prob(JB): 0.00
Kurtosis: 235.426 Cond. No. 198.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Observations:
- Model performance is very poor as the r squared is very low.
- We can see that cg graduate have high coefficient which signifies that graduated people are getting high wages.
- On the other hand exp2 has a negative coefficient which means for exp2 the wages are less.
- Coefficient of the female indicator is negative, signifies that females are getting lower wages.
Flexible model¶
# import PolynomialFeatures library to create polynomial features
from Scikit-learn.preprocessing import PolynomialFeatures
# create an object of PolynomialFeatures with only the interaction terms
poly = PolynomialFeatures(interaction_only=True,include_bias = False)
#Dropping constant and female as we dont want to create interaction features for them
X.drop(['const','female'],axis = 1,inplace = True)
print(X.shape)
# transform the data to add the polynomial features too
X_poly = poly.fit_transform(X)
# convert to dataframe
X_poly = pd.DataFrame(X_poly,columns= poly.get_feature_names_out(X.columns))
print(X_poly.shape)
X_poly['female'] = df['female']
X['female'] = df['female']
X_poly = sm.add_constant(X_poly)
#defining the model
model = sm.OLS(Y, X_poly) # Linear regression/OLS object
#fitting the model
results = model.fit() # train the model
# summary of the model
print(results.summary())
(3835, 8)
(3835, 36)
OLS Regression Results
==============================================================================
Dep. Variable: wage R-squared: 0.104
Model: OLS Adj. R-squared: 0.096
Method: Least Squares F-statistic: 13.79
Date: Thu, 08 Jun 2023 Prob (F-statistic): 5.53e-69
Time: 17:07:11 Log-Likelihood: -15217.
No. Observations: 3835 AIC: 3.050e+04
Df Residuals: 3802 BIC: 3.071e+04
Df Model: 32
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 16.5524 7.175 2.307 0.021 2.486 30.619
sc -2.3865 5.415 -0.441 0.659 -13.003 8.230
cg 2.2405 5.908 0.379 0.705 -9.342 13.823
mw -5.5194 3.375 -1.635 0.102 -12.137 1.098
so -2.9144 3.482 -0.837 0.403 -9.742 3.913
we -0.8054 3.646 -0.221 0.825 -7.953 6.342
exp1 -1.3215 2.073 -0.638 0.524 -5.386 2.743
exp2 12.5218 24.780 0.505 0.613 -36.062 61.106
exp3 -0.0484 0.151 -0.321 0.748 -0.344 0.247
sc cg 1.555e-13 4.48e-13 0.347 0.729 -7.23e-13 1.03e-12
sc mw -0.7226 1.445 -0.500 0.617 -3.555 2.110
sc so -0.6513 1.534 -0.424 0.671 -3.659 2.357
sc we -0.1047 1.592 -0.066 0.948 -3.227 3.018
sc exp1 0.8391 1.097 0.765 0.444 -1.311 2.989
sc exp2 -4.0608 6.512 -0.624 0.533 -16.827 8.706
sc exp3 0.6330 1.158 0.547 0.585 -1.637 2.904
cg mw -0.7609 1.536 -0.496 0.620 -3.772 2.250
cg so 1.7041 1.569 1.086 0.278 -1.373 4.781
cg we -1.4948 1.637 -0.913 0.361 -4.704 1.715
cg exp1 0.7859 1.245 0.631 0.528 -1.654 3.226
cg exp2 -0.0490 7.761 -0.006 0.995 -15.265 15.167
cg exp3 -0.5950 1.462 -0.407 0.684 -3.462 2.272
mw so -6.971e-15 1.95e-14 -0.358 0.720 -4.51e-14 3.12e-14
mw we -1.916e-14 2.95e-14 -0.650 0.516 -7.7e-14 3.87e-14
mw exp1 1.1076 0.724 1.530 0.126 -0.312 2.527
mw exp2 -6.0527 4.790 -1.264 0.206 -15.444 3.339
mw exp3 0.9063 0.919 0.987 0.324 -0.895 2.707
so we -7.762e-15 1.48e-14 -0.523 0.601 -3.68e-14 2.13e-14
so exp1 0.3947 0.748 0.528 0.598 -1.072 1.861
so exp2 -0.8914 4.973 -0.179 0.858 -10.641 8.858
so exp3 -0.0352 0.960 -0.037 0.971 -1.917 1.847
we exp1 0.4719 0.790 0.597 0.550 -1.077 2.021
we exp2 -3.9188 5.300 -0.739 0.460 -14.309 6.472
we exp3 0.8050 1.034 0.778 0.436 -1.223 2.833
exp1 exp2 -0.4839 1.505 -0.321 0.748 -3.435 2.467
exp1 exp3 0.0872 0.451 0.193 0.847 -0.797 0.972
exp2 exp3 -0.0580 0.503 -0.115 0.908 -1.044 0.928
female -1.8800 0.425 -4.426 0.000 -2.713 -1.047
==============================================================================
Omnibus: 6588.420 Durbin-Watson: 1.959
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8473264.499
Skew: 11.669 Prob(JB): 0.00
Kurtosis: 232.090 Cond. No. 1.14e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.61e-24. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Observations:
- R2 have improved very slightly.
- Coefficient of cg*exp is quite high, represents Experience variable times the indicator of having a college degree will have a high wage, or people with good education and good experience are paid good.
- Coefficient of cg*mw is negative which implies College graduate from mid west are not getting paid good, which is not a good sign as for the people living in mid west as from the basic model we know that the college graduates are getting paid well.
- Coefficient of cg * sc, so * we, mw * we, mw * so, is almost 0 and does not make any contribution to the model.
We estimate the linear regression model:
Y = $β_1D$ + $β^r_2W$ + ε.
- D is the indicator of being a female (1 if female and 0 otherwise). W ’s are controls.
- Basic model: W ’s consist of education and regional indicators, experience, experience squared, and experience cubed.
- Flexible model: W ’s consist of controls in the basic model plus all of their two-way interactions.
| Estimate | Standard Error | Confidence Interval | |
|---|---|---|---|
| basic reg | -1.8264 | 0.425 | [-2.659 -0.994] |
| flexi reg | -1.8800 | 0.425 | [-2.713 -1.047] |
In the above table we see the estimated regression coefficient, its standard error, and the 95% confidence interval, for both the basic and flexible regression model.
The results for basic and flexible regression models are in a very close agreement.
The estimated gender gap in hourly wage is about −2 dollars. This means that women get paid 2 dollars less per hour on average than men, controlling for experience, education, and geographical region.
The 95% confidence interval ranges from −2.7 to −1 dollars and -2 is lying within the region of the confidence interval and therefore we can conclude that the difference in the hourly wage for men and women, who have the same recorded characteristics, is both statistically and economically significant.
Illustration of Partialling Out: (Linear Specification - Basic Model)¶
“Partialling-out” is an important tool that provides a conceptual understanding of the regression coefficient β1.
The Steps for partialling out are:
- We predict Y using W only and find its residuals. i.e. removing the dependence of W on Y.
- We predict D using W and find its residuals, i.e. removing the dependence of W on D.
- Then we model residuals from step 1 and step 2, and this will give us how Y is dependent on D only.
Step 1: We predict Y using W only and find its residuals. i.e. removing the dependence of W on Y.¶
# target variable
Y = df['wage']
#regressors W
W = df[['sc', 'cg', 'mw' , 'so' , 'we' , 'exp1' , 'exp2' , 'exp3']]
W = sm.add_constant(W)
# Linear regression/OLS object
model = sm.OLS(Y, W)
# train the model
results = model.fit()
# get the residuals
t_Y = results.resid
t_Y
0 -1.686401
1 -10.550023
2 -9.298414
3 -3.466715
4 -4.451256
...
3830 1.148402
3831 -0.049341
3832 2.728471
3833 -5.471606
3834 30.261356
Length: 3835, dtype: float64
Step 2: We predict D using W and find its residuals, i.e. removing the dependence of W on D.¶
# target regressor D, i.e. female in our predicitng wage example
D = df['female']
#regressors W
W = df[['sc', 'cg', 'mw' , 'so' , 'we' , 'exp1' , 'exp2' , 'exp3']] #regressors
W = sm.add_constant(W)
# Linear regression/OLS object
model = sm.OLS(D, W)
# train the model
results = model.fit()
# get the residuals
t_D = results.resid
t_D
0 -0.323196
1 -0.448263
2 -0.443029
3 -0.485547
4 0.572763
...
3830 0.555462
3831 -0.294333
3832 0.577249
3833 0.575515
3834 -0.419835
Length: 3835, dtype: float64
Step 3: Then we model residuals from step 1 and step 2, and this will give us how Y is dependent on D only.¶
# Run OLS coefficient get coefficients and 95% confidence intervals
# target variable
Y = t_Y
X = t_D
X = sm.add_constant(X)
# Linear regression/OLS object
model = sm.OLS(Y, X)
# train the model
results = model.fit()
# print the summary of the trained model
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.005
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 18.55
Date: Thu, 08 Jun 2023 Prob (F-statistic): 1.70e-05
Time: 17:07:11 Log-Likelihood: -15235.
No. Observations: 3835 AIC: 3.047e+04
Df Residuals: 3833 BIC: 3.049e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -5.281e-15 0.208 -2.54e-14 1.000 -0.407 0.407
0 -1.8264 0.424 -4.307 0.000 -2.658 -0.995
==============================================================================
Omnibus: 6626.018 Durbin-Watson: 1.958
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8721375.159
Skew: 11.808 Prob(JB): 0.00
Kurtosis: 235.426 Cond. No. 2.04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Illustration of Partialling Out: (Quadratic Specification - Flexible model)¶
# target variable
Y = df['wage']
#regressors W
W = df[['sc', 'cg', 'mw' , 'so' , 'we' , 'exp1' , 'exp2' , 'exp3']]
# create an object of PolynomialFeatures with only the interaction terms
poly = PolynomialFeatures(interaction_only=True,include_bias = False)
# transform the data to add the polynomial features too
X_poly = poly.fit_transform(W)
X_poly = sm.add_constant(X_poly)
#defining the model
model = sm.OLS(Y, X_poly) # Linear regression/OLS object
# train the model
results = model.fit()
# get the residuals
t_Y = results.resid
t_Y
0 -2.269388
1 -11.721164
2 -8.577609
3 -4.626069
4 -4.974672
...
3830 1.726182
3831 0.775800
3832 3.639956
3833 -4.183105
3834 31.101298
Length: 3835, dtype: float64
# target regressor
D = df['female']
#Regressors w
W = df[['sc', 'cg', 'mw' , 'so' , 'we' , 'exp1' , 'exp2' , 'exp3']]
# create an object of PolynomialFeatures with only the interaction terms
poly = PolynomialFeatures(interaction_only=True,include_bias = False)
X_poly = poly.fit_transform(W)
X_poly = sm.add_constant(X_poly)
#defining the model
model = sm.OLS(D, X_poly)
# train the model
results = model.fit()
# get the residuals
t_D = results.resid
t_D
0 -0.255420
1 -0.443778
2 -0.408862
3 -0.576250
4 0.576618
...
3830 0.539392
3831 -0.267692
3832 0.551484
3833 0.588884
3834 -0.405553
Length: 3835, dtype: float64
# Run OLS coefficient get coefficients and 95% confidence intervals
Y = t_Y # target variable
X = t_D
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results = model.fit() # train the model
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.005
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 19.75
Date: Thu, 08 Jun 2023 Prob (F-statistic): 9.07e-06
Time: 17:07:11 Log-Likelihood: -15217.
No. Observations: 3835 AIC: 3.044e+04
Df Residuals: 3833 BIC: 3.045e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1.721e-13 0.207 -8.33e-13 1.000 -0.405 0.405
0 -1.8800 0.423 -4.444 0.000 -2.709 -1.051
==============================================================================
Omnibus: 6588.420 Durbin-Watson: 1.959
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8473264.499
Skew: 11.669 Prob(JB): 0.00
Kurtosis: 232.090 Cond. No. 2.05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
| Estimate | Standard Error | |
|---|---|---|
| basic reg | -1.8264 | 0.425 |
| flexi reg | -1.8800 | 0.425 |
| basic reg with partialling out | -1.8264 | 0.424 |
| flexi reg with partialling out | -1.8800 | 0.423 |
Conclusion:¶
- We have applied the ideas we have discussed so far to learn about the gender wage gap.
- The gender wage gap may partly reflect genuine discrimination against women in the labor market.
- It may also partly reflect the so-called selection effect, namely that women are more likely to end up in occupations that pay somewhat less (for example, school teachers).
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Four - Regression and Prediction/Optional Content/Gender Wage GAP in the US/Case_Study_2.ipynb"