MLS Case Study PCA¶


Context:¶


In this case study, we will use the Education dataset which contains information on educational institutes in USA. The data has various attributes such as number of applications received, enrollments, faculty education, financial aspects and graduation rate of each institute.


Objective:¶


The objective of this problem is to reduce the number of features by using dimensionality reduction techniques like PCA and extract insights.


Dataset:¶


The Education dataset contains information on various colleges in USA. It contains the following information:

  • Names: Names of various universities and colleges
  • Apps: Number of applications received
  • Accept: Number of applications accepted
  • Enroll: Number of new students enrolled
  • Top10perc: Percentage of new students from top 10% of Higher Secondary class
  • Top25perc: Percentage of new students from top 25% of Higher Secondary class
  • F_Undergrad: Number of full-time undergraduate students
  • P_Undergrad: Number of part-time undergraduate students - Outstate: Number of students for whom the particular college or university is out-of-state tuition
  • Room_Board: Cost of room and board
  • Books: Estimated book costs for a student
  • Personal: Estimated personal spending for a student
  • PhD: Percentage of faculties with a Ph.D.
  • Terminal: Percentage of faculties with terminal degree
  • S_F_Ratio: Student/faculty ratio
  • perc_alumni: Percentage of alumni who donate
  • Expend: The instructional expenditure per student
  • Grad_Rate: Graduation rate

Importing libraries and overview of the dataset¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#to scale the data using z-score
from Scikit-learn.preprocessing import StandardScaler

#Importing PCA
from Scikit-learn.decomposition import PCA

Connect to Google Drive¶

In [1]:
# Connect to google
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Loading data¶

In [ ]:
data=pd.read_csv("Education_Post_12th_Standard.csv")
In [ ]:
data.head()
Out[ ]:
Names Apps Accept Enroll Top10perc Top25perc F_Undergrad P_Undergrad Outstate Room_Board Books Personal PhD Terminal S_F_Ratio perc_alumni Expend Grad_Rate
0 Abilene Christian University 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
1 Adelphi University 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
2 Adrian College 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
3 Agnes Scott College 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
4 Alaska Pacific University 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15

Check the info of the data¶

In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Names        777 non-null    object 
 1   Apps         777 non-null    int64  
 2   Accept       777 non-null    int64  
 3   Enroll       777 non-null    int64  
 4   Top10perc    777 non-null    int64  
 5   Top25perc    777 non-null    int64  
 6   F_Undergrad  777 non-null    int64  
 7   P_Undergrad  777 non-null    int64  
 8   Outstate     777 non-null    int64  
 9   Room_Board   777 non-null    int64  
 10  Books        777 non-null    int64  
 11  Personal     777 non-null    int64  
 12  PhD          777 non-null    int64  
 13  Terminal     777 non-null    int64  
 14  S_F_Ratio    777 non-null    float64
 15  perc_alumni  777 non-null    int64  
 16  Expend       777 non-null    int64  
 17  Grad_Rate    777 non-null    int64  
dtypes: float64(1), int64(16), object(1)
memory usage: 109.4+ KB

Observations:

  • There are 777 observations and 18 columns in the dataset.
  • All columns have 777 non-null values i.e. there are no missing values.
  • All columns are numeric except the Names column which is of object data type.

Data Preprocessing and Exploratory Data Analysis¶

Check if all the college names are unique¶

In [ ]:
data.Names.nunique()
Out[ ]:
777

Observations:

  • All college names are unique
  • As all entries are unique, it would not add value to our analysis. We can drop the Names column.
In [ ]:
#Dropping Names column
data.drop(columns="Names", inplace=True)

Summary Statistics¶

In [ ]:
data.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Apps 777.0 3001.638353 3870.201484 81.0 776.0 1558.0 3624.0 48094.0
Accept 777.0 2018.804376 2451.113971 72.0 604.0 1110.0 2424.0 26330.0
Enroll 777.0 779.972973 929.176190 35.0 242.0 434.0 902.0 6392.0
Top10perc 777.0 27.558559 17.640364 1.0 15.0 23.0 35.0 96.0
Top25perc 777.0 55.796654 19.804778 9.0 41.0 54.0 69.0 100.0
F_Undergrad 777.0 3699.907336 4850.420531 139.0 992.0 1707.0 4005.0 31643.0
P_Undergrad 777.0 855.298584 1522.431887 1.0 95.0 353.0 967.0 21836.0
Outstate 777.0 10440.669241 4023.016484 2340.0 7320.0 9990.0 12925.0 21700.0
Room_Board 777.0 4357.526384 1096.696416 1780.0 3597.0 4200.0 5050.0 8124.0
Books 777.0 549.380952 165.105360 96.0 470.0 500.0 600.0 2340.0
Personal 777.0 1340.642214 677.071454 250.0 850.0 1200.0 1700.0 6800.0
PhD 777.0 72.660232 16.328155 8.0 62.0 75.0 85.0 103.0
Terminal 777.0 79.702703 14.722359 24.0 71.0 82.0 92.0 100.0
S_F_Ratio 777.0 14.089704 3.958349 2.5 11.5 13.6 16.5 39.8
perc_alumni 777.0 22.743887 12.391801 0.0 13.0 21.0 31.0 64.0
Expend 777.0 9660.171171 5221.768440 3186.0 6751.0 8377.0 10830.0 56233.0
Grad_Rate 777.0 65.463320 17.177710 10.0 53.0 65.0 78.0 118.0

Observations:

  • On an average, approx 3,000 applications are received in US universities, out of which around 2,000 applications are accepted by the universities and around 780 new students get enrolled.
  • The standard deviation is very high for these variables - Apps, Accepted, Enroll which shows the variety of universities and colleges.
  • The average cost for room and board, books, and personal expense is approx 4,357, 550, and 1,350 dollars respectively.
  • The average number of full time undergrad students are around 3700 whereas the average number of part-time undergrad students stand low at around 850.
  • PhD and Grad_Rate have a maximum value of greater than 100 which is not possible as these variables are in percentages. Let's see how many such observations are there in the data.
In [ ]:
data[(data.PhD>100) | (data.Grad_Rate>100)]
Out[ ]:
Apps Accept Enroll Top10perc Top25perc F_Undergrad P_Undergrad Outstate Room_Board Books Personal PhD Terminal S_F_Ratio perc_alumni Expend Grad_Rate
95 3847 3433 527 9 35 1010 12 9384 4840 600 500 22 47 14.3 20 7697 118
582 529 481 243 22 47 1206 134 4860 3122 600 650 103 88 17.4 16 6415 43

There is just one such observation for each variable. We can cap the values to 100%.

In [ ]:
data.loc[582,"PhD"]=100
data.loc[95,"Grad_Rate"]=100

Let's check the distribution and outliers for each column in the data¶

In [ ]:
cont_cols = list(data.columns)
for col in cont_cols:
    print(col)
    print('Skew :',round(data[col].skew(),2))
    plt.figure(figsize=(15,4))
    plt.subplot(1,2,1)
    data[col].hist(bins=10, grid=False)
    plt.ylabel('count')
    plt.subplot(1,2,2)
    sns.boxplot(x=data[col])
    plt.show()
Apps
Skew : 3.72
No description has been provided for this image
Accept
Skew : 3.42
No description has been provided for this image
Enroll
Skew : 2.69
No description has been provided for this image
Top10perc
Skew : 1.41
No description has been provided for this image
Top25perc
Skew : 0.26
No description has been provided for this image
F_Undergrad
Skew : 2.61
No description has been provided for this image
P_Undergrad
Skew : 5.69
No description has been provided for this image
Outstate
Skew : 0.51
No description has been provided for this image
Room_Board
Skew : 0.48
No description has been provided for this image
Books
Skew : 3.49
No description has been provided for this image
Personal
Skew : 1.74
No description has been provided for this image
PhD
Skew : -0.77
No description has been provided for this image
Terminal
Skew : -0.82
No description has been provided for this image
S_F_Ratio
Skew : 0.67
No description has been provided for this image
perc_alumni
Skew : 0.61
No description has been provided for this image
Expend
Skew : 3.46
No description has been provided for this image
Grad_Rate
Skew : -0.14
No description has been provided for this image

Observations:

  • The distribution plots show that Apps, Accept, Enroll, Top10perc, F_Undergrad, P_Undergrad, Books, Personal and Expend variables are highly right skewed. It is evident from boxplots that all these variables have outliers.
  • Top25percent is the only variable which does not possess outliers.
  • Outstate, Room_Board, S_F_Ratio and perc_alumni seems to have a moderate right skew.
  • PhD and Terminal are moderately left skewed.

Now, let's check the correlation among different variables.

In [ ]:
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(), annot=True, fmt='0.2f')
plt.show()
No description has been provided for this image

Observations:

  • We can see high positive correlation among the following variables:
    1. Apps and Accept
    2. Apps and Enroll
    3. Apps and F_Undergrad
    4. Accept and Enroll
    5. Accept and F_Undergrad
    6. Enroll and F_Undergrad
    7. Top10perc and Top25percent
    8. PhD and Terminal
  • We can see high negative correlation among the following variables:
    1. S_F_Ratio and Top10perc
    2. S_F_Ratio and Expend

Scaling the data¶

In [ ]:
scaler=StandardScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
In [ ]:
data_scaled.head()
Out[ ]:
Apps Accept Enroll Top10perc Top25perc F_Undergrad P_Undergrad Outstate Room_Board Books Personal PhD Terminal S_F_Ratio perc_alumni Expend Grad_Rate
0 -0.346882 -0.321205 -0.063509 -0.258583 -0.191827 -0.168116 -0.209207 -0.746356 -0.964905 -0.602312 1.270045 -0.162859 -0.115729 1.013776 -0.867574 -0.501910 -0.317993
1 -0.210884 -0.038703 -0.288584 -0.655656 -1.353911 -0.209788 0.244307 0.457496 1.909208 1.215880 0.235515 -2.676529 -3.378176 -0.477704 -0.544572 0.166110 -0.551805
2 -0.406866 -0.376318 -0.478121 -0.315307 -0.292878 -0.549565 -0.497090 0.201305 -0.554317 -0.905344 -0.259582 -1.205112 -0.931341 -0.300749 0.585935 -0.177290 -0.668710
3 -0.668261 -0.681682 -0.692427 1.840231 1.677612 -0.658079 -0.520752 0.626633 0.996791 -0.602312 -0.688173 1.185939 1.175657 -1.615274 1.151188 1.792851 -0.376446
4 -0.726176 -0.764555 -0.780735 -0.655656 -0.596031 -0.711924 0.009005 -0.716508 -0.216723 1.518912 0.235515 0.204995 -0.523535 -0.553542 -1.675079 0.241803 -2.948375

Principal Component Analysis¶

In [ ]:
#Defining the number of principal components to generate
n=data_scaled.shape[1]

#Finding principal components for the data
pca = PCA(n_components=n, random_state=1)
data_pca1 = pd.DataFrame(pca.fit_transform(data_scaled))

#The percentage of variance explained by each principal component
exp_var = pca.explained_variance_ratio_
In [ ]:
# visualize the Explained Individual Components
plt.figure(figsize = (10,10))
plt.plot(range(1,18), pca.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
Out[ ]:
Text(0, 0.5, 'Cumulative Explained Variance')
No description has been provided for this image
In [ ]:
# find the least number of components that can explain more than 70% variance
sum = 0
for ix, i in enumerate(exp_var):
  sum = sum + i
  if(sum>0.70):
    print("Number of PCs that explain at least 70% variance: ", ix+1)
    break
Number of PCs that explain at least 70% variance:  4

Observations:

  • We can see that out of the 17 original features, we reduced the number of features through principal components to 4, these components explain approximately 70% of the original variance.

  • So that is about 76% reduction in the dimensionality with a loss of 30% in variance.

  • Let us now look at these principal components as a linear combination of original features.

In [ ]:
pc_comps = ['PC1','PC2','PC3','PC4']
data_pca = pd.DataFrame(np.round(pca.components_[:4,:],2),index=pc_comps,columns=data_scaled.columns)
data_pca.T
Out[ ]:
PC1 PC2 PC3 PC4
Apps 0.25 0.33 -0.06 0.28
Accept 0.21 0.37 -0.10 0.27
Enroll 0.18 0.40 -0.08 0.16
Top10perc 0.35 -0.08 0.03 -0.05
Top25perc 0.34 -0.04 -0.02 -0.11
F_Undergrad 0.15 0.42 -0.06 0.10
P_Undergrad 0.03 0.32 0.14 -0.16
Outstate 0.29 -0.25 0.05 0.13
Room_Board 0.25 -0.14 0.15 0.19
Books 0.06 0.06 0.68 0.08
Personal -0.04 0.22 0.50 -0.24
PhD 0.32 0.06 -0.13 -0.53
Terminal 0.32 0.05 -0.07 -0.52
S_F_Ratio -0.18 0.25 -0.29 -0.16
perc_alumni 0.21 -0.25 -0.15 0.02
Expend 0.32 -0.13 0.23 0.08
Grad_Rate 0.25 -0.17 -0.21 0.26

Observations:

  • Each principal component is a linear combination of original features. For example, we can write the equation for PC1 in the following manner:

0.25 * Apps + 0.21 * Accept + 0.18 * Enroll + 0.35 * Top10perc + 0.34 * Top25perc + 0.15 * F_Undergrad + 0.03 * P_Undergrad + 0.29 * Outstate + 0.25 * Room_Board + 0.06 * Books + -0.04 * Personal + 0.32 * PhD + 0.32 * Terminal + -0.18 * S_F_Ratio + 0.21 * perc_alumni + 0.32 * Expend + 0.25 * Grad_Rate

  • For the business implications, the first two principal components picks up around 58% of the variability in the data. That is to say, picking up a considerable amount of variation in the data.

  • The explanation of each component along with their weights is also one of the ways to look at it. For example, we can consider weights with absolute value greater than 0.25 significant and analyze the each component.

NOTE: Decision regarding what value of weights is high or significant may vary from case to case. It depends on the problem at hand.

In [ ]:
def color_high(val):
    if val <-0.25: # you can decide any value as per your understanding
        return 'background: pink'
    elif val >0.25:
        return 'background: skyblue'

data_pca.T.style.applymap(color_high)
Out[ ]:
PC1 PC2 PC3 PC4
Apps 0.250000 0.330000 -0.060000 0.280000
Accept 0.210000 0.370000 -0.100000 0.270000
Enroll 0.180000 0.400000 -0.080000 0.160000
Top10perc 0.350000 -0.080000 0.030000 -0.050000
Top25perc 0.340000 -0.040000 -0.020000 -0.110000
F_Undergrad 0.150000 0.420000 -0.060000 0.100000
P_Undergrad 0.030000 0.320000 0.140000 -0.160000
Outstate 0.290000 -0.250000 0.050000 0.130000
Room_Board 0.250000 -0.140000 0.150000 0.190000
Books 0.060000 0.060000 0.680000 0.080000
Personal -0.040000 0.220000 0.500000 -0.240000
PhD 0.320000 0.060000 -0.130000 -0.530000
Terminal 0.320000 0.050000 -0.070000 -0.520000
S_F_Ratio -0.180000 0.250000 -0.290000 -0.160000
perc_alumni 0.210000 -0.250000 -0.150000 0.020000
Expend 0.320000 -0.130000 0.230000 0.080000
Grad_Rate 0.250000 -0.170000 -0.210000 0.260000

Observations:

  • The first principal component, PC1, is related to high values of students scores (Top10perc, Top25perc), number of out-of-state students (Outstate), faculties education (PhD and Terminal) and the instructional expenditure (Expend) per student. This principal component seams to capture attributes that generally define premier colleges with high quality of students entering them and higher accomplished faculty that is teaching there. They also seems to take rich students from all over the country.
  • The second principal component, PC2, is related to high values of number of applications - received (Apps), accepted (Accept) and enrolled (Enroll), and number of full time (F_Undergrad) and part time (P_Undergrad) students. This principal component seems to capture attributes that generally define non-premier colleges that are comparatively easier to get admissions into.
  • The third principal component, PC3, is related to financial aspects i.e. personal spending (Personal) and cost of books (Books) for a student. It is also associated with low values of student faculty ratios (S_F_Ratio).
  • The fourth principal component, PC4, is related to low values of faculty's education and high values of students' graduation rate. This principal component seems to capture attributes that define colleges which lack the highly educated faculty (PhD and Terminal) but it is comparatively easier to graduate from there (Grad_Rate).

We can also visualize the data in 2 dimensions using first two principal components¶

In [ ]:
plt.figure(figsize = (7,7))
sns.scatterplot(x=data_pca1[0],y=data_pca1[1])
plt.xlabel("PC1")
plt.ylabel("PC2")
Out[ ]:
Text(0, 0.5, 'PC2')
No description has been provided for this image
In [ ]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"