Exploratory Data Analysis

MINI PROJECT

Exploratory Data Analysis

About the Dataset

The dataset is a pretended dataset which emulates student data.
The dataset has 381 rows and 12 columns.
The dataset contains the following attributes:
- ID: the ID of each student (int)
- OfferOfAdmissionExtended: whether the school provides an extension for accepting the offer (Factor: “YES”/”NO”)
- GPA: undergraduate GPA (int)
- GRE: Graduate Record Examinations score (int)
- TOEFL: TOEFL score (int)
- Major: undergraduate major (Factor with 10 levels: “Accounting”, “CS”, “Statistics”, “Engineering”, etc.)
- CollegeRegion: college region (Factor with 11 levels: “Canada”, “China”, “USA”, “Spain”, etc.)
- CollegeName: college/undergraduate institution name (Factor with 66 levels: “Arizona”, “Berkeley”, etc.)
- State: state where the institution is located (Factor with 32 levels: “Arizona”, “California”, etc.)
- Dom_Int: whether the student is domestic or international (Factor: “YES”/”NO”)
- Matriculating: whether the student is matriculating the institution (Factor: “YES”/”NO”)
- Gender: male or female (Factor: “Male”/”Female”)
Three attributes contain NA values, which are:
- GRE: 4 NAs
- TOEFL: 252 NAs
- State: 156 NAs

Exploratory Visualizations

In order to perform further examination about the dataset, some explorative visualization was made to a investigate possible relationships with attributes.

Distributions of GPA

The above plot indicates how undergraduate GPAs are distributed between the two groups: Offer of Admission Extended and Offer of Admission Not Extended.
From the plot, it is clear that candidates who received an offer extension tend to have higher GPA than those who did not receive an offer extension.
In addition, the distribution of the group which received the extension are more skewed to left, indicating that the majority of the candidates in this group wight more on the right side, where on average has a higher GPA, by calculation the mean GPA for this group is 3.63. Whereas, the other group has a more normally distributed GPA, and the distribution is more dense in the middle, and the mean GPA for the group is 3.30.
The difference between the GPAs of the two groups shows that undergraduate GPA could be one of the strong determinants for students to stand out from their competitive peers, and will be used in further analysis.

GPA vs. GRE

The above plot is a 2-dimensional density plot, showing the relationship between undergraduate GPA and GRE scores.
Theoretically, one will assume that there exists a relationship between GPA and GRE. As higher GPAs normally correspond to higher GRE scores. However, from the above plot, there is no strong correlation between the two attributes. And the calculated correlation between them is 0.2284, which is not a strong evidence of correlation.
The majority of GPA hovers from the range of 3.5 to 3.8, and the GRE score mostly varies from 160 to 175, which seem to be logically acceptable, since the distributions of the two criteria accord with the commonly seen standardized test boundary set by Admission Office.
By closely examining the plot, there seems to exist an outlier, where GPA = 2.58, and GRE = 50. Even though, there is not much information about which section of this GRE score corresponds to, but this extremely low observation might affect the final result, and thus will be removed.
Since GRE score does not correlate with GPA, it will also be used as a potential indicator choosing the better candidates.

Average GPA vs. Major

The plot from left panel indicates the average GPA for different groups: Offer of Admission Extended and Offer of Admission Not Extended within each major.
From the plot, we can see that different majors have different average GPAs. From this observation, it raises the questions like whether there exists a preferable major during the admission process, or do candidates who share the same undergraduate majors show similar hidden traits about themselves, etc.
This may indicate that undergraduate Major could also be a potential factor.

Distributions of GRE

- The plot at the left panel show the distribution of GRE scores between the two group.
- At first glance, there is no huge difference between the two distributions. For offer extended group, the peak is slightly higher than than of the non-extended group.
- The non-extended group has longer left tails, which indicates more candidates with lower GRE score.

Gender vs. Offer Extended or Not

	NO	YES
Female	57	153
Male	56	115

Domestic/International vs. Offer Extended or Not

	NO	YES
Domestic	54	170
International	59	98

From the two tables on the left hand side, it is hard to conclude that whether there exists strong signs showing gender or domestic/international students would have a strong impact on separating students, since the data is somewhat imbalanced.
As there are more female applicants and more domestic applicants, further investigations are needed to draw a conclusion.

Data Preparation

Data Cleaning

As mentioned above, there exists both NA values and outliers in the dataset, which need to be dealt with, the following methodologies were used:

GRE: since GRE could be an important measurement, and there is only a limited amount of observations, simply remove the NA values is not valid, so the NA values in GRE were replaced by the mean GRE.
TOEFL: TOEFL score is one of the important measurement to check whether an international students satisfy the basic requirements for English communication. And after carefully examining the dataset, the NA values in TOEFL do appear most likely from domestic students, or international students who attended US-based undergraduate institutions, where they are not required to submit a TOEFL score. Thus, the column was regenerated as TOEFLcut where NA values were replaced with “waived”, and the rest of the score were binned into 4 groups: 110 ~ 120 are labeled as 1, 100 ~ 110 are labeled as 2, 90 ~ 100 are labeled as 3, and scores below 90 are labeled as 4.
State: the NA values in State are mostly to be students who did not attend a US-based university/college, and since almost half of the State is NA and it does not seem to contribute a lot for the purpose of our goal, this column is removed.
Outlier: the observation contains the outlier is removed.

Feature Generation

Since University/College ranking is always one of the factors that students need to take into consideration when apply for schools, the same situation can also happen where Admission officers choose those candidates whose undergraduate institutions have higher rankings.
Thus, a new feature Rank is generated where the rankings for each university/college is merged with the original dataset. The rankings are only approximate rankings, since the CollegeName is not clear for some observations, so observations with ambiguous names are assigned with the mean rankings. The rankings are collected from USNews.
Then, all the ranking are binned into 5 groups, institutions that ranked from 1 ~ 20 are assigned as 1, 20 ~ 40 are assigned as 2, 40 ~ 60 are assigned as 3, 60 ~ 80 are assigned as 4, rankings below 80 are assigned as 5.

Cluster Analysis for Major

As mentioned above, there may exist underlying information for each Major group, thus a cluster analysis for Major is performed, so that it is possible to check whether the same Major would be clustered together. The data contains both numerical and categorical data, so hierarchical clustering with method = “complete” was chosen, which can better handle such situation.

Since the Major is the response that matters, only a subset of attributes were used to perform cluster analysis, which are:

GPA
GRE
Gender
TOEFLcut

Below is the visualization of the result of the hierarchical clustering when the cluster was cut into 10 clusters (10 Majors):

From the above dendrogram, even though the nodes is not so clear, but from the color separation, it is clear that there is no convincing evidence that supports the assumption where candidates who share the same majors would have similar standardized test performances, such as similar GPA, GRE, etc..