MINI PROJECT
Find the Most Important Features
Random Forest
Since the goal for this mini project is to investigate how different factors like GPA, GRE, Major, Gender, Domestic or International status, etc. weight during admission process, random forest was performed. Random forest can help to determine the most important features in a dataset. And it is also one of the most accurate algorithm that produces an accurate classifier, and also at the same time reduce the risk of overfitting.
The OfferOfAdmissionExtended
attribute is treated as the response, since it reflects how Admission office value their candidates. The predictors in this case are: GPA, GRE, Major, Dom_Int, Gender, TOEFLcut and Rank, below are the feature importance plots: Mean Decrease Accuracy and Mean Decrease Gini
Since the goal for this mini project is to investigate how different factors like GPA, GRE, Major, Gender, Domestic or International status, etc. weight during admission process, random forest was performed. Random forest can help to determine the most important features in a dataset. And it is also one of the most accurate algorithm that produces an accurate classifier, and also at the same time reduce the risk of overfitting.
The OfferOfAdmissionExtended
attribute is treated as the response, since it reflects how Admission office value their candidates. The predictors in this case are: GPA, GRE, Major, Dom_Int, Gender, TOEFLcut and Rank, below are the feature importance plots: Mean Decrease Accuracy and Mean Decrease Gini
Logistic Regression Model
Logistic Regression model is a predictive analysis tool that uses a logistic function to model a binary variable, which is suitable for the purpose of this mini project. And the data is split into training (70%) and test (30%) set to check the performance of the model. And the ROC curve of the test set result is shown below:
- The plot on the left panel is the resulting ROC curve using predictors
GPA
,Major
andTOEFLcut
. - The AUC: 0.761, which is not a bad prediction result.
- The prediction accuracy rate calculated from the confusion matrix is:
0.7807
. - The prediction error rate calculated from the confusion matrix is:
0.2193
. - A prediction accuracy rate of 78% is quite good for a classification problem of this kind.