R PROJECT
White Wine Quality
ANLY 512 - Extra Credit
Shiqi Ning
5/9/2019
Introduction
This dataset is downloaded from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). The dataset is related to white variants of the Portuguese “Vinho Verde” wine.
In this problem, I’m going to classify different wine into different quality groups with different characteristics of wine (eg. acidity, sweetness, density, pH, etc.). Specifically, I’m going to group the quality score into two groups, with the quality score greater or equal to 6 being classified as 1: good quality and the rest quality score below 6 as 0: poor quality. I will also perform feature selections (Lasso) to choose the most relavent characteristics and fit the model. In addition, I will split the data into training and test set to check the performance of my model. And I will use Logistic Regression and Naive Bayes for the classification.
#Import libraries
require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
require(psych)
## Loading required package: psych
require(ROCR)
## Loading required package: ROCR
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
require(pROC)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:glmnet':
##
## auc
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
require(e1071)
## Loading required package: e1071
Exploratory Analysis
# Load dataset
data = read.csv("winequality-white.csv", sep=";")
colnames(data) = c("fixed.acid", "vola.acid", "citric.acid", "sugar", "chlorides", "free.SO2", "tot.SO2", "density", "pH", "sulphates", "alcohol", "quality")
#head(data)
#str(data)
cat("Summary of the dataset:\n")
summary(data)
# Check the dimension of the data
cat("\n\nThe dataset contains", dim(data)[1], "attributes, and", dim(data)[2], "instances.\n")
# Check for na values
cat("\nThere is", sum(colSums(is.na(data))), "NA values in the dataset.")
# Table showing the counts for each score
cat("\n\nTable showing the counts for each quality score:\n")
table(data$quality)
# Assign quality score with binary representation
data$class = 0
data$class[data$quality >= 6] = 1
data$class = data$class
# Scale the dataset for ML
scaled_data = scale(data[, -(ncol(data)-1): -ncol(data)])
# Plot boxplot
par(mfrow = c(1, 2))
boxplot(data[, -ncol(data)],
main = "Boxplots For Each Attribute",
ylab = "Value",
col = "lightsalmon1",
border = "lightskyblue4",
las = 2
)
boxplot(scaled_data,
main = "Boxplots For Scaled Attribute",
ylab = "Value",
col = "lightsalmon1",
border = "lightskyblue4",
las = 2
)
## Summary of the dataset:
## fixed.acid vola.acid citric.acid sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.SO2 tot.SO2 density
## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917
## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937
## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961
## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.180 Median :0.4700 Median :10.40 Median :6.000
## Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878
## 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000
##
##
## The dataset contains 4898 attributes, and 12 instances.
##
## There is 0 NA values in the dataset.
##
## Table showing the counts for each quality score:
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
From the boxplots, we can see that the attibutes free.SO2 and tot.SO2 have higher values than others, which may further affect the classification result, so I did a normalization on the data. After normalizing, the data seems to have the same level. Then, I will use Lasso to perform feature selection and compare the full model with the model chosen by Lasso.
Fit Logistic Regression Model
Prepare Training and Test Set
set.seed(810)
# Prepare dataset
scaled_df = data.frame(scaled_data, class = as.factor(data$class))
scaled_index = sample(1:nrow(scaled_df), nrow(scaled_df)*0.7)
train_scaled = scaled_df[scaled_index, ]
test_scaled = scaled_df[-scaled_index, ]
I use Logistic Regression model because, it is a predictive analysis tool that uses a logistic function to model a binary variable, which is suitable for my problem.
# Fit logistic regression model
fit_full = glm(class ~ ., data = train_scaled, family = binomial)
probs_full = predict(fit_full, test_scaled, type="response")
# Confusion matrix
pred_full = ifelse(probs_full > 0.6, 1, 0)
table_full = table(pred_full, test_scaled$class)
table_full
# Function calculating the prediction error
predict_error = function(conf_mat){
error = (conf_mat[1,2] + conf_mat[2,1])/(conf_mat[1,1] + conf_mat[1,2] + conf_mat[2,1] + conf_mat[2,2])
accuracy = (conf_mat[1,1] + conf_mat[2,2])/(conf_mat[1,1] + conf_mat[1,2] + conf_mat[2,1] + conf_mat[2,2])
return(error)
}
# Print our results
cat("\nThe prediction accuracy for Naive Bayes on test set is", 1 - predict_error(table_full))
cat("\nThe prediction error for Naive Bayes on test set is", predict_error(table_full))
# Plot the roc curve and calculate the auc
plot(roc(test_scaled$class, probs_full), print.auc = TRUE, col = "lightsalmon1")
##
## pred_full 0 1
## 0 329 198
## 1 158 785
##
## The prediction accuracy for Naive Bayes on test set is 0.7578231
## The prediction error for Naive Bayes on test set is 0.2421769
Feature Selection With Lasso
# Produce a matrix to the predictors
train_mat = model.matrix(class ~ ., data = train_scaled)
test_mat = model.matrix(class ~ ., data = test_scaled)
# Use CV to find lambda of lasso
lasso_cv = cv.glmnet(train_mat, as.numeric(train_scaled$class) , alpha = 1)
# Find the min lambda
lam = lasso_cv$lambda.1se
# Predict using test set
lasso_predict = predict(lasso_cv, s = lam, newx = test_mat)
cat("The test error obtained by Lasso is", mean((test_scaled$class - lasso_predict)^2), "\nLambda:", lam)
## Warning in Ops.factor(test_scaled$class, lasso_predict): '-' not meaningful
## for factors
#plot(lasso_cv, label = TRUE, xvar = "lambda", lwd = 3)
lasso = glmnet(train_mat, as.numeric(train_scaled$class), alpha = 1, lambda = lam)
# Print non-zero coefficient estimates
predict(lasso, s = lam, type = "coefficients")
# Fit logistic regression model chosen by Lasso
fit_lasso = glm(class ~ fixed.acid + vola.acid + sugar + free.SO2 + sulphates + alcohol, data = train_scaled, family = binomial)
probs_lasso = predict(fit_lasso, test_scaled, type="response")
# Confusion matrix
pred_lasso = ifelse(probs_lasso > 0.6, 1, 0)
table_lasso = table(pred_lasso, test_scaled$class)
table_lasso
# Print our results
cat("\nThe prediction accuracy for Naive Bayes on test set is", 1 - predict_error(table_lasso))
cat("\nThe prediction error for Naive Bayes on test set is", predict_error(table_lasso))
# Plot the roc curve and calculate the auc
plot(roc(test_scaled$class, probs_lasso), print.auc = TRUE, col = "lightskyblue4")
lines(roc(test_scaled$class, probs_full), col = "lightsalmon1")
legend("bottomright", c("ROC with lasso model", "ROC with full model"), lty = 1, lwd = 2, col = c("lightskyblue4", "lightsalmon1", bty="n"))
## The test error obtained by Lasso is NA
## Lambda: 0.0191005713 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.665746490
## (Intercept) .
## fixed.acid -0.011007662
## vola.acid -0.099176399
## citric.acid .
## sugar 0.025969478
## chlorides -0.004476576
## free.SO2 0.007972106
## tot.SO2 .
## density .
## pH .
## sulphates 0.007091731
## alcohol 0.177665887
##
## pred_lasso 0 1
## 0 328 199
## 1 159 784
##
## The prediction accuracy for Naive Bayes on test set is 0.7564626
## The prediction error for Naive Bayes on test set is 0.2435374
Since I want to check if there exists certain variables that conttributes significantly more to the score quality group, I choose to perform Lasso cross validation to choose an appropiate lambda value and to identift important varibles. Because Lasso can produce a simpler and more interpretable model that involve only a subset of the predictors, and probably lead to better prediction accuracy. From the above results, we can see that the Lasso chose variables fixed.acid, vola.acid, sugar, free.SO2, sulphates, and alcohol. However, after fitting the model using only these variables, the lasso model accuracy rate and the AUC score (accuracy rate: 0.7564626, AUC: 0.808) on the test set are all smaller than that of the full model (0.7578231, AUC: 0.803), . This could indicate all the characteristics (variables) of wine are important for making a classification. And thus the full model is chosen.
# Print out model summary
summary(fit_full)
##
## Call:
## glm(formula = class ~ ., family = binomial, data = train_scaled)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1233 -0.9029 0.4485 0.8033 2.5262
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.918857 0.045279 20.293 < 2e-16 ***
## fixed.acid -0.015371 0.075404 -0.204 0.83848
## vola.acid -0.653935 0.049934 -13.096 < 2e-16 ***
## citric.acid -0.008387 0.043841 -0.191 0.84828
## sugar 0.851144 0.169147 5.032 4.85e-07 ***
## chlorides -0.014952 0.044004 -0.340 0.73402
## free.SO2 0.165203 0.056508 2.924 0.00346 **
## tot.SO2 -0.066279 0.061282 -1.082 0.27946
## density -0.788205 0.265663 -2.967 0.00301 **
## pH 0.122887 0.066195 1.856 0.06339 .
## sulphates 0.234242 0.049503 4.732 2.22e-06 ***
## alcohol 0.889148 0.140802 6.315 2.70e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4378.1 on 3427 degrees of freedom
## Residual deviance: 3470.2 on 3416 degrees of freedom
## AIC: 3494.2
##
## Number of Fisher Scoring iterations: 5
From the full logistic regression model, the prediction accuracy rate is 75.78%, and Area Under ROC curve is 0.808, which is quite good. Since, it can correctly make classification more than 75% of the data.
Naive Bayes Classifier
Source: https://www.rdocumentation.org/packages/e1071/versions/1.7-1/topics/naiveBayes
The reason for me to choose Naive Bayes Classifier is that it uses Bayes Theroem and conditional probability to separate classe. It is said that Naive Bayes has a higher bias but lower variance compared to logistic regression. If the data set follows the bias then Naive Bayes will be a better classifier. That’s the reason for me to choose this classifier.
# Check for correlation
pairs.panels(data[,-(12:13)])
# Fit Naive Bayes model
nb = naiveBayes(class ~ . - density - tot.SO2, data = train_scaled)
nb
# Predict on test data
pred_nb = predict(nb, test_scaled[, -12], type = "class")
table_nb <- table(pred_nb, test_scaled$class)
table_nb
# Print our results
cat("\nThe prediction accuracy for Naive Bayes on test set is", 1 - predict_error(table_nb))
cat("\nThe prediction error for Naive Bayes on test set is", predict_error(table_nb))
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.3363477 0.6636523
##
## Conditional probabilities:
## fixed.acid
## Y [,1] [,2]
## 0 0.11894367 1.0876059
## 1 -0.06771118 0.9349593
##
## vola.acid
## Y [,1] [,2]
## 0 0.3317783 1.1561622
## 1 -0.1536519 0.8760109
##
## citric.acid
## Y [,1] [,2]
## 0 -0.02431495 1.192906
## 1 -0.02046955 0.909020
##
## sugar
## Y [,1] [,2]
## 0 0.11201939 1.0442663
## 1 -0.06324405 0.9455623
##
## chlorides
## Y [,1] [,2]
## 0 0.2709917 1.2434350
## 1 -0.1313394 0.8237788
##
## free.SO2
## Y [,1] [,2]
## 0 -0.011562001 1.2182192
## 1 0.002768244 0.8965903
##
## tot.SO2
## Y [,1] [,2]
## 0 0.2344816 1.124020
## 1 -0.1199278 0.920727
##
## density
## Y [,1] [,2]
## 0 0.3608480 0.8691990
## 1 -0.1825124 0.9695631
##
## pH
## Y [,1] [,2]
## 0 -0.10081024 0.9905222
## 1 0.06405155 1.0258994
##
## sulphates
## Y [,1] [,2]
## 0 -0.08179715 0.8859231
## 1 0.03492716 1.0354279
##
## alcohol
## Y [,1] [,2]
## 0 -0.5307048 0.7089245
## 1 0.2603359 1.0079970
##
##
## pred_nb 0 1
## 0 265 212
## 1 222 771
##
## The prediction accuracy for Naive Bayes on test set is 0.7047619
## The prediction error for Naive Bayes on test set is 0.2952381
From the scatter plot showing the potential correlation, I found that sugar and density as well as free.SO2 and tot.SO2 seems to have a relatively stronger correlation, so I removed density and tot.SO2, also because Lasso have indicated them as non-important variables. And from the prediction result, we can see that the prediction accuracy for Naive Bayes Classifier is 0.7047619, which is lower than that of Logistic Regression, which is 0.7578231. Thus, the Logistic Regression model will all the variables has a better performance.
Conclusion
For this problem, which is to predict the white wine quality group, the quality variable was transformed into binary data, with quality score >= 6 set as 1, and quality score <6 set as 0. For better and more accuracy results, the data is scaled to have mean 0 and standard deviation of 1. By comparing the results of logistic model containing all variables (accuracy rate: 75.78%) with the model chosen by Lasso, the former one has a higher AUC score on test set, indicating a better performance. Then, the Naive Bayes Classifier was used to perform classification, however, the result (accuracy rate: 70.48%) does not outperform than the logistic model containing all variables. I would say a prediction accuracy rate of 75% is quite good for a classification problem of this kind.
Open questions: 1. It has been observed that there exist some quality score with extreme values, whether these extreme observarion will affect greatly on the prediction result is unknown, and can be done in future work. 2. Besides the variables in this dataset, I’m also wondering if the year, producing area, company of the white wine also affect the quality score. 3. I’m also wondering if other machine learning algorithms such as unsupervised learning method (eg, Random Forest, Bagging, Kmeans or Hierarchical clustering) would perform in terms of prediction accuracy.