R PROJECT

White Wine Quality

ANLY 512 - Extra Credit

Introduction

This dataset is downloaded from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). The dataset is related to white variants of the Portuguese “Vinho Verde” wine.

In this problem, I’m going to classify different wine into different quality groups with different characteristics of wine (eg. acidity, sweetness, density, pH, etc.). Specifically, I’m going to group the quality score into two groups, with the quality score greater or equal to 6 being classified as 1: good quality and the rest quality score below 6 as 0: poor quality. I will also perform feature selections (Lasso) to choose the most relavent characteristics and fit the model. In addition, I will split the data into training and test set to check the performance of my model. And I will use Logistic Regression and Naive Bayes for the classification.

#Import libraries
require(glmnet)

## Loading required package: glmnet

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-16

require(psych)

## Loading required package: psych

require(ROCR)

## Loading required package: ROCR

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

require(pROC)

## Loading required package: pROC

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:glmnet':
## 
##     auc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

require(e1071)

## Loading required package: e1071

Exploratory Analysis

# Load dataset
data = read.csv("winequality-white.csv", sep=";")
colnames(data) = c("fixed.acid", "vola.acid", "citric.acid", "sugar", "chlorides", "free.SO2", "tot.SO2", "density", "pH", "sulphates", "alcohol", "quality")
#head(data)
#str(data)
cat("Summary of the dataset:\n")
summary(data)

# Check the dimension of the data
cat("\n\nThe dataset contains", dim(data)[1], "attributes, and", dim(data)[2], "instances.\n")

# Check for na values
cat("\nThere is", sum(colSums(is.na(data))), "NA values in the dataset.")

# Table showing the counts for each score
cat("\n\nTable showing the counts for each quality score:\n")
table(data$quality)

# Assign quality score with binary representation
data$class = 0
data$class[data$quality >= 6] = 1
data$class = data$class

# Scale the dataset for ML
scaled_data = scale(data[, -(ncol(data)-1): -ncol(data)])

# Plot boxplot
par(mfrow = c(1, 2))
boxplot(data[, -ncol(data)],
  main = "Boxplots For Each Attribute",
  ylab = "Value",
  col = "lightsalmon1",
  border = "lightskyblue4",
  las = 2
  )
boxplot(scaled_data,
  main = "Boxplots For Scaled Attribute",
  ylab = "Value",
  col = "lightsalmon1",
  border = "lightskyblue4",
  las = 2
  )

## Summary of the dataset:
##    fixed.acid       vola.acid       citric.acid         sugar       
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides          free.SO2         tot.SO2         density      
##  Min.   :0.00900   Min.   :  2.00   Min.   :  9.0   Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00   1st Qu.:108.0   1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00   Median :134.0   Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31   Mean   :138.4   Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00   3rd Qu.:167.0   3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00   Max.   :440.0   Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000  
## 
## 
## The dataset contains 4898 attributes, and 12 instances.
## 
## There is 0 NA values in the dataset.
## 
## Table showing the counts for each quality score:
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

From the boxplots, we can see that the attibutes free.SO2 and tot.SO2 have higher values than others, which may further affect the classification result, so I did a normalization on the data. After normalizing, the data seems to have the same level. Then, I will use Lasso to perform feature selection and compare the full model with the model chosen by Lasso.

Fit Logistic Regression Model

Prepare Training and Test Set

set.seed(810)
# Prepare dataset
scaled_df = data.frame(scaled_data, class = as.factor(data$class))
scaled_index = sample(1:nrow(scaled_df), nrow(scaled_df)*0.7)
train_scaled = scaled_df[scaled_index, ]
test_scaled = scaled_df[-scaled_index, ]

I use Logistic Regression model because, it is a predictive analysis tool that uses a logistic function to model a binary variable, which is suitable for my problem.

# Fit logistic regression model
fit_full = glm(class  ~ ., data = train_scaled, family = binomial)
probs_full = predict(fit_full, test_scaled, type="response")

# Confusion matrix
pred_full = ifelse(probs_full > 0.6, 1, 0)
table_full = table(pred_full, test_scaled$class)
table_full

# Function calculating the prediction error
predict_error = function(conf_mat){
  error = (conf_mat[1,2] + conf_mat[2,1])/(conf_mat[1,1] + conf_mat[1,2] + conf_mat[2,1] + conf_mat[2,2])
  accuracy = (conf_mat[1,1] + conf_mat[2,2])/(conf_mat[1,1] + conf_mat[1,2] + conf_mat[2,1] + conf_mat[2,2])
  return(error)
}
# Print our results
cat("\nThe prediction accuracy for Naive Bayes on test set is", 1 - predict_error(table_full))
cat("\nThe prediction error for Naive Bayes on test set is", predict_error(table_full))


# Plot the roc curve and calculate the auc
plot(roc(test_scaled$class, probs_full), print.auc = TRUE, col = "lightsalmon1")

##          
## pred_full   0   1
##         0 329 198
##         1 158 785
## 
## The prediction accuracy for Naive Bayes on test set is 0.7578231
## The prediction error for Naive Bayes on test set is 0.2421769

Feature Selection With Lasso

# Produce a matrix to the predictors
train_mat = model.matrix(class  ~ ., data = train_scaled)
test_mat = model.matrix(class  ~ ., data = test_scaled)

# Use CV to find lambda of lasso
lasso_cv = cv.glmnet(train_mat, as.numeric(train_scaled$class) , alpha = 1)
# Find the min lambda
lam = lasso_cv$lambda.1se

# Predict using test set
lasso_predict = predict(lasso_cv, s = lam, newx = test_mat)
cat("The test error obtained by Lasso is", mean((test_scaled$class - lasso_predict)^2), "\nLambda:", lam)

## Warning in Ops.factor(test_scaled$class, lasso_predict): '-' not meaningful
## for factors

#plot(lasso_cv, label = TRUE, xvar = "lambda", lwd = 3)
lasso = glmnet(train_mat, as.numeric(train_scaled$class), alpha = 1, lambda = lam)

# Print non-zero coefficient estimates
predict(lasso, s = lam, type = "coefficients")

# Fit logistic regression model chosen by Lasso
fit_lasso = glm(class ~ fixed.acid + vola.acid + sugar + free.SO2 + sulphates + alcohol, data = train_scaled, family = binomial)
probs_lasso = predict(fit_lasso, test_scaled, type="response")

# Confusion matrix
pred_lasso = ifelse(probs_lasso > 0.6, 1, 0)
table_lasso = table(pred_lasso, test_scaled$class)
table_lasso

# Print our results
cat("\nThe prediction accuracy for Naive Bayes on test set is", 1 - predict_error(table_lasso))
cat("\nThe prediction error for Naive Bayes on test set is", predict_error(table_lasso))


# Plot the roc curve and calculate the auc
plot(roc(test_scaled$class, probs_lasso), print.auc = TRUE, col = "lightskyblue4")
lines(roc(test_scaled$class, probs_full), col = "lightsalmon1")
legend("bottomright", c("ROC with lasso model", "ROC with full model"), lty = 1, lwd = 2, col = c("lightskyblue4", "lightsalmon1", bty="n"))

## The test error obtained by Lasso is NA 
## Lambda: 0.0191005713 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  1.665746490
## (Intercept)  .          
## fixed.acid  -0.011007662
## vola.acid   -0.099176399
## citric.acid  .          
## sugar        0.025969478
## chlorides   -0.004476576
## free.SO2     0.007972106
## tot.SO2      .          
## density      .          
## pH           .          
## sulphates    0.007091731
## alcohol      0.177665887
##           
## pred_lasso   0   1
##          0 328 199
##          1 159 784
## 
## The prediction accuracy for Naive Bayes on test set is 0.7564626
## The prediction error for Naive Bayes on test set is 0.2435374

Since I want to check if there exists certain variables that conttributes significantly more to the score quality group, I choose to perform Lasso cross validation to choose an appropiate lambda value and to identift important varibles. Because Lasso can produce a simpler and more interpretable model that involve only a subset of the predictors, and probably lead to better prediction accuracy. From the above results, we can see that the Lasso chose variables fixed.acid, vola.acid, sugar, free.SO2, sulphates, and alcohol. However, after fitting the model using only these variables, the lasso model accuracy rate and the AUC score (accuracy rate: 0.7564626, AUC: 0.808) on the test set are all smaller than that of the full model (0.7578231, AUC: 0.803), . This could indicate all the characteristics (variables) of wine are important for making a classification. And thus the full model is chosen.

# Print out model summary
summary(fit_full)

## 
## Call:
## glm(formula = class ~ ., family = binomial, data = train_scaled)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1233  -0.9029   0.4485   0.8033   2.5262  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.918857   0.045279  20.293  < 2e-16 ***
## fixed.acid  -0.015371   0.075404  -0.204  0.83848    
## vola.acid   -0.653935   0.049934 -13.096  < 2e-16 ***
## citric.acid -0.008387   0.043841  -0.191  0.84828    
## sugar        0.851144   0.169147   5.032 4.85e-07 ***
## chlorides   -0.014952   0.044004  -0.340  0.73402    
## free.SO2     0.165203   0.056508   2.924  0.00346 ** 
## tot.SO2     -0.066279   0.061282  -1.082  0.27946    
## density     -0.788205   0.265663  -2.967  0.00301 ** 
## pH           0.122887   0.066195   1.856  0.06339 .  
## sulphates    0.234242   0.049503   4.732 2.22e-06 ***
## alcohol      0.889148   0.140802   6.315 2.70e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4378.1  on 3427  degrees of freedom
## Residual deviance: 3470.2  on 3416  degrees of freedom
## AIC: 3494.2
## 
## Number of Fisher Scoring iterations: 5

From the full logistic regression model, the prediction accuracy rate is 75.78%, and Area Under ROC curve is 0.808, which is quite good. Since, it can correctly make classification more than 75% of the data.

Naive Bayes Classifier

Source: https://www.rdocumentation.org/packages/e1071/versions/1.7-1/topics/naiveBayes

The reason for me to choose Naive Bayes Classifier is that it uses Bayes Theroem and conditional probability to separate classe. It is said that Naive Bayes has a higher bias but lower variance compared to logistic regression. If the data set follows the bias then Naive Bayes will be a better classifier. That’s the reason for me to choose this classifier.

# Check for correlation
pairs.panels(data[,-(12:13)])

# Fit Naive Bayes model
nb = naiveBayes(class  ~ . - density - tot.SO2, data = train_scaled)
nb

# Predict on test data
pred_nb = predict(nb, test_scaled[, -12], type = "class")
table_nb <- table(pred_nb, test_scaled$class)
table_nb

# Print our results
cat("\nThe prediction accuracy for Naive Bayes on test set is", 1 - predict_error(table_nb))
cat("\nThe prediction error for Naive Bayes on test set is", predict_error(table_nb))

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.3363477 0.6636523 
## 
## Conditional probabilities:
##    fixed.acid
## Y          [,1]      [,2]
##   0  0.11894367 1.0876059
##   1 -0.06771118 0.9349593
## 
##    vola.acid
## Y         [,1]      [,2]
##   0  0.3317783 1.1561622
##   1 -0.1536519 0.8760109
## 
##    citric.acid
## Y          [,1]     [,2]
##   0 -0.02431495 1.192906
##   1 -0.02046955 0.909020
## 
##    sugar
## Y          [,1]      [,2]
##   0  0.11201939 1.0442663
##   1 -0.06324405 0.9455623
## 
##    chlorides
## Y         [,1]      [,2]
##   0  0.2709917 1.2434350
##   1 -0.1313394 0.8237788
## 
##    free.SO2
## Y           [,1]      [,2]
##   0 -0.011562001 1.2182192
##   1  0.002768244 0.8965903
## 
##    tot.SO2
## Y         [,1]     [,2]
##   0  0.2344816 1.124020
##   1 -0.1199278 0.920727
## 
##    density
## Y         [,1]      [,2]
##   0  0.3608480 0.8691990
##   1 -0.1825124 0.9695631
## 
##    pH
## Y          [,1]      [,2]
##   0 -0.10081024 0.9905222
##   1  0.06405155 1.0258994
## 
##    sulphates
## Y          [,1]      [,2]
##   0 -0.08179715 0.8859231
##   1  0.03492716 1.0354279
## 
##    alcohol
## Y         [,1]      [,2]
##   0 -0.5307048 0.7089245
##   1  0.2603359 1.0079970
## 
##        
## pred_nb   0   1
##       0 265 212
##       1 222 771
## 
## The prediction accuracy for Naive Bayes on test set is 0.7047619
## The prediction error for Naive Bayes on test set is 0.2952381

From the scatter plot showing the potential correlation, I found that sugar and density as well as free.SO2 and tot.SO2 seems to have a relatively stronger correlation, so I removed density and tot.SO2, also because Lasso have indicated them as non-important variables. And from the prediction result, we can see that the prediction accuracy for Naive Bayes Classifier is 0.7047619, which is lower than that of Logistic Regression, which is 0.7578231. Thus, the Logistic Regression model will all the variables has a better performance.

Conclusion

For this problem, which is to predict the white wine quality group, the quality variable was transformed into binary data, with quality score >= 6 set as 1, and quality score <6 set as 0. For better and more accuracy results, the data is scaled to have mean 0 and standard deviation of 1. By comparing the results of logistic model containing all variables (accuracy rate: 75.78%) with the model chosen by Lasso, the former one has a higher AUC score on test set, indicating a better performance. Then, the Naive Bayes Classifier was used to perform classification, however, the result (accuracy rate: 70.48%) does not outperform than the logistic model containing all variables. I would say a prediction accuracy rate of 75% is quite good for a classification problem of this kind.

Open questions: 1. It has been observed that there exist some quality score with extreme values, whether these extreme observarion will affect greatly on the prediction result is unknown, and can be done in future work. 2. Besides the variables in this dataset, I’m also wondering if the year, producing area, company of the white wine also affect the quality score. 3. I’m also wondering if other machine learning algorithms such as unsupervised learning method (eg, Random Forest, Bagging, Kmeans or Hierarchical clustering) would perform in terms of prediction accuracy.