
In this project, we delve into the world of wine classification using machine learning algorithms. We explore the wine quality dataset, aiming to predict the type of wine (red or white) based on several key factors. Through a comparative analysis of three popular machine learning algorithms, we uncover the strengths and weaknesses of each approach, ultimately seeking to determine the most effective model for wine classification.
Dataset Description
The wine quality dataset forms the backbone of our exploration. It encompasses various features such as fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality, and type. The type column serves as our target variable, with 0 denoting red wine and 1 representing white wine. This rich dataset provides us with valuable insights into the characteristics of different wines.
Method
- Data Preprocessing: We begin by loading the wine quality dataset into a data frame, performing necessary data cleaning, and exploring the distribution of variables.
- Exploratory Data Analysis: We visualize the relationships among variables using scatterplots and boxplots to gain insights into their impact on wine type.
- Model Building:
- Decision Tree: We fit a decision tree model, tune hyperparameters using grid search and cross-validation, prune the tree, and evaluate its performance.
- Random Forest: Utilizing the power of ensemble learning, we employ a random forest algorithm. We perform grid search to identify optimal hyperparameters, train the model, and assess its accuracy.
- Support Vector Machines (SVM): We leverage SVM to classify wine types. By conducting grid search for optimal hyperparameters, we train the model and evaluate its predictive performance.
- Model Evaluation: We employ performance metrics such as confusion matrix, accuracy, precision, and F1 score to compare and assess the effectiveness of each algorithm.
Insights
- Most of the variables appeared to be strongly related to the target variable (type), hence, no feature selection was conducted. All features were used in the prediction.
- All algorithms performed very well, but: the SVM had the highest accuracy of 99.5% the Random forest followed with an accuracy of 99.4% The decision tree had the lowest accuracy of 98.5%
- Using the features provided, it is clear that machine learning algorithms can be used to predict wine colour. In this case, more complex algorithms provided the highest overall performance.