Question: Given the aroma/bitterness/etc of a cup of coffee, can we predict if it will be a ‘good’ cup of coffee as rated by professional coffee tasters?

Check out the Jupyter Notebook to follow along.

We’ll bring in the data, do preliminary exploration, transform it for analysis, separate it into training/test sets, determine which model results in the highest accuracy predictions, and predict the rating a new cup of coffee might receive.

Info from The Coffee Institute, data scraped by James LeDoux.


Let’s start with our initial imports. This will bring in everything we’ll need for the project, including ways to manipulate data (pandas), visualize data (matplotlib), and do ML analysis (sklearn)

#importing
##data tools
import pandas as pd
pd.set_option('display.max_columns', 500)
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
##ML tools
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
%matplotlib inline

Next let’s get the data and take a quick look at it.


coffeeRaw = pd.read_csv('arabica_data_cleaned.csv')
coffeeRaw.head()
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region Producer Number.of.Bags Bag.Weight In.Country.Partner Harvest.Year Grading.Date Owner.1 Variety Processing.Method Aroma Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup Sweetness Cupper.Points Total.Cup.Points Moisture Category.One.Defects Quakers Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela METAD PLC 300 60 kg METAD Agricultural Development plc 2014 April 4th, 2015 metad plc NaN Washed / Wet 8.67 8.83 8.67 8.75 8.50 8.42 10.0 10.0 10.0 8.75 90.58 0.12 0 0.0 Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0
1 2 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela METAD PLC 300 60 kg METAD Agricultural Development plc 2014 April 4th, 2015 metad plc Other Washed / Wet 8.75 8.67 8.50 8.58 8.42 8.42 10.0 10.0 10.0 8.58 89.92 0.12 0 0.0 Green 1 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0
2 3 Arabica grounds for health admin Guatemala san marcos barrancas "san cristobal cuch NaN NaN NaN NaN 1600 - 1800 m NaN NaN 5 1 Specialty Coffee Association NaN May 31st, 2010 Grounds for Health Admin Bourbon NaN 8.42 8.50 8.42 8.42 8.33 8.42 10.0 10.0 10.0 9.25 89.75 0.00 0 0.0 NaN 0 May 31st, 2011 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 1600.0 1800.0 1700.0
3 4 Arabica yidnekachew dabessa Ethiopia yidnekachew dabessa coffee plantation NaN wolensu NaN yidnekachew debessa coffee plantation 1800-2200 oromia Yidnekachew Dabessa Coffee Plantation 320 60 kg METAD Agricultural Development plc 2014 March 26th, 2015 Yidnekachew Dabessa NaN Natural / Dry 8.17 8.58 8.42 8.42 8.50 8.25 10.0 10.0 10.0 8.67 89.00 0.11 0 0.0 Green 2 March 25th, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1800.0 2200.0 2000.0
4 5 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela METAD PLC 300 60 kg METAD Agricultural Development plc 2014 April 4th, 2015 metad plc Other Washed / Wet 8.25 8.50 8.25 8.50 8.42 8.33 10.0 10.0 10.0 8.58 88.83 0.12 0 0.0 Green 2 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

Ok. That’s a lot of info, far more than we need really. Remember, our goal is to build something that can predict a value using only information we might have about a cup of coffee right in front of us. We won’t know the elevation, processing method, etc. So let’s trim this down to a more useful dataset..

coffee = coffeeRaw[['Aroma','Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance',
                     'Total.Cup.Points']].copy()
coffee.head()
Aroma Flavor Aftertaste Acidity Body Balance Total.Cup.Points
0 8.67 8.83 8.67 8.75 8.50 8.42 90.58
1 8.75 8.67 8.50 8.58 8.42 8.42 89.92
2 8.42 8.50 8.42 8.42 8.33 8.42 89.75
3 8.17 8.58 8.42 8.42 8.50 8.25 89.00
4 8.25 8.50 8.25 8.50 8.42 8.33 88.83

Much better.

Real quick, let’s make sure there’s no weirdness in our data.

coffee.info()
coffee.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1311 entries, 0 to 1310
Data columns (total 7 columns):
Aroma               1311 non-null float64
Flavor              1311 non-null float64
Aftertaste          1311 non-null float64
Acidity             1311 non-null float64
Body                1311 non-null float64
Balance             1311 non-null float64
Total.Cup.Points    1311 non-null float64
dtypes: float64(7)
memory usage: 71.8 KB





Aroma               0
Flavor              0
Aftertaste          0
Acidity             0
Body                0
Balance             0
Total.Cup.Points    0
dtype: int64

Yay, no blanks!

Pre-Processing & Basic Analysis

Since all we really want to algorithmically determine is if a certain cup is ‘good’ or ‘bad’, let’s bin our data into those categories.

#preprocessing into Good/Bad coffee
bins = [-1, 85, 100] #anything scoring 85 or above is good coffee™
labels = ['bad', 'good']
coffee['goodCoffee'] = pd.cut(coffee['Total.Cup.Points'], bins = bins, labels = labels)
coffee = coffee.drop(columns=['Total.Cup.Points'])
coffee['goodCoffee'].unique()
[good, bad]
Categories (2, object): [bad < good]
coffee['goodCoffee'].value_counts()
bad     1215
good      96
Name: goodCoffee, dtype: int64
#transform 'good' and 'bad' to 1 & 0
encoder = LabelEncoder()
coffee['goodCoffee'] = encoder.fit_transform(coffee['goodCoffee'])
coffee.head()
Aroma Flavor Aftertaste Acidity Body Balance goodCoffee
0 8.67 8.83 8.67 8.75 8.50 8.42 1
1 8.75 8.67 8.50 8.58 8.42 8.42 1
2 8.42 8.50 8.42 8.42 8.33 8.42 1
3 8.17 8.58 8.42 8.42 8.50 8.25 1
4 8.25 8.50 8.25 8.50 8.42 8.33 1

We can see we have way more bad coffee than good coffee. (Sad) Let’s quickly graph this.

# Plot the data
plt.xkcd()
plt.bar(coffee['goodCoffee'].unique(), coffee['goodCoffee'].value_counts(),
        width=0.75, align='center', color=mcolors.XKCD_COLORS)
plt.xticks([0, 1], ['Good', 'Not Good'])
plt.yticks(coffee['goodCoffee'].value_counts())

plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

plt.ylabel('# of entries')
plt.title('HOW MUCH OF THIS COFFEE SUCKS?')

plt.show()

png1

# and a pie chart
plt.xkcd()
plt.pie(coffee['goodCoffee'].value_counts(), explode=(0, 0.1), labels=(['Not Good', 'Good']), autopct='%1.1f%%',
        shadow=True, startangle=90, colors=mcolors.XKCD_COLORS)
plt.title('HOW MUCH OF THIS COFFEE SUCKS?')

plt.show()

png2

Finally, let’s finish getting our data ready for analysis by splitting it into training/test sets and standardizing the values to remove a bit of bias from the system.

#seperate feature/response vars
X = coffee.drop(columns=['goodCoffee'])
y = coffee['goodCoffee']
#train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#standardize values
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Test the models

Random Forest Classifier

#better for mid-size datasets
rfc = RandomForestClassifier(n_estimators = 200)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)
#how well did model perform?
print(classification_report(y_test, pred_rfc))
print(confusion_matrix(y_test, pred_rfc))
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       244
           1       0.74      0.74      0.74        19

    accuracy                           0.96       263
   macro avg       0.86      0.86      0.86       263
weighted avg       0.96      0.96      0.96       263

[[239   5]
 [  5  14]]

SVM Classifier

#better for smaller datasets
clf = svm.SVC()
clf.fit(X_train, y_train)
pred_clf = clf.predict(X_test)
#how well did model perform?
print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       244
           1       0.78      0.74      0.76        19

    accuracy                           0.97       263
   macro avg       0.88      0.86      0.87       263
weighted avg       0.97      0.97      0.97       263

[[240   4]
 [  5  14]]

Neural Network (Multi-Layer Percepitron Classifier)

#better for huge amounts of data
mlpc= MLPClassifier(hidden_layer_sizes=(11, 11, 11), max_iter=500)
mlpc.fit(X_train, y_train)
pred_mlpc = mlpc.predict(X_test)
print(classification_report(y_test, pred_mlpc))
print(confusion_matrix(y_test, pred_mlpc))
              precision    recall  f1-score   support

           0       0.99      0.98      0.98       244
           1       0.76      0.84      0.80        19

    accuracy                           0.97       263
   macro avg       0.87      0.91      0.89       263
weighted avg       0.97      0.97      0.97       263

[[239   5]
 [  3  16]]

What’s best?

from sklearn.metrics import accuracy_score
acc_rfc = accuracy_score(y_test, pred_rfc)
acc_clf = accuracy_score(y_test, pred_clf)
acc_mlpc = accuracy_score(y_test, pred_mlpc)

print(f"Random Forest was {acc_rfc:.2%} accurate")
print(f"SVM was {acc_clf:.2%} accurate")
print(f"Neural Network was {acc_mlpc:.2%} accurate")
Random Forest was 96.20% accurate
SVM was 96.58% accurate
Neural Network was 96.96% accurate

All our models preformed about the same, with the neural net just a little bit better than everyone else. We’ll implement that model. This is doubley nice because now we get to say “I implemented a neural network” which sounds very fancy indeed.

Predicting future values

#a reminder what our data looks like
coffee[90:100]
Aroma Flavor Aftertaste Acidity Body Balance goodCoffee
90 8.42 8.00 7.42 8.00 7.92 7.92 1
91 7.58 7.83 7.83 7.92 7.83 8.17 1
92 7.83 8.00 7.75 7.75 7.83 7.83 1
93 8.08 8.17 7.92 8.00 8.08 8.00 1
94 7.67 8.00 7.83 8.00 7.92 7.83 1
95 7.50 7.92 8.25 7.83 7.92 7.75 1
96 8.17 7.92 7.75 7.75 7.67 7.75 0
97 8.00 7.92 7.75 7.92 7.75 7.83 0
98 7.83 8.00 7.58 7.92 7.92 7.83 0
99 7.83 8.00 7.83 7.75 7.67 7.83 0
#---------------
#A theoretical cup of coffee
new_Aroma = 8
new_Flavor = 7.5
new_Aftertaste = 8
new_Acidity = 8
new_Body = 7.9
new_Balance = 8
#---------------
X_new = [[new_Aroma, new_Flavor, new_Aftertaste, new_Acidity, new_Body, new_Balance]]
X_new = sc.transform(X_new)
y_new = mlpc.predict(X_new)
print(y_new)
if y_new[0] == 1:
    print("it's good!")
else:
    print ("it's not good!")
[1]
it's good!