Home » Python » scikit-learn: Building a multi class classification ensemble

About Mark Needham

Mark Needham

scikit-learn: Building a multi class classification ensemble

For the Kaggle Spooky Author Identification I wanted to combine multiple classifiers together into an ensemble and found the VotingClassifier that does exactly that.

We need to predict the probability that a sentence is written by one of three authors so the VotingClassifier needs to make a ‘soft’ prediction. If we only needed to know the most likely author we could have it make a ‘hard’ prediction instead.

We start with three classifiers which generate different n-gram based features. The code for those is as follows:

from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())

We can combine those classifiers together like this:

classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))

Now it’s time to test our ensemble. I got the code for the test function from Sohier Dane‘s tutorial.

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
Y_COLUMN = "author"
TEXT_COLUMN = "text"
def test_pipeline(df, nlp_pipeline):
    y = df[Y_COLUMN].copy()
    X = pd.Series(df[TEXT_COLUMN])
    rskf = StratifiedKFold(n_splits=5, random_state=1)
    losses = []
    accuracies = []
    for train_index, test_index in rskf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        nlp_pipeline.fit(X_train, y_train)
        losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
        accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))
    print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
        str([str(round(x, 3)) for x in sorted(losses)]),
        round(np.mean(losses), 3),
        round(np.mean(accuracies), 3)
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
test_pipeline(train_df, mixed_pipe)

Let’s run the script:

kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849

Looks good.

I’ve actually got several other classifiers as well but I’m not sure which ones should be part of the ensemble. In a future post we’ll look at how to use GridSearch to work that out.

Published on Web Code Geeks with permission by Mark Needham, partner at our WCG program. See the original article here: scikit-learn: Building a multi class classification ensemble

Opinions expressed by Web Code Geeks contributors are their own.

(0 rating, 0 votes)
You need to be a registered member to rate this.
Start the discussion Views Tweet it!
Do you want to know how to develop your skillset to become a Web Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. Building web apps with Node.js
2. HTML5 Programming Cookbook
3. CSS Programming Cookbook
4. AngularJS Programming Cookbook
5. jQuery Programming Cookbook
6. Bootstrap Programming Cookbook
and many more ....
I agree to the Terms and Privacy Policy
Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments