Home » Python » Python: scikit-learn – Training a classifier with non numeric features

About Mark Needham

Mark Needham

Python: scikit-learn – Training a classifier with non numeric features

Following on from my previous posts on training a classifier to pick out the speaker in sentences of HIMYM transcripts the next thing to do was train a random forest of decision trees to see how that fared.

I’ve used scikit-learn for this before so I decided to use that. However, before building a random forest I wanted to check that I could build an equivalent decision tree.

I initially thought that scikit-learn’s DecisionTree classifier would take in data in the same format as nltk’s so I started out with the following code:

import json
import nltk
import collections
from himymutil.ml import pos_features
from sklearn import tree
from sklearn.cross_validation import train_test_split
with open('data/import/trained_sentences.json', 'r') as json_file:
    json_data = json.load(json_file)
tagged_sents = []
for sentence in json_data:
    tagged_sents.append([(word['word'], word['speaker']) for word in sentence['words']])
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    sentence_pos = nltk.pos_tag(untagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, sentence_pos, i), tag) )
clf = tree.DecisionTreeClassifier()
train_data, test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80)
>>> train_data[1]
({'word': u'your', 'word-pos': 'PRP$', 'next-word-pos': 'NN', 'prev-word-pos': 'VB', 'prev-word': u'throw', 'next-word': u'body'}, False)
>>> clf.fit([item[0] for item in train_data], [item[1] for item in train_data])
Traceback (most recent call last):
  File '<stdin>', line 1, in <module>
  File '/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/tree/tree.py', line 137, in fit
    X, = check_arrays(X, dtype=DTYPE, sparse_format='dense')
  File '/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/utils/validation.py', line 281, in check_arrays
    array = np.asarray(array, dtype=dtype)
  File '/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/numpy/core/numeric.py', line 460, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number

In fact, the classifier can only deal with numeric features so we need to translate our features into that format using DictVectorizer.

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X = vec.fit_transform([item[0] for item in featuresets]).toarray()
>>> len(X)
>>> len(X[0])
>>> vec.get_feature_names()[10:15]
['next-word-pos=EX', 'next-word-pos=IN', 'next-word-pos=JJ', 'next-word-pos=JJR', 'next-word-pos=JJS']

We end up with one feature for every key/value combination that exists in featuresets.

I was initially confused about how to split up training and test data sets but it’s actually fairly easy – train_test_split allows us to pass in multiple lists which it splits along the same seam:

vec = DictVectorizer()
X = vec.fit_transform([item[0] for item in featuresets]).toarray()
Y = [item[1] for item in featuresets]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, train_size=0.80)

Next we want to train the classifier which is a couple of lines of code:

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)

I wrote the following function to assess the classifier:

import collections
import nltk
def assess(text, predictions_actual):
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    for i, (prediction, actual) in enumerate(predictions_actual):
    speaker_precision = nltk.metrics.precision(refsets[True], testsets[True])
    speaker_recall = nltk.metrics.recall(refsets[True], testsets[True])
    non_speaker_precision = nltk.metrics.precision(refsets[False], testsets[False])
    non_speaker_recall = nltk.metrics.recall(refsets[False], testsets[False])
    return [text, speaker_precision, speaker_recall, non_speaker_precision, non_speaker_recall]

We can call it like so:

predictions = clf.predict(X_test)
assessment = assess('Decision Tree', zip(predictions, Y_test))
>>> assessment
['Decision Tree', 0.9459459459459459, 0.9210526315789473, 0.9970134395221503, 0.9980069755854509]

Those values are in the same ball park as we’ve seen with the nltk classifier so I’m happy it’s all wired up correctly.

The last thing I wanted to do was visualise the decision tree that had been created and the easiest way to do that is export the classifier to DOT format and then use graphviz to create an image:

with open('/tmp/decisionTree.dot', 'w') as file:
    tree.export_graphviz(clf, out_file = file, feature_names = vec.get_feature_names())
dot -Tpng /tmp/decisionTree.dot -o /tmp/decisionTree.png

The decision tree is quite a few levels deep so here’s part of it:


The full script is on github if you want to play around with it.

Do you want to know how to develop your skillset to become a Web Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!


1. Building web apps with Node.js

2. HTML5 Programming Cookbook

3. CSS Programming Cookbook

4. AngularJS Programming Cookbook

5. jQuery Programming Cookbook

6. Bootstrap Programming Cookbook


and many more ....



Leave a Reply

Your email address will not be published. Required fields are marked *


Want to take your WEB dev skills to the next level?

Grab our programming books for FREE!

Here are some of the eBooks you will get:

  • PHP Programming Cookbook
  • jQuery Programming Cookbook
  • Bootstrap Programming Cookbook
  • Building WEB Apps with Node.js
  • CSS Programming Cookbook
  • HTML5 Programming Cookbook
  • AngularJS Programming Cookbook