Home » Archives for Mark Needham

Author Archives: Mark Needham

Yelp: Reverse geocoding businesses to extract detailed location information

I’ve been playing around with the Yelp Open Dataset and wanted to extract more detailed location information for each business. This is an example of the JSON representation of one business: $ cat dataset/business.json | head -n1 | jq { "business_id": "FYWN1wneV18bWNgQjJ2GNg", "name": "Dental by Design", "neighborhood": "", "address": "4855 E Warner Rd, Ste B9", "city": "Ahwatukee", "state": "AZ", "postal_code": ...

Read More »

Leaflet: Fit polyline in view

I’ve been playing with the Leaflet.js library over the Christmas holidays to visualise running routes drawn onto the map using a Polyline and I wanted to zoom the map the right amount to see all the points. Pre requisites We have the following HTML to define the div that will contain the map. <div id="container"> <div id="map" style="width: 100%; height: ...

Read More »

scikit-learn: Building a multi class classification ensemble

For the Kaggle Spooky Author Identification I wanted to combine multiple classifiers together into an ensemble and found the VotingClassifier that does exactly that. We need to predict the probability that a sentence is written by one of three authors so the VotingClassifier needs to make a ‘soft’ prediction. If we only needed to know the most likely author we ...

Read More »

Python: Learning about defaultdict’s handling of missing keys

While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat. This is the code snippet that intrigued me: vocabulary = defaultdict() vocabulary.default_factory = vocabulary.__len__ Let’s quickly see how it works by adapting an example from scikit-learn: >>> from collections import defaultdict >>> vocabulary = defaultdict() ...

Read More »

Python: Combinations of values on and off

In my continued exploration of Kaggle’s Spooky Authors competition, I wanted to run a GridSearch turning on and off different classifiers to work out the best combination. I therefore needed to generate combinations of 1s and 0s enabling different classifiers. e.g. if we had 3 classifiers we’d generate these combinations 0 0 1 0 1 0 1 0 0 1 ...

Read More »

scikit-learn: Creating a matrix of named entity counts

I’ve been trying to improve my score on Kaggle’s Spooky Author Identification competition, and my latest idea was building a model which used named entities extracted using the polyglot NLP library. We’ll start by learning how to extract entities form a sentence using polyglot which isn’t too tricky: >>> from polyglot.text import Text >>> doc = "My name is David ...

Read More »

Python: polyglot – ModuleNotFoundError: No module named ‘icu’

I wanted to use the polyglot NLP library that my colleague Will Lyon mentioned in his analysis of Russian Twitter Trolls but had installation problems which I thought I’d share in case anyone else experiences the same issues. I started by trying to install polyglot: $ pip install polyglot   ImportError: No module named 'icu' Hmmm I’m not sure what ...

Read More »

Python 3: TypeError: unsupported format string passed to numpy.ndarray.__format__

This post explains how to work around a change in how Python string formatting works for numpy arrays between Python 2 and Python 3. I’ve been going through Kevin Markham‘s scikit-learn Jupyter notebooks and ran into a problem on the Cross Validation one, which was throwing this error when attempting to print the KFold example: Iteration Training set observations Testing ...

Read More »