A couple of days ago I was scrapping the UK parliament constituencies from Wikipedia in preparation for the Graph Connect hackathon and had got to the point where I had an array with one entry per column in the table. import requests from bs4 import BeautifulSoup from soupselect import select page = open("constituencies.html", 'r') soup = BeautifulSoup(page.read()) ...
Read More »Home » Archives for Mark Needham »
Python: matplotlib hangs and shows nothing (Mac OS X)
I’ve been playing around with some of the matplotlib demos recently and discovered that simply copying one of the examples didn’t actually work for me. I was following the bar chart example and had the following code: When I execute this script from the command line it just hangs and I don’t see anything at all. Via a combination of ...
Read More »Python: Simplifying the creation of a stop word list with defaultdict
I’ve been playing around with topics models again and recently read a paper by David Mimno which suggested the following heuristic for working out which words should go onto the stop list: A good heuristic for identifying such words is to remove those that occur in more than 5-10% of documents (most common) and those that occur fewer than 5-10 ...
Read More »Python: Checking any value in a list exists in a line of text
I’ve been doing some log file analysis to see what cypher queries were being run on a Neo4j instance and I wanted to narrow down the lines I looked at to only contain ones which had mutating operations i.e. those containing the words MERGE, DELETE, SET or CREATE Here’s an example of the text file I was parsing: ...
Read More »Python/Neo4j: Finding interesting computer sciency people to follow on Twitter
At the beginning of this year I moved from Neo4j’s field team to dev team and since the code we write there is much lower level than I’m used to I thought I should find some people to follow on twitter whom I can learn from. My technique for finding some of those people was to pick a person from ...
Read More »Python: Streaming/Appending to a file
I’ve been playing around with Twitter’s API (via the tweepy library) and due to the rate limiting it imposes I wanted to stream results to a CSV file rather than waiting until my whole program had finished. I wrote the following program to simulate what I was trying to do: The program will run ...
Read More »Python: scikit-learn – Training a classifier with non numeric features
Following on from my previous posts on training a classifier to pick out the speaker in sentences of HIMYM transcripts the next thing to do was train a random forest of decision trees to see how that fared. I’ve used scikit-learn for this before so I decided to use that. However, before building a random forest I wanted to check ...
Read More »Python/pandas: Column value in list
I’ve been using Python’s pandas library while exploring some CSV files and although for the most part I’ve found it intuitive to use, I had trouble filtering a data frame based on checking whether a column value was in a list. A subset of one of the CSV files I’ve been working with looks like this: ...
Read More »Python: Find the highest value in a group
In my continued playing around with a How I met your mother data set I needed to find out the last episode that happened in a season so that I could use it in a chart I wanted to plot. I had this CSV file containing each of the episodes: I started out by ...
Read More »