Python: Regex – matching foreign characters/unicode letters

I’ve been back in the land of screen scrapping this week extracting data from the Game of Thrones wiki and needed to write a regular expression to pull out characters and actors. Here are some examples of the format of the data: Peter Dinklage as Tyrion Lannister Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer) Filip Lozić as Young ...

Python: Squashing ‘duplicate’ pairs together

As part of a data cleaning pipeline I had pairs of ids of duplicate addresses that I wanted to group together. I couldn’t work out how to solve the problem immediately so I simplified the problem into pairs of letters i.e. A B (A is the same as B) B C (B is the same as C) C D ... ...

Python: Parsing a JSON HTTP chunking stream

I’ve been playing around with meetup.com’s API again and this time wanted to consume the chunked HTTP RSVP stream and filter RSVPs for events I’m interested in. I use Python for most of my hacking these days and if HTTP requests are required the requests library is my first port of call. I started out with the following script import ...

Neo4j: Loading JSON documents with Cypher

One of the most commonly asked questions I get asked is how to load JSON documents into Neo4j and although Cypher doesn’t have a ‘LOAD JSON’ command we can still get JSON data into the graph. Michael shows how to do this from various languages in this blog post and I recently wanted to load a JSON document that I ...

Python: Extracting Excel spreadsheet into CSV files

I’ve been playing around with the Road Safety open data set and the download comes with several CSV files and an excel spreadsheet containing the legend. There are 45 sheets in total and each of them looks like this: I wanted to create a CSV file for each sheet so that I can import the data set into Neo4j using ...

Python: Converting WordPress posts in CSV format

Over the weekend I wanted to look into the WordPress data behind this blog (very meta!) and wanted to get the data in CSV format so I could do some analysis in R. I found a couple of WordPress CSV plugins but unfortunately I couldn’t get any of them to work and ended up working with the raw XML data ...

Python: Refactoring to iterator

Over the last week I’ve been building a set of scripts to scrape the events from the Bayern Munich/Barcelona game and I’ve ended up with a few hundred lines of nested for statements, if statements and mutated lists. I thought it was about time I did a bit of refactoring. The following is a function which takes in a match ...

Python: Selecting certain indexes in an array

A couple of days ago I was scrapping the UK parliament constituencies from Wikipedia in preparation for the Graph Connect hackathon and had got to the point where I had an array with one entry per column in the table. import requests   from bs4 import BeautifulSoup from soupselect import select   page = open("constituencies.html", 'r') soup = BeautifulSoup(page.read())   ...

