Python

Python: Converting WordPress posts in CSV format

Over the weekend I wanted to look into the WordPress data behind this blog (very meta!) and wanted to get the data in CSV format so I could do some analysis in R.

2015-07-07_06-59-02

I found a couple of WordPress CSV plugins but unfortunately I couldn’t get any of them to work and ended up working with the raw XML data that WordPress produces when you ‘export’ a blog.

I had the problem of the export being incomplete which I ‘solved’ by importing the posts in two parts of a few years each.

I then spent quite a few hours struggling to get the data into shape using R’s rvest library but eventually decided to do the scraping using Python’s beautifulsoup and save it to a CSV file for analysis in R.

The structure of the XML that we want to extract is as follows:

<rss version="2.0"
	xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:wp="http://wordpress.org/export/1.2/"
>
...
    <channel>
		<item>
		<title>First thoughts on Ruby...</title>
		<link>http://www.markhneedham.com/blog/2006/08/29/first-thoughts-on-ruby/</link>
		<pubDate>Tue, 29 Aug 2006 13:31:05 +0000</pubDate>
...

I wrote the following script to parse the files:

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser
 
import csv
 
def read_page(page):
    return BeautifulSoup(open(page, 'r').read())
 
with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])
 
    for row in select(read_page("part2.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])
 
    for row in select(read_page("part1.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])

We end up with a CSV file that looks like this:

$ head -n 10 posts.csv
title,date
Functional C#: Writing a 'partition' function,2010-02-01 23:34:02+00:00
Coding: Wrapping/not wrapping 3rd party libraries and DSLs,2010-02-02 23:54:21+00:00
Functional C#: LINQ vs Method chaining,2010-02-05 18:06:28+00:00
F#: function keyword,2010-02-07 02:54:13+00:00
Willed vs Forced designs,2010-02-08 22:48:05+00:00
Functional C#: Extracting a higher order function with generics,2010-02-08 23:17:47+00:00
Javascript: File encoding when using string.replace,2010-02-10 00:02:02+00:00
F#: Inline functions and statically resolved type parameters,2010-02-10 23:06:14+00:00
Javascript: Passing functions around with call and apply,2010-02-12 20:18:02+00:00

Let’s quickly look over the data in R and check it’s being correctly exported:

require(dplyr)
require(lubridate)
 
df = read.csv("posts.csv")
 
> df %>% count()
Source: local data frame [1 x 1]
 
     n
1 1501

So we’ve exported 1501 posts. Let’s cross check with the WordPress dashboard:

2015-07-07_07-06-02

We’ve gained two extra posts! A bit more exploration of the WordPress dashboard reveals that there are actually 2 draft posts lying around.

We probably want to remove those from the export and luckily there’s a ‘status’ tag for each post that we can check. We want to make sure it doesn’t have the value ‘draft’:

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser
 
import csv
 
def read_page(page):
    return BeautifulSoup(open(page, 'r').read())
 
with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])
 
    for row in select(read_page("part2.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])
 
    for row in select(read_page("part1.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])

I also had to check if that tag actually existed since there were a couple of posts which didn’t have it but had been published. If we check the resulting CSV file in R we can see that we’ve now got all the posts:

> df = read.csv("posts.csv")
> df %>% count()
Source: local data frame [1 x 1]
 
     n
1 1499

Now we’re ready to test a couple of hypotheses that I have but that’s for another post!

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button