Scraping DBPedia for Fun and Corpora

One of the main challenges in procedural text generation is obtaining big enough corpora to produce surprising results. Hand-writing corpora is a good approach, but sometimes too time-consuming or unlikely to produce surprising enough results.

Another common approach is the use of machine learning to make use of unstructured data as a corpus. Markov chains and neural networks have their uses, but they’re not for every application either.

The third approach (Which Emily Short aligns with the principle of Beeswax) is scraping open access data. Wikipedia editors have done a lot of work structuring information about the world, and that data exists in a surprisingly machine-friendly format, assuming one knows how to coax it out.

Writing ad-hoc web scraping scripts is a valid and useful technique, but there’s a more convenient (well, for a certain value of convenient”) alternative: SPARQL queries.

DBPedia is a semantic web” collation of wikipedia, joining together Wikipedia’s information into a databade of machine-friendly relationships. It uses RDF as a format, which can be queried through SPARQL.

SPARQL is a query language for RDF databases. For those of you with database experience, this is similar to the much more broadly used SQL language used to manage relational databases. For those of you without database experience (like yours truly), you can rest assured that RDF and SPARQL are totally unlike relational databases or key-value storages, so you’re on the same footing as the MongoDB nerds.

RDF, or resource description framework, is a format for describing metadata. I realise your eyes are glazing over by now but bear with me. An RDF database, like DBPedia, is a big unordered pile of triples.

A triple is essentially a statement in the subject-predicate-object form we’re used to from English. However, all three components can be resources, ie web URLs that represent something — in the case of DBPedia, Wikipedia pages or ontologies” that are used as predicates. The same resource can be the object in one triple and the subject in another, forming a web of interconnected statements which can be searched.

A SPARQL query is a series of conditions, such as find me the names of British sailing ships launched after 1820, with their launch dates”. SPARQL is a language for expressing that.

SPARQL is also clumsy, not very intuitive even for technologists from outside the database realm, and obscurely documented. So this is my attempt at wresting it out of the hands of dedicated data nerds. I’ll be going step by step until we have a list of British sailing ships launched in the 19th century, in the form of a JSON file that looks like this:

{
"name": "HMS Plantagenet",
"launched": 1801
}

Making queries

You can use dedicated software to talk to SPARQL endpoints (ie, the servers that receive and respond to queries for a database) but DBPedia has a number of web interfaces to endpoints that are very convenient, such as this one.

First, a note about prefixes: In reality, every part of a triple is either a resource (Ie, a URL) or a literal value (String, number, or date). But typing out fully qualified URLs by hand gets tiresome fast. As such, SPARQL queries often start with a list of prefixes, shorthand for naming resources in specific domains. The DBPedia web query interface comes with a preloaded list of prefixes, and we’ll mostly be using that.

So when I write dbo:Ship, what that really means is <http://dbpedia.org/ontology/Ship>; when using a literal URL in SPARQL, we enclose it in angle brackets. Note that these names are case-sensitive, even though the keywords in the SPARQL language themselves aren’t. So let’s start with a simple query:

select distinct ?ship
where { ?ship rdf:type dbo:Ship }

This will get us a long list of every ship in Wikipedia, which unfortunately also includes things such as ship classes — so you’ll find specific U-Boats listed alongside models of U-Boats. Wikipedia’s data is often messy and noisy, and going one step at a time helps in not missing anything as you filter data.

Let’s go over this line by line, since SPARQL is probably unfamiliar even to programmers.

select distinct ?ship

This first select statement tells the database what we are looking for, that is, the columns in the table we’ll get as a result. For now, we’re looking only at ships; eventually we’ll want to connect ship names to ship launch dates. This isn’t as simple as finding a list of triples; it’s essentially finding a list of paths through the database that satisfy the particular query, since the name (a literal value) and the date (another literal value) are not in fact directly connected to each other, but rather are both objects of two different predicates with the same subject, the resource for a given ship.

?ship is a variable; variables in SPARQL are prefixed with ?, because the W3C designed this thing therefore using a character that was already in common use as a variable sigil was out of the question.

where { ?ship rdf:type dbo:Ship }

The where statement contains a list of conditions that have to be fulfilled for a valid path to be found. This one simply states that we’re looking for ?ship where every possible value of ?ship relates to dbo:Ship via the rdf:type predicate.

rdf:type is a commonly-used predicate used to mean is a”; dbo:Ship is an ontology, one of many objects created in DBPedia for the purpose of acting as categories. I’ll talk about how to figure out what resources to reference at the end of this tutorial.

We can add another column to our table:

select distinct ?ship ?propulsion
where {
?ship rdf:type dbo:Ship .
?ship dbp:shipPropulsion ?propulsion
}

Note the . used as a separator between statements. This won’t refine the search, but it’ll give us a table of ships with their propulsion methods. This is useful for finding out how that’s specified in the data. Looking over the entries, we find that both Sail” and Sails” are often used to denote a sailing vessel. We don’t need our corpora to be totally perfectly comprehensive (Wikipedia scraping won’t get you that anyway), so let’s just consider that our qualification.

select distinct ?ship
where {
?ship rdf:type dbo:Ship .
?ship dbp:shipPropulsion "Sails"@en
}

"Sails"@en is a string literal. Strings in RDF come with a specified language, so just Sails wouldn’t match; we need the language tag (@en) in there. This is only half the equation, though; Sails” isn’t Sail”; curse Wikipedia editors for their inconsistency.

Here’s how we look up both together:

select distinct ?ship
where {
?ship rdf:type dbo:Ship .
{ ?ship dbp:shipPropulsion "Sails"@en } union
{ ?ship dbp:shipPropulsion "Sail"@en }
}

union is a SPARQL operator. It means a set union, of course, and it’s infix, because why would the syntax make sense. This gets us all the sailing ships, at last.

By looking at the data, we can find the right names to use in order to further select only British ships:

select distinct ?ship
where {
?ship rdf:type dbo:Ship .
{ ?ship dbp:shipPropulsion "Sails"@en } union
{ ?ship dbp:shipPropulsion "Sail"@en } .
?ship dbo:country dbr:United_Kingdom_of_Great_Britain_and_Ireland
}

Finally, we want to know when those ships were launched, and filter out the ones that were launched before or after the 19th century:

select distinct ?ship ?launched
where {
?ship rdf:type dbo:Ship .
{ ?ship dbp:shipPropulsion "Sails"@en } union
{ ?ship dbp:shipPropulsion "Sail"@en } .
?ship dbo:country dbr:United_Kingdom_of_Great_Britain_and_Ireland .
?ship dbo:shipLaunch ?launched .
filter (
?launched > xsd:dateTime('1820-1-1') &&
?launched < xsd:dateTime('1900-1-1')
)
}

Note how we can have two variables in a predicate: ?ship dbo:shipLaunch ?launched. This lets us traverse the network of triples, going arbitrarily far and deep across the relationships; it’s possible to ask elaborate questions such as Football players under 25 who play for countries that took part in WWII”, because we can draw indirect relationships like that.

The contents of the filter statement should make sense to people with some programming familiarity; the one notable thing is that to write out a date literal, we use a function to create it from a string. Simply writing 1820-1-1” wouldn’t work.

Now we have a table of ships (that is, web resources representing ships) and their launch dates. But we want a table of ships’ names and their launch dates, information that we can actually use. For neatness’ sake, we’ll also sort the results by date:

select distinct ?ship ?name ?launched
where {
?ship rdf:type dbo:Ship .
{ ?ship dbp:shipPropulsion "Sails"@en } union
{ ?ship dbp:shipPropulsion "Sail"@en } .
?ship dbo:country dbr:United_Kingdom_of_Great_Britain_and_Ireland .
?ship dbo:shipLaunch ?launched .
filter (
?launched > xsd:dateTime('1820-1-1') &&
?launched < xsd:dateTime('1900-1-1')
) .
?ship dbp:shipName ?name
}
order by asc(?launched)

asc means ascending, of course. At this point, we can change the results format” setting on the web interface to JSON and download a nice machine-readable JSON file.

The JSON includes a lot of metadata we don’t need, but it’s easy to clean that up with a simple script. You can use whatever tool you like for this; I wrote a dirty ES6 script that runs on babel-node:

import jetpack from 'fs-jetpack'
jetpack.write('ships-cleaned.json',
jetpack.read('ships.json', 'json')
.results
.bindings
.map(entry => ({
name: entry.name.value,
launched: entry.launched.value.split('-')[0]
})
)
)

You can see the final result in this gist.

Finding Resources

Here’s the problem with SPARQL: Even if you know the syntax and semantics of it, you don’t necessarily know what resources to use in queries, which is to say the right names to express the relationships you want to search for.

So far, the best way I’ve found of figuring this out is by using the DBPedia faceted browser. With it, you can search for the DBPedia resources that are counterparts to wikipedia pages, and see how their relationships are structured and what predicates are used. For instance, when I started writing this example, I first looked at the page for the HMS Trafalgar, which is where I found out how the different relationships are structured in the data: dbo:country used to express country of origin, for instance, and that ships have rdf:type to dbo:Ship. Some experimentation is required to get useful queries, and I’m still myself figuring out how to best use this tool.

Now go out there and make some twitter bots.