Quotations Book Example
From n² wiki
I thought I'd have a play with loading some RDF into a store. These are my working notes. Looking around I found that Quotations Book had released an RSS 1.0 feed of their data under a Creative Commons Attribution licence. (Since I did this, Tom Morris has performed a conversion of the data to RDF which is also available direct from Quotations Book)
Contents |
[edit] Examining the Data
The Quotations Book is presented as a big RSS 1.0 feed, formatted for use with Google Base. I was only interested in the RSS items, which look like the following (I've included the root rdf:RDF element too so you can see the namespaces):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:g="http://base.google.com/ns/1.0" xmlns:c="http://base.google.com/cns/1.0"> <item rdf:about="http://quotationsbook.com/quote/43114"> <link>http://quotationsbook.com/quote/43114</link> <title>Quote by Jessel, George on Perseverance</title> <c:quote type="string">If you haven't struck oil in the first three minutes - stop boring!</c:quote> <c:quote_link type="url">http://quotationsbook.com/quote/43114</c:quote_link> <c:author type="string">Jessel, George</c:author> <c:author_link type="url">http://quotationsbook.com/author/3811</c:author_link> <c:subject type="string">Perseverance</c:subject> <c:subject_link type="url">http://quotationsbook.com/subject/perseverance</c:subject_link> </item> </rdf:RDF>
Ummm. The first problem was that those items aren't valid RDF at all. Those type attributes really mess things up. So, I needed to map this to some valid RDF.
Also the QuotationsBook export uses a google namespace for the quotes - and I'm pretty sure that namespace isn't designed for expressing quotations data. So I took the liberty of mapping the QuotationsBook data to a new quotation schema containing the single class of "Quotation" and a property called "quote". I wanted to include information about the CC attribution licence so I included a dc:rights element. Because Platform stores can do that yet but it is planned. I used dc:creator to denote the name of the person who uttered the quotation but I could have included some FOAF to properly link the quotation to a page about that person. I also used foaf:isPrimaryTopicOf for the link back to the page on QuotationsBook.com. Finally I used dc:subject to denote the classification of the quotation. It's worth comparing this with Tom Morris' version at the Quotations Book site which uses dc:subject and dc:creator but with resources as values rather than literals. My preference is to use literals with those properties.
I wanted my quotes to look like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:q="http://purl.org/vocab/quotation/schema#" > <q:Quotation> <q:quote>If you haven't struck oil in the first three minutes - stop boring!</c:quote> <dc:rights>Published by QuotationsBook.com Limited under a creative commons attribution licence.</dc:rights> <foaf:isPrimaryTopicOf rdf:resource="http://quotationsbook.com/quote/43114" /> <dc:creator>Jessel, George</dc:creator> <dc:subject>Perseverance</dc:subject> </q:Quotation> </rdf:RDF>
[edit] Configuring the Store
Knowing this, I created a field/predicate map so my store would know how to index my RDF. I mapped three properties to three field names:
- http://purl.org/vocab/quotation/schema#quote to quote
- http://purl.org/dc/elements/1.1/creator to creator
- http://purl.org/dc/elements/1.1/subject to subject
That means I'll be able to search the data using short names like subject:love. Here's what it looked like:
<rdf:RDF xmlns:frm="http://schemas.talis.com/2006/frame/schema#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:bf="http://schemas.talis.com/2006/bigfoot/configuration#" > <bf:FieldPredicateMap rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/fpmaps/default"> <rdfs:label>Quotations book field/predicate map</rdfs:label> <frm:mappedDatatypeProperty> <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/fpmaps/default#quote"> <frm:property rdf:resource="http://purl.org/vocab/quotation/schema#quote"/> <frm:name>quote</frm:name> </rdf:Description> </frm:mappedDatatypeProperty> <frm:mappedDatatypeProperty> <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/fpmaps/default#author"> <frm:property rdf:resource="http://purl.org/dc/elements/1.1/creator"/> <frm:name>creator</frm:name> </rdf:Description> </frm:mappedDatatypeProperty> <frm:mappedDatatypeProperty> <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/fpmaps/default#subject"> <frm:property rdf:resource="http://purl.org/dc/elements/1.1/subject"/> <frm:name>subject</frm:name> </rdf:Description> </frm:mappedDatatypeProperty> </bf:FieldPredicateMap> </rdf:RDF>
I used cURL from the command line to PUT this to the right place in my store.
I also set up a query profile, so that I could use simple keyword searching in my store. I weighted the quote field as 5, and the creator and subject as 2 each. That means any searches that aren't prefixed with a field name will favour the quote field over the other two. Once again I used cURL from the command line to PUT this to the right place.
<rdf:RDF xmlns:frm="http://schemas.talis.com/2006/frame/schema#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:bf="http://schemas.talis.com/2006/bigfoot/configuration#" > <bf:QueryProfile rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/queryprofiles/default"> <bf:fieldWeight> <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/queryprofiles/default#quote"> <frm:name>quote</frm:name> <bf:weight>5</bf:weight> </rdf:Description> </bf:fieldWeight> <bf:fieldWeight> <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/queryprofiles/default#creator"> <frm:name>creator</frm:name> <bf:weight>2</bf:weight> </rdf:Description> </bf:fieldWeight> <bf:fieldWeight> <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/queryprofiles/default#subject"> <frm:name>subject</frm:name> <bf:weight>2</bf:weight> </rdf:Description> </bf:fieldWeight> </bf:QueryProfile> </rdf:RDF>
[edit] Converting and Loading
Now for the conversion. This time around I was using Perl so my first thought was to use the expat streaming XML parser to parse the file. However, a short way into developing this approach I realised that the file was not valid XML either! I got errors like not well-formed (invalid token) at line 54475, column 127, byte 3782711. OK, this intrepid Perl hacker is never daunted... so I turned to the trusty regex. Despite the other problems, at least the data was regular. I dealt with the bogus entities in the XML with a bit of regex to clean up truncated entities and unescaped ampersands.
Here's the code I used, followed by a few notes about it.
#!/usr/bin/perl -w use strict; # a script to convert quotations book data to a series of well-formed rdf files use XML::Parser::Expat; use LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->credentials( 'api.talis.com:80', 'bigfoot', 'user' => 'password' ); my $filename = 'quotes_full.xml'; my @quotes = (); my $rdfxml = qq~<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:q="http://purl.org/vocab/quotation/schema#" >\n~; my $batch_size = 1000; open DATA, $filename or die "Cannot open $filename"; my $quote_count = 0; my $line; while ($line = <DATA>) { if ($line =~ m~<item\s+rdf:about="(.+?)">~) { $rdfxml .= qq~<q:Quotation>\n~; $rdfxml .= qq~<dc:rights>Published by QuotationsBook.com Limited under a creative commons attribution licence.</dc:rights>~; } elsif ($line =~ m~<c:quote[^>_]+>([^<]+)<~) { my $match = $1; $match =~ s/&#\S*\s/ /g; $match =~ s/\s&\s/ & /g; $rdfxml .= qq~<q:quote>$match</q:quote>\n~; } elsif ($line =~ m~<c:author[^>_]+>([^<]+)<~) { my $match = $1; $match =~ s/&#\S*\s/ /g; $match =~ s/\s&\s/ & /g; $rdfxml .= qq~<dc:creator>$match</dc:creator>\n~; } elsif ($line =~ m~<c:subject[^>_]+>([^<]+)<~) { my $match = $1; $match =~ s/&#\S*\s/ /g; $match =~ s/\s&\s/ & /g; $rdfxml .= qq~<dc:subject>$match</dc:subject>\n~; } elsif ($line =~ m~<c:quote_link[^>]+>([^<]+)<~) { $rdfxml .= qq~<foaf:isPrimaryTopicOf rdf:resource="$1" />\n~; } elsif ($line =~ m~</item>~) { $rdfxml .= '</q:Quotation>'; $quote_count++; if ( $quote_count % $batch_size == 0) { $rdfxml .= '</rdf:RDF>'; print "POSTING $quote_count quotes (" . length($rdfxml) . " characters) to metabox\n"; my $req = HTTP::Request->new(POST => 'http://api.talis.com/stores/iand-dev2/meta'); $req->content_type('application/rdf+xml'); $req->content($rdfxml); my $res = $ua->request($req); print $res->as_string; $rdfxml = qq~<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:q="http://purl.org/vocab/quotation/schema#" >\n~; } } }
I batched the conversion up into groups of 1000 quotes and POSTed each one into the metabox of my store. This was mainly so I could catch any further conversion errors and also so I coudl check on progress as it went. Performing secure POSTs to my store was easy, just a matter of the following:
my $ua = LWP::UserAgent->new; $ua->credentials( 'api.talis.com:80', 'bigfoot', 'user' => 'password' );
and LWP::UserAgent took care of all the authentication details.
[edit] Exploring the Results
Now the data was in my store, I could experiment with some of the searching. (These examples use iand-dev2 as a store name which is where I did this experimentation. I might move it to a more permanent store if people think it's useful so the URIs will change)
- Searching for quotes about love (http://api.talis.com/stores/iand-dev2/items?query=love) reveals 1441 results
- Love and death: http://api.talis.com/stores/iand-dev2/items?query=love+death
- Computers: http://api.talis.com/stores/iand-dev2/items?query=computers
- Quotes by einstein: http://api.talis.com/stores/iand-dev2/items?query=creator:einstein
And sparqling:
- Quotes by sophia loren: http://api.talis.com/stores/iand-dev2/services/sparql?query=prefix+dc%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%0D%0Adescribe+%3Fs%0D%0Awhere+%7B%3Fs+dc%3Acreator+%22Loren%2C+Sophia%22+%7D
Ummm. Yeah, long URI :) The original query was this:
prefix dc: <http://purl.org/dc/elements/1.1/>
describe ?s
where {?s dc:creator "Loren, Sophia" }
Facetting:
- A league table of who is quoted most often on the subject of love: http://api.talis.com/stores/iand-dev2/services/facet?query=subject%3Alove&fields=creator&top=10&output=html (hmmm interesting results)
- And the same for death: http://api.talis.com/stores/iand-dev2/services/facet?query=subject%3Adeath&fields=creator&top=10&output=html (the usual depressives :-) )
- What are the top topics for socrates: http://api.talis.com/stores/iand-dev2/services/facet?query=creator%3Asocrates&fields=subject&top=10&output=html (Love, Life, Wisdom, Death, Education)
- And for Kennedy: http://api.talis.com/stores/iand-dev2/services/facet?query=creator%3Akennedy&fields=subject&top=10&output=html (Politics, Peace, Voting, Freedom, Change)
- And for Woody Allen: http://api.talis.com/stores/iand-dev2/services/facet?query=creator%3Aallen+creator%3Awoody&fields=subject&top=10&output=html (Death, Love, Sex, Marriage, Punishment)

