Quotations Book Example

From n² wiki

Jump to: navigation, search

I thought I'd have a play with loading some RDF into a store. These are my working notes. Looking around I found that Quotations Book had released an RSS 1.0 feed of their data under a Creative Commons Attribution licence. (Since I did this, Tom Morris has performed a conversion of the data to RDF which is also available direct from Quotations Book)

Contents

[edit] Examining the Data

The Quotations Book is presented as a big RSS 1.0 feed, formatted for use with Google Base. I was only interested in the RSS items, which look like the following (I've included the root rdf:RDF element too so you can see the namespaces):

<rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns="http://purl.org/rss/1.0/"
        xmlns:g="http://base.google.com/ns/1.0"
                xmlns:c="http://base.google.com/cns/1.0">
 
 
  <item rdf:about="http://quotationsbook.com/quote/43114">
    <link>http://quotationsbook.com/quote/43114</link>
    <title>Quote by Jessel, George on Perseverance</title>
    <c:quote type="string">If you haven't struck oil in the first three minutes - stop boring!</c:quote>
    <c:quote_link type="url">http://quotationsbook.com/quote/43114</c:quote_link>
    <c:author type="string">Jessel, George</c:author>
    <c:author_link type="url">http://quotationsbook.com/author/3811</c:author_link>
    <c:subject type="string">Perseverance</c:subject>
    <c:subject_link type="url">http://quotationsbook.com/subject/perseverance</c:subject_link>
  </item>
</rdf:RDF>

Ummm. The first problem was that those items aren't valid RDF at all. Those type attributes really mess things up. So, I needed to map this to some valid RDF.

Also the QuotationsBook export uses a google namespace for the quotes - and I'm pretty sure that namespace isn't designed for expressing quotations data. So I took the liberty of mapping the QuotationsBook data to a new quotation schema containing the single class of "Quotation" and a property called "quote". I wanted to include information about the CC attribution licence so I included a dc:rights element. Because Platform stores can do that yet but it is planned. I used dc:creator to denote the name of the person who uttered the quotation but I could have included some FOAF to properly link the quotation to a page about that person. I also used foaf:isPrimaryTopicOf for the link back to the page on QuotationsBook.com. Finally I used dc:subject to denote the classification of the quotation. It's worth comparing this with Tom Morris' version at the Quotations Book site which uses dc:subject and dc:creator but with resources as values rather than literals. My preference is to use literals with those properties.

I wanted my quotes to look like this:

<rdf:RDF 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:q="http://purl.org/vocab/quotation/schema#"
>
  <q:Quotation>
    <q:quote>If you haven't struck oil in the first three minutes - stop boring!</c:quote>
    <dc:rights>Published by QuotationsBook.com Limited under a creative commons attribution licence.</dc:rights>
    <foaf:isPrimaryTopicOf rdf:resource="http://quotationsbook.com/quote/43114" />
    <dc:creator>Jessel, George</dc:creator>
    <dc:subject>Perseverance</dc:subject>
  </q:Quotation>
</rdf:RDF>

[edit] Configuring the Store

Knowing this, I created a field/predicate map so my store would know how to index my RDF. I mapped three properties to three field names:

That means I'll be able to search the data using short names like subject:love. Here's what it looked like:

<rdf:RDF
    xmlns:frm="http://schemas.talis.com/2006/frame/schema#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:bf="http://schemas.talis.com/2006/bigfoot/configuration#" > 
 
  <bf:FieldPredicateMap rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/fpmaps/default">
    <rdfs:label>Quotations book field/predicate map</rdfs:label>
    <frm:mappedDatatypeProperty>
      <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/fpmaps/default#quote">
        <frm:property rdf:resource="http://purl.org/vocab/quotation/schema#quote"/>
        <frm:name>quote</frm:name>
      </rdf:Description>
    </frm:mappedDatatypeProperty>
 
    <frm:mappedDatatypeProperty>
      <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/fpmaps/default#author">
        <frm:property rdf:resource="http://purl.org/dc/elements/1.1/creator"/>
        <frm:name>creator</frm:name>
      </rdf:Description>
    </frm:mappedDatatypeProperty>
 
    <frm:mappedDatatypeProperty>
      <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/fpmaps/default#subject">
        <frm:property rdf:resource="http://purl.org/dc/elements/1.1/subject"/>
        <frm:name>subject</frm:name>
      </rdf:Description>
    </frm:mappedDatatypeProperty>
  </bf:FieldPredicateMap>
</rdf:RDF>

I used cURL from the command line to PUT this to the right place in my store.

I also set up a query profile, so that I could use simple keyword searching in my store. I weighted the quote field as 5, and the creator and subject as 2 each. That means any searches that aren't prefixed with a field name will favour the quote field over the other two. Once again I used cURL from the command line to PUT this to the right place.

<rdf:RDF
    xmlns:frm="http://schemas.talis.com/2006/frame/schema#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:bf="http://schemas.talis.com/2006/bigfoot/configuration#" > 
  <bf:QueryProfile rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/queryprofiles/default">
    <bf:fieldWeight>
      <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/queryprofiles/default#quote">
        <frm:name>quote</frm:name>
        <bf:weight>5</bf:weight>
      </rdf:Description>
    </bf:fieldWeight>
    <bf:fieldWeight>
      <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/queryprofiles/default#creator">
        <frm:name>creator</frm:name>
        <bf:weight>2</bf:weight>
      </rdf:Description>
    </bf:fieldWeight>
    <bf:fieldWeight>
      <rdf:Description rdf:about="http://api.talis.com/stores/iand-dev2/indexes/default/queryprofiles/default#subject">
        <frm:name>subject</frm:name>
        <bf:weight>2</bf:weight>
      </rdf:Description>
    </bf:fieldWeight>
  </bf:QueryProfile>
</rdf:RDF>

[edit] Converting and Loading

Now for the conversion. This time around I was using Perl so my first thought was to use the expat streaming XML parser to parse the file. However, a short way into developing this approach I realised that the file was not valid XML either! I got errors like not well-formed (invalid token) at line 54475, column 127, byte 3782711. OK, this intrepid Perl hacker is never daunted... so I turned to the trusty regex. Despite the other problems, at least the data was regular. I dealt with the bogus entities in the XML with a bit of regex to clean up truncated entities and unescaped ampersands.

Here's the code I used, followed by a few notes about it.

#!/usr/bin/perl -w
use strict;
 
# a script to convert quotations book data to a series of well-formed rdf files
 
use XML::Parser::Expat;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
 
   $ua->credentials(
     'api.talis.com:80',
     'bigfoot',
    'user' => 'password'
   );
 
my $filename = 'quotes_full.xml';
 
my @quotes = ();
my $rdfxml = qq~<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:q="http://purl.org/vocab/quotation/schema#"
>\n~;
my $batch_size = 1000;
 
open DATA, $filename or die "Cannot open $filename";
 
my $quote_count = 0;
my $line;
 
while ($line = <DATA>) {
  if ($line =~ m~<item\s+rdf:about="(.+?)">~) {
    $rdfxml .= qq~<q:Quotation>\n~;
    $rdfxml .= qq~<dc:rights>Published by QuotationsBook.com Limited under a creative commons attribution licence.</dc:rights>~;
  }
  elsif ($line =~ m~<c:quote[^>_]+>([^<]+)<~) {
    my $match = $1;
    $match =~ s/&#\S*\s/ /g;
    $match =~ s/\s&\s/ &amp; /g;
    $rdfxml .= qq~<q:quote>$match</q:quote>\n~;
  }
  elsif ($line =~ m~<c:author[^>_]+>([^<]+)<~) {
    my $match = $1;
    $match =~ s/&#\S*\s/ /g;
    $match =~ s/\s&\s/ &amp; /g;
    $rdfxml .= qq~<dc:creator>$match</dc:creator>\n~;
  }
  elsif ($line =~ m~<c:subject[^>_]+>([^<]+)<~) {
    my $match = $1;
    $match =~ s/&#\S*\s/ /g;
    $match =~ s/\s&\s/ &amp; /g;
    $rdfxml .= qq~<dc:subject>$match</dc:subject>\n~;
  }
  elsif ($line =~ m~<c:quote_link[^>]+>([^<]+)<~) {
    $rdfxml .= qq~<foaf:isPrimaryTopicOf rdf:resource="$1" />\n~;
  }
  elsif ($line =~ m~</item>~) {
    $rdfxml .= '</q:Quotation>';
    $quote_count++;
    if ( $quote_count % $batch_size == 0) {
      $rdfxml .= '</rdf:RDF>';
      print "POSTING $quote_count quotes (" . length($rdfxml) . " characters) to metabox\n";
 
      my $req = HTTP::Request->new(POST => 'http://api.talis.com/stores/iand-dev2/meta');
      $req->content_type('application/rdf+xml');
      $req->content($rdfxml);
      my $res = $ua->request($req);
      print $res->as_string;
 
      $rdfxml = qq~<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:foaf="http://xmlns.com/foaf/0.1/"
        xmlns:q="http://purl.org/vocab/quotation/schema#"
      >\n~;
    }
  }
}

I batched the conversion up into groups of 1000 quotes and POSTed each one into the metabox of my store. This was mainly so I could catch any further conversion errors and also so I coudl check on progress as it went. Performing secure POSTs to my store was easy, just a matter of the following:

my $ua = LWP::UserAgent->new;
 
$ua->credentials(
  'api.talis.com:80',
  'bigfoot',
  'user' => 'password'
);

and LWP::UserAgent took care of all the authentication details.

[edit] Exploring the Results

Now the data was in my store, I could experiment with some of the searching. (These examples use iand-dev2 as a store name which is where I did this experimentation. I might move it to a more permanent store if people think it's useful so the URIs will change)

And sparqling:

Ummm. Yeah, long URI :) The original query was this:

prefix dc: <http://purl.org/dc/elements/1.1/>
describe ?s
where {?s dc:creator "Loren, Sophia" }

Facetting:

Personal tools