Building SchemaCache

From n² wiki

Jump to: navigation, search

Contents

[edit] Bigfoot Fundamentals

Schema-cache is an application built on top of a Bigfoot RDF store, which it interacts with via a REST interface. You retrieve information with GET, save or change data with POST, and upload configuration files with PUT.

[edit] MetaBox

The MetaBox is the part of the store where you keep the RDF data. You can either load data straight into the MetaBox by POSTing RDF/XML to the /meta URI of your store, or use ChangeSets to add or remove triples from the store.

The URI of the schema-cache store is http://api.talis.com/stores/schema-cache

So the schema-cache’s MetaBox URI is http://api.talis.com/stores/schema-cache/meta

If you want to perform versioned changes to the data in the MetaBox, you post a ChangeSet document to http://api.talis.com/stores/schema-cache/meta/changesets .

[edit] ChangeSets

ChangeSets are RDF documents that describe the changes you wish to make to the data in the store. The document consists of reified triples, and statements that declare whether they should be removed from, or added to, the MetaBox of your store.

So to get the RDF into the store, you can:

1. POST an RDF/XML document with an application/rdf+xml mimetype to /meta – this will save the RDF directly to the store. 2. POST a ChangeSet document with an application/vnd.talis.changeset+xml mimetype to /meta – this will perform the additions and removals to the store, but will not save the changeset document itself to the store. It’s unversioned. 3. POST a ChangeSet document with an application/vnd.talis.changeset+xml mimetype to /meta/changesets – this will perform the additions and removals, and will also save the changeset document to the metabox – it’s versioned. The reason you would want the ChangeSet saved to the MetaBox as well is that this enables you to undo the changes made.

In schema-cache, I loaded all the data in by simply POSTing it straight to /meta, because I was POSTing it all at once. I would have preferred to use ChangeSets, and would like to use them for updating the schemas and ontologies, and allowing users to import new schemas (and update old ones manually) but at the moment ChangeSets can only affect one resource at a time, and Bigfoot only accepts one ChangeSet per HTTP request, so this would be a bit too slow (but hopefully this restriction will be removed in the near future).

[edit] Querying the Data

Once you’ve got the data in, there are two ways to query it:

[edit] /items

Do a GET on /items with some query paramters, eg: /items?query=John+Locke&max=20&offset=50 This will return an RSS 1.0 feed containing the results as a sequence of items, and information about the query (such as the total number of results found). These queries are performed on the indexes of the data (you have to configure the store to index on certain properties that you want to query on), and are very fast. The order of the results is determined by the quality of the text match, and by the weight given to certain properties (matches within labels and titles, for instance, are more important than those in comments).

The schema-cache site uses this for the /Search section, to perform fast text matching. It also uses this to provide RSS feeds for the Schemas, Classes and Properties sections of the site (the queries can simply be constraints on the rdf:type property).

[edit] /services/sparql

A more powerful (though slower) way of querying the data is via the SPARQL endpoint. Again, you do a GET with query parameters, and get back either RDF/XML or SPARQL XML, depending on the kind of query you did.

In schema-cache, I used DESCRIBE and CONSTRUCT queries in preference to SELECTs because they both return RDF/XML (whereas a SELECT query returns a table of results in XML). I like to get back RDF because I can deal with all RDF in the same way (it all has the same shape), and because the data keeps its own semantics all the way to the template (which means less work and more flexibility for me, as I don’t need to hard-code the semantics into my variables).

[edit] Schema-cache application design

I approached the design of the application by deciding what URLs I wanted it to have, and what I wanted to show at them:

/Importer a form for POSTing the uri of a schema to be imported
/Schemas for listing and searching through the schemas and ontologies
/Classes for listing the owl and rdfs classes
/Properties for listing the properties
/Search for doing text search on the data
/Res?uri= for viewing the details of each individual resource
/Inverse?uri= for listing the resources that link to a particular resource
/services/sparql a SPARQL interface to the data – but as well as returning RDF, it can also serve the results as HTML hyperlinking into the application (in the case of DESCRIBE and CONSTRUCT queries at least), and as JSON and JSONP
/services/json-query an experimental query interface which takes in an rdf json template, translates it into a sparql query, and serves the results (again, either as HTML, JSON, JSONP, or RDF/XML). I thought this might be handy for javascript applications built on top of the data, as they can create and manipulate a query object more readily than a query string. Time will tell I suppose.

The file structure looks like this:

[edit] .htaccess

[edit] dispatch.php

[edit] / config

  • ====mimetypes.php====contains a key => value array of file extensions => mimetypes. The mimetype of each web page served is determined by the value that corresponds to the file extension of the template used.
  • ====namespaces.php====a key => value array of prefix => namespace URIs used at various places in the application (SPARQL queries, templates, converting RDFXML to JSON and turtle).
  • ====routes.php====An associative array for configuring the class, templates, and parameters needed to serve responses in each URL space

[edit] / lib

This is where I put the libraries and components I used. /arc contains the arc RDF/XML parser, /json contains a json serialiser/deserialiser class, /bigfoot contains a bigfoot client library for interacting with bigfoot over HTTP, and convertors.php is a library I wrote for converting various kinds of data into various other kinds of data.

  • /arc
  • /bigfoot
  • /json
  • convertors.php

[edit] / requests

These are classes that return responses to requests; they have methods that correspond to the HTTP methods they support (eg: the class may have a GET method which is called when the HTTP request method is GET ).

  • item.php
  • collection.php
  • contentbox.php
  • importer.php
  • jsonquery.php
  • sparqler.php

[edit] / templates

These are the templates that the application uses to construct the web pages it serves. The file extensions used here occur as keys in the mimetypes array in /config/mimetypes.php – the Content-type the page will be served with is the value that corresponds to that key.

eg: $mimetypes[‘html’] == ‘text/html’, so if the file extension is html, the web page will be served with a “Content-type: text/html” header.

Templates can have ‘parent templates’, that they are included in. Schema-cache uses main.html as the parent template for html pages; it contains all the boiler-plate (header, footer, navigation).

  • collection.html
  • default.json
  • default.jsonp
  • default.rdf
  • import.html
  • inverse.html
  • item.html
  • json-query.html
  • main.html
  • sparql.html

[edit] Overview of a Request -> Response

The process of the application serving a response to a request goes like this:

  1. The .htaccess file intercepts all the requests with a mod_rewrite, and redirects them to dispatch.php
  2. dispatch.php includes in mimetypes.php, namespaces.php, and routes.php, assesses the requested uri against the url regexes in routes.php, creates a new Request object (of the class defined in routes) and calls the same method on the object as the HTTP Method called on our URL.
  3. So if the request is a GET, $request->GET will be called, which will query the Bigfoot store for data, and pass that data to a template to create a web page. The web page will be returned by the GET method, and printed to screen by dispatch.php

[edit] The Routing

This is probably the most interesting bit of the application. The routing is configured by an associative array that tells the application (schema-cache) which PHP class to use, with which template and parameters, for each URL.

In the associative array, the keys are the regular expressions for the URLs that the application will serve. The value of each URL key is another key => value array. There needs to be a ‘class’ key which corresponds to the name of a class and file in the /requests folder. When a URL is requested that matches the regex key, an object will be instantiated from that class, and all the other key => values will be passed as parameters to that object’s constructor.

NB: If you group parts of the regular expression, you can put the back-references into the parameter values.

The configuration for URLs of pages that display a record of a resource is like this:


    '/Res/*' => array(
            'class' => 'Item',
            'template' => 'item',
            'uri' => isset($_GET['uri'])? trim($_GET['uri']): 'false',
            'store_uri' => "http://api.talis.com/stores/schema-cache",
        ),

So when http://schemacache.test.talis.com/Res?uri=http%3A%2F%2Fwww.gnowsis.org%2Font%2Fkissology%23kissed is requested with a GET, an object will be instantiated from the Item class found in /requests/item.php and these parameters:


    array(
        'template' => 'item',
        'uri' => 'http://www.gnowsis.org/ont/kissology#kissed',
        'store_uri' => "http://api.talis.com/stores/schema-cache",
        ),

The reason I do it like this, defining the URL spaces here and routing all the requests through dispatch.php, rather than have the URLs map directly to a PHP script to serve pages in that URL space (perhaps with some mod_rewriting for prettier URLs), is that it gives me a bit more flexibility in combining different request classes and templates, keeps my application’s URL spaces configuration in one file, and hopefully encourages modularity as the application expands. It also gives me a single central point where I can intercept the requests and responses, and process them in PHP.

So what does the Item class do?

[edit] Request Classes

I decided that there were at least two basic kinds of pages that I wanted to serve: pages that display a (description of a) thing, and pages that display a list of things.

So I wrote an Item class which has a $uri, $template, and $store_uri_ properties. The constructor sets the properties from the parameters passed to it, and calls in the Bigfoot client library from /lib/bigfoot, instantiating a BigfootStore object at Item->$store using the $store_uri property. My Item class can now interact with the schema-cache store (at http://api.talis.com/stores/schema-cache) through the Item->$store object.

The Item::GET method (which is called by dispatch.php when the http request is a GET) retrieves an RDF description of the item the web page is about by doing a SPARQL DESCRIBE query on the uri passed to the Item object by the routes.php configuration (http://www.gnowsis.org/ont/kissology#kissed).

I use a DESCRIBE query because it gets me back everything known about that resource, and it gives it to me with the semantics intact. Unlike with a (SPARQL or SQL) SELECT query I don’t need to know what the query was to understand the semantics and structure of the results. One advantage of this is that my template that presents the data can be more loosely coupled with the class that retrieves the data.

So having retrieved the data, I need to get it out of the RDF/XML file, and into my template. To do this, I use one of the converters from my library of convertors in /lib/convertors.php, calling RDFXML::to_resources($rdfxml_results) to get an associative array of RDF resources that looks like this:

     'http://www.w3.org/2002/07/owl#Ontology' => 
       array (
         'http://www.w3.org/2000/01/rdf-schema#label' => 
         array (
           0 => 
           array (
             'type' => 'literal',
             'value' => 'Ontology',
             'lang' => '',
           ),
         ),
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' => 
         array (
           0 => 
           array (
             'type' => 'uri',
             'value' => 'http://www.w3.org/2000/01/rdf-schema#Class',
           ),
         ),
       )    
 

This is the same basic structure as the proposed RDF/JSON format

I find the following things useful about this structure:

  • Uniformity. However many property values there are, of whatever sort, I can access them in the same way – I don’t need to check if it’s an array or a single value.
  • easy to find resources – because they are all in the same place, on the same level, identified by their URIs.
  • easy to find properties of a resource because they are all under the one key

[edit] item.html – the template for /Res?uri=something

In the template, I want to display the properties of the resource as a definition list of key => values, I want to display so I get the properties of the current resource by looking up the resource’s uri in the resources array. I can then iterate over the properties, to get their qnames, and then over the array of values of that property.

Now, I want something a bit friendlier to display than just a qname or a uri – I want human-readable labels for the properties I am displaying, so I have a function that takes in the resources array, gathers up all the property qnames, queries for their rdfs:label and returns an associative array of qnames => human-readable labels. In theory anyway. In practice, unfortunately, most schema authors (including the authors of RDFS!) have simply used the fragment identifier of the term as the label, camel-casing, underscores and all, so the results are not as presentable as one might hope.

When I iterate over the property values, if the value is a uri, then I want to display it as a link. If schema-cache knows about that uri, I want that link to point to another page in the schema-cache application, showing a description of the resource with that uri. If schema-cache doesn’t have any more information about that URI, I want the link just to point at the URI itself.

This is why the query that fetches the data for the page isn’t just a simple DESCRIBE, but a query to retrieve something like a Labelled Concise Bounded Description [1] – which is to say, I query for a description of the resource, and for the labels (if they exist) of any resources it links to. So when I come across a URI, I can check if it is a key in my $resources array. If it is, I know that schema-cache has some information about it I can display, and I link to a page within schema-cache about that resource – and I can use the rdfs:label for the link text.

Originally, I also wanted to display links to resources that linked in to the current resource – this worked in much the same way – my query was extended a bit, and I had a function to give me back an array of the resources that linked to the current page’s resource. However, I found it a bit confusing to look at because I couldn’t get reverse labels for the properties. For example, if the current page was describing a (foaf) Person, and there was Schema resource linking to the Person resource as the creator of that resource, I wanted to link back to the Schema resource indicating that this Person had created it. Unfortunately I couldn’t come up with a clear and reliable solution to this.

The other problem with this is that, while most resources have a relatively modest number of other resources they link to, some (for example, classes and properties commonly sub-classed, such as rdfs:Class or wordnet:Person ) are linked to by too many other resources to show comfortably on one web page at a time.

In the end, I moved these links from other resources, to their own page, where they are clearly separate from the links to other resources, and provided a facility to page through the results.

[edit] Collection class – for requests of lists of things

I wanted a class that could display lists and search results of queries on certain types of resources, to drive a listing of schemas, of classes and properties. So the Collection class extends the Item class, but queries for any resources that match the given conditions. These query conditions can be specified as parameters to the class. So for instance, in routes.php, the entry for /Classes/ says to use the Collection class, and pass in an ‘rdftype’ paramteter of


    array($namespaces['rdfs'].'Class', $namespaces['owl'].'Class'),

which modifies the sparql query so that the retrieved list items or either a rdfs:Class or a owl:Class .

[edit] Results Paging

Other restrictions can be defined by query string parameters. If there is a ?q= in the url query string, a FILTER is added to the query, which does a regex match restriction based on the value of q. The

OFFSET and LIMIT have a default of 0 and 20 respectively, but can be over-ridden by url parameters offset=10&limit=30; so I can have a simple pager snippet I include at the bottom of the collection.html template which displays a Next link if the number of results is the same as the offset + the limit, and a Previous link if the offset is greater than zero.

Unfortunately, with the current SPARQL interface, a more sophisticated pager (in the style of, eg. Google), indicating the total number of results, and linking to all the available pages of the results, is not practical. This is because there is no easy way to get the actual total number of results for a query, without returning all the results, which would in many cases be extremely slow.

With an /items query, you can get the total number of results as one of the properties of the RSS 1.0 feed you get back. The /items query is also often quite a bit quicker, and the results are ranked more intelligently. However, you (currently) need to define in your store’s field/predicate mapping what properties you want to index and search by before you load in the data. The /items query is also less powerful than the SPARQL interface. In this application, I used /items for a ‘site-wide’ search, but SPARQL for driving the pages listing the various sorts of things the application is about (Schemas, Classes and Properties).

[edit] /Search

Because I wanted the /Search page to use the /items query interface to the store, I created a new request class called Contentbox (as well as data, bigfoot stores can also store binary content, in the content box; /items is the query interface to the indices of the contenbox, as well as the indexed resources of the meta box ).

This class simply passes the user’s query to the /items API, and receives the results as an RSS 1.0 Sequence of Items. I then convert the RSS (which is RDF/XML) into the associative array structure described above, and extract the result items (by filtering on those with an rdf:type of rss:item ), and pass that to the template.

The nice thing here is that the uniformity and self-descriptive nature of RDF lets me de-couple the template from the code that handles user data and retrieves the data. So I can reuse the collection.html template from the /Classes /Schemas and /Properties pages for the /Search page because all that the collection.html template does is display a list of links to resource descriptions (which are displayed at the /Res page), a simple pager, and a search box. So for situations where a page can use the simpler /items query, it is easy to configure it to use the Contentbox class instead of the Collection class – and the template can remain the same.

[edit] Alternate formats.

I want to expose the data that drives the pages of the application, in JSON as well as RDF. This is quite easy because the data structure is a generic one.

In dispatch.php (which handles the incoming requests to the application) I check the ‘request uri’ for anything that looks like a filename extension (eg: .html, .json, .rdf), remove it, and pass it as a parameter to the request class. When the request class goes to render the page, it looks for a file in /templates with the template name (also passed as a parameter in $urls) + the file extension. If it doesn’t find one, it will then look for a file with the given extension, and the name ‘default’. Thus I can have various specific ways of rendering resources as HTML (or XML, or anything), but for the formats I can render generically, such as RDF/XML and RDF/JSON, I just need one template for each format.

The mime-type sent will be the value from the $mimetypes array in /config/mimetypes.php where the key corresponds to the file extension.

For (a lazy man’s) RSS feeds of various types of data in my store, I can simply put links in the <head> of the document pointing to an /items query on my store, constrained with a :type restriction corresponding to the rdf:type of the type of data I want to provide a feed of. Again, you need to ensure that field/predicate map in your store’s configuration indexes on rdf:type to be able to do this.

[edit] What’s Different About Web Development with Bigfoot and Semantic Web Technologies

[edit] Security

With SQL based web-development, a cagey mind-set is the healthy norm. Every user input might by a SQL injection attack. Some developers go so far as to choose deliberately obtuse table and column names, in an attempt to limit the damage if some malicious user data should make it through.

When developing on Bigfoot, you can afford to open up a little bit. For example, The Schema Cache site contains it’s own SPARQL endpoint, which simply passes queries to the ‘real’ SPARQL endpoint (which is also public), but will optionally render the results as HTML (where the results of a DESCRIBE or CONSTRUCT hyperlink into the application, RDF/XML, JSON, or JSONP). Also, adding a ‘sparql’ parameter (of any value) to the query string of any SPARQL driven page will redirect to the Schema-Cache SPARQL interface, with the query displayed in the input textarea. I found this quite useful for debugging and optimising and queries in development, and don’t see any reason to disable it in the live version.

Access to sensitive data, however, should be constrained in your store’s configuration – for instance, usernames and passwords should be stored in a private “named graph” that only the store admin can view (querying on these named graphs must be done through the authenticated /multisparql API, and you will probably still want to exercise caution when your application is accessing this part of the store).

Asides from that, concerns about SQL injection and highly expensive queries are Bigfoot’s problem, rather than the developer’s – if you want to provide read access to the public part of your data, you can simply point to your store’s SPARQL endpoint (/yourstore/services/sparql).

Perhaps even write access (for certain kinds of application at least) need not be so jealously guarded by your application – if versioned write access to your store (through /meta/changesets) is open, and abused by a vandal or spammer, you can in theory always use the changesets to rollback their changes. The caveats here though, are that you have to construct the changesets to do the rollbacks yourself (Bigfoot does not provide an API for this, at least not yet); that if the malicious user changes a changeset, the process of rolling back will be greatly complicated; and that if write access is not filtered by your application, you then have no way to block access to those who have betrayed your trust!

And of course, you must still be careful to filter user data that you will render again in the page you send out (to guard against cross-site scripting attacks).

[edit] Less Code, Greater Modularity

The uniform structure and self-describing nature of RDF make it easier to write more generic code (and consequently, less code), and by holding your data all in the same uniform structure, it can more easily be passed between the different components of your application. Those components can then be plugged together in various ways, with less customisation or configuration required – as is the case, for example, in the /Search, /Classes, /Properties, and /Schemas pages which can all share the same template.

[edit] Openness and Open World Assumptions

Another benefit of using RDF to drive your application is that it is easy to bring in more RDF from other sources and mix it with your own, either adding it to your store, or mixing it in at the application level. You have two ways of dealing with this RDF, which may have an unpredictable structure:

  • only use the bits you recognise – the results listings pages do this – they only take the URI and a known label property of each resource.
  • use everything. The /Res page does this. It will display all the properties it can find in the store of any resource URI you give it. This is important for this page because schema authors might annotate their schemas and terms with all kinds of different properties – perhaps from auxiliary vocabularies like VANN. So the page has to be able to show any property without knowing about all the properties in advance (which would be impossible). As described above, I also query the store again for the labels of the properties I need to show, in the (often vain!) hope of getting back human readable labels to display next to the property values.

With semantic web development, you may also have to change your assumptions about the logic you apply to the data. For example, in another Talis Platform application, the Silkworm Directory, resources often have a foaf:homepage property – and sometimes different resources will have the same foaf:homepage. My initial response to that was “oh that’s wrong – the form shouldn’t validate if the foaf:homepage isn’t unique”, but Ian Davis pointed out that, especially if some of the data was to come from the web, rather than direct user input into the system, I might be mistaken in making the Closed World Assumption that the silkworm data was invalid – it might simply be lacking data. For instance


    <#a> foaf:homepage <http://google.com> .
    <#b> foaf:homepage <http://google.com> .

might be true, but missing the data: <#a> owl:sameAs <#b> .

Not that the Open World Assumption will necessarily be the right one to make for all applications, but it’s certainly worth bearing in mind.

See Also :

[edit] There’s nothing particularly special about rdf:type

This is another part of the nebulous nature of RDF.

In relational databases, and Object-Oriented code, it’s really important what type of thing something is – something only has one type, and that determines what properties it has. With RDF, a resource might be of one, many, or no types at all, and you cannot generally assume (unless your application strictly validates all input to the store) that because a resource is of a particular type, that it will have any other particular properties.

Sometimes what sort of resource you are looking for is defined, not by a value of its rdf:type property, but the existence of another property belonging to it, or pointing to it. For instance, what really defines if something is an RDF ontology/schema/vocabulary is not whether it has an owl:Ontology value for its rdf:type property, but whether it is pointed to by the rdfs:isDefinedBy property of some other resource.

[1] My query is a little different because I know that some resources do not have rdfs:labels: schemas tend to have dc:titles instead, and people have foaf:name. So I also bring in those properties if they exist.

Personal tools