Steve's Software Salad: July 2012

Recently I've been exploring RDF and SPARQL in some depth. Today I spent several hours converting my personal digital library from an shell/XML based system to an RDF/SPARQL system.

First a little background.

Several years ago I started documenting and storing metadata about papers and documents I downloaded from the Web to it make it much easier to find something I had read previously. Search engines are good, but sometimes they find too much stuff and I don't want to waste time wading through long lists of results.

The system I originally worked out used Dublin Core and XML to store information about documents. Each document had one XML file describing it. To find something I would just cd to the directory and use grep to find the documents.

This system worked reasonably well. It usually took only a few seconds to copy a template and edit the data for a document. An example file:

<?xml version="1.0"?>
<!DOCTYPE rdf:RDF SYSTEM "http://dublincore.org/documents/2002/07/31/dcmes-xml/dcmes-xml-dtd.dtd">
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:xhtml="http://www3.w3.org/1999/xhtml">
  <rdf:Description rdf:about="content/000091">
    <dc:contributor></dc:contributor>
    <dc:coverage></dc:coverage>
    <dc:creator>Ken Thompson</dc:creator>
    <dc:date>1984-08</dc:date>
    <dc:description>
 Ken Thompson describes self-reproducing programs and the problem of trusting software.
    </dc:description>
    <dc:format>text/html</dc:format>
    <dc:identifier>000091</dc:identifier>
    <dc:identifier></dc:identifier>
    <dc:language>English</dc:language>
    <dc:publisher>ACM</dc:publisher>
    <dc:relation></dc:relation>
    <dc:rights></dc:rights>
    <dc:source>http://cm.bell-labs.com/who/ken/trust.html</dc:source>
    <dc:subject>
 UNIX; self-reproducing code; compilers; 
    </dc:subject>
    <dc:title>Reflections on Trusting Trust</dc:title>
    <dc:type>Text</dc:type>
  </rdf:Description>
</rdf:RDF>

The problem with this approach is that searching only utilizes a simple grep command so it's impossible to perform searches with any level of sophistication.

I had a little time on my hands, so I thought I would use RDF and SPARQL to make a better system. A few hours of work and I have a solution. It's not something that will scale to huge numbers of documents without additional work, but it works for me at this time.

Documents in my personal digital library are represented in a single RDF file encoded using Turtle syntax. The above record becomes the following using the Turtle syntax:

</000091>
    dc:creator "Ken Thompson" ;
    dc:date "1984-08" ;
    dc:description """
 Ken Thompson describes self-reproducing programs and the problem of trusting software.
    """ ;
    dc:format "text/html" ;
    dc:identifier "000091" ;
    dc:language "English" ;
    dc:publisher "ACM" ;
    dc:source "http://cm.bell-labs.com/who/ken/trust.html" ;
    dc:subject """
 UNIX; self-reproducing code; compilers; 
    """ ;
    dc:title "Reflections on Trusting Trust" ;
    dc:type "Text" .

Dave Beckett's Redlands RDF software and SPARQL queries provides the means to transfrom all of the records from the original XML to the Turtle syntax. This software is provided in Ubuntu packages redland-utils, rasqal-utils, and raptor2-utils.

I used the following command to convert all of the existing records into Turtle and concatenate them together into one file:

for f in /media/SKR-LIBRARY/Library/meta/*.xml; 
do roqet -D $f -r turtle ../libquery.rq >$(basename $f .xml).ttl; 
done
cat *.ttl >library.ttl

After a quick emacs session to remove the repeated @prefix and @base lines, I now have all of my records in a single file and can query it using SPARQL.

The last element I need for this is a convenient query mechanism, so I created a shell script to create and execute the SPARQL query:

#!/bin/bash
#
# libsearch - search the library for a regexp

if [ -n "$2" ]; then
   echo "usage:  libsearch regexp"
   exit 1
fi

# data source
LIBRARYDATA=library.ttl

QRY=\
"PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> \
PREFIX dc:   <http://purl.org/dc/elements/1.1/> \
SELECT DISTINCT ?doc, ?title \
WHERE {{ ?doc dc:creator ?a;      dc:title ?title.\
           FILTER (regex( ?a, \"$1\", \"sim\" ))} \
  UNION { ?doc dc:description ?d; dc:title ?title.\
           FILTER (regex( ?d, \"$1\", \"sim\" ))} \
  UNION { ?doc dc:subject ?s;     dc:title ?title.\
           FILTER (regex( ?s, \"$1\", \"sim\" ))} \
  UNION { ?doc dc:title ?title.                   \
           FILTER (regex( ?title, \"$1\", \"sim\" ))} \
}"

roqet -qE -D $LIBRARYDATA -r table -e "$QRY" 2>/dev/null

We need the ?doc and ?title in each WHERE clause to have both of these variables appear in the results. The UNION clauses combines selections from the four different Dublin Core elements: title, description, creator and subject. The DISTINCT keyword removes duplicates.

We can now search for documents:

Library$ libsearch nasa
-----------------------------------------------------------------------------
| doc                         | title                                       |
=============================================================================
| uri<http://kb.local/000075> | string("NASA Systems Engineering Handbook") |
-----------------------------------------------------------------------------
Library$

It's not the prettiest display, but it gets the job done.

Steve's Software Salad

Wednesday, July 25, 2012

Personal digital library using RDF and SPARQL