Steve's Software Salad

Wednesday, July 25, 2012

Personal digital library using RDF and SPARQL

Recently I've been exploring RDF and SPARQL in some depth. Today I spent several hours converting my personal digital library from an shell/XML based system to an RDF/SPARQL system.

First a little background.

Several years ago I started documenting and storing metadata about papers and documents I downloaded from the Web to it make it much easier to find something I had read previously. Search engines are good, but sometimes they find too much stuff and I don't want to waste time wading through long lists of results.

The system I originally worked out used Dublin Core and XML to store information about documents. Each document had one XML file describing it. To find something I would just cd to the directory and use grep to find the documents.

This system worked reasonably well. It usually took only a few seconds to copy a template and edit the data for a document. An example file:

<?xml version="1.0"?>
<!DOCTYPE rdf:RDF SYSTEM "http://dublincore.org/documents/2002/07/31/dcmes-xml/dcmes-xml-dtd.dtd">
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:xhtml="http://www3.w3.org/1999/xhtml">
  <rdf:Description rdf:about="content/000091">
    <dc:contributor></dc:contributor>
    <dc:coverage></dc:coverage>
    <dc:creator>Ken Thompson</dc:creator>
    <dc:date>1984-08</dc:date>
    <dc:description>
 Ken Thompson describes self-reproducing programs and the problem of trusting software.
    </dc:description>
    <dc:format>text/html</dc:format>
    <dc:identifier>000091</dc:identifier>
    <dc:identifier></dc:identifier>
    <dc:language>English</dc:language>
    <dc:publisher>ACM</dc:publisher>
    <dc:relation></dc:relation>
    <dc:rights></dc:rights>
    <dc:source>http://cm.bell-labs.com/who/ken/trust.html</dc:source>
    <dc:subject>
 UNIX; self-reproducing code; compilers; 
    </dc:subject>
    <dc:title>Reflections on Trusting Trust</dc:title>
    <dc:type>Text</dc:type>
  </rdf:Description>
</rdf:RDF>

The problem with this approach is that searching only utilizes a simple grep command so it's impossible to perform searches with any level of sophistication.

I had a little time on my hands, so I thought I would use RDF and SPARQL to make a better system. A few hours of work and I have a solution. It's not something that will scale to huge numbers of documents without additional work, but it works for me at this time.

Documents in my personal digital library are represented in a single RDF file encoded using Turtle syntax. The above record becomes the following using the Turtle syntax:

</000091>
    dc:creator "Ken Thompson" ;
    dc:date "1984-08" ;
    dc:description """
 Ken Thompson describes self-reproducing programs and the problem of trusting software.
    """ ;
    dc:format "text/html" ;
    dc:identifier "000091" ;
    dc:language "English" ;
    dc:publisher "ACM" ;
    dc:source "http://cm.bell-labs.com/who/ken/trust.html" ;
    dc:subject """
 UNIX; self-reproducing code; compilers; 
    """ ;
    dc:title "Reflections on Trusting Trust" ;
    dc:type "Text" .

Dave Beckett's Redlands RDF software and SPARQL queries provides the means to transfrom all of the records from the original XML to the Turtle syntax. This software is provided in Ubuntu packages redland-utils, rasqal-utils, and raptor2-utils.

I used the following command to convert all of the existing records into Turtle and concatenate them together into one file:

for f in /media/SKR-LIBRARY/Library/meta/*.xml; 
do roqet -D $f -r turtle ../libquery.rq >$(basename $f .xml).ttl; 
done
cat *.ttl >library.ttl

After a quick emacs session to remove the repeated @prefix and @base lines, I now have all of my records in a single file and can query it using SPARQL.

The last element I need for this is a convenient query mechanism, so I created a shell script to create and execute the SPARQL query:

#!/bin/bash
#
# libsearch - search the library for a regexp

if [ -n "$2" ]; then
   echo "usage:  libsearch regexp"
   exit 1
fi

# data source
LIBRARYDATA=library.ttl

QRY=\
"PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> \
PREFIX dc:   <http://purl.org/dc/elements/1.1/> \
SELECT DISTINCT ?doc, ?title \
WHERE {{ ?doc dc:creator ?a;      dc:title ?title.\
           FILTER (regex( ?a, \"$1\", \"sim\" ))} \
  UNION { ?doc dc:description ?d; dc:title ?title.\
           FILTER (regex( ?d, \"$1\", \"sim\" ))} \
  UNION { ?doc dc:subject ?s;     dc:title ?title.\
           FILTER (regex( ?s, \"$1\", \"sim\" ))} \
  UNION { ?doc dc:title ?title.                   \
           FILTER (regex( ?title, \"$1\", \"sim\" ))} \
}"

roqet -qE -D $LIBRARYDATA -r table -e "$QRY" 2>/dev/null

We need the ?doc and ?title in each WHERE clause to have both of these variables appear in the results. The UNION clauses combines selections from the four different Dublin Core elements: title, description, creator and subject. The DISTINCT keyword removes duplicates.

We can now search for documents:

Library$ libsearch nasa
-----------------------------------------------------------------------------
| doc                         | title                                       |
=============================================================================
| uri<http://kb.local/000075> | string("NASA Systems Engineering Handbook") |
-----------------------------------------------------------------------------
Library$

It's not the prettiest display, but it gets the job done.

Thursday, January 19, 2012

Scala and Lift

I'm working on a small discrete event simulation program in scala. I originally wrote it as a command line program, but I would like to transition it to a web based project. I've used Lift web framework in the past for this type of project, so I decided to use it again. It has transitioned from using Maven to sbt to build it.

The four examples provided with the latest version have an included version of sbt, but it is a relatively old version of sbt as is the version of jetty used. I'd like to use an update version of both.

I immediately ran into problems, but I was able to solve them after a few hours of research and experimentation.

First, I had to add the following lines to the build.sbt file located in the main project directory:

seq(webSettings :_*)

libraryDependencies ++= Seq (
"net.liftweb" %% "lift-webkit" % "2.4-M4" % "compile",
"org.eclipse.jetty" % "jetty-webapp" % "8.1.0.RC4" % "container"
)

This uses a recent version of Lift and jetty. I tried to move the build.sbt to the project/ directory, but I got the following error:

skr@nb00:Farm$ sbt update ~container:start
/home/skr/Software/Farm/project/build.sbt:7: error: not found: value webSettings
seq(webSettings :_*)
^
[error] Type error in expression
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q

So I moved build.sbt back up to the main directory.

I had to add the plugins.sbt file in the project directory consisting of:

// plugins

libraryDependencies <+=
sbtVersion(v => "com.github.siasia" %% "xsbt-web-plugin" % (v+"-0.2.10"))

Once I made these changes, everything worked as advertised.

Software Salad - What's in a name?

Why call a blog about software, 'Software Salad'?

I often get comments in the cafeteria about the salads I create for lunch. The cafeteria charges by the container used to construct the salad, not by weight or number of ingredients. So, it pays to 'pile it on', especially, if I'm hungry.

I like a wide variety of ingredients for my salad, generally, the less processed, the better. I use ingredients from A to Z: artichoke hearts, broccoli, cabbage, carrots, eggs, peppers, spinach, sunflower seeds, tomatoes, zucchini and more. The light fluffy stuff (lettuce and other greens) go on the plate first and the heavier items last since they will compress the greens. When I'm done, I've created a work of art, as some coworkers have commented.

So what has this to do with software?

Modern software development has become much like building a salad. We have hundreds of ingredients, languages, libraries, and tool sets, to choose from. You need a plan in order to build a masterpiece.

I plan to use this blog to document my building of 'Software Salads'.