The Seahawk MOBY Data Rules File

What is the Seahawk MOBY Data Rules File?

Seahawk discovers services for selected data primarily through rules (or templates if you prefer) mapping text strings and XML data to MOBY data ontology objects. These rules are defined in a file, so that Seahawk's data mapping capabilities can be extended simply by editing a text file.

Sections

What is the layout of the rules file?
How do I add a regular expression rule?
How do I add an XPath rule?
How do I build complex objects?
How do I override the default rules file?
How do I add/remove rules programmatically (in a running application)?

What is the layout of the rules file?

The default rules file can be seen here.

The rules file is an XML file. It has a fairly simple primary structure:

The root element is mappings
mappings can have 0 or more prefix children, followed by 0 or more object children

The prefix elements globally define namespace mappings to be used in the file's XPath rules.

<prefix value="tigr">http://www.bioxml.info/dtd/tigrxml.dtd

This allows one to write many XPath statements later on such as ./tigr:TU instead of the every time using the rather unwieldy ./[namespace-uri()="http://www.bioxml.info/dtd/tigrxml.dtd" and local-name()="TU"]

The object elements represent templates for MOBY objects to construct. Tags nested inside the object tag fill in the various MOBY object fields. The simplest MOBY objects have just a (namespace, identifier) attribute pair. For example, the following file defines one rule, to build a MOBY object in the NCBI global identifier namespace ("NCBI_gi" in MOBY's namespace ontology):

<?xml version="1.0"?>
<mappings>

  <object>
    <regex>(?:GI|gi)[:|](\d+)</regex>
    <namespace>
      <ns value="NCBI_gi">$1</ns>
    </namespace>
  </object>

</mappings>

More details on the regular expression syntax are presented below.

How do I add a regular expression rule?

In the previous example, the object element had a regex child. Whenever Seahawk is asked to generate Moby objects from a java.lang.String, this regular expression will be evaluated. Suppose the string in question is:

This genes is cross-referenced to GI:78045557

Since the regex (details) matches, a new Moby object will be created. The namespace of the object is filled in using the namespace element. Each ns child must specify a MOBY object namespace in the value attribute, and the text contents of the element will be used as the MOBY object ID. Any $# in the text will be replaced with parenthetical groups from the regex match. i.e. the namespace rule in the example is

    <namespace>
      <ns value="NCBI_gi">$1</ns>
    </namespace>

which will be replaced with the digit group "78045557" in the match of (?:GI|gi)[:|](\d+). This yields a MOBY object of the form

<MOBY:Object namespace="NCBI_gi" id="78045557"/>

Multiple Namespaces

Because a MOBY object can have only one (namespace, id) pair, any additional ns rules will be stored in the MOBY CrossReference Information Block (CRIB).

Special Predefined Character Classes

For convenience and regular expression readability, Seahawk support two extra predefined character classes:

\N: any primary IUPAC nucleotide characters [acgtunxACGTUNX]
\P: any primary IUPAC amino acid characters [ARNDCQEGHILKMFPSTWYVBZXarndcqeghilkmfpstwyvbz*]

How do I add an XPath rule?

XPath rules are used to build MOBY Objects from in-memory XML Document Object Models (DOMs). In contrast to the regular expressions, which pick salient substring from a simple text character sequence, XPath rules search the highly structured data of a DOM. XPath rules consequently are considerably more flexible and powerful.

<?xml version="1.0"?>
<mappings>
<prefix value="agave">http://www.bioxml.info/dtd/agave.dtd</prefix>

<!-- a base object in the Gene Ontology (GO) namespace -->
<object>
  <!-- find AGAVE gene elements with a GO classification child element -->
  <xpath>self::agave:gene//agave:classification[@system='GO']</xpath>
  <namespace>
    <!-- find the id attribute of the above result -->
    <ns value="GO">./@id</ns>
  </namespace>
</object>
</mappings>

Note that the "./@id" in the namespace rule is another XPath statement. Its context is the results of the xpath rule.

How do I build complex objects?

In order to construct non-primitive MOBY data ontology objects, we must be able to specify information besides the namespace and id. This is done using the object's child elements datatype and member. For example, FastA-formatted data:

>DVU0035
ATGGACAGCTACATCGTTCGCGGCATCCTCATCGGCGGTTCCGTCGGGGTCATCGCCGCG
CTTCTCGGCTTCAGCGACAGTATCCCCCGCGCCTTCGGCGTAGGCATGGTGGGCGGCTTC
TTCGCAGGCATCACACTCGAAAGCAGGCGCCGCAAACGCCCTTCCGGGAAGTAG

This data is represented in MOBY as a FASTA_NA with one member: "contents"

<MOBY:FASTA_NA namespace="unknown" id="DVU0035">
  <String articleName="contents" namespace="" id="">
>DVU0035
ATGGACAGCTACATCGTTCGCGGCATCCTCATCGGCGGTTCCGTCGGGGTCATCGCCGCG
CTTCTCGGCTTCAGCGACAGTATCCCCCGCGCCTTCGGCGTAGGCATGGTGGGCGGCTTC
TTCGCAGGCATCACACTCGAAAGCAGGCGCCGCAAACGCCCTTCCGGGAAGTAG
  </String>
</MOBY:FASTA_NA>

Data Types

The MOBY data type is specified simply by adding a datatype element to the object:

<datatype value="FASTA_NA"/>

Data Members

In order to populate the "contents" member, a member element is added to the object, so in total we have the rule:

<object>
  <!-- \N is a Seahawk-specific regex character class for DNA/RNA characters -->
  <regex>(>(\S*)[^\n]*(?:\n\N+)+)</regex>
  <namespace>
    <ns value="unknown">$2</ns>
  </namespace>

  <datatype value="FASTA_NA"/>
  <member value="content">$1</member>
</object>

Sanity Checks

Seahawk will check that all members for a given MOBY datatype are specified (e.g. DNASequence must both "sequenceLength" and "sequenceString" member rules), otherwise an exception will be thrown. This ensures that at run-time, the rules are up-to-date with respect to the Moby Data Ontology

How do I override the default rules file?

The default rules loaded by Seahawk come from the resource "docs/mobyBuilderRules.xml". This resource is in the Seahawk JAR file, and it could be updated using the Java Development Kit (JDK) JAR utilities. An easier way to change the default mappings file is to set the system property seahawk.rules. For example, on the command line:

java -Dseahawk.rules=path/to/newrules.xml -jar seahawk.jar

or in your program code, before creating the MobyClient:

System.setProperty(MobyClient.RESOURCE_SYSTEM_PROPERTY, 
                  "path/to/rules.xml");

How do I add/remove rules programmatically (in a running application)?

At runtime, rules can be added to or removed from Seahawk as follows:

import ca.ucalgary.bluejay.services.MobyClient;

MobyClient client = new MobyClient();

// example:  Add another XML rules file
client.addMappingsFromURL(new URL("file:///foo/bar/rules.xml"));

// example: Add a regex for the gene ontology, which always has 7 digits as its ID
client.addRegexMapping("GO:\d{7}", new String[]{"GO"});

// example: Add an XPath rule to get the gene ontology id attribute from a 
// classified gene in an AGAVE document
String xpath = "self::agave:gene//agave:classification[@system='GO']/@id";
client.addNamespaceContext("agave", "http://www.bioxml.info/dtd/agave.dtd");  // == prefix XML element
client.addXPathMapping(xpath, new String[]{"GO"});

// example: delete the mapping we just added
client.deleteXPathMapping(xpath);

Paul Gordon

Last modified: Tue Apr 25 20:31:58 MDT 2006