Seahawk discovers services for selected data primarily through rules (or templates if you prefer) mapping text strings and XML data to MOBY data ontology objects. These rules are defined in a file, so that Seahawk's data mapping capabilities can be extended simply by editing a text file.
The default rules file can be seen here.
The rules file is an XML file. It has a fairly simple primary structure:
mappingsmappings can have 0 or more prefix children,
followed by 0 or more object children
The prefix elements globally define namespace mappings to be used in the
file's XPath rules.
This allows one to write many XPath statements later on such as<prefix value="tigr">http://www.bioxml.info/dtd/tigrxml.dtd
./tigr:TU instead of
the every time using the rather unwieldy
./[namespace-uri()="http://www.bioxml.info/dtd/tigrxml.dtd" and local-name()="TU"]
The object elements represent templates for MOBY objects to construct.
Tags nested inside the
object tag fill in the various MOBY object fields. The simplest
MOBY objects have just a (namespace, identifier) attribute pair. For example,
the following file
defines one rule, to build a MOBY object in the NCBI global identifier namespace
("NCBI_gi" in MOBY's namespace ontology):
<?xml version="1.0"?>
<mappings>
<object>
<regex>(?:GI|gi)[:|](\d+)</regex>
<namespace>
<ns value="NCBI_gi">$1</ns>
</namespace>
</object>
</mappings>
More details on the regular expression syntax are presented below.
In the previous example, the object element had a regex child.
Whenever Seahawk is asked to generate Moby objects from a java.lang.String,
this regular expression will be evaluated. Suppose the string in question is:
Since the regex (details) matches, a new Moby object will be created. The namespace of the object is filled in using theThis genes is cross-referenced to GI:78045557
namespace element. Each ns child must specify a MOBY
object namespace in the value attribute, and the text contents of the element
will be used as the MOBY object ID. Any $# in the text will be replaced with parenthetical
groups from the regex match. i.e. the namespace rule in the example is
<namespace>
<ns value="NCBI_gi">$1</ns>
</namespace>
which will be replaced with the digit group "78045557" in the match of
(?:GI|gi)[:|](\d+). This yields a MOBY object of the form
<MOBY:Object namespace="NCBI_gi" id="78045557"/>
Because a MOBY object can have only one (namespace, id) pair, any additional ns
rules will be stored in the MOBY CrossReference Information Block (CRIB).
For convenience and regular expression readability, Seahawk support two extra predefined character classes:
\N: any primary IUPAC nucleotide characters [acgtunxACGTUNX]\P: any primary IUPAC amino acid characters [ARNDCQEGHILKMFPSTWYVBZXarndcqeghilkmfpstwyvbz*]XPath rules are used to build MOBY Objects from in-memory XML Document Object Models (DOMs). In contrast to the regular expressions, which pick salient substring from a simple text character sequence, XPath rules search the highly structured data of a DOM. XPath rules consequently are considerably more flexible and powerful.
<?xml version="1.0"?>
<mappings>
<prefix value="agave">http://www.bioxml.info/dtd/agave.dtd</prefix>
<!-- a base object in the Gene Ontology (GO) namespace -->
<object>
<!-- find AGAVE gene elements with a GO classification child element -->
<xpath>self::agave:gene//agave:classification[@system='GO']</xpath>
<namespace>
<!-- find the id attribute of the above result -->
<ns value="GO">./@id</ns>
</namespace>
</object>
</mappings>
Note that the "./@id" in the namespace rule is another XPath statement.
Its context is the results of the xpath rule.
In order to construct non-primitive MOBY data ontology objects, we must be able to
specify information besides the namespace and id. This is done using the
object's child elements datatype and member.
For example, FastA-formatted data:
>DVU0035 ATGGACAGCTACATCGTTCGCGGCATCCTCATCGGCGGTTCCGTCGGGGTCATCGCCGCG CTTCTCGGCTTCAGCGACAGTATCCCCCGCGCCTTCGGCGTAGGCATGGTGGGCGGCTTC TTCGCAGGCATCACACTCGAAAGCAGGCGCCGCAAACGCCCTTCCGGGAAGTAG
This data is represented in MOBY as a FASTA_NA with one member: "contents"
<MOBY:FASTA_NA namespace="unknown" id="DVU0035"> <String articleName="contents" namespace="" id=""> >DVU0035 ATGGACAGCTACATCGTTCGCGGCATCCTCATCGGCGGTTCCGTCGGGGTCATCGCCGCG CTTCTCGGCTTCAGCGACAGTATCCCCCGCGCCTTCGGCGTAGGCATGGTGGGCGGCTTC TTCGCAGGCATCACACTCGAAAGCAGGCGCCGCAAACGCCCTTCCGGGAAGTAG </String> </MOBY:FASTA_NA>
The MOBY data type is specified simply by adding a datatype element to
the object:
<datatype value="FASTA_NA"/>
In order to populate the "contents" member, a member element is added to the
object, so in total we have the rule:
<object>
<!-- \N is a Seahawk-specific regex character class for DNA/RNA characters -->
<regex>(>(\S*)[^\n]*(?:\n\N+)+)</regex>
<namespace>
<ns value="unknown">$2</ns>
</namespace>
<datatype value="FASTA_NA"/>
<member value="content">$1</member>
</object>
Seahawk will check that all members for a given MOBY datatype are specified (e.g. DNASequence must both "sequenceLength" and "sequenceString" member rules), otherwise an exception will be thrown. This ensures that at run-time, the rules are up-to-date with respect to the Moby Data Ontology
The default rules loaded by Seahawk come from the resource "docs/mobyBuilderRules.xml".
This resource is in the Seahawk JAR file, and it could be updated using the
Java Development Kit (JDK) JAR utilities. An easier way to change the default mappings
file is to set the system property seahawk.rules. For example, on the command line:
or in your program code, before creating thejava -Dseahawk.rules=path/to/newrules.xml -jar seahawk.jar
MobyClient:
System.setProperty(MobyClient.RESOURCE_SYSTEM_PROPERTY,
"path/to/rules.xml");
import ca.ucalgary.bluejay.services.MobyClient;
MobyClient client = new MobyClient();
// example: Add another XML rules file
client.addMappingsFromURL(new URL("file:///foo/bar/rules.xml"));
// example: Add a regex for the gene ontology, which always has 7 digits as its ID
client.addRegexMapping("GO:\d{7}", new String[]{"GO"});
// example: Add an XPath rule to get the gene ontology id attribute from a
// classified gene in an AGAVE document
String xpath = "self::agave:gene//agave:classification[@system='GO']/@id";
client.addNamespaceContext("agave", "http://www.bioxml.info/dtd/agave.dtd"); // == prefix XML element
client.addXPathMapping(xpath, new String[]{"GO"});
// example: delete the mapping we just added
client.deleteXPathMapping(xpath);