Seahawk discovers services for selected data primarily through rules (or templates if you prefer) mapping text strings and XML data to MOBY data ontology objects. These rules are defined in a file, so that Seahawk's data mapping capabilities can be extended simply by editing a text file.
The default rules file can be seen here.
The rules file is an XML file. It has a fairly simple primary structure:
mappings
mappings
can have 0 or more prefix
children,
followed by 0 or more object
children
The prefix
elements globally define namespace mappings to be used in the
file's XPath rules.
This allows one to write many XPath statements later on such as<prefix value="tigr">http://www.bioxml.info/dtd/tigrxml.dtd
./tigr:TU
instead of
the every time using the rather unwieldy
./[namespace-uri()="http://www.bioxml.info/dtd/tigrxml.dtd" and local-name()="TU"]
The object
elements represent templates for MOBY objects to construct.
Tags nested inside the
object
tag fill in the various MOBY object fields. The simplest
MOBY objects have just a (namespace, identifier) attribute pair. For example,
the following file
defines one rule, to build a MOBY object in the NCBI global identifier namespace
("NCBI_gi" in MOBY's namespace ontology):
More details on the regular expression syntax are presented below.<?xml version="1.0"?> <mappings> <object> <regex>(?:GI|gi)[:|](\d+)</regex> <namespace> <ns value="NCBI_gi">$1</ns> </namespace> </object> </mappings>
In the previous example, the object
element had a regex
child.
Whenever Seahawk is asked to generate Moby objects from a java.lang.String
,
this regular expression will be evaluated. Suppose the string in question is:
Since the regex (details) matches, a new Moby object will be created. The namespace of the object is filled in using theThis genes is cross-referenced to GI:78045557
namespace
element. Each ns
child must specify a MOBY
object namespace in the value
attribute, and the text contents of the element
will be used as the MOBY object ID. Any $# in the text will be replaced with parenthetical
groups from the regex match. i.e. the namespace rule in the example is
which will be replaced with the digit group "78045557" in the match of<namespace> <ns value="NCBI_gi">$1</ns> </namespace>
(?:GI|gi)[:|](\d+)
. This yields a MOBY object of the form
<MOBY:Object namespace="NCBI_gi" id="78045557"/>
Because a MOBY object can have only one (namespace, id) pair, any additional ns
rules will be stored in the MOBY CrossReference Information Block (CRIB).
For convenience and regular expression readability, Seahawk support two extra predefined character classes:
\N
: any primary IUPAC nucleotide characters [acgtunxACGTUNX]\P
: any primary IUPAC amino acid characters [ARNDCQEGHILKMFPSTWYVBZXarndcqeghilkmfpstwyvbz*]XPath rules are used to build MOBY Objects from in-memory XML Document Object Models (DOMs). In contrast to the regular expressions, which pick salient substring from a simple text character sequence, XPath rules search the highly structured data of a DOM. XPath rules consequently are considerably more flexible and powerful.
Note that the "./@id" in the<?xml version="1.0"?> <mappings> <prefix value="agave">http://www.bioxml.info/dtd/agave.dtd</prefix> <!-- a base object in the Gene Ontology (GO) namespace --> <object> <!-- find AGAVE gene elements with a GO classification child element --> <xpath>self::agave:gene//agave:classification[@system='GO']</xpath> <namespace> <!-- find the id attribute of the above result --> <ns value="GO">./@id</ns> </namespace> </object> </mappings>
namespace
rule is another XPath statement.
Its context is the results of the xpath
rule.
In order to construct non-primitive MOBY data ontology objects, we must be able to
specify information besides the namespace and id. This is done using the
object
's child elements datatype
and member
.
For example, FastA-formatted data:
>DVU0035 ATGGACAGCTACATCGTTCGCGGCATCCTCATCGGCGGTTCCGTCGGGGTCATCGCCGCG CTTCTCGGCTTCAGCGACAGTATCCCCCGCGCCTTCGGCGTAGGCATGGTGGGCGGCTTC TTCGCAGGCATCACACTCGAAAGCAGGCGCCGCAAACGCCCTTCCGGGAAGTAG
This data is represented in MOBY as a FASTA_NA
with one member: "contents"
<MOBY:FASTA_NA namespace="unknown" id="DVU0035"> <String articleName="contents" namespace="" id=""> >DVU0035 ATGGACAGCTACATCGTTCGCGGCATCCTCATCGGCGGTTCCGTCGGGGTCATCGCCGCG CTTCTCGGCTTCAGCGACAGTATCCCCCGCGCCTTCGGCGTAGGCATGGTGGGCGGCTTC TTCGCAGGCATCACACTCGAAAGCAGGCGCCGCAAACGCCCTTCCGGGAAGTAG </String> </MOBY:FASTA_NA>
The MOBY data type is specified simply by adding a datatype
element to
the object
:
<datatype value="FASTA_NA"/>
In order to populate the "contents" member, a member
element is added to the
object
, so in total we have the rule:
<object> <!-- \N is a Seahawk-specific regex character class for DNA/RNA characters --> <regex>(>(\S*)[^\n]*(?:\n\N+)+)</regex> <namespace> <ns value="unknown">$2</ns> </namespace> <datatype value="FASTA_NA"/> <member value="content">$1</member> </object>
Seahawk will check that all members for a given MOBY datatype are specified (e.g. DNASequence must both "sequenceLength" and "sequenceString" member rules), otherwise an exception will be thrown. This ensures that at run-time, the rules are up-to-date with respect to the Moby Data Ontology
The default rules loaded by Seahawk come from the resource "docs/mobyBuilderRules.xml"
.
This resource is in the Seahawk JAR file, and it could be updated using the
Java Development Kit (JDK) JAR utilities. An easier way to change the default mappings
file is to set the system property seahawk.rules
. For example, on the command line:
or in your program code, before creating thejava -Dseahawk.rules=path/to/newrules.xml -jar seahawk.jar
MobyClient
:
System.setProperty(MobyClient.RESOURCE_SYSTEM_PROPERTY, "path/to/rules.xml");
import ca.ucalgary.bluejay.services.MobyClient; MobyClient client = new MobyClient(); // example: Add another XML rules file client.addMappingsFromURL(new URL("file:///foo/bar/rules.xml")); // example: Add a regex for the gene ontology, which always has 7 digits as its ID client.addRegexMapping("GO:\d{7}", new String[]{"GO"}); // example: Add an XPath rule to get the gene ontology id attribute from a // classified gene in an AGAVE document String xpath = "self::agave:gene//agave:classification[@system='GO']/@id"; client.addNamespaceContext("agave", "http://www.bioxml.info/dtd/agave.dtd"); // == prefix XML element client.addXPathMapping(xpath, new String[]{"GO"}); // example: delete the mapping we just added client.deleteXPathMapping(xpath);