This report concerns the messaging layer of the Moby project, that point at which semantic information is exchanged between the data consumer (the biologist or client process) and the data provider (the model organism system database).
There are a wide variety of possible messaging systems. Here are a number of prominent ones listed in rough chronological order.
For the purposes of this assessment, I ignored 1, 2, 3 and 4. I rejected custom messaging alternatives 1 and 2 because they represent fallback positions that we should consider only if none of the standard solutions meet our requirements. Exchange of messages via ASN.1 streams and Microsoft DCOM can rightly be treated as legacy solutions that have found important places in niche applications but are no longer acceptable as solutions for enabling the exchange of semantic information across administrative domains. SOAP arises from, and supersedes, XML-RPC, and so are folded together. This leaves REST, CORBA, and SOAP.
I will now consider these in chronological order.
REST stands for REpresentational State Transfer, and is a term coined in Roy Fielding's graduate thesis to describe a style of information architecture that had already become the de facto standard for the World Wide Web. According to Fielding, REST is suited for scaleable applications in which relatively large hypermedia representations of information resources are exchanged within the "anarchic network."
Probably the most innovative aspect of REST is its use of URIs to address each resource. I will use a DAS-like application (DAS is the Distributed Sequence Annotation System, used to distribute annotations on a genome) as an exemplar of this style. In DAS, a URI can be used to identify a particular segment of a genome:
http://my.site/das/d-melanogaster/r3.1/2R
This identifies the genome of drosophila melanogaster, assembly release version 3.1, chromosome arm 2R. To fetch the list of features from this region, one would issue a GET request on the following URL:
GET http://my.site/das/d-melanogaster/r3.1/2R/features
To address an individual feature named "exon00001", one refers to this URL:
GET http://my.site/das/d-melanogaster/r3.1/2R/features/exon00001
To add a new feature to the chromosome, one issues a PUT:
PUT http://my.site/das/d-melanogaster/r3.1/2R/features/exon00002
Updates and deletes are handled similarly.
REST is elegant because it allows very generic software to be written. For example, caching code does not have to know anything about the contents of the data it caches, and fetching code can simply hand off the data it receives to the appropriate helper application. However, it is unclear to me how REST can be used to handle transformative tasks. For example, for the task of transforming genes into GO_terms, should the task be represented as a method on the gene:
GET http://my.site/das/d-melanogaster/genes/notch/GO_terms
or as a hierarchy of tasks:
GET http://my.site/transformations/gene/gene2go?gene=notch
In one sense, everyone is. In another sense, nobody. The main exemplar of a fully RESTful Web service is WebDAV, an implementation of the DAV Distributed Authoring and Versioning protcol. Beyond WebDAV, there are previous few "pure REST" applications out there. There are many almost-REST services, but a variety of common practices, such as the use of cookies, interferes with REST by confusing the semantics of stateless information transfer operations.
DAS/1 is among these "almost REST" services. More or less accidentally, it follows some of the REST conventions but it does other things that are discouraged by REST, such as using POST to mean GET.
The security of REST messages is limited to whatever HTTP can provide, which ranges from horribly insecure cleartext passwords (Basic authentication) to a sophisticated session-based public key encryption system (SSL). Because each REST message is specified by a distinct combination of URL and request method, Web server-based access and authentication controls can be applied directly to REST messages, allowing fine-grained control over who accesses data and what manipulations they are allowed to perform on it.
There is significant infrastructural support for REST services. It's the web!
CORBA is a remote method call (RMC) protocol that uses binary-encoded objects and an object request, lookup and serialization infrastructure called an ORB.
Another key feature of CORBA is its ability to support legacy applications in C++, Java or Perl. In theory, one can take existing library code written to support a local application, define a public interface to it in IDL, and then link a small CORBA application wrapper to it, thereby turning it into a network service. Client code written to access the local library can now operate on the remote service with no other source code changes. In practice, I have found this process a less than transparent because the architecture of a network server has fundamental constraints that are different from that of a local application, and rarely does a converted application perform in a satisfactory way.
The Life Sciences committee of the Object Management Group (OMG), has been hard at work for several years developing IDLs for the life sciences. However, due to the rapid change of the field, the IDLs that are being ratified now have little relationship to the MOBY use cases, and therefore are not as valuable as one would hope.
To contrast CORBA to REST, let us consider casting DAS in CORBA terms. The first step would be to write an IDL with some of the following declarations:
interface DataSource {
void setGenome(string new_source);
string getGenome();
void setVersion(string new_version);
string getVersion();
Segment getSegment(string lsid);
};
typedef sequence<Feature> FeatureArray
interface Segment {
DataSource getSource();
string getReference();
integer getStart();
integer getEnd();
FeatureArray getFeatureSet();
};
interface Feature : Segment {
string getType();
FeatureArray getSubFeatures();
};
A CORBA DAS service would provide an API like the following:
data_source = MYCORBA.get_an_object_somehow('urn:lsid:biodas.org:provider/das');
data_source.setGenome('urn:lsid:www.taxonomy.org:taxa/dmelanogaster');
data_source.setVersion('r3.1');
segment = data_source.getSegment('urn:lsid:my.site:chromosomes/2R');
features = segment.getFeatureSet();
if (features[0].getType() == "transcript")
exons = features[0].getSubFeatures();
The process of fetching a list of exon feature becomes a set of method calls. Objects are identified by an arbitrary naming system that is unrelated to the Web's URI system. For fun, I've used the LSID system, but in fact any opaque identifier would do here. The get_an_object_somehow() call is a stand-in for a series of CORBA object directory lookup calls, which are outside the scope of this discussion.
Unlike SOAP, CORBA is strongly tied to the underlying transport layer at one side, and to the directory service at the other. It is also very much integrated with the syntax and feature set of IDL. This has usage implications. For example, the Internet Inter-ORB Protocol (IIOP) is the only sanctioned object exchange protocol for TCP/IP. IIOP currently provides synchronous message exchange only: after initiating a method call, a client process must wait until it is complete. Asynchronous communications are precluded, at least until a new version of CORBA becomes available. Similarly, the object discovery service is tightly bound to the CORBA package, and cannot easily be mixed and matched.
The CORBA IDL is a powerful and expressive interface language that includes many of the features of object-oriented languages. It was designed during the days when C++ reigned, and this heritage shows: it provides the basic C++ types, including characters, strings, integers, floats, unions, references and enums, as well as aggregations of these types, including structs and arrays. The pointer type is not available, for good reasons.
IDL supports class inheritance including multiple inheritance, but does not have a straightforward mapping to C++'s method scoping rules, such as protected methods. The multiple inheritance rules also forbid inheriting the same method name from two base classes, something that will prevent multiple inheritance from being much use in large collaborative projects where such collisions are frequent. IDL also has the concept of an object "attribute", something akin to an instance variable, but some texts recommend against using them.
CORBA supports both application- and system-defined exceptions. Application-level exceptions are explicitly declared in the IDL interface definition and are mapped onto whatever language-specific exception-handling mechanism CORBA is bound to. Because of the need to support non-object oriented languages like C, exception types cannot be inherited or extended.
Security is provided via a CORBA Security Service, which provides an API for object-level access control. The API is "technology neutral," which means that it is up to ORB implementors to find the best way to implement the API. The dominant solution seems to be to run CORBA running on top of SSL/TLS, but I do not have the complete picture of how widespread such implementations are.
At one point, CORBA was going to be the saviour of bioinformatics. It was heavily promoted by the EBI and by a number of biotech/biopharm companies. It has found a niche in certain LAN applications, but has not achieved any significant use for public servers. I do not have a good sampling of opinions as to why it has failed, but Ewan Birney, an early and strong proponent of CORBA, now quotes "performance problems" as a major factor.
CORBA never had the support of Microsoft, and no longer has the support of IBM or Sun.
Software support is good if you are on a Linux machine running the Gnome desktop environment. Gnome made the big leap five years ago and committed to a completely CORBAized architecture. Therefore the CORBA libraries, development tools, and other infrastructural elements are preinstalled on such machines. As far as I can tell from my personal experience, CORBA supports Gnome on the desktop well, but it has not provided the interoperability win or "killer app" that one might hope for.
Netscape and Mozilla both include a freeware ORB and IIOP, allowing components of those browsers to send and receive CORBA objects. The Mozilla ORB appears to be different from the one that comes with Gnome, at least insofar as it comes with a different IDL compiler and a slightly different IDL syntax.
The Java runtime up to J2SEE v1.4 includes an ORB, and the standard Java library has a full set of CORBA bindings. However, Sun clearly intends to deprecate its CORBA support. The Java Web Services FAQ, marketing literature, and white papers are exclusively devoted to SOAP/XML, and references to CORBA are now buried in the technical documentation.
Microsoft has long been antagonistic to CORBA, and spent the 90s actively promoting DCOM (under a variety of names, including ActiveX) as an alternative framework for network services. Microsoft operating systems require the installation of third-party ORBs in order to participate in CORBA-based services.
SOAP initially stood for Simple Object Access Protocol. However it isn't particularly simple, and it has little to do with accessing objects, so recent reference works have tended to use the acronym on its own. In a nutshell, SOAP is a Remote Procedure Call (RPC) protocol which uses XML for its messages.
SOAP is positioned very much in the same niche as CORBA. In theory, one can take legacy applications, flip a compiler switch, and have them act as SOAP clients and servers. This is because each language provides bindings that map its fundamental data types and method call conventions into language-independent XML encodings.
I would repeat the DAS example here, but the interface definition would be very much larger and harder to understand in XML/WSDL format. The application-level code, however, would be similar, if not identical to the CORBA example.
In contrast to CORBA, which is tightly coupled to its transport layer, interface language, and directory service, SOAP takes a modular approach. The SOAP transport framework can run on top of stateless synchronous protocols such as HTTP, stateful synchronous protocols such as FTP, stateless asynchronous protocols such as Jabber, and delayed stateless asynchronous protocols such as SMTP. In theory services can be described using a variety of data definition languages, although in practice XSD and WSDL dominate. Resource discovery is outside of core SOAP, but is provided by the separate UDDI specification.
SOAP does not formally support inheritance, nor does it, to be honest, truly support an object-oriented API. It is up to the language binding to serialize and deserialize native objects in such a way to simulate the exchange of objects and the invocation of method calls on them. XSD provides for inheritance, but it is an extremely data-centric type of inheritance that requires some oddities, such as restating the contents of the base class in the derived classes, that interfere with OO design. I do not understand how inheritance works in WSDL, and have been unable to determine from my readings whether a SOAP service that provides a derived object class can communicate correctly with a client that expects to receive and manipulate the superclass.
SOAP has a formal exception-handling mechanism that maps onto the exception system of the currently bound language. Like CORBA, the list of exceptions form a simple enumerated list without inheritance. However, the list of exceptions can be extended by application developers, and it does not seem impossible for a language binding to impose inheritance on the system.
SOAP security can be achieved by running SOAP sessions across SSL/TLS. This provides a very coarse-grained access control based on the identities of the server and client, and not the fine-grained access control needed to provide selective access to individual objects and method calls. A number of proposed extensions to SOAP add this fine-grained access control. The one that is furthest along is a straightforward application of the XML digital signature syntax to SOAP. Interestingly, one of SOAP's strongest selling points is that when used on top of HTTP it will go through firewalls, which typically pass port 80 traffic. Thus its ability to circumvent firewall security is a feature, not a bug.
I have tried SOAP in my own applications and find that it works fine for simple to moderately complex applications. Because of its transparency, programmers can be tricked into performing foolish operations. For example, in a local application it makes sense to create lots of large complex objects and then invoke method calls on them. In SOAP, every method call requires the object to be marshalled (serialized along with all its subobjects), transmitted across the wire, and unserialized by the server. The whole process is repeated on the way back, making the application slow. Just as is the case with CORBA, awareness of the strength and weaknesses of network-based software must inform the design of services from the very beginning.
In my hands, SOAP does not work well in applications that transfer extremely large amounts of data. For example, the genome-size data streams that DAS generates rapidly exceed the DOM data structures of SOAP/1.1 and earlier libraries, which expect objects to fit in memory. SOAP 1.2 fixes this by allowing for incremental event-based (SAX) parsing of messages, but this weakens the procedure-call API by exposing the developer to the innards of object marshalling and unmarshalling.
Many people are talking about SOAP but the list of toy examples and proofs of principle far outnumber the number of production applications. This applies both to biological and non-biological domains. My greatest success with SOAP has been a database application that tracks the merges and splits in gene names. The operations in this application are lightweight and require very little data transfer, and a server written in Perl communicates very nicely with clients written in Java and C. However, the application remains a toy. In production I connect to the database over a socket using the database's SQL API. One issue is speed, but another is that I do not have confidence in the Perl SOAP library. Undoubtedly I would be more enthusiastic if I were using the Java binding, which is more mature and better supported.
SOAP is receiving strong developer support from IBM and Sun. The level of support is not even across languages. It is very good for Java and C#, pretty good for C++, good for Perl (although I don't trust the library to be bug-free), and poor for Python.
Microsoft's .NET architecture, as far as I can tell from the marketing literature and sometimes contradictory Internet commentary, is SOAP, WDL and UDDI with a set of Microsoft APIs at the front end and a runtime intermediate language in the middle. Provided that Microsoft does not follow its traditional practice of embracing and modifying open standards, the future of SOAP and its associated technologies looks assured.
REST is a collection of software architecture patterns that is suitable for developing highly-tuned web-based services. However, it is very much a do-it-yourself proposition that is hard to compare directly to either CORBA or SOAP. If we were to consider adopting REST for use in MOBY, we would have to define the following:
A REST-based MOBY would be tied for the conceivable future to the HTTP protocol and to the HTTP security architecture. We would be unable to run MOBY on top of asynchronous or delayed protocols such as instant messaging or SMTP.
CORBA is a very complete package that provides everything from the transport layer to the resource discovery system. Details of the CORBA messaging system are well-hidden from the application developer. This simplifies the development process, but makes it harder to tune performance. The major downside to CORBA is that it has been abandoned as a technology by the vendors that once promoted it, and has been effectively pithed by Microsoft's SOAP-based .NET architecture. If we were to run MOBY on CORBA, we would be hitching our cart to a half-dead horse caught in quicksand. I do not recommend this course of action.
SOAP provides a messaging system that is mostly transparent to the applications developer. It does not have a tightly-coupled transport layer and resource discovery system, thereby providing the flexibility to mix and match these components. We could run MOBY services off an e-mail server; thereby reintroducing the sorely-missed batch BLAST and sequence retrieval services of my graduate school days.
Unlike CORBA, SOAP is well supported by the industry, and absenting an underhanded move by Microsoft is likely to dominate web services over the next half decade. At the same time, it is a work in progress, and we can expect to reimplement MOBY a few times as SOAP evolves.
If we were to run MOBY on SOAP (as Mark's prototype does!) we would have to define the following:
An unresolved concern of mine is whether SOAP truly supports object-oriented interface design. I don't know if this will eventually turn into a requirement, but a nice feature for MOBY to have would be the ability to write clients that access and manipulate the base class of an object. If a simple client that understands the "SimpleSequence" class tries to retrieve data from a more sophisticated MOBY service that returns "SuperDuperSequence" objects, will the client be able to invoke methods on the derived class? I don't see this being enforced in any real way in the SOAP specification, and the textbooks and Internet sources are curiously silent on this topic.
I think we can make a plausible argument for either SOAP or REST as the messaging layer for MOBY.