Protocols

Date:: March 13, 2003
Author:: Damian Gessler
Version:: 1.0

Overview

A protocol is a set of conventions or rules that allow the orderly and sensible exchange of information. When two parties or software agents agree on a protocol, they agree on an encoding and decoding scheme for the representation and interpretation of data. There are hundreds of communication protocols (see, for example, Javvin's map of communication protocols). A common organization of communication protocols is the OSI-RM (Open Systems Interconnect Reference Model) which was developed by the ISO in 1978. This multi-layer model creates a protocol stack:

Protocol Layer	Description	Example
7. Application	Interface to user processes	HTTP, FTP, SMTP
6. Presentation	Architecture-independent data encoding	PP
5. Session	Establishes connection between processes on different hosts; handles security and creation of the sessions.	NetBIOS/IP
4. Transport	Point to point connection between hosts	TCP[1], UDP
3. Network	Packet routing and multicast	IP
2. Data link	Low-level framing and error correction. MAC addresses are at this level.	IEEE protocols
1. Physical	Electrical and mechanical connections to the network	Ethernet, Token ring

Protocol Layers (Reconstructed from queries at: http://foldoc.doc.ic.ac.uk/foldoc/index.html)

For MOBY, we will be concerned exclusively with the Application layer; that is, a new protocol that uses HTTP, FTP, etc. to handle the exchange of data under a shared syntax and semantic. This may be accomplished by either extending an existing protocol (for example, using extension headers in HTTP), or layering on top an existing application layer (for example, using SOAP's binding to HTTP, SMTP, etc.). A benefit to the former approach is that HTTP is functionally rich yet simple; it is well documented and broadly adopted, and this makes it is relative easy to implement a set of extension standards amongst compliant web servers. HTTP's clean use of headers and payloads, MIME-types, and its set of mature helper applications makes it an excellent platform with which to rapidly develop a prototypical model or even a new protocol[2].

Yet a tight dependency on HTTP has its limitations: HTTP is a stateless, synchronous protocol (see below), and this has limitations in a web service environment. For example, see IBM's proposal for a "reliable" HTTPR protocol to be layered on top of HTTP[3]. Just as SOAP[4] can be layered on top of HTTP, it can be layered on top of HTTPR, so the use of a separate messaging protocol like SOAP offers an additional layer of encapsulation. Messaging layers like SOAP that can layer on top of both synchronous and asynchronous protocols may offer elegant benefits, since asynchronous protocols are increasingly considered essential for web services.

Protocol Properties

The decision between the extension or additional layering of protocols may be aided by "grading" protocols- including a new MOBY protocol- on how they satisfy desirable properties. Ideally, protocols are scalable, efficient, simple, and extensible. Additionally, protocols may operate without the need for a maintained open connection (asynchronous), or they may maintain an interaction between sender and receiver (synchronous); they may send a series of self-contained requests (stateless), or they may set and use information across requests (stateful). Desirable properties may sometimes be antagonistic, for example, a compact binary encoding may increase a protocol's efficiency but reduce its simplicity; similarly, the construction of state information may increase per request efficiency at the cost of connection overheads. The protocol designer should consult the use cases and technical requirements of the protocol to balance these conflicting properties.

Property	Description
Scalability	Properties are preserved upon massive scaling.
Efficiency	Bandwidth is conserved by compactly sending relevant, and only relevant, information.
Simplicity	Simple tasks are done simply, complicated tasks can be decomposed into simple tasks.
Extensibility	The protocol can adapt to address unforeseen changes.
Synchronicity	Asynchronous: does not wait for reply from receiver (e.g., ANTP[5], BEEP[6])[7]. Synchronous: waits (blocks) for a reply (e.g., HTTP, FTP).
State	Stateless: requests are self-contained (e.g., HTTP, IP, NFS, UDP). Stateful: connection or session information is used over the packet stream or across multiple requests within a session (e.g., FTP, SMTP, TCP).

Desirable Protocol Properties (Based in part on: http://mappa.mundi.net/features/mtr/properties.html)

HTTP Evaluated

Because of the ubiquitousness of HTTP as an Application layer for the web, the remainder of this document discusses the above six properties as they relate to HTTP and contrasting protocols. HTTP is a connection-based, client-server protocol; that means it is based on the model that one computer (a client) sends a request to another computer (the server), and the server then sends a response back to the client. The client-server model places distinctly different software demands on the client and the server, which is easily seen in noting that merely accessing the web (as a client) is not the same as publishing on the web (as a server). This basic design can be contrasted with peer-to-peer architectures, where all parties access and publish data. HTTP is synchronous: there is no way within the protocol itself to get a response without first sending a request. Both the request and the response may contain arbitrary data in the message body. This differs from single-direction protocols like SMTP (Simple Mail Transfer Protocol) for email, where the protocol does not support anything other than acknowledgements and status messages as a response.

There are only two types of messages in HTTP: requests and responses. Both are specified in a simple, human-readable format, with requests consisting of a start line specifying a method (such as GET or POST), a series of headers (specifying details about the request), and in some cases a body or payload for data. Responses are equally simple, with a status code in the start line, a series of headers, and also a body. A useful header is the content-type, which allows a binding of the data to specific applications, thus allowing clients a simple lookup mechanism for custom data handling.

	Request	Response
Start line	GET /resource/file.txt HTTP/1.1	HTTP/1.1 200 OK
Headers	Accept: text/*	Content-length: 14
		Content-type: text/plain
<blank line>
Body		Hello, world!

Based on Figure 1-7 of Gourley, D. and B. Totty 2002 HTTP: The Definitive Guide. O'Reilly & Associates. Sebastopol, CA. p. 10.

The web and HTTP rely heavily on the concept of data as resources, and these resources are referenced by Uniform Resource Identifiers, or URIs. URIs may be of two types, either URLs (Uniform Resource Locators) or URNs (Uniform Resource Names). URLs specify a location via a scheme (or protocol, such as http:// or ftp://), a server, and a path within the server (e.g., http://www.server.org/path/file.htm). In dereferencing the URL, the server's name is mapped to a unique electronic address (IP address) by a network of DNS (Domain Name Service) lookup servers. URNs are location independent resources and thus are not tied to IP addresses (e.g., urn:nameSpaceID:nameSpaceString). This makes dereferencing the URN problematic (where is it?), and thus- in the absence of URN-URL mapping servers- URLs are used to the virtual exclusion of URNs. URIs are highly relevant to MOBY, since early thinking on MOBY has not exploited the application of MOBY objects as URI identified resources.

Scalability

HTTP, as the de facto protocol of the web, is massively scalable; a clear requirement for a web architecture. While most protocols aim for scalability, this is not always immediately attainable. For example, packet forwarding in wireless protocols can cause severe performance degradation upon scaling[8].

Efficiency

While greater efficiency is always desirable, HTTP on top of TCP/IP is considered reasonably efficient given its breath of application. As reported in Touch et al. 1996, HTTP was developed (in part) to address inefficiencies in FTP[9]^,[10]. And while there are performance issues with HTTP over TCP/IP, Touch et al.'s, analysis of did not support any particular pathology with HTTP for low bandwidth users. Additionally, increased performance was an important advancement of HTTP 1.1 over previous versions (www.w3.org/Protocols/HTTP/Performance/Pipeline.html). For one source of work in preventing the World Wide Web from becoming the World Wide Wait, see http://www.w3.org/Protocols/HTTP/Performance/Overview.html and http://www.w3.org/Protocols/NL-PerfNote.html. Of course, performance is not the same as efficiency, and efficiency often comes at the price of readability and generality: the latter two are noted characteristics of HTTP.

Simplicity

HTTP is a reasonably powerful and mature protocol, yet it is also relatively simple in implementation and operation for the programmer. There are only eight methods (HEAD, GET, POST, PUT, DELETE, TRACE, CONNECT, and OPTIONS) in the HTTP 1.1 specification, and a simple, invariant model of request/response, each consisting of a human-readable format of start line, headers, and body. Contrast this with CORBA, which is complex in both design and implementation.

Extensibility

HTTP can be extended in both its methods and headers. New methods, such as LOCK, UNLOCK, COPY, and MOVE, are used by WebDAV (Web Distributed Authoring and Versioning) in the area of web publishing and collaboration. Method extensibility is not dynamic, i.e., there is no specification in the protocol for generic web servers to "understand" a new method by invoking some process, thus method extensibility relies on the adoption of new compliant servers[11]. Similarly, servers may introduce new headers. Currently, HTTP extensibility is ad hoc; the W3C is examining ways to develop an extension framework (For example, by using SOAP or URI-prefixed extensions; see www.w3.org/Protocols/HTTP/ietf-http-ext and ftp://ftp.isi.edu/in-notes/rfc2774.txt).

Synchronicity

While the web is asynchronous, HTTP is a synchronous protocol. It is synchronous, because a single TCP/IP connection is kept open to receive a reply from the server. If that connection is broken, then the server cannot notify the client with a response as part of the HTTP protocol itself. Since most HTTP connections have a default timeout of 300 seconds, this means that HTTP is not well suited for batch processing. For example, submitters of a BLAST job may get notification by email that their job is complete: it cannot happen as a response to the initiating HTTP request unless the connection is kept open. Email is an asynchronous application: a one-way messaging service were the receiver does not need be on-line at the time of transmission (though SMTP as a protocol is synchronous, since sender and receiver partake in a stateful transaction). Because many services have asynchronous demands, a synchronous protocol is considered non-optimal for web services.

State

HTTP is stateless, meaning that the response to one request is not dependent on the content of previous requests. Contrast this to FTP, which keeps the state of the "current working directory" across get and put calls. The stateless nature of HTTP adds robustness across network interruptions, since new requests do not have to re-establish state information. Yet the stateless nature of HTTP complicates cross-session management (for example, multiple, discontinuous visits to a web site) and requires the ad hoc use of cookies and "fat URLs"[12] to mimic stateful connections. There are benefits to both stateful and stateless properties in a distributed environment, though statelessness is necessary as well as desirable on the web. The stateless property of HTTP is particular suited to MOBY: just as multiple transactions within a web page are independent of each other, there is a natural extension of this to MOBY's emphasis on atomically tagged data vs. HTML documents as a basic unit of manipulation.

MOBY will essentially encompass three protocols, one for syntactical messaging, one for semantic interpretation, and one for data/service discovery and mapping. Even if these properties are embedded in a single API instead of being delineated as separate "protocols," MOBY will have to specify the conventions or rules for the parsing of data, its meaning, and its mechanisms for service discovery. The design and construction of solutions to these tasks may be aided by assessing technologies against the properties of scalability, efficiency, simplicity, extensibility, synchronousity, and state.

[1] TCP (Transmission Control Protocol) is a reliable, byte-stream protocol: data is bundled into packets with a checksum and each packet is sequentially numbered. The byte-stream is guaranteed to be able to be reassembled in order, and the protocol allows the receiver to send requests back to the sender to resend corrupt packets. In this sense, TCP is reliable, since higher layered protocols can send and forget over TCP with respect to data transmission integrity. UDP (Use Datagram Protocol) is a datagram (vs. a byte-stream) protocol: it unreliable and connectionless, meaning that it does not employ receiver acknowledgements (though datagrams do have checksums), nor does it guarantee a sequential flow of datagrams across larger data streams. UDP is "datagram centric," and is thus appropriate where small amounts of data are being sent in low-overhead conditions. Because UDP does not handle data loss or corruption as part of the protocol per se, integrity is the responsibility of higher layers. See http://www.tcm.hut.fi/Studies/Tik-110.350/1997/Essays/udp.html and www.novell.com/documentation/lg/nw6p/index.html?page=/documentation/lg/nw6p/tcpipenu/data/h1a308vx.html.

[2] Roy Fielding, a noted web guru, proposes the deployment of a new web protocol via HTTP's Update header field in "waka: A replacement for HTTP" available at www.apache.org/~fielding/waka/ 200211_fielding_apachecon.ppt.

[3] See www-106.ibm.com/developerworks/library/ws-phtt. "Reliability" is used with respect to failure recovery; e.g., multiple POSTs under HTTP may have side-effects such as duplicate orders in a shopping cart or replicate updates in a database, so failed connections while POSTing should not be naively resent. HTTPR addresses these types of issues in both synchronous and asynchronous settings.

[4] SOAP, Simple Object Access Protocol, is an XML-based messaging protocol for data exchange (www.w3.org/TR/SOAP). SOAP is an Application layer messaging protocol: the emphasis in SOAP is how messages are constructed (vs., for example, the Transport layer protocol TCP where the emphasis is in how data is delivered).

[5] Asynchronous Notification Transport Protocol; see http://simp.mitre.org/drafts/antp.html.

[6] Blocks Extensible Exchange Protocol; see http://www.beepcore.org/beepcore/home.jsp, http://www.clipcode.com/peer/beep_technical_whitepaper.htm, and www.ietf.org/rfc/rfc3080.txt.

[7] For interesting reading on how to use HTTP asynchronously, see Technical Whitepaper by Clipcode.com at www.clipcode.com/peer/http_async_notif.htm.

[8] Wu, J. and F. Dai 2003 A Generic Broadcast Protocol in Ad Hoc Networks Based on Self-Pruning, Accepted by the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003) , Apr. 2003, Nice, France. Available at www.cse.fau.edu/~fdai/pub/wu_dai_sp_ipdps_2003.pdf.

[9] See studies cited in Touch, J. J. Heidemann, and K. Obraczka 1996 Analysis of HTTP Performance. USC/Information Sciences Institute Available at www.isi.edu/lsam/publications/http-perf.

[10] See also Litjens, R. M. Siler, and M. D. Spiller 1995 FTP versus HTTP: A Comparison of Two Mainstream Transfer Protocols EE228A Fall 1995. Available at www-cad.eecs.berkeley.edu/~mds/research/1995/http.html.

[11] Though see some of the earlier work on PEP - an Extension Mechanism for HTTP http://www.w3.org/TR/WD-http-pep.html.

[12] URLs created dynamically with user-identifying information in the local resource component.