Lecture 05: Applications #3.1: HTML and HTTP Basics

The World Wide Web

Of all the Big Ideas in computer networking, the invention of the World Wide Web (also called the WWW, or just the Web) would have to be the biggest.

History:

1989: original proposal from Tim Berners-Lee at CERN for a "Web" of linked documents. Prototype followed soon after.
December 1991: First public demonstration.
February 1993: Mosaic (first alpha version) released by NCSA. First fully operational, multiplatform version released in September. Awareness of WWW project growing.
February 1994: We (Department of IT) start running a Web server on machine ironbark. at Bendigo (first regional institute in Australia to do so, and in the first 10 nationally!) Rah, Rah!
Early 1995: Netscape Communications releases Netscape Navigator 1.1. The rest is, as they say, history.

WWW Architecture

Four key components:

Web Browser software (eg IE, Mozilla, Camino, etc, Netscape, Opera, Safari, iCab, OmniWeb, lynx, Amaya, Mosaic, or even (for the truly desperate) Emacs/W3 -- and this is by no means an exhaustive list!).
Web server software. The most popular server program is apache -- this is what we run on ironbark., and redgum however there are several other popular server packages, especially those from Microsoft.
A collection of "hyperlinked" documents (or pages) written in HTML (the HyperText Markup Language), as well as a great number of other object types (eg, images, sounds, video clips, etc).
The HyperText Transfer Protocol, HTTP. The browser uses HTTP to obtain Web documents, specified using a URL, from a server. For example, the "home page" of ironbark is:
```
http://ironbark.bendigo.latrobe.edu.au/index.html
```
This specifies the application protocol (HTTP) used to fetch the object, the domain name where it is located and the local filename of the object on that host (/index.html). The "magic" string :// doesn't mean anything in particular except to signify that it's a URL...

Digression: HTML

Although it is not "core" knowledge in this unit, we really need to mention HTML.

HTML is a markup language -- documents are (in general) plain ASCII textfiles, with certain characters reserved to denote markup. Such languages have a long and venerable history in computing (eg starting with *roff, TeX, (see also here), LaTeX, SGML and subsequently XML.

The structure (or, to a somewhat lesser extent, the displayed appearance) of a HTML document (or Web page) is described using embedded formatting codes (or tags) intermingled with the information in the document.
In HTML, the markup tags are delimited by the special characters "<" and ">" -- the "less than" and "greater than" characters, often (rather clumsily IMHO) called "angle brackets". If either of these characters must appear as part of the actual data, they are written as < and > respectively.
HTML introduced a uniform, and revolutionary, way of specifying hyperlinks in a document, using the <A HREF="...some URL...:">link text</A> structure. This was revolutionary!
Modern HTML standards have evolved to support incredibly complex document layouts (using the <TABLE> markup, style sheets, client-side scrpting, etc), seemlessly mingling text and graphics into what has become an entirely new form of media.

If you're interested to see some very simple hand-crafted HTML, have a look at the document source for these lecture notes...

Hypertext Transfer Protocol (HTTP)

In Lecture #2,, the World Wide Web was used to illustrate the idea of a layered communications architecture. In that lecture, the basic ideas of the original version (0.9, circa 1992) of HTTP were introduced.

To revise, in HTTP/0.9 the GET operation was used to obtain HTML "pages" from a server, eg: the "home page" of ironbark at URL http://ironbark.bendigo.latrobe.edu.au/index.html

We first establish a reliable (TCP) connection to the server process waiting at port 80 (HTTP) on ironbark.bendigo.latrobe.edu.au. We then send the single line request shown in italics and receive in response the HTML text, shown here in boldface:

GET /index.html
<HTML>
<HEAD>
<TITLE>The Department of Information Technology at La Trobe University, Bendigo</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF">

<!-- ******** Department Header ***************-->
<IMG SRC="/gifs/irbkname.short.gif"  align="right" ALT="La Trobe University, Bendigo">
<font size="+2">La Trobe University, Bendigo</font>

    ..........etc

HTTP 0.9 actually defined a few other operations besides GET. However, since HTTP/1.0 (RFC 1945) and HTTP/1.1 are now commonly used, we shall defer discussion of them.

HyperText Transfer Protocol, v1.0

The original (0.9) version of HTTP was not in use for very long, being quickly replaced by version 1.0. In its most basic form, a v1.0 GET request looks like:

GET /index.html HTTP/1.0<newline><newline>

The response from the server consists of a status line, then a number of plain text headers, followed by a blank line and then the requested data object. It's clearly a very similar format to an RFC822 email message:

GET /index.html HTTP/1.0

HTTP/1.0 200 OK
Server: Netscape-Enterprise/3.5.1C
Date: Sun, 16 Mar 2004 11:48:39 GMT
Content-type: text/html
Last-modified: Fri, 14 Mar 2004 02:22:52 GMT
Content-length: 11378

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>

    ........(etc)

A Tour of the HTTP/1.0 Response Headers

HTTP/1.0 200 OK: An ordinary plain text status line -- note the "200-series" status.
Server: Netscape-Enterprise/3.5.1C Date: Sun, 16 Mar 2004 11:48:39 GMT Last-modified: Fri, 14 Mar 2004 02:22:52 GMT: Various entertaining bits of information. The "Last-modified:" header is very useful, see the HTTP/1.0 "Conditional-GET" and HEAD" request types.
Content-length: 11378 Content-type: text/html: These two headers follow (approximately) the MIME convention for identifying the type of data contained in the "body" of the response -- in this case, ASCII text which should be interpreted as HTML by the browser. Note that MIME email-header "Content-Encoding:" (used in MIME-encoded email messages) is not normally used in HTTP because the protocol is designed to handle "8-bit" data. That is, any data at all can be sent after the blank line which signifies the end of the response headers.

More on the `GET` Request

HTTP/1.0 permits the GET request (and other HTTP request types, see later) to additionally send a series of optional Request Headers along with the request. For example, here's a typical request to ironbark, snarfed from the local network (with some cosmetic editing):

GET /index.html HTTP/1.0
Accept: image/gif, image/jpeg, */*
Host: ironbark.bendigo.latrobe.edu.au
User-Agent: Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)
Referer: http://bindi.bendigo.latrobe.edu.au/index.html

The request headers are terminated with a blank line -- hence the need for two newlines, as seen in the first slide of today's lecture. It's also possible for the request to contain a "message body", just like a response message -- we defer discussion of this until later.

Conditional-GET

Perhaps the most interesting optional request header is "If-modified-since:", which takes an HTTP standard GMT time/date string as its value.

For example, in the above example we saw an HTTP response with the following header line:

Last-modified: Fri, 14 Mar 2004 02:22:52 GMT

The browser can cache this object (keep a local copy in case it's requested again soon), and use the local copy instead of going out to the network, possibly causing uneccessary delays. The HTTP request would then look like:

GET /index.html HTTP/1.0
If-modified-since: Fri, 14 Mar 2004 02:22:52 GMT
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:0.9.4)
Host: ironbark.bendigo.latrobe.edu.au
....etc, as before

If the requested page has not, in fact, been modified since the specified time, it won't be returned -- instead, a "

304 Not 
Modified

" response is sent, without a response body -- just the headers. We return to the topic of caching in the next lecture.

Other HTTP/1.0 Request Types

The HTTP 1.0 protocol is formally specified in terms of "methods," rather than simple commands. The available methods are:

GET: We've already seen this "request to read a generalised object". The object can be a Web "page" (HTML document), an image, a sound sample or a wide range of other types.
HEAD: A request to return the response header only, without the content. This can contain much useful information about the requested entity, without the need to actually load it -- eg, how big it is.
POST: Originally defined as a request to "append to a named resource" (eg, a Web page), this method is extensively used in CGI-based systems, see later.
PUT: Request to store an object (eg, Web page, image, etc). Has only ever been used experimentally.
DELETE: Delete the specified object. I'm unaware of this having ever being used, so we can ignore it.
LINK: Connect two existing resources. Likewise, never used.
UNLINK: Breaks an existing connection between two resources. Not used.

La Trobe Uni Logo