Lecture 05: Applications #3.1: HTML and HTTP Basics


The World Wide Web

Of all the Big Ideas in computer networking, the invention of the World Wide Web (also called the WWW, or just the Web) would have to be the biggest.

History:

1989
original proposal from Tim Berners-Lee at CERN for a "Web" of linked documents. Prototype followed soon after.

December 1991
First public demonstration.

February 1993
Mosaic (first alpha version) released by NCSA. First fully operational, multiplatform version released in September. Awareness of WWW project growing.

February 1994
We (Department of IT) start running a Web server on machine ironbark. at Bendigo (first regional institute in Australia to do so, and in the first 10 nationally!) Rah, Rah!

Early 1995
Netscape Communications releases Netscape Navigator 1.1. The rest is, as they say, history.


WWW Architecture

Four key components:

  1. Web Browser software (eg IE, Mozilla, Camino, etc, Netscape, Opera, Safari, iCab, OmniWeb, lynx, Amaya, Mosaic, or even (for the truly desperate) Emacs/W3 -- and this is by no means an exhaustive list!).

  2. Web server software. The most popular server program is apache -- this is what we run on ironbark., and redgum however there are several other popular server packages, especially those from Microsoft.

  3. A collection of "hyperlinked" documents (or pages) written in HTML (the HyperText Markup Language), as well as a great number of other object types (eg, images, sounds, video clips, etc).

  4. The HyperText Transfer Protocol, HTTP. The browser uses HTTP to obtain Web documents, specified using a URL, from a server. For example, the "home page" of ironbark is:
    http://ironbark.bendigo.latrobe.edu.au/index.html
    
    This specifies the application protocol (HTTP) used to fetch the object, the domain name where it is located and the local filename of the object on that host (/index.html). The "magic" string :// doesn't mean anything in particular except to signify that it's a URL...


Digression: HTML

Although it is not "core" knowledge in this unit, we really need to mention HTML.

HTML is a markup language -- documents are (in general) plain ASCII textfiles, with certain characters reserved to denote markup. Such languages have a long and venerable history in computing (eg starting with *roff, TeX, (see also here), LaTeX, SGML and subsequently XML.

If you're interested to see some very simple hand-crafted HTML, have a look at the document source for these lecture notes...


Hypertext Transfer Protocol (HTTP)

In Lecture #2,, the World Wide Web was used to illustrate the idea of a layered communications architecture. In that lecture, the basic ideas of the original version (0.9, circa 1992) of HTTP were introduced.

To revise, in HTTP/0.9 the GET operation was used to obtain HTML "pages" from a server, eg: the "home page" of ironbark at URL http://ironbark.bendigo.latrobe.edu.au/index.html

We first establish a reliable (TCP) connection to the server process waiting at port 80 (HTTP) on ironbark.bendigo.latrobe.edu.au. We then send the single line request shown in italics and receive in response the HTML text, shown here in boldface:

GET /index.html
<HTML>
<HEAD>
<TITLE>The Department of Information Technology at La Trobe University, Bendigo</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF">

<!-- ******** Department Header ***************-->
<IMG SRC="/gifs/irbkname.short.gif"  align="right" ALT="La Trobe University, Bendigo">
<font size="+2">La Trobe University, Bendigo</font>

    ..........etc
HTTP 0.9 actually defined a few other operations besides GET. However, since HTTP/1.0 (RFC 1945) and HTTP/1.1 are now commonly used, we shall defer discussion of them.


HyperText Transfer Protocol, v1.0

The original (0.9) version of HTTP was not in use for very long, being quickly replaced by version 1.0. In its most basic form, a v1.0 GET request looks like:
GET /index.html HTTP/1.0<newline><newline>
The response from the server consists of a status line, then a number of plain text headers, followed by a blank line and then the requested data object. It's clearly a very similar format to an RFC822 email message:
GET /index.html HTTP/1.0

HTTP/1.0 200 OK
Server: Netscape-Enterprise/3.5.1C
Date: Sun, 16 Mar 2004 11:48:39 GMT
Content-type: text/html
Last-modified: Fri, 14 Mar 2004 02:22:52 GMT
Content-length: 11378

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>

    ........(etc)

A Tour of the HTTP/1.0 Response Headers

HTTP/1.0 200 OK
An ordinary plain text status line -- note the "200-series" status.

Server: Netscape-Enterprise/3.5.1C
Date: Sun, 16 Mar 2004 11:48:39 GMT
Last-modified: Fri, 14 Mar 2004 02:22:52 GMT
Various entertaining bits of information. The "Last-modified:" header is very useful, see the HTTP/1.0 "Conditional-GET" and HEAD" request types.

Content-length: 11378
Content-type: text/html
These two headers follow (approximately) the MIME convention for identifying the type of data contained in the "body" of the response -- in this case, ASCII text which should be interpreted as HTML by the browser. Note that MIME email-header "Content-Encoding:" (used in MIME-encoded email messages) is not normally used in HTTP because the protocol is designed to handle "8-bit" data. That is, any data at all can be sent after the blank line which signifies the end of the response headers.


More on the GET Request

HTTP/1.0 permits the GET request (and other HTTP request types, see later) to additionally send a series of optional Request Headers along with the request. For example, here's a typical request to ironbark, snarfed from the local network (with some cosmetic editing):
GET /index.html HTTP/1.0
Accept: image/gif, image/jpeg, */*
Host: ironbark.bendigo.latrobe.edu.au
User-Agent: Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)
Referer: http://bindi.bendigo.latrobe.edu.au/index.html
The request headers are terminated with a blank line -- hence the need for two newlines, as seen in the first slide of today's lecture. It's also possible for the request to contain a "message body", just like a response message -- we defer discussion of this until later.


Conditional-GET

Perhaps the most interesting optional request header is "If-modified-since:", which takes an HTTP standard GMT time/date string as its value.

For example, in the above example we saw an HTTP response with the following header line:

Last-modified: Fri, 14 Mar 2004 02:22:52 GMT
The browser can cache this object (keep a local copy in case it's requested again soon), and use the local copy instead of going out to the network, possibly causing uneccessary delays. The HTTP request would then look like:
GET /index.html HTTP/1.0
If-modified-since: Fri, 14 Mar 2004 02:22:52 GMT
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:0.9.4)
Host: ironbark.bendigo.latrobe.edu.au
....etc, as before
If the requested page has not, in fact, been modified since the specified time, it won't be returned -- instead, a "304 Not Modified" response is sent, without a response body -- just the headers. We return to the topic of caching in the next lecture.


Other HTTP/1.0 Request Types

The HTTP 1.0 protocol is formally specified in terms of "methods," rather than simple commands. The available methods are:
GET
We've already seen this "request to read a generalised object". The object can be a Web "page" (HTML document), an image, a sound sample or a wide range of other types.
HEAD
A request to return the response header only, without the content. This can contain much useful information about the requested entity, without the need to actually load it -- eg, how big it is.
POST
Originally defined as a request to "append to a named resource" (eg, a Web page), this method is extensively used in CGI-based systems, see later.
PUT
Request to store an object (eg, Web page, image, etc). Has only ever been used experimentally.
DELETE
Delete the specified object. I'm unaware of this having ever being used, so we can ignore it.
LINK
Connect two existing resources. Likewise, never used.
UNLINK
Breaks an existing connection between two resources. Not used.

La Trobe Uni Logo


Copyright 2004 by Philip Scott, La Trobe University.
Valid HTML 3.2!