Lecture 06: Applications #3.2: HTTP

HTTP/1.0 Authentication

Most users of the Web will have at some time attempted to access a page, and been presented with a dialogue something like:

An initial HTTP attempt to access a "password protected" Web page of this type (without providing suitable "authentication" information) will generate an HTTP error message together with a Web page which explains the nature of the error. Typically the response headers will contain:

HTTP/1.1 401 Authorization Required
Date: Wed, 17 Mar 2004 01:17:56 GMT
Server: Apache/1.2.6
WWW-Authenticate: Basic realm="ByPassword"
Last-Modified: Mon, 15 Mar 2004 00:43:51 GMT
....etc....

In HTTP/1.0, only the Basic authentication method was available, as used in this example.

Upon receiving this error, the Web browser will normally pop up a dialog box similar to the above, collect a user-ID and password from the user, and then retry the request with an additional "Authorization: " request header containing the additional information.

The `Authorization` Request Header

The "Basic" form of authentication used in HTTP is slightly strange. It takes a user-ID string and a password string and concatenates them using a colon character as a separator. The resulting string is then encoded using the base64 scheme, and included into a new request header.

Let's use as an example, a page for which the username is "student", password "student" -- pretty typical :-). The concantenation is thus "student:student". We can use the Unix commandline base64 program mimencode to encode the data, (it encodes to "c3R1ZGVudDpzdHVkZW50") so that the request header will look something like:

GET /subjects/int21cn/test/index.html HTTP/1.0
Authorization: Basic c3R1ZGVudDpzdHVkZW50
....etc....

This, of course, begs the obvious question -- why on earth do they do this? The obvious answer is "for security reasons" -- to deter casual network snoopers who might be observing traffic, watching for passing user-IDs and passwords. We are left wondering...

Cookies

Cookies are an extension to HTTP, originally developed at Netscape. In general, a server "sets" a cookie by sending an additional response header, thus (eg)

HTTP/1.0 200 OK
Set-cookie: myname=myvalue
....etc...

A browser which is "cookie-enabled" will normally^[1] store this name/value pair, and future requests to the same server will contain an additional request header, thus:

GET /somefile.html HTTP/1.0
Cookie: myname=myvalue
....etc...

Cookies are extensively used in Web session management, which is discussed later in the unit.

^[1] In fact, cookie operation is rather more complex than we discuss here -- for example, the "Set-cookie: " header can take several additional parameters (which affect how the cookie is interpreted), and the behaviour of browsers with respect to cookies can be changed by the end-user.

Digression: Forms in HTML

In HTML version 2, the idea of "forms" (and various related data structures) were introduced. These provided the basis technology for the recent explosion in "electronic storefronts" on the Web as well as several other innovations.

A form in HTML is an area of a Web page which is used to gather input from a human user. The information which is gathered can then be returned to the page's owner using a SUBMIT action.

The form is, as expected, delimited by a <FORM> and </FORM> markup pair.

The <FORM> markup has two important attributes:

ACTION: specifies the action URL of this form. Typically this is the URL of an executable CGI program, see later.
METHOD: specifies the way in which the ACTION URL is accessed. There are two methods, GET and POST.

Example:

<FORM
ACTION="http://ironbark.bendigo.latrobe.edu.au/cgi-bin/myprog" METHOD="GET">

Form Elements

Data is collected in a form by the use of INPUT tags. Each INPUT tag has an associated TYPE attribute.

For example:

<INPUT TYPE="TEXT"

This INPUT type can take several further attributes, eg:

<INPUT TYPE="TEXT"  NAME="Name" MAXLENGTH="64" SIZE="20">

In a browser, this would be presented as a (scrollable) textbox, 20 characters wide (but able to accept 64 characters of input).

There are several other INPUT types:

TYPE="PASSWORD"
TYPE="CHECKBOX"
TYPE="RADIO"
TYPE="IMAGE"
TYPE="HIDDEN"
TYPE="SUBMIT"
TYPE="RESET"

Form Elements #2

There are two other markup tags used in forms:

SELECT: allows the user to select from an enumerated list of values. Each value is given by an OPTION markup tag, which can take a couple of extra attributes.
TEXTAREA: presents a multi-line text field into which the user can type information. It is specified as a number of ROWS and COLS and can have a NAME attribute and an initial value.

URL Encoding

When form information is returned to the HTTP server, it is encoded into a format called (using MIME terminology):

application/x-www-form-urlencoded

...or simply "URL-encoded". In this format:

ASCII space characters (decimal 32) are (usually) replaced by the "+" character. This is a hangover from an older format and is normally, but not universally, used -- see next point.
Most (but not all) non-alphanumeric characters are encoded in hexadecimal format, thus: %HH, where the H characters are the two hexadecimal digits of the byte. Sometimes the space character is also sent in this format, as "%20", instead of as "+".
The fields of a form are encoded as name=value, with each name-value pair separated by the "&" (ampersand) character.
Fields with null values are (normally) not sent, nor are unselected CHECKBOXes and RADIO buttons.

More information

Some sites with good information on URL encoding include:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
http://www.freesoft.org/CIE/RFC/1738/4.htm

Submission Methods

The two ways in which form data can be returned to the server are METHOD=GET and METHOD=POST.

GET: This method is (according to the original specification) preferred if the submission of the form is not going to have a lasting effect on the global state of the universe -- that is, it does not have side effects. For example, it may query a database, returning the result as HTML. A HTTP GET request is issued to the ACTION URL specified in the <FORM> markup tag, with the urlencoded form information appended after a separating "?" character. This can generate very long URLs.
POST: This method was originally used where processing of the form was intended to have side effects, eg, updating the contents of a database. In this case, a HTTP POST transaction is performed. The "body" of the transaction contains the urlencoded form data, as a single long line of text. The POST transaction is directed at the URL specified in the ACTION attribute of the <FORM> tag.

In "real life", GET and POST methods are used pretty much interchangeably, depending on the programmer's or system designer's preference.

Form submission using `GET`

Here is a typical HTML form which you can use to enter some random data. When you click on the Submit button, you should pay close attention to two things:

Notice that the form data is appended to the URL, in URL-encoded form as described above.
The server's response (generated by a trivial CGI program on ironbark) shows the complete "QUERY_STRING" which was passed to it. Notice that it's exactly the same as the information which was "tacked onto" the URL after the ? character.

The HTML for our FORM looks like:

<FORM action="/subjects/int21cn/cgi/L06CGIa.cgi" method="GET">
info1: <INPUT type="text" name="info1" size="20"><br>
info2: <INPUT type="text" name="info2" size="20"><br>
<input type="submit" value="Submit">
<input type="reset" value="Clear Form">
</FORM>

This is rendered in your Web browser as:

Try it!

Form submission using `POST`

We revisit the same Form as the previous slide, except this time the submission method is changed to POST.

In this case, we're going to try something different -- the CGI program which is the target of this Form is going to show us the actual HTTP request as it was received^[2].

Again, try it.

^[2] Actually, it's a "reconstructed" version of the HTTP request: not all request headers are necessarily shown. But it's close enough for our purposes!

Common Gateway Interface (CGI)

CGI defines the (original) way in which form data was/is presented to an application program by the HTTP server. There are several newer standards than CGI, but it's still the "default" way of doing Web server-side programming. The examples on the previous slides use the CGI standard interface.

When a user clicks the SUBMIT button on a form, the HTTP server starts up the specified CGI program, and makes the form data available to it.

From a programming perspective, the difference between GET and POST is the way in which a CGI program receives the form data. If the method was GET, the information is usually obtained by examining the contents of an environment variable (usually called "QUERY_STRING) containing the URL-encoded form data. Other environment variables contain additional useful information.

If the method was POST, the CGI program usually receives the form data on its standard input stream, with any extra stuff obtained, as before, from environment variables.

CGI programs can, as a rule, be written in any language (compiled or interpreted) supported on the system running the HTTP server.

On Unix servers, they are commonly written in Perl, C or as Bourne shell (/bin/sh) scripts.

A CGI program (almost) always generates (to standard output) a Web page which is returned to the browser, in addition to any other effect.

La Trobe Uni Logo