Content Headers (Practical mod

16.2. Content Headers

The following sections describe the HTTP headers that specify the type and length of the content, and the version of the content being sent. Note that in this section we often use the term message. This term is used to describe the data that comprises the HTTP headers along with their associated content; the content is the actual page, image, file, etc.

16.2.1. Content-Type Header

Most CGI programmers are familiar with Content-Type. Sections 3.7, 7.2.1, and 14.17 of the HTTP specification cover the details. mod_perl has a content_type( ) method to deal with this header:

$r->content_type("image/png");

Content-Type should be included in every set of headers, according to the standard, and Apache will generate one if your code doesn't. It will be whatever is specified in the relevant DefaultType configuration directive, or text/plain if none is active.

16.2.2. Content-Length Header

According to section 14.13 of the HTTP specification, the Content-Length header is the number of octets (8-bit bytes) in the body of a message. If the length can be determined prior to sending, it can be very useful to include it. The most important reason is that KeepAlive requests (when the same connection is used to fetch more than one object from the web server) work only with responses that contain a Content-Length header. In mod_perl we can write:

$r->header_out('Content-Length', $length);

When using Apache::File, the additional set_content_length( ) method, which is slightly more efficient than the above, becomes available to the Apache class. In this case we can write:

$r->set_content_length($length);

The Content-Length header can have a significant impact on caches by invalidating cache entries, as the following extract from the specification explains:

The response to a HEAD request MAY be cacheable in the sense that the information 
contained in the response MAY be used to update a previously cached entity from that 
resource.  If the new field values indicate that the cached entity differs from the 
current entity (as would be indicated by a change in Content-Length, Content-MD5, 
ETag or Last-Modified), then the cache MUST treat the cache entry as stale.

It is important not to send an erroneous Content-Length header in a response to either a GET or a HEAD request.

16.2.3. Entity Tags

An entity tag (ETag) is a validator that can be used instead of, or in addition to, the Last-Modified header; it is a quoted string that can be used to identify different versions of a particular resource. An entity tag can be added to the response headers like this:

$r->header_out("ETag","\"$VERSION\"");

mod_perl offers the $r->set_etag( ) method if we have use( ) ed Apache::File. However, we strongly recommend that you don't use the set_etag( ) method! set_etag( ) is meant to be used in conjunction with a static request for a file on disk that has been stat( ) ed in the course of the current request. It is inappropriate and dangerous to use it for dynamic content.

By sending an entity tag we are promising the recipient that we will not send the same ETag for the same resource again unless the content is "equal" to what we are sending now.

The pros and cons of using entity tags are discussed in section 13.3 of the HTTP specification. For mod_perl programmers, that discussion can be summed up as follows.

There are strong and weak validators. Strong validators change whenever a single bit changes in the response; i.e., when anything changes, even if the meaning is unchanged. Weak validators change only when the meaning of the response changes. Strong validators are needed for caches to allow for sub-range requests. Weak validators allow more efficient caching of equivalent objects. Algorithms such as MD5 or SHA are good strong validators, but what is usually required when we want to take advantage of caching is a good weak validator.

HTTP Range Requests

It is possible in web clients to interrupt the connection before the data transfer has finished. As a result, the client may have partial documents or images loaded into its memory. If the page is reentered later, it is useful to be able to request the server to return just the missing portion of the document, instead of retransferring the entire file.

There are also a number of web applications that benefit from being able to request the server to give a byte range of a document. As an example, a PDF viewer would need to be able to access individual pages by byte range—the table that defines those ranges is located at the end of the PDF file.

In practice, most of the data on the Web is represented as a byte stream and can be addressed with a byte range to retrieve a desired portion of it.

For such an exchange to happen, the server needs to let the client know that it can support byte ranges, which it does by sending the Accept-Ranges header:
Accept-Ranges: bytes
The server will send this header only for documents for which it will be able to satisfy the byte-range request—e.g., for PDF documents or images that are only partially cached and can be partially reloaded if the user interrupts the page load.

The client requests a byte range using the Range header:
Range: bytes=0-500,5000-
Because of the architecture of the byte-range request and response, the client is not limited to attempting to use byte ranges only when this header is present. If a server does not support the Range header, it will simply ignore it and send the entire document as a response.

A Last-Modified time, when used as a validator in a request, can be strong or weak, depending on a couple of rules described in section 13.3.3 of the HTTP standard. This is mostly relevant for range requests, as this quote from section 14.27 explains:

If the client has no entity tag for an entity, but does have a Last-Modified date, it 
MAY use that date in an If-Range header.

But it is not limited to range requests. As section 13.3.1 states, the value of the Last-Modified header can also be used as a cache validator.

The fact that a Last-Modified date may be used as a strong validator can be pretty disturbing if we are in fact changing our output slightly without changing its semantics. To prevent this kind of misunderstanding between us and the cache servers in the response chain, we can send a weak validator in an ETag header. This is possible because the specification states:

If a client wishes to perform a sub-range retrieval on a value for which it has only 
a Last-Modified time and no opaque validator, it MAY do this only if the Last-
Modified time is strong in the sense described here.

In other words, by sending an ETag that is marked as weak, we prevent the cache server from using the Last-Modified header as a strong validator.

An ETag value is marked as a weak validator by prepending the string W/ to the quoted string; otherwise, it is strong. In Perl this would mean something like this:

$r->header_out('ETag',"W/\"$VERSION\"");

Consider carefully which string is chosen to act as a validator. We are on our own with this decision:

... only the service author knows the semantics of a resource well enough to select 
an appropriate cache validation mechanism, and the specification of any validator 
comparison function more complex than byte-equality would open up a can of worms.  
Thus, comparisons of any other headers (except Last-Modified, for compatibility with 
HTTP/1.0) are never used for purposes of validating a cache entry.

If we are composing a message from multiple components, it may be necessary to combine some kind of version information for all these components into a single string.

If we are producing relatively large documents, or content that does not change frequently, then a strong entity tag will probably be preferred, since this will give caches a chance to transfer the document in chunks.