CGI Programming Guide - [Chapter 2] 2.2 Using Environment Variables

2.2 Using Environment Variables

Much of the most crucial information needed by CGI applications is made available via UNIX environment variables. Programs can access this information as they would any environment variable (e.g., via the %ENV associative array in Perl).

This section concentrates on showing examples of some of the more typical uses of environment variables in CGI programs. First, however, Table 2.1 shows a full list of environment variables available for CGI.

Table 2.1: List of CGI Environment Variables
Environment Variable	Description
GATEWAY_INTERFACE	The revision of the Common Gateway Interface that the server uses.
SERVER_NAME	The server's hostname or IP address.
SERVER_SOFTWARE	The name and version of the server software that is answering the client request.
SERVER_PROTOCOL	The name and revision of the information protocol the request came in with.
SERVER_PORT	The port number of the host on which the server is running.
REQUEST_METHOD	The method with which the information request was issued.
PATH_INFO	Extra path information passed to a CGI program.
PATH_TRANSLATED	The translated version of the path given by the variable PATH_INFO.
SCRIPT_NAME	The virtual path (e.g., /cgi-bin/program.pl) of the script being executed.
DOCUMENT_ROOT	The directory from which Web documents are served.
QUERY_STRING	The query information passed to the program. It is appended to the URL with a "?".
REMOTE_HOST	The remote hostname of the user making the request.
REMOTE_ADDR	The remote IP address of the user making the request.
AUTH_TYPE	The authentication method used to validate a user.
REMOTE_USER	The authenticated name of the user.
REMOTE_IDENT	The user making the request. This variable will only be set if NCSA IdentityCheck flag is enabled, and the client machine supports the RFC 931 identification scheme (ident daemon).
CONTENT_TYPE	The MIME type of the query data, such as "text/html".
CONTENT_LENGTH	The length of the data (in bytes or the number of characters) passed to the CGI program through standard input.
HTTP_FROM	The email address of the user making the request. Most browsers do not support this variable.
HTTP_ACCEPT	A list of the MIME types that the client can accept.
HTTP_USER_AGENT	The browser the client is using to issue the request.
HTTP_REFERER	The URL of the document that the client points to before accessing the CGI program.

We'll use examples to demonstrate how these variables are typically used within a CGI program.

About This Server

Let's start with a simple program that displays various information about the server, such as the CGI and HTTP revisions used and the name of the server software.

#!/usr/local/bin/perl
print "Content-type: text/html", "\n\n";
print "<HTML>", "\n";
print "<HEAD><TITLE>About this Server</TITLE></HEAD>", "\n";
print "<BODY><H1>About this Server</H1>", "\n";
print "<HR><PRE>";
print "Server Name:      ", $ENV{'SERVER_NAME'}, "<BR>", "\n";
print "Running on Port:  ", $ENV{'SERVER_PORT'}, "<BR>", "\n";
print "Server Software:  ", $ENV{'SERVER_SOFTWARE'}, "<BR>", "\n";
print "Server Protocol:  ", $ENV{'SERVER_PROTOCOL'}, "<BR>", "\n";
print "CGI Revision:     ", $ENV{'GATEWAY_INTERFACE'}, "<BR>", "\n";
print "<HR></PRE>", "\n";
print "</BODY></HTML>", "\n";
exit (0);

Let's go through this program step by step. The first line is very important. It instructs the server to use the Perl interpreter located in the /usr/local/bin directory to execute the CGI program. Without this line, the server won't know how to run the program, and will display an error stating that it cannot execute the program.

Once the CGI script is running, the first thing it needs to generate is a valid HTTP header, ending with a blank line. The header generally contains a content type, also known as a MIME type. In this case, the content type of the data that follows is text/html.

After the MIME content type is output, we can go ahead and display output in HTML. We send the information directly to standard output, which is read and processed by the server, and then sent to the client for display. Five environment variables are output, consisting of the server name (the IP name or address of the machine where the server is running), the port the server is running on, the server software, and the HTTP and CGI revisions. In Perl, you can access the environment variables through the %ENV associative array, keyed by name.

A typical output of this program might look like this:

<HTML>
<HEAD><TITLE>About this Server</TITLE></HEAD>
<BODY><H1>About this Server</H1>
<HR><PRE>
Server Name:      bu.edu
Running on Port:  80
Server Software:  NCSA/1.4.2
Server Protocol:  HTTP/1.0
CGI Revision:     CGI/1.1
<HR></PRE>
</BODY></HTML>

Check the Client Browser

Now, let's look at a slightly more complicated example. One of the more useful items that the server passes to the CGI program is the client (or browser) name. We can put this information to good use by checking the browser type, and then displaying either a text or graphic document.

Different Web browsers support different HTML tags and different types of information. If your CGI program generates an inline image, you need to be sensitive that some browsers support <IMG> extensions that others don't, some browsers support JPEG images as well as GIF images, and some browsers (notably, Lynx and the old www client) don't support images at all. Using the HTTP_USER_AGENT environment variable, you can determine which browser is being used, and with that information you can fine-tune your CGI program to generate output that is optimized for that browser.

Let's build a short program that delivers a different document depending on whether the browser supports graphics. First, identify the browsers that you know don't support graphics. Then get the name of the browser from the HTTP_USER_AGENT variable:

#!/usr/local/bin/perl
$nongraphic_browsers = 'Lynx|CERN-LineMode';
$client_browser  = $ENV{'HTTP_USER_AGENT'};

The variable $nongraphic_browsers contains a list of the browsers that don't support graphics. Each browser is separated by the "|" character, which represents alternation in the regular expression we use later in the program. In this instance, there are only two browsers listed, Lynx and www. ("CERN-LineMode" is the string the www browser uses to identify itself.)

The HTTP_USER_AGENT environment variable contains the name of the browser. All environment variables that start with HTTP represent information that is sent by the client. The server adds the prefix and sends this data with the other information to the CGI program.

Now identify the files that you intend to return depending on whether the browser supports graphics:

$graphic_document = "full_graphics.html";
$text_document = "text_only.html";

The variables $graphic_document and $text_document contain the names of the two documents that we will use.

The next thing to do is simply to check if the browser name is included in the list of non-graphic browsers.

if ($client_browser =~ /$nongraphic_browsers/) {
    $html_document = $text_document;
} else {
    $html_document = $graphic_document;
}

The conditional checks whether the client browser is one that we know does not support graphics. If it is, the variable $html_document will contain the name of the text-only version of the HTML file. Otherwise, it will contain the name of the version of the HTML document that contains graphics.

Finally, print the partial header and open the file. (We need to get the document root from the DOCUMENT_ROOT variable and prepend it to the filename, so the Perl program can locate the document in the file system.)

print "Content-type: text/html", "\n\n";
$document_root = $ENV{'DOCUMENT_ROOT'};
$html_document = join ("/", $document_root, $html_document);    
if (open (HTML, "<" . $html_document)) {
    while (<HTML>) {
           print;
    }
    close (HTML);
} else {
    print "Oops! There is a problem with the configuration on this system!", "\n";
    print "Please inform the Webmaster of the problem. Thanks!", "\n";
}
exit (0);

If the filename stored in $html_document can be opened for reading (as specified by the "<" character), the while loop iterates through the file and displays it. The open command creates a handle, HTML, which is then used to access the file. During the while loop, as Perl reads a line from the HTML file handle, it places that line in its default variable $_. The print statement without any arguments displays the value stored in $_. After the entire file is displayed, it is closed. If the file cannot be opened, an error message is output.

Restricting Access for Specified Domains

Suppose you have a set of HTML documents: one for users in your IP domain (e.g., bu.edu), and another one for users outside of your domain. Why would anyone want to do this, you may ask? Say you have a document containing internal company phone numbers, meeting schedules, and other company information. You certainly don't want everyone on the Internet to see this document. So you need to set up some type of security to keep your documents away from prying eyes.

You can configure most servers to restrict access to your documents according to what domain the user connects from. For example, under the NCSA server, you can list the domains which you want to allow or deny access to certain directories by editing the access.conf configuration file. However, you can also control domain-based access in a CGI script. The advantage of using a CGI script is that you don't have to turn away other domains, just send them different documents. Let's look at a CGI program that performs pseudo authentication:

#!/usr/local/bin/perl
$host_address = 'bu\.edu';
$ip_address = '128\.197';

These two variables hold the IP domain name and address that are considered local. In other words, users in this domain can access the internal information. The period is "escaped" in both of these variables (by placing a "\" before the character), because the variables will be interpolated in a regular expression later in this program. The "." character has a special significance in a regular expression; it is used to match any character other than a newline.

$remote_address = $ENV{'REMOTE_ADDR'};
$remote_host = $ENV{'REMOTE_HOST'};

The environment variable REMOTE_ADDR returns the IP numerical address for the remote user, while REMOTE_HOST contains the IP alphanumeric name for the remote user. There are times when REMOTE_HOST will not return the name, but only the address (if the DNS server does not have an entry for the domain). In such a case, you can use the following snippet of code to convert an IP address to its corresponding name:

@subnet_numbers = split (/\./, $remote_address);
$packed_address = pack ("C4", @subnet_numbers);
($remote_host)  = gethostbyaddr ($packed_address, 2);

Don't worry about this code yet. We will discuss functions like these in Chapter 9, Gateways, Databases, and Search/Index Utilities. Now, let's continue with the rest of this program.

$local_users = "internal_info.html";
$outside_users = "general.html";
if (($remote_host =~ /\.$host_address$/) && ($remote_address =~ /^$ip_address/)) {
    $html_document = $local_users;
} else {
    $html_document = $outside_users;
}

The remote host is examined to see if it ends with the domain name, as specified by the $host_address variable, and the remote address is checked to make sure it starts with the domain address stored in $ip_address. Depending on the outcome of the conditional, the $html_document variable is set accordingly.

print "Content-type: text/html", "\n\n";
$document_root = $ENV{'DOCUMENT_ROOT'};
$html_document = join ("/", $document_root, $html_document); 
if (open (HTML, "<" . $html_document)) {
    while (<HTML>) {
           print;
    }
    close (HTML);
} else {
    print "Oops! There is a problem with the configuration on this system!", "\n";
    print "Please inform the Webmaster of the problem. Thanks!", "\n";
}
exit (0);

The specified document is opened and the information stored within it is displayed.

User Authentication and Identification

In addition to domain-based security, most HTTP servers also support a more complicated method of security, known as user authentication. When configured for user authentication, specified files or directories are set up to allow access only by certain users. A user attempting to open the URLs associated with these files is prompted for a name and password.

The user name and password (which, incidentally, need have no relation to the user's real user name and password on any system) is checked by the server, and if legitimate, the user is allowed access. In addition to allowing the user access to the protected file, the server also maintains the user's name and passes it to any subsequent CGI programs that are called. The server passes the user name in the REMOTE_USER environment variable.

A CGI script can therefore use server authentication information to identify users.[1] This isn't what user authentication was meant for, but if the information is available, it can come in mighty handy. Here is a snippet of code that illustrates what you can do with the REMOTE_USER environment variable:

[1] The HTTP_FROM environment variable also carries information that can be used to identify a user-generally, the user's email address. However, this variable depends on the browser to make it available, and few browsers do, so HTTP_FROM is of limited use.

$remote_user = $ENV{'REMOTE_USER'};
if ($remote_user eq "jack") {
    print "Welcome Jack, how is Jack Manufacturing doing these days?", "\n";
} elsif ($remote_user eq "bob") {
    print "Hey Bob, how's the wife doing? I heard she was sick.", "\n";
}
.
.
.

Server authentication does not provide complete security: Since the user name and password are sent unencrypted over the network, it's possible for a "snoop" to look at this data. For that reason, it's a bad idea to use your real login name and password for server authentication.

Where Did You Come From?

Companies who provide services on the Web often want to know from what server (or document) the remote users came. For example, say you visit the server located at https://www.cgi.edu, and then from there you go to https://www.flowers.com. A CGI program on www.flowers.com can actually determine that you were previously at www.cgi.edu.

How is this useful? For advertising, of course. If a company determines that 90% of all users that visit them come from a certain server, then they can perhaps work something out financially with the webmaster at that server to provide advertising. Also, if your site moves or the content at your site changes dramatically, you can help avoid frustration among your visitors by informing the webmasters at the sites referring to yours to change their links. Here is a simple program that displays this "referral" information:

#!/usr/local/bin/perl
print "Content-type: text/plain", "\n\n";
$remote_address = $ENV{'REMOTE_ADDR'};
$referral_address = $ENV{'HTTP_REFERER'};
print "Hello user from $remote_address!", "\n";
print "The last site you visited was: $referral_address. Am I genius or what?", "\n";
exit (0);

The environment variable HTTP_REFERER, which is passed to the server by the client, contains the last site the user visited before accessing the current server.

Now for the caveats. There are three important things you need to remember before using the HTTP_REFERER variable:

First, not all browsers set this variable.
Second, if a user accesses your server first, right at startup, this variable will not be set.
Third, if someone accesses your site via a bookmark or just by typing in the URL, the referring document is meaningless. So if you are keeping some sort of count to determine where users are coming from, it won't be totally accurate.