CGI Programming Guide - [Chapter 9] Gateways, Databases, and Search/Index Utilities

Imagine a situation where you have an enormous amount of data stored in a format that is foreign to a typical web browser. And you need to find a way to present this information on the Web, as well as allowing potential users to search through the information. How would you accomplish such a task?

Many information providers on the Web find themselves in situations like this. Such a problem can be solved by writing a CGI program that acts as a gateway between the data and the Web. A simple gateway program was presented in Chapter 7, Advanced Form Applications. The pie graph program can read the ice cream data file and produce a graph illustrating the information contained within it. In this chapter, we will discuss gateways to UNIX programs, relational databases, and search engines.

9.1 UNIX Manual Page Gateway

Manual pages on a UNIX operating system provide documentation on the various software and utilities installed on the system. In this section, I will write a gateway that reads the requested manual page, converts it to HTML, and displays it (see Figure 9.1). We will let the standard utility for formatting manual pages, nroff, do most of the work. But this example is useful for showing what a little HTML can do to spruce up a document. The key technique you need is to examine the input expected by a program and the output that it generates, so that you can communicate with it.

Figure 9.1: Converting manual page to HTML

[Graphic: Figure 9-1]

Here is the form that is presented to the user:

<HTML>
<HEAD><TITLE>UNIX Manual Page Gateway</TITLE></HEAD>
<BODY>
<H1>UNIX Manual Page Gateway</H1>
<HR>
<FORM ACTION="/cgi-bin/manpage.pl" METHOD="POST">
<EM>What manual page would you like to see?</EM>
<BR>
<INPUT TYPE="text" NAME="manpage" SIZE=40>
<P>
<EM>What section is that manual page located in?</EM>
<BR>
<SELECT NAME="section" SIZE=1>
<OPTION SELECTED>1
<OPTION>2
<OPTION>3
<OPTION>4
<OPTION>5
<OPTION>6
<OPTION>7
<OPTION>8
<OPTION>Don't Know
</SELECT>
<P>
<INPUT TYPE="submit" VALUE="Submit the form">
<INPUT TYPE="reset"  VALUE="Clear all fields">
</FORM>
<HR>
</BODY></HTML>

This form will be rendered as shown in Figure 9.2.

Figure 9.2: UNIX manual page form

[Graphic: Figure 9-2]

On nearly all UNIX systems, manual pages are divided into eight or more sections (or subdirectories), located under one main directory--usually /usr/local/man or /usr/man. This form asks the user to provide the section number for the desired manual page.

The CGI program follows. The main program is devoted entirely to finding the right section, and the particular manual page. A subroutine invokes nroff on the page to handle the internal nroff codes that all manual pages are formatted in, then converts the nroff output to HTML.

#!/usr/local/bin/perl
$webmaster = "Shishir Gundavaram (shishir\@bu\.edu)";
$script = $ENV{'SCRIPT_NAME'};
$man_path = "/usr/local/man";
$nroff = "/usr/bin/nroff -man";

The program assumes that the manual pages are stored in the /usr/local/man directory. The nroff utility formats the manual page according to the directives found within the document. A typical unformatted manual page looks like this:

.TH EMACS 1 "1994 April 19"
.UC 4
.SH NAME
emacs \- GNU project Emacs
.SH SYNOPSIS
.B emacs
[
.I command-line switches
] [
.I files ...
]
.br
.SH DESCRIPTION
.I GNU Emacs
is a version of
.I Emacs,
written by the author of the original (PDP-10)
.I Emacs,
Richard Stallman.
.br
.
.
.

Once it is formatted by nroff, it looks like this:

EMACS(1)                 USER COMMANDS                   EMACS(1)
NAME
     emacs - GNU project Emacs
SYNOPSIS
     emacs [ command-line switches ] [ files ... ]
DESCRIPTION
     GNU Emacs is a version of Emacs, written by  the  author  of
     the original (PDP-10) Emacs, Richard Stallman.
.
.
.
Sun Release 4.1    Last change: 1994 April 19                   1

Now, let's continue with the program to see how this information can be further formatted for display on a web browser.

$last_line = "Last change:";

The $last_line variable contains the text that is found on the last line of each page in a manual. This variable is used to remove that line when formatting for the Web.

&parse_form_data (*FORM);
($manpage = $FORM{'manpage'}) =~ s/^\s*(.*)\b\s*$/$1/;
$section = $FORM{'section'};

The data in the form is parsed and stored. The parse_form_data subroutine is the one used initially in the last chapter. Leading and trailing spaces are removed from the information in the manpage field. The reason for doing this is so that the specified page can be found.

if ( (!$manpage) || ($manpage !~ /^[\w\+\-]+$/) ) {
    &return_error (500, "UNIX Manual Page Gateway Error",
                        "Invalid manual page specification.");

This block is very important! If a manual page was not specified, or if the information contains characters other than (A-Z, a-z, 0-9, _, +, -), an error message is returned. As discussed in Chapter 7, Advanced Form Applications, it is always important to check for shell metacharacters for security reasons.

} else {
    if ($section !~ /^\d+$/) {
        $section = &find_section ();
    } else {
        $section = &check_section ();
    }

If the section field consists of a number, the check_section subroutine is called to check the specified section for the particular manual page. If non-numerical information was passed, such as "Don't Know," the find_section subroutine iterates through all of the sections to determine the appropriate one. In the regular expression, "\d" stands for digit, "+" allows for one or more of them, and the "^" and "$" ensure that nothing but digits are in the string. To simplify this part of the search, we do not allow the "nonstandard" subsections some systems offer, such as 2v or 3m.

Both of these search subroutines return values upon termination. These return values are used by the code below to make sure that there are no errors.

    if ( ($section >= 1) && ($section <= 8) ) {
        &display_manpage ();
    } else {
        &return_error (500, "UNIX Manual Page Gateway Error",
                            "Could not find the requested document.");
    }
}
exit (0);

The find_section and check_section subroutines called above return a value of zero (0) if the specified manual page does not exist. This return value is stored in the section variable. If the information contained in section is in the range of 1 through 8, the display_manpage subroutine is called to display the manual page. Otherwise, an error is returned.

The find_section subroutine searches for a particular manual page in all the sections (from 1 through 8).

sub find_section
{
    local ($temp_section, $loop, $temp_dir, $temp_file);
    $temp_section = 0;
    for ($loop=1; $loop <= 8; $loop++) {
        $temp_dir  = join("", $man_path, "/man", $loop);
        $temp_file = join("", $temp_dir, "/", $manpage, ".", $loop);

find_section searches in the subdirectories called "man1," "man2," "man3," etc. And each manual page in the subdirectory is suffixed with the section number, such as "zmore.1," and "emacs.1." Thus, the first pass through the loop might join "/usr/local/man" with "man1" and "zmore.1" to make "/usr/local/man/ man1/zmore.1", which is stored in the $temp_file variable.

        if (-e $temp_file) {
            $temp_section = $loop;
        }
    }

The -e switch returns TRUE if the file exists. If the manual page is found, the temp_section variable contains the section number.

    return ($temp_section);
}

The subroutine returns the value stored in $temp_section. If the specified manual page is not found, it returns zero.

The check_section subroutine checks the specified section for the particular manual page. If it exists, the section number passed to the subroutine is returned. Otherwise, the subroutine returns zero to indicate failure. Remember that you may have to modify this program to reflect the directories and filenames of manual pages on your system.

sub check_section
{
    local ($temp_section, $temp_file);
    $temp_section = 0;
    $temp_file    = join ("", $man_path, "/man", $section,
                              "/", $manpage, ".", $section);
    if (-e $temp_file) {
        $temp_section = $section;
    }
    return ($temp_section);
}

The heart of this gateway is the display_manpage subroutine. It does not try to interpret the nroff codes in the manual page. Manual page style is complex enough that our best bet is to invoke nroff, which has always been used to format the pages. But there are big differences between the output generated by nroff and what we want to see on a web browser. The nroff utility produces output suitable for an old-fashioned line printer, which produced bold and underlined text by backspacing and reprinting. nroff also puts a header at the top of each page and a footer at the bottom, which we have to remove. Finally, we can ignore a lot of the blank space generated by nroff, both at the beginning of each line and in between lines.

The display_manpage subroutine starts by running the page through nroff. Then, the subroutine performs a few substitutions to make the page look good on a web browser.

sub display_manpage
{
    local ($file, $blank, $heading);
    $file = join ("", $man_path, "/man", $section, 
                      "/", $manpage, ".", $section);
    print "Content-type: text/html", "\n\n";
        print "<HTML>", "\n";
    print "<HEAD><TITLE>UNIX Manual Page Gateway</TITLE></HEAD>", "\n";
        print "<BODY>", "\n";    
        print "<H1>UNIX Manual Page Gateway</H1>", "\n";
    print "<HR><PRE>";

The usual MIME header and HTML text are displayed.

    open (MANUAL, "$nroff $file |");

A pipe to the nroff program is opened for output. Whenever you open a pipe, it is critical to check that there are no shell metacharacters on the command line. Otherwise, a malicious user can execute commands on your machine! This is why we performed the check at the beginning of this program.

    $blank = 0;

The blank variable keeps track of the number of consecutive empty lines in the document. If there is more than one consecutive blank line, it is ignored.

    while (<MANUAL>) {
        next if ( (/^$manpage\(\w+\)/i) || (/\b$last_line/o) );

The while loop iterates through each line in the manual page. The next construct ignores the first and last lines of each page. For example, the first and last lines of each page of the emacs manual page look like this:

EMACS(1)                 USER COMMANDS                   EMACS(1)
.
.
.
Sun Release 4.1    Last change: 1994 April 19                   1

This is unnecessary information, and therefore we skip over it. The if statement checks for a string that does not contain any spaces. The previous while statement stores the current line in Perl's default variable, $_. A regular expression without a corresponding variable name matches against the value stored in $_.

        if (/^([A-Z0-9_ ]+)$/) {
            $heading = $1;
            print "<H2>", $heading, "</H2>", "\n";

All manual pages consist of distinct headings such as "NAME," "SYNOPSIS," "DESCRIPTION," and "SEE ALSO," which are displayed as all capital letters. This conditional checks for such headings, stores them in the variable heading, and displays them as HTML level 2 headers. The heading is stored to be used later on.

        } elsif (/^\s*$/) {
            $blank++;
            if ($blank < 2) {
                print;
            }

If the line consists entirely of whitespace, the subroutine increments the $blank variable. If the value of that variable is greater than two, the line is ignored. In other words, consecutive blank lines are ignored.

        } else {
        
            $blank = 0;
            s//&amp;/g       if (/&/);
            s//&lt;/g        if (/</);
            s//&gt;/g        if (/>/);

The blank variable is initialized to zero, since this block is executed only if the line contains non-whitespace characters. The regular expressions replace the "&", "<", and ">" characters with their HTML equivalents, since these characters have a special meaning to the browser.

            if (/((_\010\S)+)/) {
                s//<B>$1<\/B>/g;
                s/_\010//g;
            }

All manual pages have text strings that are underlined for emphasis. The nroff utility creates an underlined effect by using the "_" and the "^H" (Control-H or \010) characters. Here is how the word "options" would be underlined:

_^Ho_^Hp_^Ht_^Hi_^Ho_^Hn_^Hs

The regular expression in the if statement searches for an underlined word and stores it in $1, as illustrated below.

[Graphic: Figure from the text]

This first substitution statement adds the <B> .. </B> tags to the string:

<B>_^Ho_^Hp_^Ht_^Hi_^Ho_^Hn_^Hs</B>

Finally, the "_^H" characters are removed to create:

<B>options</B>

Let's modify the file in one more way before we start to display the information:

            if ($heading =~ /ALSO/) {
                if (/([\w\+\-]+)\((\w+)\)/) {
                s//<A HREF="$script\?manpage=$1&section=$2">$1($2)<\/A>/g;
                }
            }

Most manual pages contain a "SEE ALSO" heading under which related software applications are listed. Here is an example:

SEE ALSO
     X(1), xlsfonts(1), xterm(1), xrdb(1)

The regular expression stores the command name in $1 and the manpage section number in $2, as seen below. Using this regular expression, we add a hypertext link to this program for each one of the listed applications. The query string contains the manual page title, as well as the section number.

[Graphic: Figure from the text]

The program continues as follows:

            print;
        }
    }
    print "</PRE><HR>", "\n";
        print "</BODY></HTML>", "\n";
    
    close (MANUAL);
}

Finally, the modified line is displayed. After all the lines in the file--or pipe--are read, it is closed. Figure 9.3 shows the output produced by this application.

Figure 9.3: Manual page gateway

[Graphic: Figure 9-3]

This particular gateway program concerned itself mostly with the output of the program it invoked (nroff). You will see in this chapter that you often have to expend equal effort (or even more effort) fashioning input in the way the existing program expects it. Those are the general tasks of gateways.

9. Gateways, Databases, and Search/Index Utilities

9.1 UNIX Manual Page Gateway

Figure 9.1: Converting manual page to HTML

Figure 9.2: UNIX manual page form

Figure 9.3: Manual page gateway