Imagine a situation where you have an enormous amount of data
stored in a format that is foreign to a typical web browser. And
you need to find a way to present this information on the Web, as
well as allowing potential users to search through the information.
How would you accomplish such a task?
Many information providers on the Web find themselves in situations
like this. Such a problem can be solved by writing a CGI program
that acts as a gateway between the data and the Web. A simple gateway
program was presented in Chapter 7, Advanced Form Applications.
The pie graph program can read the ice cream data file and produce
a graph illustrating the information contained within it. In this
chapter, we will discuss gateways to UNIX programs,
relational databases, and search engines.
Manual
pages on a UNIX operating system provide documentation
on the various software and utilities installed on the system. In
this section, I will write a gateway that reads the requested manual
page, converts it to HTML, and displays it (see Figure 9.1). We
will let the standard utility for formatting manual pages, nroff,
do most of the work. But this example is useful for showing what
a little HTML can do to spruce up a document.
The key technique you need is to examine the input expected by a
program and the output that it generates, so that you can communicate
with it.
Here is the form that is presented to the user:
<HTML>
<HEAD><TITLE>UNIX Manual Page Gateway</TITLE></HEAD>
<BODY>
<H1>UNIX Manual Page Gateway</H1>
<HR>
<FORM ACTION="/cgi-bin/manpage.pl" METHOD="POST">
<EM>What manual page would you like to see?</EM>
<BR>
<INPUT TYPE="text" NAME="manpage" SIZE=40>
<P>
<EM>What section is that manual page located in?</EM>
<BR>
<SELECT NAME="section" SIZE=1>
<OPTION SELECTED>1
<OPTION>2
<OPTION>3
<OPTION>4
<OPTION>5
<OPTION>6
<OPTION>7
<OPTION>8
<OPTION>Don't Know
</SELECT>
<P>
<INPUT TYPE="submit" VALUE="Submit the form">
<INPUT TYPE="reset" VALUE="Clear all fields">
</FORM>
<HR>
</BODY></HTML>
This form will be rendered as shown in Figure 9.2.
On nearly all UNIX systems, manual pages
are divided into eight or more sections (or subdirectories), located
under one main directory--usually /usr/local/man
or /usr/man. This form
asks the user to provide the section number for the desired manual
page.
The CGI program follows. The main program is devoted entirely
to finding the right section, and the particular manual page. A
subroutine invokes nroff
on the page to handle the internal nroff codes
that all manual pages are formatted in, then converts the nroff
output to HTML.
#!/usr/local/bin/perl
$webmaster = "Shishir Gundavaram (shishir\@bu\.edu)";
$script = $ENV{'SCRIPT_NAME'};
$man_path = "/usr/local/man";
$nroff = "/usr/bin/nroff -man";
The program assumes that the manual pages are stored in the
/usr/local/man directory. The nroff
utility formats the manual page according to the directives found
within the document. A typical unformatted manual page looks like
this:
.TH EMACS 1 "1994 April 19"
.UC 4
.SH NAME
emacs \- GNU project Emacs
.SH SYNOPSIS
.B emacs
[
.I command-line switches
] [
.I files ...
]
.br
.SH DESCRIPTION
.I GNU Emacs
is a version of
.I Emacs,
written by the author of the original (PDP-10)
.I Emacs,
Richard Stallman.
.br
.
.
.
Once it is formatted by nroff, it looks
like this:
EMACS(1) USER COMMANDS EMACS(1)
NAME
emacs - GNU project Emacs
SYNOPSIS
emacs [ command-line switches ] [ files ... ]
DESCRIPTION
GNU Emacs is a version of Emacs, written by the author of
the original (PDP-10) Emacs, Richard Stallman.
.
.
.
Sun Release 4.1 Last change: 1994 April 19 1
Now, let's continue with the program to see how this information
can be further formatted for display on a web browser.
$last_line = "Last change:";
The $last_line variable contains the
text that is found on the last line of each page in a manual. This
variable is used to remove that line when formatting for the Web.
&parse_form_data (*FORM);
($manpage = $FORM{'manpage'}) =~ s/^\s*(.*)\b\s*$/$1/;
$section = $FORM{'section'};
The data in the form is parsed and stored. The parse_form_data
subroutine is the one used initially in the last chapter. Leading
and trailing spaces are removed from the information in the manpage
field. The reason for doing this is so that the specified page can
be found.
if ( (!$manpage) || ($manpage !~ /^[\w\+\-]+$/) ) {
&return_error (500, "UNIX Manual Page Gateway Error",
"Invalid manual page specification.");
This block is very important! If a manual page was not specified,
or if the information contains characters other than (A-Z, a-z,
0-9, _, +, -), an error message is returned. As discussed in Chapter 7, Advanced Form Applications, it is always important to check for shell metacharacters
for security reasons.
} else {
if ($section !~ /^\d+$/) {
$section = &find_section ();
} else {
$section = &check_section ();
}
If the section field consists of a number,
the check_section subroutine is called to check
the specified section for the particular manual page. If non-numerical
information was passed, such as "Don't Know," the find_section
subroutine iterates through all of the sections to determine the
appropriate one. In the regular expression, "\d" stands for digit,
"+" allows for one or more of them, and the "^" and "$" ensure that
nothing but digits are in the string. To simplify this part of the
search, we do not allow the "nonstandard" subsections some systems
offer, such as 2v or 3m.
Both of these search subroutines return values upon termination.
These return values are used by the code below to make sure that
there are no errors.
if ( ($section >= 1) && ($section <= 8) ) {
&display_manpage ();
} else {
&return_error (500, "UNIX Manual Page Gateway Error",
"Could not find the requested document.");
}
}
exit (0);
The find_section and check_section
subroutines called above return a value of zero (0) if the specified
manual page does not exist. This return value is stored in the section
variable. If the information contained in section
is in the range of 1 through 8, the display_manpage
subroutine is called to display the manual page. Otherwise, an error
is returned.
The find_section subroutine searches
for a particular manual page in all the sections (from 1 through
8).
sub find_section
{
local ($temp_section, $loop, $temp_dir, $temp_file);
$temp_section = 0;
for ($loop=1; $loop <= 8; $loop++) {
$temp_dir = join("", $man_path, "/man", $loop);
$temp_file = join("", $temp_dir, "/", $manpage, ".", $loop);
find_section searches in the subdirectories
called "man1," "man2," "man3," etc. And each manual page in the
subdirectory is suffixed with the section number, such as "zmore.1,"
and "emacs.1." Thus, the first pass through the loop might join
"/usr/local/man" with "man1" and "zmore.1" to make "/usr/local/man/
man1/zmore.1", which is stored in the $temp_file
variable.
if (-e $temp_file) {
$temp_section = $loop;
}
}
The -e switch returns TRUE if the file
exists. If the manual page is found, the temp_section
variable contains the section number.
return ($temp_section);
}
The subroutine returns the value stored in $temp_section.
If the specified manual page is not found, it returns zero.
The check_section subroutine checks the
specified section for the particular manual page. If it exists,
the section number passed to the subroutine is returned. Otherwise,
the subroutine returns zero to indicate failure. Remember that you
may have to modify this program to reflect the directories and filenames
of manual pages on your system.
sub check_section
{
local ($temp_section, $temp_file);
$temp_section = 0;
$temp_file = join ("", $man_path, "/man", $section,
"/", $manpage, ".", $section);
if (-e $temp_file) {
$temp_section = $section;
}
return ($temp_section);
}
The heart of this gateway is the display_manpage
subroutine. It does not try to interpret the nroff codes in the manual page. Manual
page style is complex enough that our best bet is to invoke nroff,
which has always been used to format the pages. But there are big
differences between the output generated by nroff
and what we want to see on a web browser. The nroff
utility produces output suitable for an old-fashioned line printer,
which produced bold and underlined text by backspacing and reprinting.
nroff also puts a header at the top of each
page and a footer at the bottom, which we have to remove. Finally,
we can ignore a lot of the blank space generated by nroff,
both at the beginning of each line and in between lines.
The display_manpage subroutine starts
by running the page through nroff. Then, the
subroutine performs a few substitutions to make the page look good
on a web browser.
sub display_manpage
{
local ($file, $blank, $heading);
$file = join ("", $man_path, "/man", $section,
"/", $manpage, ".", $section);
print "Content-type: text/html", "\n\n";
print "<HTML>", "\n";
print "<HEAD><TITLE>UNIX Manual Page Gateway</TITLE></HEAD>", "\n";
print "<BODY>", "\n";
print "<H1>UNIX Manual Page Gateway</H1>", "\n";
print "<HR><PRE>";
The usual MIME header and HTML
text are displayed.
open (MANUAL, "$nroff $file |");
A pipe to the nroff program is opened
for output. Whenever you open a pipe, it is critical to check that
there are no shell metacharacters on the command line. Otherwise,
a malicious user can execute commands on your machine! This is why
we performed the check at the beginning of this program.
The blank variable keeps track of the
number of consecutive empty lines in the document. If there is more
than one consecutive blank line, it is ignored.
while (<MANUAL>) {
next if ( (/^$manpage\(\w+\)/i) || (/\b$last_line/o) );
The while loop iterates through each
line in the manual page. The next construct
ignores the first and last lines of each page. For example, the
first and last lines of each page of the emacs
manual page look like this:
EMACS(1) USER COMMANDS EMACS(1)
.
.
.
Sun Release 4.1 Last change: 1994 April 19 1
This is unnecessary information, and therefore we skip over
it. The if statement checks for a string that
does not contain any spaces. The previous while
statement stores the current line in Perl's default variable, $_.
A regular expression without a corresponding variable name matches
against the value stored in $_.
if (/^([A-Z0-9_ ]+)$/) {
$heading = $1;
print "<H2>", $heading, "</H2>", "\n";
All manual pages consist of distinct headings such as "NAME,"
"SYNOPSIS," "DESCRIPTION," and "SEE ALSO," which are displayed as
all capital letters. This conditional checks for such headings,
stores them in the variable heading, and displays
them as HTML level 2 headers. The heading is
stored to be used later on.
} elsif (/^\s*$/) {
$blank++;
if ($blank < 2) {
print;
}
If the line consists entirely of whitespace, the subroutine
increments the $blank variable. If the value
of that variable is greater than two, the line is ignored. In other
words, consecutive blank lines are ignored.
} else {
$blank = 0;
s//&/g if (/&/);
s//</g if (/</);
s//>/g if (/>/);
The blank variable is initialized to
zero, since this block is executed only if the line contains non-whitespace
characters. The regular expressions replace the "&", "<",
and ">" characters with their HTML equivalents,
since these characters have a special meaning to the browser.
if (/((_\010\S)+)/) {
s//<B>$1<\/B>/g;
s/_\010//g;
}
All manual pages have text strings
that are underlined for emphasis. The nroff
utility creates an underlined effect by using the "_" and the "^H"
(Control-H or \010) characters. Here is how the word "options" would
be underlined:
_^Ho_^Hp_^Ht_^Hi_^Ho_^Hn_^Hs
The regular expression in the if statement
searches for an underlined word and stores it in $1,
as illustrated below.
This first substitution statement adds the <B> .. </B>
tags to the string:
<B>_^Ho_^Hp_^Ht_^Hi_^Ho_^Hn_^Hs</B>
Finally, the "_^H" characters are removed to create:
Let's modify the file in one more way before we start to display
the information:
if ($heading =~ /ALSO/) {
if (/([\w\+\-]+)\((\w+)\)/) {
s//<A HREF="$script\?manpage=$1§ion=$2">$1($2)<\/A>/g;
}
}
Most manual pages contain a "SEE ALSO" heading under which
related software applications are listed. Here is an example:
SEE ALSO
X(1), xlsfonts(1), xterm(1), xrdb(1)
The regular expression stores the command name in $1
and the manpage section number in $2, as seen
below. Using this regular expression, we add a hypertext link to
this program for each one of the listed applications. The query
string contains the manual page title, as well as the section number.
The program continues as follows:
print;
}
}
print "</PRE><HR>", "\n";
print "</BODY></HTML>", "\n";
close (MANUAL);
}
Finally, the modified line is displayed. After all the lines
in the file--or pipe--are read, it is closed.
Figure 9.3 shows the
output produced by this application.
This particular gateway program concerned itself mostly with
the output of the program it invoked (nroff).
You will see in this chapter that you often have to expend equal
effort (or even more effort) fashioning input in the way the existing
program expects it. Those are the general tasks of gateways.