One
of the most useful CGI applications is a web server search/index
gateway. This allows a user to search all of the files on the server
for particular information. Here is a very simple gateway to do
just that. We rely on the UNIX command fgrep
[1] to search all our
files, and then filter its output to something attractive and useful.
First, let's look at the form's front end:
<HTML>
<HEAD><TITLE>Search Gateway</TITLE></HEAD>
<BODY>
<H1>Search Gateway</H1>
<HR>
<FORM ACTION="/cgi-bin/search.pl" METHOD="POST">
What would you like to search for:
<BR>
<INPUT TYPE="text" NAME="query" SIZE=40>
<P>
<INPUT TYPE="submit" VALUE="Start Searching!">
<INPUT TYPE="reset" VALUE="Clear your form">
</FORM>
<HR>
</BODY>
</HTML>
Nothing fancy. The form contains just one field to hold the
search query. Now, here is the program:
#!/usr/local/bin/perl
$webmaster = "Shishir Gundavaram (shishir\@bu\.edu)";
$fgrep = "/usr/local/bin/fgrep";
$document_root = $ENV{'DOCUMENT_ROOT'};
The fgrep UNIX command
is used to perform the actual searching in the directory pointed
to by the variable document_root. fgrep
searches for fixed strings; in other words, wildcards and regular
expressions are not evaluated.
&parse_form_data (*SEARCH);
$query = $SEARCH{'query'};
The form data (or one field) is decoded and stored in the
SEARCH associative array.
if ($query eq "") {
&return_error (500, "Search Error", "Please enter a search query.");
} elsif ($query !~ /^(\w+)$/) {
&return_error (500, "Search Error", "Invalid characters in query.");
} else {
If the query entered by the user contains a non-alphanumeric
character (A-Z, a-z, 0-9, _), or is empty, an error message is returned.
print "Content-type: text/html", "\n\n";
print "<HTML>", "\n";
print "<HEAD><TITLE>Search Results</TITLE></HEAD>";
print "<BODY>", "\n";
print "<H1>Results of searching for: ", $query, "</H1>";
print "<HR>";
open (SEARCH, "$fgrep -A2 -B2 -i -n -s $query $document_root/* |");
The pipe is opened to the fgrep command
for output. We use the following command-line options:
- -A2 and -B2
display two lines before and after the match
- -i indicates case insensitivity
- -n displays the line numbers
- -s instructs fgrep to suppress
all error messages.
Here is what the output format looks like:
/abc/cde/filename.abc-57-Previous, previous line
/abc/cde/filename.abc-58-Previous line
/abc/cde/filename.abc-59:Matched line
/abc/cde/filename.abc-60-Following line
/abc/cde/filename.abc-61-Following, following line
As you can see, a total of five or more lines are output for
each match. If the query string is found in multiple files, fgrep
returns the "--" boundary string to separate the output from the
different files.
$count = 0;
$matches = 0;
%accessed_files = ();
Three important variables are initialized. The first one,
count, is used to keep track of the number
of lines returned per match. The matches variable
stores the number of different files that contain the specified
query. And finally, the accessed_files associative
array keeps track of the filenames that contain a match.
We could have used another grep
command that returned just filenames, and then our processing would
be much easier. But I want to display the actual text found, so
I chose more complicated output. Thus, I have to do a little fancy
parsing and text substitution to change the lines of fgrep
output into something that looks good on a web browser. What we
want to display is:
- The name of each file found, with
a hypertext link so the user can go directly to a file
- The text found with the search string highlighted
- A summary of the files found
The following code performs these steps.
while (<SEARCH>) {
if ( ($file, $type, $line) = m|^(/\S+)([\-:])\d+\2(.*)| ) {
The while loop iterates through the data
returned by fgrep. If a line resembles the
format presented above, this block of code is executed. The regular
expression is explained below.
unless ($count) {
if ( defined ($accessed_files{$file}) ) {
next;
} else {
$accessed_files{$file} = 1;
}
$file =~ s/^$document_root\/(.*)/$1/;
$matches++;
print qq|<A HREF="/$file">$file</A><BR><BR>|;
}
If count is equal to zero (which means
we are either on line 1 or on the line right after the boundary),
the associative array is checked to see if an element exists for
the current filename. If it exists, there is a premature break from
the conditional, and the while loop executes
again. If not, the matches variable is incremented,
and a hypertext anchor is linked to the relative pathname of the
matched file.
Remember, if there is more than one match per file, fgrep
returns the matched lines as separate entities (separated by the
"--" string). Since we want only one link per filename, the associative
array has to be used to "cache" the filename.
$count++;
$line =~ s/<(([^>]|\n)*)>/<$1>/g;
The count variable is incremented so
that the next time through the loop, the previous block of code
will not be executed, and therefore a hypertext link will not be
created. Also, all HTML tags are "escaped" by
the regular expression illustrated below, so that they appear as
regular text when this dynamic document is displayed. If we did
not escape these tags, the browser would interpret them as regular
HTML statements, and display formatted output.
We could totally remove all tags by using:
$line =~ s/<(([^>]|\n)*)>//g;
Let's continue with the program:
if ($line =~ /^[^A-Za-z0-9]*$/) {
next;
}
If a line consists of any characters besides the subset of
alphanumeric characters (A-Z, a-z, 0-9), the line will not be displayed.
if ($type eq ":") {
$line =~ s/($query)/<B>$1<\/B>/ig;
}
print $line, "<BR>";
For the matched line, the query is emboldened using the <B>
... </B> HTML tags, and printed.
} else {
if ($count) {
print "<HR>";
$count = 0;
}
}
}
This conditional is executed if the line contains the boundary
string, in which case a horizontal rule is output and the counter
is initialized.
print "<P>", "<HR>";
print "Total number of files containing matches: ", $matches, "<BR>";
print "<HR>";
print "</BODY></HTML>", "\n";
close (SEARCH);
}
exit (0);
Finally, the total number of files that contained matches
to the query are displayed, as shown in Figure 9.11.
This is a very simple example of a search/index utility. It
can be quite slow if you need to search hundreds (or thousands)
of documents. However, there are numerous indexing engines (as well
as corresponding CGI gateways) that are extremely fast and powerful.
These include Swish and Glimpse. See Appendix E, information on where
to retrieve those packages.