Archie is a database/index of the numerous
FTP sites (and their contents) throughout the world. You can use
an Archie client to search the database for specific files. In this
example, we will use Brendan Kehoe's Archie client software (version
1.3) to connect to an Archie server and search for user-specified
information. Though we could have easily written a client using
the socket library, it would be a waste of time, since an excellent
one exists. This Archie gateway is based on ArchiPlex, developed
by Martijn Koster.
#!/usr/local/bin/perl
$webmaster = "Shishir Gundavaram (shishir\@bu\.edu)";
$archie = "/usr/local/bin/archie";
$error = "CGI Archie Gateway Error";
$default_server = "archie.rutgers.edu";
$timeout_value = 180;
The archie variable contains the full
path to the Archie client. Make sure you have an Archie client with
this pathname on your local machine; if you do not have a client,
you have to telnet to a machine with a client and run this program
there.
The default server to search is stored. This is used in case
the user failed to select a server.
Finally, timeout_value contains the number
of seconds after which an gateway will return an error message and
terminate. This is so that the user will not have to wait forever
for the search results.
%servers = (
'ANS Net (New York, USA)', 'archie.ans.net',
'Australia', 'archie.au',
'Canada', 'archie.mcgill.ca',
'Finland/Mainland Europe', 'archie.funet.fi',
'Germany', 'archie.th-darmstadt.de',
'Great Britain/Ireland', 'archie.doc.ac.ac.uk',
'Internic Net (New York, USA)', 'ds.internic.net',
'Israel', 'archie.ac.il',
'Japan', 'archie.wide.ad.jp',
'Korea', 'archie.kr',
'New Zealand', 'archie.nz',
'Rutgers University (NJ, USA)', 'archie.rutgers.edu',
'Spain', 'archie.rediris.es',
'Sweden', 'archie.luth.se',
'SURANet (Maryland, USA)', 'archie.sura.net',
'Switzerland', 'archie.switch.ch',
'Taiwan', 'archie.ncu.edu.tw',
'University of Nebrasksa (USA)', 'archie.unl.edu' );
Some of the Archie servers and their IP names are stored in
an associative array. We will create the form for this gateway dynamically,
listing all of the servers located in this array.
$request_method = $ENV{'REQUEST_METHOD'};
if ($request_method eq "GET") {
&display_form ();
The form will be created and displayed if this program was
accessed with the browser.
} elsif ($request_method eq "POST") {
&parse_form_data (*FORM);
$command = &parse_archie_fields ();
All of the form data is decoded and stored in the FORM
associative array. The parse_archie_fields
subroutine uses the form data in constructing a query to be passed
to the Archie client.
$SIG{'ALRM'} = "time_to_exit";
alarm ($timeout_value);
To understand how this array is used, you have to understand
that the UNIX kernel checks every time an interrupt
or break arrives for a program, and asks, "What routine should I
call?" The routine that the program wants called is a signal handler.
Perl associates a handler with a signal in the SIG associative array.
As shown above, the traditional way to implement a time-out
is to set an ALRM signal to be called after a
specified number of seconds. The first line says that when an alarm
is signaled, the time_to_exit subroutine should be
executed. The Perl
alarm call on the
second line schedules the ALRM signal to be sent
in the number of seconds represented by the
$timeout_value variable.
open (ARCHIE, "$archie $command |");
$first_line = <ARCHIE>;
A pipe is opened to the Archie client. The command
variable contains a "query" that specifies various command-line
options, such as search type and Archie server address, as well
as the string to search for. The parse_archie_fields
subroutine makes sure that no shell metacharacters are specified,
since the command variable is "exposed" to
the shell.
if ($first_line =~ /(failed|Usage|WARNING|Timed)/) {
&return_error (500, $error,
"The archie client encountered a bad request.");
} elsif ($first_line =~ /No [Mm]atches/) {
&return_error (500, $error,
"There were no matches for <B>$FORM{'query'}</B>.");
}
If the first line from the Archie server contains either an
error or a "No Matches" string, the return_error
subroutine is called to return a more friendly (and verbose) message.
If there is no error, the first line is usually blank.
print "Content-type: text/html", "\n\n";
print "<HTML>", "\n";
print "<HEAD><TITLE>", "CGI Archie Gateway", "</TITLE></HEAD>", "\n";
print "<BODY>", "\n";
print "<H1>", "Archie search for: ", $FORM{'query'}, "</H1>", "\n";
print "<HR>", "<PRE>", "\n";
The usual type of header information is output. The following
lines of code parse the output from the Archie server, and create
hypertext links to the matched files. Here is the typical format
for the Archie server output. It lists each host where a desired
file (in this case, emacs) is found, followed
by a list of all publicly accessible directories containing a file
of that name. Files are listed in long format, so you can see how
old they are and what their sizes are.
Host amadeus.ireq-robot.hydro.qc.ca
Location: /pub
DIRECTORY drwxr-xr-x 512 Dec 18 1990 emacs
Host anubis.ac.hmc.edu
Location: /pub
DIRECTORY drwxr-xr-x 512 Dec 6 1994 emacs
Location: /pub/emacs/packages/ffap
DIRECTORY drwxr-xr-x 512 Apr 5 02:05 emacs
Location: /pub/perl/dist
DIRECTORY drwxr-xr-x 512 Aug 16 1994 emacs
Location: /pub/perl/scripts/text-processing
FILE -rwxrwxrwx 16 Feb 25 1994 emacs
We can enhance this output by putting in hypertext
links. That way, the user can open a connection to any of the hosts
with a click of a button and retrieve the file. Here is the code
to parse this output:
while (<ARCHIE>) {
if ( ($host) = /^Host (\S+)$/ ) {
$host_url = join ("", "ftp://", $host);
s|$host|<A HREF="$host_url">$host</A>|;
<ARCHIE>;
If the line starts with a "Host", the specified host is stored.
A URL to the host is created with the join function, using the
ftp scheme and the hostname--for example, if the hostname were ftp.ora.com,
the URL would be ftp://ftp.ora.com.
Finally, the blank line after this line is discarded.
} elsif (/^\s+Location:\s+(\S+)$/) {
$location = $1;
s|$location|<A HREF="${host_url}${location}">$location</A>|;
} elsif ( ($type, $file) = /^\s+(DIRECTORY|FILE).*\s+(\S+)/) {
s|$type|<I>$type</I>|;
s|$file|<A HREF="${host_url}${location}/${file}">$file</A>|;
} elsif (/^\s*$/) {
print "<HR>";
}
print;
}
One subtle feature of regular expressions is shown here: They
are "greedy," eating up as much text as they can. The expression
(DIRECTORY|FILE).*\s+ means
match DIRECTORY or FILE, then
match as many characters as you can up to whitespace. There are
chunks of whitespace throughout the line, but the .* takes up everything
up to the last whitespace. This leaves just the word "emacs" to
match the final parenthesized expression (\S+).
The rest of the lines are read and parsed in the same manner
and displayed (see Figure 10.1). If the line is empty, a horizontal rule is output--to indicate the end of each entry.
$SIG{'ALRM'} = "DEFAULT";
close (ARCHIE);
print "</PRE>";
print "</BODY></HTML>", "\n";
Finally, the ALRM signal is reset, and
the file handle is closed.
} else {
&return_error (500, $error, "Server uses unspecified method");
}
exit (0);
Remember how we set the SIG array so that a signal would cause
the time_to_exit subroutine to run? Here it
is:
sub time_to_exit
{
close (ARCHIE);
&return_error (500, $error,
"The search was terminated after $timeout_value seconds.");
}
When this subroutine runs, it means that the 180 seconds that
were allowed for the search have passed, and that it is time to
terminate the script. Generally, the Archie server returns the matched
FTP sites and its files quickly, but there are times when it can
be queued up with requests. In such a case, it is wise to terminate
the script, rather than let the user wait for a long period of time.
Now, we have to build a command that the Archie client recognizes
using the parse_archie_fields subroutine:
sub parse_archie_fields
{
local ($query, $server, $type, $address, $status, $options);
$status = 1;
$query = $FORM{'query'};
$server = $FORM{'server'};
$type = $FORM{'type'};
if ($query !~ /^\w+$/) {
&return_error (500, $error,
"Search query contains invalid characters.");
If the query field contains non-alphanumeric characters (characters
other than A-Z, a-z, 0-9, _), an error message is output.
} else {
foreach $address (keys %servers) {
if ($server eq $address) {
$server = $servers{$address};
$status = 0;
}
}
The foreach loop iterates through the keys of the servers
associative array. If the user-specified server matches the name
as contained in the array, the IP name is stored in the server variable,
and the status is set to zero.
if ($status) {
&return_error (500, $error, "Please select a valid archie host.");
A status of non-zero indicates that the user specified an
invalid address for the Archie server.
} else {
if ($type eq "cs_sub") {
$type = "-c";
} elsif ($type eq "ci_sub") {
$type = "-s";
} else {
$type = "-e";
}
If the user selected "Case Sensitive Substring", the "-c"
switch is used. The "-s" switch indicates a "Case Insensitive Substring".
If the user did not select any option, the "-e" switch ("Exact Match")
is used.
$options = "-h $server $type $query";
return ($options);
}
}
}
A string containing all of the options is created, and then
returned to the main program.
Our last task is a simple one--to create a form that allows
the user to enter a query, using the display_form subroutine. The
program creates the form dynamically because some information is
subject to change (i.e., the list of servers).
sub display_form
{
local ($archie);
print <<End_of_Archie_One;
Content-type: text/html
<HTML>
<HEAD><TITLE>Gateway to Internet Information Servers</TITLE></HEAD>
<BODY>
<H1>CGI Archie Gateway</H1>
<HR>
<FORM ACTION="/cgi-bin/archie.pl" METHOD="POST">
Please enter a string to search from: <BR>
<INPUT TYPE="text" NAME="query" SIZE=40>
<P>
What archie server would you like to use (<B>please</B>, be considerate
and use the one that is closest to you): <BR>
<SELECT NAME="server" SIZE=1>
End_of_Archie_One
foreach $archie (sort keys %servers) {
if ($servers{$archie} eq $default_server) {
print "<OPTION SELECTED>", $archie, "\n";
} else {
print "<OPTION>", $archie, "\n";
}
}
This loop iterates through the associative array and displays
all of the server names.
print <<End_of_Archie_Two;
</SELECT>
<P>
Please select a type of search to perform: <BR>
<INPUT TYPE="radio" NAME="type" VALUE="exact" CHECKED>Exact<BR>
<INPUT TYPE="radio" NAME="type" VALUE="ci_sub">Case Insensitive Substring<BR>
<INPUT TYPE="radio" NAME="type" VALUE="cs_sub">Case Sensitive Substring<BR>
<P>
<INPUT TYPE="submit" VALUE="Start Archie Search!">
<INPUT TYPE="reset" VALUE="Clear the form">
</FORM>
<HR>
</BODY>
</HTML>
End_of_Archie_Two
}
The dynamic form looks like that in Figure 10.2.
This was a rather simple program because we did not have to
deal with the Archie server directly, but rather through a pre-existing
client. Now, we will look at an example that is a little bit more
complicated.