NNTP
(Network News Transfer Protocol) is the most popular software used
to transmit Usenet news over the Internet. It lets the receiving
(client) system tell the sending (server) system which newsgroups
to send, and which articles from each group. NNTP
accepts commands in a fairly simple format. It sends back a stream
of text consisting of the articles posted and occasional status
information.
This CGI gateway communicates with an NTTP
server directly by using socket I/O. The program displays lists
of newsgroups and articles for the user to choose from. You will
be able to read news from the specified newsgroups in a threaded
fashion (all the replies to each article are grouped together).
#!/usr/local/bin/perl
require "sockets.pl";
$webmaster = "Shishir Gundavaram (shishir\@bu\.edu)";
$error = "CGI NNTP Gateway Error";
%groups = ( 'cgi', 'comp.infosystems.www.authoring.cgi',
'html', 'comp.infosystems.www.authoring.html',
'images', 'comp.infosystems.www.authoring.images',
'misc', 'comp.infosystems.www.authoring.misc',
'perl', 'comp.lang.perl.misc' );
The groups associative array contains
a list of the newsgroups that will be displayed when the form is
dynamically created.
$all_groups = '(cgi|html|images|misc|perl)';
The all_groups variable contains a regular
expression listing all of the keys of the groups
associative array. This will be used to ensure that a valid newsgroup
is specified by the user.
$nntp_server = "nntp.bu.edu";
The NNTP server is set to "nntp.bu.edu".
If you do not want users from domains other than "bu.edu" to access
this form, you can set up a simple authentication scheme like this:
$allowed_domain = "bu.edu";
$remote_host = $ENV{'REMOTE_HOST'};
($remote_domain) = ($remote_host =~ /([^.]+\.[^.]+)$/);
if ($remote_domain ne $allowed_domain) {
&return_error (500, $error, "Sorry! You are not allowed to read news!");
}
The regular expression used above extracts the domain name
from an IP name or address.
Or, you can allow multiple domains like this:
$allowed_domains = "(bu.edu|mit.edu|perl.com)";
$remote_host = $ENV{'REMOTE_HOST'};
if ($remote_host !~ /$allowed_domains$/o) {
&return_error (500, $error, "Sorry! You are not allowed to read news!");
}
To continue with the program:
&parse_form_data (*NEWS);
$group_name = $NEWS{'group'};
$article_number = $NEWS{'article'};
There is no form front end to this CGI gateway. Instead, all
parameters are passed as query information (GET
method). If you access this application without a query, a document
listing all the newsgroups is listed. Once you select a newsgroup
from this list, the program is invoked again, this time with a query
that specifies the newsgroup you want. For instance, if you want
the newsgroup whose key is "images", this query is passed to the
program:
https://some.machine/cgi-bin/nntp.pl?group=images
The groups associative array associates
the string "images" with the actual newsgroup name. This is a more
secure way of handling things--much like the way the Archie server
names were passed instead of the actual IP names in the previous
example. If the program receives a query like the one above, it
displays a list of the articles in the newsgroup. When the user
chooses an article, the query information will look like this:
https://some.machine/cgi-bin/nntp.pl?group=images&article=18721
This program will then display the article.
if ($group_name =~ /\b$all_groups\b/o) {
$selected_group = $groups{$group_name};
This block of code will be executed only if the group
field consists of a valid newsgroup name, as stored in all_groups.
The actual newsgroup name is stored in the selected_group
variable.
&open_connection (NNTP, $nntp_server, "nntp") ||
&return_error (500, $error, "Could not connect to NNTP server.");
&check_nntp ();
A socket is opened to the NNTP server.
The server usually runs on port 119. The check_nntp
subroutine checks the header information that is output by the server
upon connection. If the server issues any error messages, the script
terminates.
($first, $last) = &set_newsgroup ($selected_group);
The NNTP server keeps track of all the
articles in a newsgroup by numbering them in ascending order, starting
at some arbitrary number. The set_newsgroup
subroutine returns the identification number for the first and last
articles.
if ($article_number) {
if (($article_number < $first) || ($article_number > $last)) {
&return_error (500, $error,
"The article number you specified is not valid.");
} else {
&show_article ($selected_group, $article_number);
}
If the user selected an article from the list that was dynamically
generated when a newsgroup is selected, this branch of code is executed.
The article number is checked to make sure that it lies within the
valid range. You might wonder why we need to check this, since the
list that is presented to the user is based on the range generated
by the set_newsgroup subroutine. The reason
for this is that the NNTP server lets articles
expire periodically, and articles are sometimes deleted by their
author. If sufficient time passes between the time the list is displayed
and the time the user makes a selection, the specified article number
could be invalid. In addition, I like to handle the possibility
that a user hardcoded a query.
} else {
&show_all_articles ($group_name, $selected_group, $first, $last);
}
If no article is specified, which happens when the user selects
a newsgroup from the main HTML document, the
show_all_articles subroutine is called to display
a list of all the articles for the selected newsgroup.
print NNTP "quit", "\n";
&close_connection (NNTP);
Finally, the quit command is sent to
the NNTP server, and the socket is closed.
} else {
&display_newsgroups ();
}
exit (0);
If this program is accessed without any query information,
or if the specified newsgroup is not among the list stored in the
groups associative array, the
display_newsgroups
subroutine is called to output the valid newsgroups.
The following print_header subroutine displays a MIME
header, and some HTML to display the title and
the header.
sub print_header
{
local ($title) = @_;
print "Content-type: text/html", "\n\n";
print "<HTML>", "\n";
print "<HEAD><TITLE>", $title, "</TITLE></HEAD>", "\n";
print "<BODY>", "\n";
print "<H1>", $title, "</H1>", "\n";
print "<HR>", "<BR>", "\n";
}
The print_footer subroutine outputs the webmaster's address.
sub print_footer
{
print "<HR>", "\n";
print "<ADDRESS>", $webmaster, "</ADDRESS>", "\n";
print "</BODY></HTML>", "\n";
}
The escape subroutine "escapes" all characters except for
alphanumeric characters and whitespace. The main reason for this
is so that "special" characters are displayed properly.
sub escape
{
local ($string) = @_;
$string =~ s/([^\w\s])/sprintf ("&#%d;", ord ($1))/ge;
return ($string);
}
For example, if an article in a newsgroup contains:
From: [email protected] (Joe Test)
Subject: I can't get the <H1> headers to display correctly
The browser will actually interpret the "<H1>", and the
rest of the document will be messed up. This subroutine escapes
the text so that it looks like this:
From: joe@test.net (Joe Test)
Subject: I can't get the <H1> headers to display correctly
A web client can interpret any string in the form &#n,
where n is the ASCII code of the character. This
might slow down the display slightly, but it is much safer than
escaping specific characters only.
The
check_nntp subroutine continuously reads the output from the NNTP
server until the return status is either a success (200 or 201)
or a failure (4xx or 5xx). You might have noticed that these status
codes are very similar to the HTTP status code.
In fact, most Internet servers that follow a standard use these
codes.
sub check_nntp
{
while (<NNTP>) {
if (/^(200|201)/) {
last;
} elsif (/^4|5\d+/) {
&return_error (500, $error, "The NNTP server returned an error.");
}
}
}
The set_newsgroup subroutine returns the first and last article
numbers for the newsgroup.
sub set_newsgroup
{
local ($group) = @_;
local ($group_info, $status, $first_post, $last_post);
print NNTP "group ", $group, "\n";
The group
command is sent to the NNTP server. In response
to this, the server sets its current newsgroup to the one specified,
and outputs information in the following format:
group comp.infosystems.www.authoring.cgi
211 1289 4776 14059 comp.infosystems.www.authoring.cgi
The first column indicates the status of the operation (
211 being a success). The total number
of articles, the first and last articles, and the newsgroup name
constitute the rest of the line, respectively. As you can see, the
number of articles is not equal to the numerical difference of the
first and last articles. This is due to article expiration and deletion
(as mentioned above).
$group_info = <NNTP>;
($status, $first_post, $last_post) = (split (/\s+/, $group_info))[0, 2, 3];
The server output is split on whitespace, and the first, third,
and fourth elements are stored in status,
first_post,
and last_post, respectively.
Remember, arrays are zero based; the first element is zero, not
one.
if ($status != 211) {
&return_error (500, $error,
"Could not get group information for $group.");
} else {
return ($first_post, $last_post);
}
}
If the status is not 211, an error message is displayed. Otherwise,
the first and last article numbers are returned.
In the show_article subroutine, the actual news article is
retrieved and printed.
sub show_article
{
local ($group, $number) = @_;
local ($useful_headers, $header_line);
$useful_headers = '(From:|Subject:|Date:|Organization:)';
print NNTP "head $number", "\n";
$header_line = <NNTP>;
The head command displays the headers
for the specified article. Here is the format of the NNTP
output:
221 14059 <[email protected]> head
Path: news.bu.edu!decwrl!nntp.test.net!usenet
From: [email protected] (Joe Test)
Newsgroups: comp.infosystems.www.authoring.cgi
Subject: I can't get the <H1> headers to display correctly
Date: Thu, 05 Oct 1995 05:19:03 GMT
Organization: Joe's Test Net
Lines: 17
Message-ID: <[email protected]>
Reply-To: [email protected]
NNTP-Posting-Host: my.news.test.net
X-Newsreader: Joe Windows Reader v1.28
.
The first line contains the status, the article number, the
article identification, and the NNTP command,
respectively. The status of
221
indicates success. All of the other lines constitute the various
article headers, and are based on how and where the article was
posted. The header body ends with the "." character.
if ($header_line =~ /^221/) {
&print_header ($group);
print "<PRE>", "\n";
If the server returns a success status of 221, the print_header
subroutine is called to display the MIME header,
followed by the usual HTML.
while (<NNTP>) {
if (/^$useful_headers/) {
$_ = &escape ($_);
print "<B>", $_, "</B>";
} elsif (/^\.\s*$/) {
last;
}
}
This loop iterates through the header body, and escapes and
displays the From, Subject, Date, and Organization headers.
print "\n";
print NNTP "body $number", "\n";
<NNTP>;
If everything is successful up to this point, the body
command is sent to the server. In response, the server outputs the
body of the article in the following format:
body 14059
222 14059 <[email protected]> body
I am trying to display headers using the <H1> tag, but it does not
seem to be working. What should I do? Please help.
Thanks in advance,
-Joe
.
There is no need to check the status of this command, if the
head command executed successfully. The server
returns a status of
222
to indicate success.
while (<NNTP>) {
last if (/^\.\s*$/);
$_ = &escape ($_);
print;
}
The while loop iterates through the body, escapes all the
lines, and displays them. If the line starts with a period and contains
nothing else but whitespace, the loop terminates.
print "</PRE>", "\n";
&print_footer ();
} else {
&return_error (500, $error,
"Article number $number could not be retrieved.");
}
}
If the specified article is not found, an error message is
displayed.
The following subroutine reads all of the articles for a particular
group into memory, threads them--all replies to a specific article
are grouped together for reading convenience--and displays the article
numbers and subject lines.
sub show_all_articles
{
local ($id, $group, $first_article, $last_article) = @_;
local ($this_script, %all, $count, @numbers, $article,
$subject, @threads, $query);
$this_script = $ENV{'SCRIPT_NAME'};
$count = 0;
This is the most complicated (but the most interesting) part
of the program. Before your eyes, you will see a nice web interface
grow from some fairly primitive output from the NNTP
server.
print NNTP "xhdr subject $first_article-$last_article", "\n";
<NNTP>;
The xhdr subject lists all the articles in the specified range
in the following format:
xhdr subject 4776-14059
221 subject fields follow
4776 Re: CGI Scripts (guestbook ie)
4831 Re: Access counter for CERN server
12769 Re: Problems using sendmail from Perl script
12770 File upload, Frames and BSCW
-
- (More Articles)
-
.
The first line contains the status. Again, there is no need
to check this, as we know the newsgroup exists. Each article is
listed with its number and subject.
&print_header ("Newsgroup: $group");
print "<UL>", "\n";
while (<NNTP>) {
last if (/^\.\s*$/);
$_ = &escape ($_);
($article, $subject) = split (/\s+/, $_, 2);
$subject =~ s/^\s*(.*)\b\s*/$1/;
$subject =~ s/^[Rr][Ee]:\s*//;
The loop iterates through all of the subjects. The split
command separates each entry into the article number and subject.
Leading and trailing spaces, as well as "Re:" at the beginning of
the line are removed from the subject. This is for sorting purposes.
if (defined ($all{$subject})) {
$all{$subject} = join ("-", $all{$subject}, $article);
} else {
$count++;
$all{$subject} = join ("\0", $count, $article);
}
}
This is responsible for threading the articles. Each new subject
is stored in an associative array, $all, keyed
by the subject itself. The $count variable
gives a unique number to start each value in the array. If the article
already exists, the article number is simply appended to the end
to the element with the same subject. For example, if the subjects
look like this:
2020 What is CGI?
2026 How do you create counters?
2027 Please help with file locking!!!
2029 Re: What is CGI?
2030 Re: What is CGI?
2047 Re: How do you create counters?
.
.
.
Then this is how the associative array will look:
$all{'What is CGI?'} = "1\02020-2029-2030";
$all{'How do you create counters?'} = "2\02026-2047";
$all{'Please help with file locking!!!'} = "3\02027";
Note that we assigned a $count of 1 to
the first thread we see ("What's CGI?"), 2 to the second thread,
and so on. Later we sort by these numbers, so the user will see
threads in the order that they came in to the newsgroup.
@numbers = sort by_article_number keys (%all);
What you see here
is a common Perl technique for sorting. The sort command invokes
a subroutine repeatedly (in this case, one that I wrote called by_article_number).
Using a fast algorithm, it passes pairs of elements from the $all
array to the subroutine.
foreach $subject (@numbers) {
$article = (split("\0", $all{$subject}))[1];
The loop iterates through all of the subjects. The list of
article numbers for each subject is stored in article.
Thus, the $article variable for "What is CGI?"
would be:
Now, we work on the string of articles.
@threads = split (/-/, $article);
The string containing all of the articles for a particular
subject are split on the "-" delimiter and stored in the threads
array.
foreach (@threads) {
$query = join ("", $this_script, "?", "group=", $id,
"&", "article=", $_);
print qq|<LI><A HREF="$query">$subject</A>|, "\n";
}
}
print "</UL>", "\n";
&print_footer ();
}
The loop iterates through each article number (or thread),
and builds a hypertext link containing the newsgroup name and the
article number (see Figure 10.3).
The following is a simple subroutine that compares two values
of an associative array.
sub by_article_number
{
$all{$a} <=> $all{$b};
}
This statement is identical to the following:
if ($all{$a} < $all{$b}) {
return (-1);
} elsif ($all{$a} == $all{$b}) {
return (0);
} elsif ($all{$a} > $all{$b}) {
return (1);
}
The $a and $b constitute
two values in the associative array. In this case, Perl uses this
logic to compare all of the values in the associative array.
The display_newsgroups subroutine creates a dynamic HTML
document that lists all the newsgroups contained in the groups
associative array.
sub display_newsgroups
{
local ($script_name, $keyword, $newsgroup, $query);
&print_header ("CGI NNTP Gateway");
$script_name = $ENV{'SCRIPT_NAME'};
print "<UL>", "\n";
foreach $keyword (keys %groups) {
$newsgroup = $groups{$keyword};
$query = join ("", $script_name, "?", "group=", $keyword);
print qq|<LI><A HREF="$query">$newsgroup</A>|, "\n";
}
print "</UL>";
&print_footer ();
}
Each newsgroup is listed as an unordered list, with the query
consisting of the specific key from the associative array. Remember,
the qq|...| notation is exactly like the "..." notation, except
for the fact that "|" is the delimiter, instead of the double quotation
marks.