Section 12.4
Programming with Collection Classes
IN THIS SECTION, we finish the
chapter and the book by looking at a few programming examples
that use the collection and map classes which were discussed
in Sections 1, 2, and 3.
Symbol Tables
We begin with a straightforward but important application
of Maps. When a compiler reads the source code of a program,
it encounters definitions of variables, subroutines, and classes.
The names of these things can be used later in the program. The compiler has to
remember the definition of each name, so that it can recognize
the name and apply the definition when the name is encountered
later in the program. This is a natural application for a
Map. The name can be used as a key in the map. The
value associated to the key is the definition of the name, encoded
somehow as an Object. A Map that is used in this way is
called a symbol table.
In a compiler, the values in a symbol table can be quite
complicated, since the compiler has to deal with names for
various sorts of things, and it needs a different type of
information for each different type of name. We will keep things
simple by looking at a symbol table in another context.
Suppose that we want a program that can evaluate expressions
entered by the user, and suppose that the expressions can
contain variables, in addition to operators, numbers, and
parentheses. For this to make sense, we need some way of
assigning values to variables. When a variable is used in an
expression, we need to retrieve the variable's value. A
symbol table can be used to store the variables' values.
The keys for the symbol table are variable names. The value
associated with a key is the value of that variable.
To demonstrate the idea, we'll use a rather simple-minded
program in which the user types commands such as:
let x = 3 + 12
print 2 + 2
print 10*x +17
let rate = 0.06
print 1000*(1+rate)
The only two commands that the program understands are "print"
and "let". When a "print" command is executed, the computer evaluates
the expression and displays the value.
If the expression contains a variable, the computer
has to look up the value of that variable in the symbol table.
A "let" command is
used to give a value to a variable. The computer has to store
the value of the variable in the symbol table. (Note: The "variables"
I am talking about here are not variables in the Java program.
The Java program is executing a sort of program typed in by the
user. I am talking about variables in the user's program. The user
gets to make up variable names, so there is no way for the Java program
to know in advance what the variables will be.)
In Section 11.5, we saw how to write a program,
SimpleParser2.java, that
can evaluate expressions that do not contain variables. The new
program, SimpleParser5.java,
is based on that older program. I will only talk about the parts
that are relevant to the symbol table. Here is an applet
that simulates the program:
The program uses a HashMap as the symbol table. A TreeMap
could also be used, but since we don't have any reason to access the
variables in alphabetical order, we don't need to have the keys
stored in sorted order. The value of a variable is a double,
but Java's collection and map classes can only hold objects.
To get around this restriction, we have to use the wrapper class,
Double. The variable's value, which is of type double
is wrapped in an object of type Double. That
object is stored in the HashMap, using the variable's
name as the key.
Let symbolTable be the HashMap that is used
as the symbol table. At the beginning of the program, we
start with an empty map:
symbolTable = new HashMap();
To execute a "let" command, the program uses the put()
method to associate a value with the variable name.
Suppose that the name of the variable is given by
a String, varName. The command for setting the
value of the variable to val would be:
symbolTable.put(varName, new Double(val));
This associates the object new Double(val)
with the key, varName. In the program SimpleParser5.java,
you'll find this in the method named doLetCommand().
Just for fun, I decided to pre-define two variables named "pi" and
"e" whose values are the usual mathematical constants pi and e.
In Java, the values of these constants are given by Math.PI
and Math.E. To make these variables available to the
user of the program, they are added to the symbol table with the commands:
symbolTable.put("pi", new Double(Math.PI));
symbolTable.put("e", new double(Math.E));
When a variable is encountered in an expression,
the get() method is
used to retrieve its value. Since this method
returns a value of type Object, we have to type-cast
the return value to Double. If the variable has
never been given a value, then the get() method
returns null. We have to handle this in some way;
I will consider it to be an error:
Object symbolTableEntry = symbolTable.get(varName);
if (symbolTableEntry == null) {
... // Throw an exception: Undefined variable.
}
Double value = (Double)symbolTableEntry;
double val = value.doubleValue();
The last line gets the double that is the actual value
of the variable from the wrapper object. You will find this code,
more or less, in a method named primaryValue() in
the program SimpleParser5.java.
As you can see, aside from the necessity of using a wrapper class,
Maps are really quite easy to use.
Sets Inside a Map
The objects in a collection or map can be of any type. They can even be
collections. Here's an example where it's natural to store sets as
values in a map.
Consider the problem of making an index for a book. An index consists of
a list of terms that appear in the book. Next to each term is a list of
the pages on which that term appears. To represent an index in a program,
we need a data structure that can hold a list of terms, along
with a list of pages for each term. Adding new data should be easy
and efficient. When it's time to print the index, it should be easy to
access the terms in alphabetical order. There are many ways this could be
done, but I'd like to use Java's generic data structures and let them do
as much of the work as possible.
We can think of an index as a Map that associates a list of page references
to each term. The terms are keys, and the value associated with a given key is the
list of page references for that term. A Map can be either a TreeMap
or a HashMap, but only a TreeMap will make it easy to access the
terms in sorted order. The value associated with a term is a list of page references.
How can we represent such a value? If you think about it, you see that it's not
really a list in the sense of Java's generic classes. If you look in any index,
you'll see that a list of page references has no duplicates, so it's really a set
rather than a list. Furthermore, the page references for a given term are always
printed in increasing order, so we want a sorted set. This means that we should
use a TreeSet to represent a list of page references. The values that
we really want to put in this set are of type int, but once again we have to
deal with the fact that generic data structures can only hold objects. We have to
wrap each value of type int in an object belonging to the wrapper class,
Integer.
To summerize, an index will be represented by a TreeMap. The keys
for the map will be terms, which are of type String. The values in the
map will be TreeSets. The TreeSet corresponding to a term
contains Integers which give the page numbers of every page on which the
term appears.
To make an index, we need to start with an empty TreeMap, look through
the book, inserting every reference that we want to be in the index into the TreeMap,
and then print out the data from the TreeMap. Let's leave aside the
question of how we find the references to put in the index, and just look
at how the TreeMap is used. The TreeMap can be created
with the command:
TreeMap index = new TreeMap();
Now, suppose that we find a reference to some term on some page. We need
to insert this information into the index. To do this, we should look up
the term in the index, using index.get(term). The return value
is either null or is the set of page references that we already have
for the term. If the return value is null, then this is the first page
reference for the term, so we should add the term to the index, with a new
set that contains the page reference we've just found. If the return value
is non-null, we already have a set, and we should just add the new
page reference to the set. Here is a subroutine that does this:
void addReference(String term, int pageNum) {
// Add a page reference to the index.
TreeSet references; // The set of page references that we
// have so far for the term.
references = (TreeSet)index.get(term); // Type-cast!
if (references == null){
// This is the first reference that we have
// found for the term. Make a new set containing
// the page number and add it to the index, with
// the term as the key.
TreeSet firstRef = new TreeSet();
firstRef.add( new Integer(pageNum) );
index.put(term,firstRef);
}
else {
// references is the set of page references
// that we have found previously for the term.
// Add the new page number to that set.
references.add( new Integer(pageNum) );
}
}
The only other thing we need to do with the index is print it out.
We are going to have to print out sets of Integers. Let's
write a separate subroutine to do that:
void printIntegers( Set integers ) {
// Assume that all the objects in the set are of
// type Integer. Print the integer values on
// one line, separated by commas. The commas
// make this a little tricky.
if (integers.isEmpty()) {
// There is nothing to print if the set is empty.
return;
}
Integer integer; // One of the Integers in the set.
Iterator iter = integers.iterator(); // For traversing the set.
integer = (Integer)iter.next(); // First item in the set.
// We know the set is non-empty,
// so this is OK.
System.out.print(integer.intValue()); // Print the first item.
while (iter.hasNext()) {
// Print additional items, if any, with separating commas.
integer = (Integer)iter.next();
System.out.print(", " + integer.intValue());
}
}
Finally, we need a routine that can iterate through all the terms
in the map and print each term along with its list of page references.
There are no iterators for maps. To iterate through a map, we need
to use one of the Set views of the map. Since we want to
print both the keys and the values, it's most efficient to use the
entry set of the map. Here's the subroutine:
void printIndex() {
// Print each entry in the index.
Set entries = index.entrySet();
// The index viewed as a set of entries, where each
// entry has a key and a value. The objects in
// this set are of type Map.Entry.
Iterator iter = entries.iterator();
while (iter.hasNext()) {
// Get the next entry from the entry set and print
// the term and list of page references that
// it contains.
Map.Entry entry = (Map.Entry)iter.next();
String term = (String)entry.getKey();
Set pages = (Set)entry.getValue();
System.out.print(term + " ");
printIntegers(pages);
System.out.println();
}
}
This is not a lot of code, considering the complexity of the operations.
The only really tricky part is the constant need for type-casting and
the need to use wrapper objects for primitive types. Both of these
are necessary because of the nature of generic programming in Java,
that is, the fact that generic data structures hold values of type
Object.
I have not written a complete indexing program, but the subroutines
presented here are used in a related context in Exercise 12.5.
Using a Comparator
There is a potential problem with our solution to the indexing problem.
If the terms in the index can contain both upper case and lower case letters,
then the terms will not be in alphabetical order. The ordering on String
is not alphabetical. It is based on the Unicode codes of the characters in
the string. The codes for all the upper case letters are less than
the codes for the lower case letters. So, for example, terms beginning
with "Z" come before terms beginning with "a". If the terms are restricted
to use lower case letters only (or upper case only), then the ordering would
be alphabetical. But suppose that we allow both upper and lower case,
and that we insist on alphabetical order. In that case, our index can't
use the usual ordering for Strings. Fortunately, it's possible
to specify a different method to be used for comparing the keys of a map.
This is a typical use for a Comparator.
Recall that a Comparator is an object that implements the
Comparator interface and defines the method:
public int compare(Object obj1, Object obj2)
This method can be used to compare two objects. It should return
an integer that is positive, zero, or negative, depending on whether
obj1 is less than, equal to, or greater than obj2.
We need a Comparator that will compare two Strings
based on alphabetical order. The easiest (although not most efficient)
way to do this is to convert
the Strings to lower case and use the default comparison on
the lower case Strings. The following class defines such
a comparator:
class AlphabeticalOrder implements Comparator {
// Represents a Comparator that can be used for
// comparing Strings according to alphabetical
// order. It is an error to apply this
// Comparator to objects that are non-strings.
public int compare(Object obj1, Object obj2) {
String str1 = (String)obj1; // Type-cast objects to Strings.
String str2 = (String)obj2;
str1 = str1.toLowerCase(); // Convert to lower case.
str2 = str2.toLowerCase();
return str1.compareTo(str2); // Compare lower-case Strings.
}
}
To solve our indexing problem, we just need to tell our index to use
an object of type AlphabeticalOrder for comparing keys. This is
done by providing the Comparator object as a parameter to the
constructor. We just have to create the index with the command:
TreeMap index = new TreeMap( new AlphabeticalOrder() );
This does work, but I've concealed one technicality. Suppose, for example,
that the program calls addReference("aardvark",56) and that it later
calls addReference("Aardvark",102). The words "aardvark" and "Aardvark"
differ only in that one of them begins with an upper case letter. When we
insert them into the index, do they count as two different terms or as
one term? The answer depends on the way that a TreeMap tests objects
for equality. In fact, TreeMaps and TreeSets always use a Comparator
object or a compareTo method to test for equality. They do not
use the equals() method. The Comparator that is used for
the TreeMap in this example returns the value zero when it is used to compare
"aardvark" and "Aardvark", so the TreeMap considers them to be the same.
Page references to "aardvark" and "Aardvark" are combined into a single list.
This is probably the correct behavior in this example. If not, some other technique
must be used to sort the terms into alphabetical order.
Word Counting
The final example also deals with storing information about words.
In Section 10.3, we looked at
WordList.java, a program that makes
a list of all the words in a file. In Section 2,
we did the same thing using generic data structures. Now, we extend the
problem so that instead of just listing the words, the program also counts the
number of times each word occurs in the file. The output consists
of two lists, one sorted alphabetically and one sorted according to the
number of occurrences, with the most common words at the top and the
least common at the bottom. The same problem was assigned in
Exercise 10.1 and solved
without using generic data structures.
As the program reads an input file, it must keep track of how many
times it encounters each word. We could simply throw all the words,
with duplicates, into a list and count them later. But that would require
a lot of storage and would not be very efficient. A better method is to
keep a counter for each word. The first time the word is encountered,
the counter is initialized to 1. On subsequent encounters, the
counter is incremented. To keep track of the data for one word, the
program uses a simple class that holds a word and the counter for that
word. The class is a static nested class:
static class WordData {
// Represents the data we need about a word: the word and
// the number of times it has been encountered.
String word;
int count;
WordData(String w) {
// Constructor for creating a WordData object when
// we encounter a new word.
word = w;
count = 1; // The initial value of count is 1.
}
} // end class WordData
The program has to store all the WordData objects in some sort
of data structure. We want to be able to add new words efficiently.
Given a word, we need to check whether a WordData object already
exists for that word, and if it does, we need to find that object
so that we can increment its counter. A Map can be used to
implement these operations. Given a word, we want to look up a
WordData object in the Map. This means that the
word is the key, and the WordData object is the value.
(It might seem strange that the key is also one of the instance variables
in the value object, but in fact this is probably the most common situation:
The value object contains all the information about some entity, and the
key is one of those pieces of information.) After reading the file,
we want to output the words in alphabetical order, so we should use
a TreeMap rather than a HashMap. This program converts
all words to lower case so that the default ordering on Strings will
put the words in alphabetical order.
When the program reads a word from a file, it calls words.get(word)
to find out if that word is already in the map, where words is a
variable that refers to the TreeMap. If the return value is null,
then this is the first time the word has been encountered, so a new WordData
object is created and inserted into the map with the command
words.put(word, new WordData(word)). If words.get(word)
is not null, then it's the WordData object for this word,
and the program only has to increment the counter in that object. Here is the
subroutine that is used to read the words from the file:
static void readWords(TextReader inStream, Map words) {
// Read all words from inStream, and store data about them in words.
try {
while (true) {
while (! inStream.eof() && ! Character.isLetter(inStream.peek()))
inStream.getAnyChar(); // Skip past non-letters.
if (inStream.eof())
break; // Exit because there is no more data to read.
String word = inStream.getAlpha(); // Read one word from stream.
word = word.toLowerCase();
WordData data = (WordData)words.get(word);
// Check whether the word is already in the Map. If not,
// the value of data will be null. If it is not null, then
// it is a WordData object containing the word and the
// number of times we have encountered it so far.
if (data == null) {
// We have not encountered word before. Add it to
// the map. The initial frequency count is
// automatically set to 1 by the WordData constructor.
words.put(word, new WordData(word) );
}
else {
// The word has already been encountered, and data is
// the WordData object that holds data about the word.
// Add 1 to the frequency count in the WordData object.
data.count = data.count + 1;
}
}
}
catch (TextReader.Error e) {
System.out.println("An error occurred while reading the data.");
System.out.println(e.toString());
System.exit(1);
}
} // end readWords()
After reading the words and printing them out in alphabetical order, the program
has to sort the words by frequency count and print them again. To do the sorting
using a generic algorithm, I defined a Comparator class for comparing two
word objects according to their frequency counts:
static class CountCompare implements Comparator {
// A comparator for comparing objects of type WordData
// according to their counts. This is used for
// sorting the list of words by frequency.
public int compare(Object obj1, Object obj2) {
WordData data1 = (WordData)obj1;
WordData data2 = (WordData)obj2;
return data2.count - data1.count;
// The return value is positive if data2.count > data1.count.
// I.E., data1 comes after data2 in the ordering if there
// were more occurrences of data2.word than of data1.word.
// The words are sorted according to decreasing counts.
}
} // end class CountCompare
Given this class, we can sort the WordData objects according to frequency
by copying them into a list and using the generic method Collections.sort(List,Comparator).
The WordData objects are the values in the map, words,
and words.values() is a Collection that contains all these values.
The constructor for the ArrayList class lets you specify a collection to be
copied into the list when it is created. So, we can create a List containing the
word data and sort that list according to frequency count using the commands:
ArrayList wordsByCount = new ArrayList( words.values() );
Collections.sort( wordsByCount, new CountCompare() );
You should notice that these two lines replace a lot of code! It requires some
practice to think in terms of generic data structures and algorithms, but the payoff is
significant in terms of saved time and effort.
The only remaining problem is to print the data to the output file. We have to
print the data from all the WordData objects twice, first in alphabetical order
and then sorted according to frequency count. The data is in alphabetical order in
the TreeMap, or more precisely, in the values of the TreeMap.
We can use an Iterator for the value collection, words.values(),
to access the words in alphabetical order. Similarly, the words are in the ArrayList
wordsByCount in the correct order according to frequency count, so we could use
an Iterator for the ArrayList to access and print the data according
to frequency count. Since we have to do essentially the same thing twice, we might
as well write a subroutine to do it:
static void printWords(PrintWriter outStream, Collection wordData) {
// wordData must contain objects of type WordData. The words
// and counts in these objects are written to the output stream.
Iterator iter = wordData.iterator();
while (iter.hasNext()) {
WordData data = (WordData)iter.next();
outStream.println(" " + data.word + " (" + data.count + ")");
}
} // end printWords()
This is a generic subroutine. Since its second parameter is of type
Collection, it can be applied to the collection words.values() as
well as to the collection wordsByCount. This is the last piece needed for
the program. You can find the complete program
in the file WordCount.java.
With this section, we reach the end of Introduction to Programming Using Java.
It has been a long journey, but I hope a worthwhile one and one that has left you
prepared to move on to more advanced study of Java, programming, and computer science
in general. Good luck and have fun!
End of Chapter 12