We'll look at four examples of file processing. In all cases,
we'll read simple text files. We'll show some traditional kinds of file
processing programs and how those can be implemented using
Python.
Reading a Text File
The following program will examine a standard unix password
file. We'll use the explicit readline method
to show the processing in detail. We'll use the
split method of the input
string as an example of parsing a line of
input.
Example 19.1. readpswd.py
pswd = file( "/etc/passwd", "r" )
for aLine in pswd
fields= aLine.split( ":" )
print fields[0], fields[1]
pswd.close()
This program creates a file
object, pswd, that represents the
/etc/passwd file, opened for
reading.
A file is a sequence of lines. We
can use a file in the
for statement, and the
file object will return each individual
line in response to the next
method.
The input string is split into
individual fields using ":" boundaries. Two
particular fields are printed. Field 0 is the username and
field 1 is the password.
Closing the file releases any resources used by the file
processing.
For non-unix users, a password file looks like the
following:
This program shows us that a file is a sequence of individual
lines. Because it is an iterable object, the for
statement will provide the individual lines.
A popular stock quoting service on the Internet will provide CSV
files with current stock quotes. The files have comma-separated values
in the following format:
stock, lastPrice, date, time, change, openPrice, daysHi, daysLo, volume
The stock, date and time are typically quoted
strings. The other fields are numbers,
typically in dollars or percents with two digits of precision. We can
use the Python eval function on each column to
gracefully evaluate each value, which will eliminate the quotes, and
transform a string of digits into a floating-point price value. We'll
look at dates in Chapter 32, Dates and Times: the time and
datetime Modules.
The first line shows a quote for an index: the Dow-Jones
Industrial average. The trading volume doesn't apply to an index, so
it is "N/A". The second line shows a regular stock (Apple Computer)
that traded 8,122,800 shares on June 15, 2001. The third line shows a
mutual fund. The detailed opening price, day's high, day's low and
volume are not reported for mutual funds.
After looking at the results on line, we clicked on the link to
save the results as a CSV file. We called it
quotes.csv. The following program will open and
read the quotes.csv file after we download it
from this service.
Example 19.2. readquotes.py
qFile= file( "quotes.csv", "r" )
for q in qFile:
try:
stock, price, date, time, change, opPrc, dHi, dLo, vol\
= q.strip().split( "," )
print eval(stock), float(price), date, time, change, vol
except ValueError:
pass
qFile.close()
We open our quotes file,
quotes.csv, for reading, creating an
object named qFile.
We use a for statement to iterate
through the sequence of lines in the file.
The quotes file typically has an empty line at the end,
which splits into zero fields, so we surround this with a
try statement. The empty line will raise a
ValueError exception, which is
caught in the except clause and
ignored.
Each stock quote, q, is a
string. By using the
strip operation of the
string, we create a new string with
excess whitespace characters removed. The
string which is created then performs
the split
(',' ) operation
to separate the fields into a list. We
use multiple assignment to assign each field to a relevant
variable. Note that we strip this file into nine fields,
leading to a long statement. We put a \ to break
the statement into two lines.
The name of the stock is a string which includes quotes.
In order to gracefully remove the quotes, we use the
eval function. The price is a string. We
use the float function to convert this
string to a proper numeric value for further
processing.
Read, Sort and Write
For COBOL expatriates, here's an example that shows a short way
to read a file into an in-memory sequence, sort that sequence and
print the results. This is a very common COBOL design pattern, and it
tends to be rather long and complex in COBOL.
This example looks forward to some slightly more advanced
techniques like list sorting. We'll delve into
sorting in Chapter 20, Advanced Sequences.
Example 19.3. sortquotes.py
data= []
qFile= file( "quotes.csv", "r" )
for q in qFile:
fields= tuple( q.strip().split( "," ) )
if len(fields) == 9: data.append( fields )
qFile.close()
def priceVolume(a,b):
return cmp(a[1],b[1]) or cmp(a[8],b[8])
data.sort( priceVolume )
for stock, price, date, time, change, opPrc, dHi, dLo, vol in data:
print stock, price, date, time, change, volume
We create an empty sequence, data, to
which we will append tuples created
from splitting each line into fields.
We create file object that will read all the lines of
our CSV-format file.
This for loop will set
q to each line in the file.
The variable field is created by
stripping whitespace from the line, q,
breaking it up on the "," boundaries into
separate fields, and making the resulting sequence of field
values into a tuple.
If the line has the expected nine fields, the
tuple of fields is appended to the
data sequence. Lines with the wrong number
of fields are typically the blank lines at the beginning or
end of the file.
To prepare for the sort, we define a comparison
function. This will compare fields 1 and 8, price and volume.
This relies on the behavior of the or
operator: if the comparison of field 1 is equal, the value of
cmp will be 0, which is equivalent to
False; so field 8 must be compared.
We can then sort the data sequence.
The sort function will use our
priceVolume function to compare records.
This kind of sort is covered in depth in the section called “Advanced List Sorting”.
Once the sequence of data elements is sorted, we can
then print a report showing our stocks ranked by price, and
for stocks of the same price, ranked by volume. We could
expand on this by using the % operator to provide
a nicer-looking report format.
Reading "Records"
In languages like C or COBOL a "record" or "struct" that
describe the contents of a file. The advantage of a record is that the
fields have names instead of numeric positions. In Python, we can
acheive the same level of clarity using a dict
for each line in the file.
For this, we'll download files from a web-based portfolio
manager. This portfolio manager gives us stock information in a file
called display.csv. Here is an example.
This file contains a header line that names the data columns,
making processing considerably more reliable. We can use the column
titles to create a dict for each line of data.
By using each data line along with the column titles, we can make our
program quite a bit more flexible. This shows a way of handling this
kind of well-structured information.
Example 19.4. readportfolio.py
quotes=open( "display.csv", "rU" )
titles= quotes.next().strip().split( ',' )
invest= 0
current= 0
for q in quotes:
values= q.strip().split( ',' )
data= dict( zip(titles,values) )
print data
invest += float(data["Purchase Price"])*float(data["# Shares"])
current += float(data["Price"])*float(data["# Shares"])
print invest, current, (current-invest)/invest
We open our portfolio file,
display.csv, for reading, creating a file
object named quotes.
The first line of input,
quotes.next, is the
set of column titles. We strip any extraneous whitespace
characters from this line, creating a new
string. We perform a
split(','
) to create a list of
individual column title strings. This
list is saved in the variable
titles.
We also initialize two counters,
invest and current to
zero. These will accumulate our initial investment and the
current value of this portfolio.
We use a for statement to iterate
through the remaining lines in quotes file.
Each line is assigned to q.
Each stock quote, q, is a
string. We use the
strip operation to remove excess
whitespace characters; the string which
is created then performs the split
(',' ) operation
to separate the fields into a list. We
assign this list to the variable
values.
We create a dict,
data; the column titles in the
titleslist are the
keys. The data fields from the current record, in
values are used to fill this
dict. The built-in
zip function is designed for precisely
this situation. This function interleaves values from each
list to create a new
list of tuples.
In this case, we will get a sequence of
tuples, each
tuple will be a value from
titles and the corresponding value from
values. This list of
2-tuples creates the
dict.
Now, we have access to each piece of data using it's
proper column tile. The number of shares is in the column
titled "# Shares". We can find this
information in data["# Shares"].
We perform some simple calculations on each
dict. In this case, we convert the
purchase price to a number, convert the number of shares to a
number and multiply to determine how much we spent on this
stock. We accumulate the sum of these products into
invest.
We also convert the current price to a number and
multiply this by the number of shares to get the current value
of this stock. We accumulate the sum of these products into
current.
When the loop has terminated, we can write out the two
numbers, and compute the percent change.
Published under the terms of the Open Publication License