Chapter 34. File Formats: CSV, Tab, XML, Logs and Others
Comma-Separated Values: The csv
Module
Often, we have data that is in Comma-Separated Value (CSV) format.
This used by many spreadsheets and is a widely-used standard for data
files.
In the section called “Several Examples” we parsed CSV files using
simple string manipulations. The
csv module does a far better job at parsing and
creating CSV files than the programming we showed in those
examples.
About CSV Files
CSV files are text files organized around data that has rows and
columns. This format is used to exchange data between spread-sheet
programs or databases. A CSV file uses a number of punctuation rules
to encode the data.
Each row is delimited by a line-ending sequence of
characters. This is usually the ASCII sequence \r\n.
Since this may not be the default way to process text files on
your platform, you have to open files using the "rb" and "wb"
modes.
Within a row, columns are delimited by a ,. To
handle the situation where a column's data contains a
,, the column data may be quoted; surrounded by
"'s. If the column contains a ", there are two common
rules used. One CSV dialect uses an escape character, usually
\ for \". The other dialect uses double
""'s.
In the ideal case, a CSV file will have the same number of
columns in each row, and the first row will be column titles. Almost
as pleasant is a file without column titles, but with a known sequence
of columns. In the more complex cases, the number of columns per row
varies.
The CSV Module
The CSV module provides you with readers or writers; these are
objects which use an existing file object,
created with the file or
open function. A CSV reader will read a file,
parsing the commas and quotes, delivering you the data elements of
each row in a sequence or mapping. A CSV writer will create a file,
adding the necessary commas and quotes to create a valid CSV
file.
Module-Level Constructors. The following constructors within the
csv module are used to create a
reader, DictReader,
writer or
DictWriter.
csv.reader(csvfile)
→ reader
Creates a reader object which can
parse the given file, returning a sequence of values for each
line of the file. This can be used as follows: rdr=
csv.reader( open( "file.csv", "rb" ) ). The
csvfile can be any iterable object.
csv.writer(csvfile)
→ writer
Creates a writer object which can
format a sequence of values and write them to a line of the
file. This can be used as follows: wtr= csv.writer( open(
"file.csv", "wb" ) ). The csvfile can
be any object which supports write.
Creates a DictReader object which
can parse the given file, returning a dictionary of values for
each line of the file. The dictionary keys are typically the
first line of the file. You can, optionally, provide the field
names if they are not the first line of the file. The
csvfile can be any iterable object.
csv.DictWriter(csvfile, fieldnames)
→ DictWriter
Creates a DictWriter object which
can format a dictionary of values and write them to a line of
the file. You must provide a sequence of field names which is
used to format each individual dictionary entry. The
csvfile can be any object which supports
write.
Reader Functions. The following functions within a reader
(or DictReader) object will read and parse
the CSV file. In these function definitions cr is
a csv.reader or
csv.DictReader.
cr.(next) →
sequence
Reads the next line of the source file, parses it, and
returns a sequence (for reader) or
dictionary (for DictReader) of the
individual column values.
cr.line_num →
number
Returns the line number of the source file.
Writer Functions. The following functions with a writer
(or DictWriter) object will format and write
a CSV file. In these function definitions cw is a
csv.writer or
csv.DictWriter.
cw.(writerowrow)
Writes the next lines of the destination file from the
given sequence (for writer) or dictionary
(for DictWriter).
cw., (writerowsrowList)
Writes the next lines of the destination file with each
sequence (for writer) or dictionary (for
DictWriter) from the list,
rowList.
Basic CSV Reading
The basic CSV reader processing treats each line of the file as
data. This is typical for files which lack column titles, or files
which have such a complex format that special parsing and analysis is
required. In some cases, a file has a simple, regular format with a
single row of column titles, which can be processed by a special
reader we'll look at below.
We'll revise the readquotes.py program from
the section called “Reading a File as a Sequence of Strings”. This will properly handle all of
the quoting rules, eliminating a number of irritating problems with
the example in the previous chapter.
Example 34.1. readquotes2.py
import csv
qFile= file( "quotes.csv", "rb" )
csvReader= csv.reader( qFile )
for q in csvReader:
try:
stock, price, date, time, change, opPrc, dHi, dLo, vol = q
print stock, float(price), date, time, change, vol
except ValueError:
pass
qFile.close()
We open our quotes file,
quotes.csv, for reading, creating an
object named qFile.
We create a csv reader object
which will parse this file for us, transforming each line into
a sequence of individual column values.
We use a for statement to iterate
through the sequence of lines in the file.
In the unlikely event of an invalid number for the
price, we surround this with a try
statement. The invalid number line will raise a
ValueError exception, which is
caught in the except clause and quietly
ignored.
Each stock quote, q, is a sequence of
column values. We use multiple assignment to assign each field
to a relevant variable. We don't need to strip whitespace,
split the string, or handle quotes; the reader already did
this.
Since the price is a string, we use the
float function to convert this string to
a proper numeric value for further processing.
Consistent Columns as Dictionaries
In some cases, you have a simple, regular file with a single
line of column titles. In this case, you can transform each line of
the file into a dictionary. The key for each field is the column
title. This can lead to programs which are more clear, and more
flexible. The flexibility comes from not assuming a specific order to
the columns.
We'll revise the readportfolio.py program
from the section called “Reading "Records"”. This will properly handle
all of the quoting rules, eliminating a number of irritating problems
with the example in the previous chapter. It will make use of the
column titles in the file.
Example 34.2. readportfolio2.py
import csv
quotes=open( "display.csv", "rb" )
csvReader= csv.DictReader( quotes )
invest= 0
current= 0
for data in csvReader:
print data
invest += float(data["Purchase Price"])*float(data["# Shares"])
current += float(data["Price"])*float(data["# Shares"])
print invest, current, (current-invest)/invest
We open our portfolio file,
display.csv, for reading, creating a file
object named quotes.
We create a csvDictReader object from our
quotes file. This will read the first line
of the file to get the column titles; each subsequent line
will be parsed and transformed into a dictionary.
We initialize two counters, invest
and current to zero. These will accumulate
our initial investment and the current value of this
portfolio.
We use a for statement to iterate
through the lines in quotes file. Each line
is parsed, and the column titles are used to create a
dictionary, which is assigned to
data.
Each stock quote, q, is a
string. We use the
strip operation to remove excess
whitespace characters; the string which
is created then performs the split
(',' ) operation
to separate the fields into a list. We
assign this list to the variable
values.
We perform some simple calculations on each
dict. In this case, we convert the
purchase price to a number, convert the number of shares to a
number and multiply to determine how much we spent on this
stock. We accumulate the sum of these products into
invest.
We also convert the current price to a number and
multiply this by the number of shares to get the current value
of this stock. We accumulate the sum of these products into
current.
When the loop has terminated, we can write out the two
numbers, and compute the percent change.
Writing CSV Files
The most general case for writing CSV is shown in the following
example. Assume we've got a list of objects, named
someList. Further, let's assume that each object
has three attributes: this, that
and aKey.
In this case, we assemble the list of values that becomes a row
in the CSV file.
In some cases we can provide two methods to allow our classes to
participate in CSV writing. We can define a
csvRow method as well as a
csvHeading method. These methods will provide
the necessary tuples of heading or data to be written to the CSV
file.
For example, let's look at the following class definition for a
small database of sailboats. This class shows how the
csvRow and
csvHeading methods might look.
Including these methods in our class definitions simplifies the
loop that writes the objects to a CSV file. Instead of building each
row as a list, we can do the following: wtr.writerow(
someData.csvRow() ).
Here's an example that leverages each object's internal
dictionary (__dict__) to dump objects to a CSV
file.