Fixed Format Files, A COBOL Legacy: The
codecs
Module
Files that come from COBOL programs have three characteristic
features:
-
The file layout is defined positionally. There are no
delimiters or separators on which to base file parsing. The file may
not even have \n
characters at the end of each
record.
-
They're usually encoded in EBCDIC, not ASCII or
Unicode.
-
They may include packed decimal fields; these are numeric
values represented with two decimal digits (or a decimal digit and a
sign) in each byte of the field.
The first problem requires figuring the starting position and size
of each field. In some cases, there are no gaps (or filler) between
fields; in this case the sizes of each field are all that are required.
Once we have the position and size, however, we can use a string slice
operation to pick those characters out of a record. The code is simply
aLine[start:start+size]
.
We can tackle the second problem using the
codecs
module to decode the EBCDIC characters.
The result of codecs.getdecoder('cp037')
is a function that
you can use as an EBCDIC decoder.
The third problem requires that our program know the data type as
well as the position and offset of each field. If we know the data type,
then we can do EBCDIC conversion or packed decimal conversion as
appropriate. This is a much more subtle algorithm, since we have two
strategies for converting the data fields. See the section called “Strategy” for some reasons why we'd do it
this way.
In order to mirror COBOL's largely decimal world-view, we will
need to use the decimal
module for all numbers
and airthmetic.
We note that the presence of packed decimal data changes the file
from text to binary. We'll begin with techniques for handling a text
file with a fixed layout. However, since this often slides over to
binary file processing, we'll move on to that topic, also.
Reading an All-Text File. If we ignore the EBCDIC and packed decimal problems, we can
easily process a fixed-layout file. The way to do this is to define a
handy structure that defines our record layout. We can use this
structure to parse each record, transforming the record from a string
into a dictionary that we can use for further processing.
In this example, we also use a generator function,
yieldRecords
, to break the file into individual
records. We separate this functionality out so that our processing loop
is a simple
for
statement, as it is with other kinds
of files. In principle, this generator function can also check the
length of recBytes
before it yields it. If the block
of data isn't the expected size, the file was damaged and an exception
should be raised.
layout = [
( 'field1', 0, 12 ),
( 'field2', 12, 4 ),
( 'anotherField', 16, 20 ),
( 'lastField', 36, 8 ),
]
reclen= 44
def yieldRecords( aFile, recSize ):
recBytes= aFile.read(recSize)
while recBytes:
yield recBytes
recBytes= aFile.read(recSize)
cobolFile= file( 'my.cobol.file', 'rb' )
for recBytes in yieldRecords(cobolFile, reclen):
record = dict()
for name, start, size in layout:
record[name]= recBytes[start:start+len]
Reading Mixed Data Types. If we have to tackle the complete EBCDIC and packed decimal
problem, we have to use a slightly more sophisticated structure for
our file layout definition. First, we need some data conversion
functions, then we can use those functions as part of picking apart a
record.
We may need several conversion functions, depending on the kind of
data that's present in our file. Minimally, we'll need the following two
functions.
-
display
-
This function is used to get character data. In COBOL, this
is called display data. It will be in EBCDIC if our files
originated on a mainframe.
-
packed
-
This function is used to get packed decimal data. In COBOL,
this is called "comp-3" data. In our example, we have not dealt
with the insert of the decimal point prior to the creation of a
decimal.Decimal
object.
import codecs
display = codecs.getdecoder('cp037')
def packed( bytes ):
n= [ '' ]
for b in bytes[:-1]:
hi, lo = divmod( ord(b), 16 )
n.append( str(hi) )
n.append( str(lo) )
digit, sign = divmod( ord(bytes[-1]), 16 )
n.append( str(digit) )
if sign in (0x0b, 0x0d ):
n[0]= '-'
else:
n[0]= '+'
return n
Given these two functions, we can expand our handy record layout
structure.
layout = [
( 'field1', 0, 12, display ),
( 'field2', 12, 4, packed ),
( 'anotherField', 16, 20, display ),
( 'lastField', 36, 8, packed ),
]
reclen= 44
This changes our record decoding to the following.
cobolFile= file( 'my.cobol.file', 'rb' )
for recBytes in yieldRecords(cobolFile, reclen):
record = dict()
for name, start, size, convert in layout:
record[name]= convert( recBytes[start:start+len] )
This example underscores some of the key values of Python. Simple
things can be kept simple. The layout structure, which describes the
data, is both easy to read, and written in Python itself. The evolution
of this example shows how adding a sophisticated feature can be done
simply and cleanly.
At some point, our record layout will have to evolve from a simple
tuple to a proper class definition. We'll need to take this evolutionary
step when we want to convert packed decimal numbers into values that we
can use for further processing.