The File Archive Modules: tarfile
and
zipfile
An archive file contains a complex, hierarchical file directory in
a single sequential file. The archive file includes the original
directory information as well as a the contents of all of the files in
those directories. There are a number of archive file formats, Python
directory supports two: tar and zip archives.
The tar (Tape Archive) format is widely used in the GNU/Linux
world to distribute files. It is a POSIX standard, making it usable on a
wide variety of operating systems. A tar file can also be compressed,
often with the GZip utility, leading to .tgz
or
.tar.gz
files which are compressed archives.
The Zip file format was invented by Phil Katz at PKWare as a way
to archive a complex, hierarchical file directory into a compact
sequential file. The Zip format is widely used but is not a POSIX
standard. Zip file processing includes a choice of compression
algorithms; the exact algorithm used is encoded in the header of the
file, not in the name of file.
Creating a TarFile
or a
ZipFile
. Since an archive file is still, essentially a file, it is opened
with a variation on the open
function. Since an
archive file contains directory and file contents, it has a number of
methods above and beyond what a simple file has.
-
tarfile.open〈
name
〉〈
mode
〉〈
fileobj
〉〈
buffersize
〉
→ TarFile
-
This module-level function opens the given tar file for
processing. The
name
is a file name
string
; it is optional because the
fileobj
can be used instead. The
mode
is similar to the built-in
open
(or file
) function;
it has additional characters to specify the compression
algorithms, if any. The
fileobject
is a
conventional file object, which can be used instead of the
name
; it can be a standard file like
sys.stdin
. The
buffersize
is like the built-in
open
function.
-
zipfile.
(ZipFile
name
,
mode
,
compression
)→
ZipFile
-
This class constructor opens the given zip file for
processing. The
name
is a file name
string
. The mode is similar to the built-in
open
(or file
) function.
The
compression
is the compression
code. It can be zipfile.ZIP_STORED
or
zipfile.ZIP_DEFLATED
. A
compression
of
ZIP_STORED
uses no compression; a value of
ZIP_DEFLATED
uses the Zlib compression
algorithms
The open function can be used to read or write the archive file.
It can be used to process a simple disk file, using the filename. Or,
more importantly, it can be used to process a non-disk file: this
includes tape devices and network sockets. In the non-disk case, a file
object is given to tarfile.open.
For tar files, the mode information is rather complex because we
can do more than simply read, write and append. The mode
string
adresses three issues: the kind of opening
(reading, writing, appending), the kind of access (block or stream) and
the kind of compression.
For zip files, however, the mode is simply the kind of opening
that is done.
Opening - Both zip and tar files. A zip or tar file can be opened in any of three modes.
-
r
-
Open the file for reading.
-
w
-
Open the file for writing.
-
a
-
Open the file for appending.
Access - tar files only. A tar file can have either of two fundamentally different kinds
of access. If a tar file is a disk file, which supports seek and tell
operations, then you we access the tar file in block mode. If the tar
file is a stream, network connection or a pipeline, which does not
support seek or tell operations, then we must access the archive in
stream mode.
-
:
-
Block mode. The tar file is an disk file, and seek and tell
operations are supported. This is the assumed default, if neither
:
or |
are specified.
-
|
-
Stream mode. The tar file is a stream, socket or pipeline,
and cannot respond to seek or tell operations. Note that you
cannot append to a stream, so the 'a|'
combination is
illegal.
This access distinction isn't meaningful for zip files.
Compression - tar files only. A tar file may be compressed with GZip or BZip2 algorithms, or
it may be uncompressed. Generally, you only need to select compression
when writing. It doesn't make sense to attempt to select compression
when appending to an existing file, or when reading a file.
-
(nothing)
-
The tar file will not be compressed.
-
gz
-
The tar file will be compressed with GZip.
-
bz2
-
The tar file will be compressed with BZip2.
This compression distinction isn't meaningful for zip files. Zip
file compression is specified in the
zipfile.ZipFile
constructor.
Tar File Examples. The most common block modes for tar files are r
,
a
, w:
, w:gz
,
w:bz2
. Note that read and append modes cannot
meaningfully provide compression information, since it's obvious from
the file if it was compressed, and which algorithm was used.
For stream modes, however, the compression information must be
provided. The modes include all six combinations: r|
,
r|gz
, r|bz2
, w|
,
w|gz
, w|bz2
.
Directory Information. Each individual file in a tar archive is described with a
TarInfo
object. This has name, size, access
mode, ownership and other OS information on the file. A number of
methods will retrieve member information from an archive. In the
following summaries,
tf
is a tar file,
created with tarfile.open
.
-
tf.
getmember
(
name
)
→ TarInfo
-
Reads through the archive index looking for the given member
name
. Returns a
TarInfo
object for the named member, or
raises a KeyError
exception.
-
tf.
getmembers
→ list of TarInfo
-
Returns a list
of
TarInfo
objects for all of the members in
the archive.
-
tf.
next
→ TarInfo
-
Returns a TarInfo
object for the next
member of the archive.
-
tf.
getnames
→ list of strings
-
Returns a list
of member
names.
Each individual file in a zip archive is described with a
ZipInfo
object. This has name, size, access mode,
ownership and other OS information on the file. A number of methods will
retrieve member information from an archive. In the following summaries,
zf
is a zip file, created with
zipfile.ZipFile
.
-
zf.
,
(getinfo
name
)
→ ZipInfo
-
Locates information about the given member
name
. Returns a
ZipInfo
object for the named member, or
raises a KeyError
exception.
-
zf.
,
(infolist
)
→ list of ZipInfo
-
Returns a list
of
ZipInfo
objects for all of the members in
the archive.
-
zf.
namelist
→ list of strings
-
Returns a list
of member
names.
Extracting Files From an Archive. If a tar archive is opened with r
, then you can
read the archive and extract files from it. The following methods will
extract member files from an archive. In these summaries,
tf
is a tar file, created with
tarfile.open
.
-
tf.
extract
(
member
, 〈
path
〉)
-
The
member
can be either a
string
member name or a
TarInfo
for a member. This will extract the
file's contents and reconstruct the original file. If
path
is given, this is the new location
for the file.
-
tf.
extractfile
(
member
)
→ file
-
The
member
can be either a
string
member name or a
TarInfo
for a member. This will open a
simple file for access to this member's contents. The member
access file has only read-oriented methods, limited to
read
, readline
,
readlines
, seek
,
tell
.
If a zip archive is opened with r, then you can read the archive
and extract the contents of a file from it. In these summaries,
zf
is a zip file, created with
zipfile.ZipFile
.
-
zf.
read
(
member
)
→ string
-
The
member
is a
string
member name. This will extract the
member's contents, decompress them if necessary, and return the
bytes that consitute the member.
Creating or Extending an Archive. If a tar archive is opened with w
or
a
, then you can add files to it. The following methods
will add member files to an archive. In the following summaries,
tf
is a tar file, created with
tarfile.open
.
-
tf.
add
(
name
, 〈
arcname
〉〈
recursive
〉)
-
Adds the file with the given
name
to the current archive file. If
arcname
is provided, this is the name the file will have in the archive;
this allows you to build an archive which doesn't reflect the
source structure. Generally, directories are expanded; using
recursive=False
prevents expanding
directories.
-
tf.
addfile
(
tarinfo
,
fileobj
)
-
Creates an entry in the archive. The description comes from
the
tarinfo
, an instance of
TarInfo
, created with the
gettarinfo
function. The
fileobj
is an open file, from which the
content is read. Note that the TarInfo.size
field can override the actual size of the file. For a given
filename, fn
, this might look like the
following: tf.addfile( tf.gettarinfo(fn), open(fn,"r")
)
.
-
tf.
(close)
-
Closes the archive. For archives being written or appended,
this adds the block of zeroes that defines the end of the
file.
-
tf.
gettarinfo
(
name
,
〈arcname〉
,
〈fileobj〉
)
→ TarInfo
-
Creates a TarInfo
object for a file
based either on
name
, or the
fileobj
. If a
name
is given, this is a local
filename. The
arcname
is the name that
will be used in the archive, allowing you to modify local
filesystem names. If the
fileobj
is
given, this file is interrogated to gather required
information.
If a zip archive is opened with w
or a
,
then you can add files to it. The following methods will add member
files to an archive. In the following summaries,
zf
is a zip file, created with
zipfile.ZipFile
.
-
zf.
write
(
filename
,
〈arcname〉
,
〈compress〉
)
→ string
-
The
filename
is a
string
file name. This will read the file,
compress it, and write it to the archive. If the
arcname
is given, this will be the name
in the archive; otherwise it will use the original
filename
. The
compress
parameter overrides the
default compression specified when the
ZipFile
was created.
-
zf.
writestr
(
arcname
,
bytes
)
→ string
-
The
arcname
is a
string
file name or a
ZipInfo
object that will be used to create
a new member in the archive. This will write the given bytes to
the archive. The compression used is specified when the
ZipFile
is created.
A tarfile
Example. Here's an example of a program to examine a tarfile, looking for
documentation like .html
files or
README
files. It will provide a list of
.html
files, and actually show the contents of
the README
files.
Example 33.2. readtar.py
#!/usr/bin/env python
"""Scan a tarfile looking for *.html and a README."""
import tarfile
import fnmatch
archive= tarfile.open( "SQLAlchemy-0.3.5.tar.gz", "r" )
for mem in archive.getmembers():
if fnmatch.fnmatch( mem.name, "*.html" ):
print mem.name
elif fnmatch.fnmatch( mem.name.upper(), "*README*" ):
print mem.name
docFile= archive.extractfile( mem )
print docFile.read()
A zipfile Example. Here's an example of a program to create a zipfile based on the
.xml
files in a particular directory.
Example 33.3. writezip.py
import zipfile, os, fnmatch
bookDistro= zipfile.ZipFile( 'book.zip', 'w', zipfile.ZIP_DEFLATED )
for nm in os.listdir('..'):
if fnmatch.fnmatch(nm,'*.xml'):
full= os.path.join( '..', nm )
bookDistro.write( full )
bookDistro.close()