The concept of backup and restoration of data is a vital one to any database administrator. No system is immune from
hard drive crashes, careless users, or any number of potential catastrophes that can endanger data stored within PostgreSQL.
This section covers two general methods for backing up your data. The first approach uses the
pg_dump application distributed with PostgreSQL to create a set of SQL instructions with which a
database can be fully restored. The second method consists of backing up the filesystem itself.
The pg_dump (which is short for "PostgreSQL dump") application is run from a command line, and
creates a list of SQL commands. These commands, executed in the order provided, re-create the database from
scratch.
Here is the syntax for pg_dump:
pg_dump [ options ] dbname
In this syntax, dbname is the name of the database that you want to "dump" SQL instructions for.
The available options are similar in format to those of the other database management utilities
included with PostgreSQL, such as createdb. The most common flag specified in the
options to pg_dump is the -f flag, which specifies the file
to store the dumped SQL statements within.
Note: If the -f flag is not specified to pg_dump, the dumped SQL will
be written to stdout rather than stored in a file.
The complete list of pg_dump options follow:
-a, - -data-only
Forces only COPY or INSERT SQL statements to be
dumped (depending on whether or not the -d flag is used). This results in a backup of data, and
not database objects (or schema). If the -d flag is
not passed along with this flag, the dumped COPY commands are
used to copy all data from stdin (i.e., the rows are stored literally within the dumped file as
COPY commands from stdin). Otherwise, each row is represented
as sequential INSERT statements.
-b, - -blobs
Causes any large objects to be dumped as well as normal data. This option also requires that the
-F flag be provided with either the t or c format.
By default, large object data is not dumped.
-c, - -clean
Specifies that SQL statements to drop all existing objects will precede the SQL statements to create those objects.
This option is useful in re-initializing an existing database, rather than dropping it and
recreating it from scratch.
-C, - -create
Specifies that the SQL statement to create the database (CREATE DATABASE)
should be included in the dump.
-d, - -inserts
Causes INSERT statements to be dumped for each row of data, rather than the default
COPY statements. This can be safer, as a single corrupted row will cause a
COPY statement to fail, though it is a much slower process to add a single row at a time during restoration.
-D, - -attribute-inserts
Like the -d flag, causes INSERT statements to be dumped;
however, with this flag, each INSERT statement is created with an explicit column
target list in parentheses, immediately preceding the VALUES.
-f FILENAME, - -file=FILENAME
Directs the output of pg_dump to a file named FILENAME, rather
than to stdout. The user executing pg_dump must have system permissions
to write to this file.
-F { c | t | p }, - -format { c | t | p }
Determines the file format of the output:
c (gzip compressed)
A format of c creates a gzip-compressed tar file
(i.e., a .tar.gz file).
t (tar)
A value of t creates a tar file (i.e., a .tar
file).
p (plain text)
The default value of p causes plain text output.
Note that pg_restore is typically used to handle files created with the
c or t (gzip-compressed or tar)
formats.
-h HOSTNAME, - -host=HOSTNAME
Specifies that HOSTNAME should be connected to, rather than the localhost. Use this when the target
database is on another server.
-i, - -ignore-version
Overrides the check between the version of pg_dump and the version of PostgreSQL
running. This option is not recommended unless there is no other alternative, as it most likely will
produce errors due to changes in the system catalogs between versions. Normally, you should use the version of
pg_dump matching the database that you are backing up.
-n, - -no-quotes
Suppresses any double-quotes surrounding identifiers unless there are normally illegal characters in
the identifier (e.g., spaces, or uppercase characters).
-N, - -quotes
Explicitly indicates that double-quotes should be used around all identifiers. This has been the default behavior
of pg_dump since PostgreSQL 6.4.
-o, - -oid
Causes OIDs (object identifiers) for each row of data to be dumped as well. This can be vital if any of your
applications based around the data in PostgreSQL use OIDs in any kind of meaningful or associative way.
-O, - -no-owner
Causes ownership to not be taken into account in the dump. A restore with suppressed ownership will cause all
re-created objects to belong to the user performing the restore.
-p PORT, - -port=PORT
Specifies that the database connection should be made on port PORT, rather than the
default port (usually 5432, though it may have been configured differently when PostgreSQL was
compiled, by the - -with-pgport flag).
-R, - -no-reconnect
Suppresses any \connect statements, which are usually used to enforce currently
reflected ownerships when a backup is restored. This is similar in practice to the -O flag,
but also precludes the ability to use the -C flag, as a reconnect is required after creation
of a new database.
-s, - -schema-only
Causes only the schema-related (database objects such as tables, sequences, indices and views)
SQL statements to be dumped, ignoring re-creation of the data. This can be useful in moving a general database
structure from a development machine to a production machine.
-t TABLE, - -table=TABLE
Causes only TABLE to be dumped from the specified database, rather than all tables.
-u, - -password
Provides a prompt for a username and password. As of PostgreSQL 7.1.x, this is the only method to provide an
alternate username. If the user's password is unset (NULL), you may simply press enter
when prompted for it.
-v, - -verbose
Causes verbose output from the pg_dump functions being performed to be displayed to stderr (not
stdout).
-x, - -no-acl
Suppresses any GRANT or REVOKE statements, which
are usually used to preserve the rights set at the time of the dump. Use this flag if you do not wish to enforce
any existing rights or restrictions when re-creating a database from this dump.
-Z, - -compress { 0 - 9 }
Sets the degree of compression (0 for the least compression, 9 for the most) when used with the -F c argument.
Any system user may run pg_dump by default, but the user with which you connect to PostgreSQL
must have SELECT rights for every object in the database being dumped.
Example 9-20 demonstrates the use of pg_dump on the
booktown database, with the manager PostgreSQL user. The
-C flag passed causes the CREATE DATABASE command to be
included in the SQL dump as well. This command is not always included, as you may prefer to create the database
beforehand with non-standard options.
Note that since pg_dump provides the standard connection options for specifying a host
connection (e.g., the -h, -u and -p flags), it can be used
to perform remote backups from any host allowed to make a remote connection (according to the
pg_hba.conf file: see Chapter 8 for more on this subject). Example 9-21 shows a user on a remote server specifying a connection to the
booktown.commandprompt.com server, in order to create a backup file in compressed
format (with the -F c flag) called booktown.sql.tar.gz.
If you wish to use large objects in your dumped file, it is necessary that you use either the
tar (t) or gzip-compressed (c) format,
since the plain-text format cannot include large objects. Otherwise, normal plain-text formatted dumps are suitable for
most users.
Creating a dump in tar format can often result in a backup file more than twice the size of its
plain-text counterpart, even without large objects. This is because the tar format involves a
hierarchy of TOC (table of contents) .dat files which represent information on how to be used by the
corresponding pg_restore command; thus, more disk space is taken up by the extra instructions. Since
tar is not a compression utility, the gzip-compressed (c)
format exists as well to compress the tarred file into gzipped format automatically.
PostgreSQL supplies a supplementary wrapper command to the pg_dump application called
pg_dumpall. The primary use of this application is to allow the entire cluster of PostgreSQL databases
on a system to be dumped at once, rather than having to perform a pg_dump for each database, one at a
time.
Here is the syntax for pg_dumpall, displayed from the - -help flag:
The pg_dumpall command accepts the same connection parameters available to
pg_dump. The following are the pg_dumpall-specific parameters:
-c
Specifies that SQL statements to drop existing global objects will precede the SQL statements to create those objects.
-h HOSTNAME, - -host=HOSTNAME
Specifies that HOSTNAME should be connected to, rather than the localhost, or the host defined
by the PGHOST environment variable. Use this when the target database is on another
server.
-p PORT, - -port=PORT
Specifies that the database connection should be made on port PORT, rather than the
default port (usually 5432).
-g, - -globals-only
Specifies that only global objects will be dumped. This is primarily useful for recreating just users and groups, or
for duplicating them on another machine (by taking the dump to another machine, and executing it). The -g
flag implicitly causes all users to be deleted from the pg_shadow table prior to the
CREATE statements. Exercise caution with the output from this command!
Warning
Do not pass the -? flag to pg_dumpall, as it will result in passing that
flag to pg_dump for each database, which may result in quite a few more help screens than you would
expect. If you wish to view pg_dumpall's help, use the - -help flag.
Note that as of PostgreSQL 7.1.x, the use of the pg_dumpall script does have some practical
limitations over the normal use of pg_dump. For example, the -u flag
cannot be used to provide a different username and password, and the -F flag may not be used to
specify a format other than plain text (your backups will be stored as plain text, regardless of chosen format). This means
that the -b flag cannot be used with pg_dumpall either, as it requires a format
other than plain-text.
While the -f flag can be used to pass a filename through to pg_dump, doing
so does not create a complete dump, as pg_dumpall's global data is still sent to
stdout. To solve the problem of not having the -f flag available to
pg_dumpall, the shell re-direction operator (>) can be used to
redirect the output of pg_dumpall to a file.
A simple workaround to the lack of the -u flag is to set the
PGUSER environment variable. You can also set the
PGPASSWORD environment variable in this fashion if you are connecting to a system which
requires password authentication, and you do not wish to provide a password for each database that it
connects to and dumps.
Example 9-22 demonstrates a simple bash-shell shorthand for a temporary environment
variable (PGUSER) when invoking any PostgreSQL client. While not usually necessary, it can
be a handy trick to know for exceptions such as the pg_dumpall script. Example 9-22 uses this technique to create one dump file for all databases.
The first part of the shell statement demonstrated in Example 9-22 sets a temporary
environment variable named PGUSER to the value of postgres. This
variable is set for the duration of the pg_dumpall command, and expires when the command has finished.
Note: The pg_dumpall command generally requires that the user executing the script be a
PostgreSQL superuser. This is because the pg_dumpall command requires access to the PostgreSQL system catalogs,
as it dumps global objects as well as database objects.
You can also run the pg_dumpall command remotely, though be sure to set any environment
variables as needed. These will depend greatly on the remote host configuration in the pg_hba.conf
file.
You should use the pg_dumpall command if it is especially
inconvenient to backup all your existing databases individually, or if you have any kind of complex user and group system
in place. If you are inhibited by the limitations of pg_dumpall for data output (particularly if your
database makes use of large objects), the simplest solution is to use pg_dumpall with the
-g flag to keep a backup of all user and group data, and to subsequently use
pg_dump for each database which needs to be backed up, individually.
Data from a SQL dump can be restored to a database in one of two ways. If the dump created by
pg_dump as a simple, plain-text file, it may be passed through to psql directly
as an input file. Alternatively, if another output format was chosen (e.g., the tar or
compressed tar format), the pg_restore command must be used.
A database may either be restored from scratch, to an empty database, or to a non-existent database; how you go about
restoring a database depends largely on how it was dumped (e.g., if only data were dumped, or if the commands to create the
database were included).
A plain text output file from pg_dump may be passed through to psql as an
input file. This executes sequentially each of the dumped SQL instructions. Depending on how the dump was created,
and for what purpose, there are a variety of ways to invoke psql with the dump.
If your dump was created with the -C flag, the SQL statement to create the database is
included in the dump file itself. This means that, most likely, the database was either dropped, or has not
yet been created on the system on which it is being restored. If the database already exists, it may need to be dropped,
but only do this if you are sure the dump is up to date.
On the other hand, if the -C flag was not used, you may need to first
create the database before connecting to it and restoring its attributes and data. Remember also that you need to specify
the usual connection parameters to psql in order to operate as a user with the
rights to create a database.
Example 9-23 demonstrates recreating the
booktown database with the booktown.sql file created in Example 9-20, in the Section called Using pg_dump" earlier in this chapter. Since the
-C flag was used in this example, there is no need to create the database first; it can be
created by bootstrapping through the template1 database.
Example 9-23. Recreating the booktown database
[jworsley@booktown ~]$ psql -U manager -f booktown.sql template1
CREATE DATABASE
You are now connected to database booktown as user postgres.
COMMENT
CREATE
CREATE
CHANGE
[...]
As each dumped command in the booktown.sql file is processed by PostgreSQL, the resulting
server messages (e.g., CREATE, CHANGE) will be
displayed to stderr.
Note: Since psql can be used remotely, this same technique may be used across a network,
provided the correct connection parameters are used from an authorized host.
For files created by pg_dump with a file format other than plain text, the
pg_restore command exists to seamlessly restore the dumped database from the tar,
or compressed tar file.
Here is the syntax for the pg_restore command:
pg_restore [ options ] [ file ]
In this syntax, if file is not specified, pg_restore will wait for data
from stdin. This means that you may effectively use the < shell
redirection character with the same results. Notice especially in the options the
-d flag. If left unspecified, pg_restore will simply display the database
restoration statements to stdout (and thus, to the screen) rather than actually restoring a
database.
If you are using the -C flag to create a database from scratch, you must still supply the
-d flag (with the name of an existing database to connect to first, e.g.,
template1) from which to initialize a connection and create the new database. In
such a case, it is not important to which database you initially connect, as it is only a temporary connection until the
new database is created.
Many of the options for pg_restore directly mirror those available in
pg_dump. In some cases, the same option must be supplied in both pg_dump and
pg_restore in order for the desired functionality to be achieved. For example, this is the case with
the -C flag. If used with pg_dump, but not pg_restore,
the CREATE DATABASE command will be ignored by pg_restore, even
though the command is present in the dump file.
The following are more detailed explanations of each option:
-a, --data-only
Causes any reference to creation of database schema objects to be ignored, restoring only data records (those with
COPY or INSERT statements).
-c, - -clean
Causes any DROP SQL statements to be executed before creating database objects.
Without the -c flag, these statements are ignored, even if they are present in the dump file.
-C, - -create
Causes the CREATE DATABASE SQL statement (if found in the dump file) to be
executed. Without the -C flag, the statement is ignored.
-d NAME, - -dbname=NAME
Specifies the database called NAME to be connected to for the restoration. If the
-C flag is used to create a new database, the -d flag should
be pointed to template1. If this parameter is not specified, the commands to restore
the database will instead be displayed to stdout rather than sent to PostgreSQL.
-f FILENAME, - -file=FILENAME
Indicates that FILENAME is the target for the database restoration SQL commands, rather than
a the postmaster backend (with the -d flag), or stdout
(the default).
-F { c | t }, - -format={ c | t }
Specifies the format of the input file, FILENAME. A value of c
indicates that it is a compressed and tarred dump, while t indicates that it is only
tarred. Note that this option is usually not necessary, as pg_restore can judge
what kind of file it is dealing with through its header data.
-h HOSTNAME, - -host=HOSTNAME
Specifies that you should connect to HOSTNAME, rather than the localhost.
-i, - -index
Specifies that only indices be recreated. Note that due to a bug, the -i
flag may not work, though the - -index flag should (as of PostgreSQL 7.1.x).
-l, - -list
Specifies that only the table of contents of database objects should be output in a comma-delimited, PostgreSQL
specific TOC (table of contents) format. This output can be redirected to a file (either with shell redirection
via the < character, or with the -f flag) and later used with the -L
flag to control what database objects are restored.
-L FILENAME, - -use-list=FILENAME
Indicates use of the PostgreSQL TOC file FILENAME to determine which objects should be
restored by pg_restore. This file is generated using the -l
flag. After generating the file, delete lines for objects you do now wish to restore, or preface those lines with a semicolon (;).
When -L is used, pg_restore only restores those objects listed
in the specified TOC file.
-N, - -orig-order
Causes the restore to occur in the same order that the pg_dump originally dumped the objects
in (through the use of the extra TOC information in a tar or
gzip-compressed format). This is not the same as the literal order in which the statements are
placed in the dump file itself, which is the default restore order. This option excludes the use of the
-o or -r options.
If, during a restoration, database objects are created in an incorrect order (e.g., an object which relies on
another existing object is created before the object it relies on), you can re-initialize a database and try this
flag to override the order which was originally chosen by pg_dump.
-o, - -oid-order
Causes the restore to occur strictly in the order of OIDs, ascending; this option excludes the use of the
-N or -r options.
-O, - -no-owner
Forces pg_restore to ignore any \connect statements which
would be used to enforce ownership.
-p PORT, - -port=PORT
Specifies that the database connection should be made on port PORT, rather than the
default port (usually 5432, though it may have been configured differently when PostgreSQL was
compiled, by the - -with-pgport flag).
-P, - -function
Specifies that only functions are to be recreated. Like the -i
flag, due to a bug, the -P flag may not work, though the - -function flag should,
as of PostgreSQL 7.1.x.
-r, - -rearrange
Causes the restore to occur in the order chosen by pg_dump at the time of the dump file's
creation. Most of the objects are created in OID order with this option, though statements creating rules and
indices are placed towards the end of the file. This option is the default.
-R, - -no-reconnect
Forces pg_restore to ignore all\connect
statements (not just those intended to enforce ownership). This cannot be used with the -C
flag, which requires at least one reconnection after the creation of a new database.
-s, - -schema-only
Causes only the creation of database schema objects, such as tables, views, sequences, and indices. No rows will
be copied or inserted into the tables, and sequences will initialize to their default values. This can be used, for example,
to create an empty production database that matches the structure of a development database.
-S NAME, - -superuser=NAME
Specifies the superuser with username NAME to be used in disabling triggers (if necessary, to
recreate a trigger), as well as to set ownership of schema elements.
-t NAME, - -table[=NAME]
Causes only the table NAME to be restored, rather than all database objects.
Specifying just - -table causes only tables to be restored.
-T NAME, - -trigger[=NAME]
Causes only the trigger NAME to be restored, rather than all database objects.
Specifying just - -trigger causes only triggers to be restored.
-u, - -password
Causes pg_restore to provide a prompt for a username and password.
-v, - -verbose
Causes verbose output of each action as it is performed. This output is sent to stderr,
rather than stdout.
-x, - -no-acl
Suppresses any SQL GRANT or REVOKE statement
in the dump being restored.
Example 9-24 demonstrates a restoration of the booktown
database on a separate machine from which the original was made. It uses the booktown.sql.tar file
created in Example 9-21, in the Section called Using pg_dump," earlier in this chapter, as the source
for the restoration.
Example 9-24. Restore with pg_restore
[jworsley@cmd ~]$ pg_restore -v -C -O -d template1 booktown.sql.tar
Connecting to database for restore
Creating DATABASE booktown
Connecting to new DB 'booktown' as postgres
Connecting to booktown as postgres
Creating COMMENT DATABASE "booktown"
Creating TABLE inventory
[...]
You can see upon examining the pg_restore command in Example 9-24 that it
uses the -v flag for verbose output as it operates, the -C flag to create the
database (as this is a new database on this machine), and the -O flag to ignore ownership from the
original database (as the users on another machine are not guaranteed to exist locally). Notice also the
-d flag is used to connect to the template1 database before creating, and
connecting to, the booktown database.
Note that the use of the -O flag can be dangerous if ownership is an important part of the
recreation of a database. It can play a helpful role in moving from a development environment to a production
environment (e.g., if test or development account names were associated with various database objects). However, if a
database is being restored on an existing machine (e.g., from a nightly backup), it is not recommended that the
-O flag be used.
Warning
Remember that pg_restore exists only for files that are output in either
tar format (t), or compressed tar format
(c). Plain text SQL dumps may be processed with psql, as
documented in the Section called Using psql for plain text dumps," earlier in this chapter.
An important consideration to the use of the pg_dump, pg_dumpall, and
pg_restore commands is when to use them, and when not to. Fortunately, in respect
to each of these procedures, PostgreSQL is quite accommodating.
With regards to backing up data with either pg_dump or pg_dumpall, there
are few considerations necessary for when they may be performed. PostgreSQL has supported hot backup
procedures since Version 6.5—these allow you to request data without blocking the normal activity of other concurrent
users. It is called a hot backup because it is performed while the system is running, uninterrupted.
Therefore, the only potential considerations for backup PostgreSQL center around performance. An
exceptionally large database may take a while to dump all of its contents. The use of large objects may also be a factor
if you intend to back up large object data with pg_dump's -b flag (thus, adding
to the amount of data needing to be output).
If you have a large database that takes a substantial amount of time to complete a dump, it is recommended that you
schedule the pg_dump execution for a time when the database is not heavily used. Even though a dump
does not block users from requesting and completing transactions, it can still slow down the general performance of such
a system during heavy, or even medium usage.
With respect to restoration there are several more considerations to be taken into account than when merely backing
up data. Specifically, these apply to how "deep" a restoration must go; restoring just the data is a very different
operation from totally recreating the database from scratch, and the restrictions involved scale with the depth of the
operation.
The least restrictive kind of restoration is one which restores data only. This can be
executed while users are actively connected to the database. It may even be executed while connected users are in the
middle of transactions. This is possible through PostgreSQL's multiversion control. Such a restore can be performed on
the fly, without having to restart the database system. Once modifications are synchronized with the database, the
changes are immediately available to connected users.
A restoration involving dropping and recreating database schema (e.g., tables, views, etc.) may also be performed
while the system is running. This method is not as seamless as a data-only restoration, because database objects will
briefly be removed from the system, which may cause temporary problems to applications relying on certain objects to
exist. The exact nature of such a restriction is dependent on the nature of the application accessing the
database.
The most restrictive kind of restoration is one which involves dropping the database itself. If such a restoration
is scheduled, it must be done at a time when no other user is connected to that database. The
DROP DATABASE command will fail if any user is actively connected at the time it is
executed.
It may in fact be necessary to shut down and restart PostgreSQL with TCP/IP connections disabled if a highly-used
database is intended to be dropped and recreated from scratch; this will prevent any external machine from connecting to
the database server until the work is completed.
While PostgreSQL abstracts the literal data files from its users, all of the data within PostgreSQL databases can be
found in normal system files. These files are constantly in a state of flux when the database is running, as not all
changes made in PostgreSQL are immediately written to disk. These files are stored within the PostgreSQL directory (e.g.,
/usr/local/pgsql/data, or whatever path is specified by the
PGDATA environment variable).
Rather than creating a set of SQL instructions to re-create a database, you may instead opt to stop the PostgreSQL
server (in order to ensure that all changes are synchronized with the hard disk) and create a backup of that area of the
filesystem. Typically this is done with the tar utility, and optionally compressed with a compression
utility such as gzip.
Example 9-25 demonstrates the backing up of a filesystem in which PostgreSQL
keeps its data, as performed by the user who owns the data files (which is the same user that runs PostgreSQL's backend). In
this case, the path is /usr/local/pgsql/data, and the system user is
postgres.
Example 9-25. Backing up the PostgreSQL filesystem
[postgres@booktown ~]$ cd /usr/local/pgsql
[postgres@booktown pgsql]$ pg_ctl stop
Smart Shutdown request at Fri Sep 14 14:54:15 2001
DEBUG: shutting down
waiting for postmaster to shut down......DEBUG: database system is shut down
done
postmaster successfully shut down
[postgres@booktown pgsql]$ tar czf pgsql.bak.tar.gz data/
[postgres@booktown pgsql]$ ls -l *.tar.gz
-rw-rw-r- - 1 postgres postgres 66124795 Sep 14 14:36 pgsql.bak.tar.gz
Notice that the pg_ctl command is called before the tar process, to stop
the PostgreSQL backend (if installed, the SysV script may be invoked with the service command to the
same end). As stated, this is to ensure that the most recent changes to the database have been synchronized with the hard
disk, as well as to verify that none of the data files are modified while backed up.
The primary advantage to backing up your data in this fashion is that you have a literal backup of
PostgreSQL's data files. In order to restore a crashed database from this kind of file, it needs to be decompressed in the
appropriate location, and for the backend to be re-started. There is no need for negotiation of options, ownership, or
potential conflicts between the pg_dump output and restoring it to a live PostgreSQL server through
sequential SQL statements.
However, while this method is easier to implement, it presents several limitations. First, the database must be shut
down completely to backup or restore the data, eliminating the primary advantage of a hot backup-capable DBMS, which is
limited downtime. Further, it is not possible to backup only specific databases, or tables. The entire data directory must
be backed up for a complete restoration of the filesystem. This is because there are many files associated with a
particular database, and it is not obvious which files correlate to which databases.
Finally, because more than abstract information is represented on disk by a live database, a much greater amount of
disk space is required to backup even a compressed copy of the entire data directory.