|
|
|
|
As demonstrated by the use of the WHERE clause on two table sources in the Section called Selecting Sources with the FROM Clause
," you have the ability to retrieve data from different data sources by
combining their columns into joined rows. In SQL, this process is formally called a
join
.
The essential concept behind a join is that two or more data sets, when joined, have their columns combined into a
new
set of rows containing each of the columns requested from each of the data sets. The foundation
of all joins is the
Cartesian product
, which is the set of all possible combinations between two
data sets. That product may then be refined into a smaller subset by a set of criteria in the
JOIN syntax. These criteria describe a relationship between data sets, though such a definition
is not required.
There are three general types of joins:
-
Cross joins
-
Creates a Cartesian product (or cross product) between two sets of data. It is called a product
because it does not define a relationship between the sets; instead, it returns every possible combination of rows
between the joined sets, essentially multiplying the sources by one another.
-
Inner joins
-
Creates a subset of the Cartesian product between two sets of data, requiring a
conditional clause to specify criteria upon which to join records. The condition must
return a Boolean value to determine whether or not a row is included in the joined set.
-
Outer joins
-
Similar to an inner join, in that it accepts criteria which will match rows between two sets of
data, but returns at least one instance of each row from a specified set. This is either the left set (the
data source to the left of the JOIN keyword), the right set (the data source to
the right of the JOIN keyword), or both sets, depending on the variety of outer
join employed. The missing column values for the empty half of the row which does not meet the join condition are returned as
NULL values.
A cross join is functionally identical to listing comma-delimited sources. It therefore should almost always be
accompanied by a WHERE clause to qualify the relationship between the joined data sets.
Example 4-35 demonstrates the same functional query used in Example 4-27, substituting the comma for the formal JOIN
syntax.
Example 4-35. A simple CROSS JOIN
booktown=#
SELECT b.id, title, a.id, last_name
booktown-#
FROM books AS b CROSS JOIN authors AS a
booktown-#
WHERE b.author_id = a.id;
id | title | id | last_name
-------+-----------------------------+-------+--------------
190 | Little Women | 16 | Alcott
156 | The Tell-Tale Heart | 115 | Poe
41472 | Practical PostgreSQL | 1212 | Worsley
2038 | Dynamic Anatomy | 1644 | Hogarth
1608 | The Cat in the Hat | 1809 | Geisel
1590 | Bartholomew and the Oobleck | 1809 | Geisel
4513 | Dune | 1866 | Herbert
4267 | 2001: A Space Odyssey | 2001 | Clarke
1501 | Goodnight Moon | 2031 | Brown
7808 | The Shining | 4156 | King
41473 | Programming Python | 7805 | Lutz
41477 | Learning Python | 7805 | Lutz
41478 | Perl Cookbook | 7806 | Christiansen
25908 | Franklin in the Dark | 15990 | Bourgeois
1234 | The Velveteen Rabbit | 25041 | Bianco
(15 rows)
This syntax is merely a more formal way of stating the relationship between the two data sets. There is no functional
difference between the CROSS JOIN syntax and using a simple comma delimited list of columns.
More useful are the inner and outer joins, which
require
a qualification of the relationship
between joined data sets in the JOIN syntax itself. The following is the syntax for an
inner or outer join:
source1
[ NATURAL ]
join_type
source2
[ ON (
condition
[, ...] ) | USING (
column
[, ...] ) ]
-
source1
-
Identifies the first data set that is being joined (i.e., a table name or sub-query).
-
[ NATURAL ]
-
Implies that the two data sets should be joined on equivalent values between like-named columns
(e.g., if two tables have a column called id, it will join rows where the
id values are equivalent). The NATURAL clause
will respect column aliases, if applied. The use of the NATURAL clause makes it both
unnecessary and invalid to try to specify either of the ON or
USING clauses.
-
join_type
-
Specifies the type of JOIN intended. Valid values in this context
are [ INNER ] JOIN (specifying just JOIN implies
an INNER JOIN), LEFT [ OUTER] JOIN,
RIGHT [ OUTER ] JOIN, and FULL [ OUTER ] JOIN.
-
source2
-
Identifies the second data set that is being joined (i.e., a table name, or sub-query).
-
[ ON (
condition
[, ...] ) |
-
Identifies the second data set that is being joined (i.e., a table name, or sub-query).
Specifies the relationship between source1 and
source2. Any arbitrary criteria may be specified within the
ON clause, just as you would specify conditions following a
WHERE clause. Column and table aliases are allowed in this criteria.
-
USING (
column
[, ...] ) ]
-
Specifies like-named columns between source1 and
source2 with which to join rows by equivalent values. Similar to a
NATURAL JOIN, but allows you to indicate what specific columns to join on,
whereas NATURAL will join on
all
like-named columns.
Similar to NATURAL joins, column aliases are respected in the
USING clause's parameters.
The SQL92 INNER JOIN syntax is a tool that helps differentiate the conditions with
which you are joining data sources (the JOIN conditions) from the conditions with which
you are evaluating rows for inclusion in your data set (the WHERE conditions). For
example, consider the two SELECT statements in Example 4-36.
Example 4-36. Comparing INNER JOIN to WHERE
booktown=#
SELECT title, last_name, first_name
booktown-#
FROM books, authors
booktown-#
WHERE (books.author_id = authors.id)
booktown-#
AND last_name = 'Geisel';
title | last_name | first_name
-----------------------------+-----------+---------------
The Cat in the Hat | Geisel | Theodor Seuss
Bartholomew and the Oobleck | Geisel | Theodor Seuss
(2 rows)
booktown=#
SELECT title, last_name, first_name
booktown-#
FROM books AS b INNER JOIN authors AS a
booktown-#
ON (b.author_id = a.id)
booktown-#
WHERE last_name = 'Geisel';
title | last_name | first_name
-----------------------------+-----------+---------------
The Cat in the Hat | Geisel | Theodor Seuss
Bartholomew and the Oobleck | Geisel | Theodor Seuss
(2 rows)
The two forms of syntax in Example 4-36 are functionally identical, and return the same results.
The INNER JOIN syntax allows you to segregate the
relational criteria from your evaluation criteria by only defining the set relationships in the
ON clause. This can make involved queries much easier to read and maintain, as you do
not need to interpret what each condition described by the WHERE clause is conceptually
achieving.
Notice that the second query demonstrates the use of aliases b and
a in the ON clause for the
books and authors tables, respectively. The use of
these aliases in the ON clause is perfectly valid, and often preferable from a
perspective of improved readability.
In cases of simple equivalence joins, it may be more convenient for you to use either the
USING or NATURAL clauses instead of the
ON clause. These are only applicable on data sets with identically named columns. If you
have columns that define a relationship between two sets that are not identically named, you may still use the
USING or NATURAL clauses by employing column aliases,
as demonstrated in Example 4-37, to re-name one or both of the columns to a uniform
name.
Example 4-37. The NATURAL and USING clauses
booktown=#
SELECT title, last_name, first_name
booktown-#
FROM books INNER JOIN authors AS a (author_id)
booktown-#
USING (author_id)
booktown-#
WHERE last_name = 'Geisel';
title | last_name | first_name
-----------------------------+-----------+---------------
The Cat in the Hat | Geisel | Theodor Seuss
Bartholomew and the Oobleck | Geisel | Theodor Seuss
(2 rows)
booktown=#
SELECT title, last_name, first_name
booktown-#
FROM books NATURAL INNER JOIN authors AS a (author_id)
booktown-#
WHERE last_name = 'Geisel';
title | last_name | first_name
-----------------------------+-----------+---------------
The Cat in the Hat | Geisel | Theodor Seuss
Bartholomew and the Oobleck | Geisel | Theodor Seuss
(2 rows)
The first SELECT statement in Example 4-37 assigns
the alias of author_id to the first column in the authors
table (which is actually named id). By passing the author_id
identifier to the USING clause, PostgreSQL then searches for a column identifier in each data
set with that name to join rows on values found to be equivalent.
Inner joins are adequate for a wide variety of queries, but there are times when an outer join is required to get
all of the data you need. The key to understanding the difference between inner and outer joins is in knowing how each
type of join handles rows that do not meet their defined relationship.
In short, an inner join will discard any row for which it cannot find a corresponding value between the sets being
joined (as specified by either the ON or USING clause).
In contrast to inner joins, an outer join
can
retain rows where corresponding values between sets
are not found, populating the missing columns with
NULL values. Whether or not the outer join
does
retain that row
depends on which set is missing the value and the kind of outer join that is specified.
There are three forms of outer joins:
-
Left outer joins
-
Will always return at least one instance of each row in the set of rows to the
left of the JOIN keyword. Missing columns in the right set
are populated with NULL values.
-
Right outer joins
-
Will always return at least one instance of each row in the set of rows to the
right of the JOIN keyword. Missing columns in the left set
are populated with NULL values.
-
Full outer joins
-
Will always return at least one instance of each row in each joined set. Missing
columns on either side of the new set will be populated with NULL
values.
Consider again Book Town's books table, and another Book Town table called
editions. While the books table stores general information
on a given title, the editions table stores specific information pertaining to each edition,
such as an the book's ISBN, publisher, and publication date. The editions table has a
column called book_id which corresponds to the books table's
primary key column, id.
Suppose that you want to retrieve each of Book Town's titles, along with its isbn, if applicable. Performing a
query with an inner join between the books and editions
tables will correctly return a data set with title and isbn columns. However, as demonstrated in Example 4-38, if a book does not yet have a printed edition (or if that edition has not yet been
entered into Book Town's database), those titles will not be displayed.
In contrast, the statement immediately following the inner join in Example 4-38
employs an outer join, returning 20 rows. Three of the returned rows do not have ISBN numbers, but are not omitted due to
the definition of the join.
Example 4-38. Inner joins versus outer joins
booktown=#
SELECT title, isbn
booktown-#
FROM books INNER JOIN editions
booktown-#
ON (books.id = editions.book_id);
title | isbn
-----------------------------+------------
The Tell-Tale Heart | 1885418035
The Tell-Tale Heart | 0929605942
Little Women | 0760720002
The Velveteen Rabbit | 0679803335
Goodnight Moon | 0694003611
Bartholomew and the Oobleck | 0394800753
The Cat in the Hat | 039480001X
The Cat in the Hat | 0394900014
Dynamic Anatomy | 0823015505
2001: A Space Odyssey | 0451457994
2001: A Space Odyssey | 0451198492
Dune | 0441172717
Dune | 044100590X
The Shining | 0451160916
The Shining | 0385121679
Franklin in the Dark | 0590445065
Programming Python | 0596000855
(17 rows)
booktown=#
SELECT title, isbn
booktown-#
FROM books LEFT OUTER JOIN editions
booktown-#
ON (books.id = editions.book_id);
title | isbn
-----------------------------+------------
The Tell-Tale Heart | 1885418035
The Tell-Tale Heart | 0929605942
Little Women | 0760720002
The Velveteen Rabbit | 0679803335
Goodnight Moon | 0694003611
Bartholomew and the Oobleck | 0394800753
The Cat in the Hat | 039480001X
The Cat in the Hat | 0394900014
Dynamic Anatomy | 0823015505
2001: A Space Odyssey | 0451457994
2001: A Space Odyssey | 0451198492
Dune | 0441172717
Dune | 044100590X
The Shining | 0451160916
The Shining | 0385121679
Franklin in the Dark | 0590445065
Practical PostgreSQL |
Programming Python | 0596000855
Learning Python |
Perl Cookbook |
(20 rows)
The join specified by the second query in Example 4-38 uses the
LEFT OUTER JOIN clause to define its join type. This is because
the query focuses on titles from the books table that
have ISBN numbers, and not those editions having ISBN numbers that do not correspond to titles. As the
books table is to the left of the JOIN
keyword, it is defined as a left outer join to achieve this. If the focus of the query was to see both ISBN numbers
without titles as well as titles without ISBN numbers, the same query could instead be modified to be a full outer join
with the FULL OUTER JOIN clause.
The difference between inner and outer joins illustrated in Example 4-38 is a vital
concept to understand, as misuse of joins can lead to both omitted and unexpected rows.
Note: The actual OUTER keyword is an optional term in a PostgreSQL outer join. Specifying a
join as either a LEFT JOIN, RIGHT JOIN or
FULL JOIN implicitly defines it as an outer join.
It should be understood that while a single JOIN clause connects only two sets of
data, in practice, joins are not restricted to only two data sources. You may arbitrarily specify numerous
JOIN clauses following sets that are themselves constructed from joins, just as you may
specify numerous data sources separated by commas.
When connecting several joins together, it is a good practice to group each join and sub-join within parentheses.
Explicitly grouping joins in this fashion insures that there is no ambiguity, to either PostgreSQL or a developer, as to
which data sets are joined, and in what order.
Example 4-39. Joining many data sources
booktown=#
SELECT a.last_name, p.name AS publisher, e.isbn, s.subject
booktown-#
FROM ((((authors AS a INNER JOIN books AS b
booktown(#
ON (a.id = b.author_id))
booktown(#
INNER JOIN editions AS e ON (e.book_id = b.id))
booktown(#
INNER JOIN publishers AS p ON (p.id = e.publisher_id))
booktown(#
INNER JOIN subjects AS s ON (s.id = b.subject_id));
last_name | publisher | isbn | subject
-----------+-----------------------------+------------+------------------
Hogarth | Watson-Guptill Publications | 0823015505 | Arts
Brown | HarperCollins | 0694003611 | Children's Books
Geisel | Random House | 0394800753 | Children's Books
Geisel | Random House | 039480001X | Children's Books
Geisel | Random House | 0394900014 | Children's Books
Bourgeois | Kids Can Press | 0590445065 | Children's Books
Bianco | Penguin | 0679803335 | Classics
Lutz | O'Reilly & Associates | 0596000855 | Computers
Alcott | Henry Holt & Company, Inc. | 0760720002 | Drama
Poe | Mojo Press | 1885418035 | Horror
Poe | Books of Wonder | 0929605942 | Horror
King | Doubleday | 0451160916 | Horror
King | Doubleday | 0385121679 | Horror
Clarke | Roc | 0451457994 | Science Fiction
Clarke | Roc | 0451198492 | Science Fiction
Herbert | Ace Books | 0441172717 | Science Fiction
Herbert | Ace Books | 044100590X | Science Fiction
(17 rows)
An interesting observation to be made about Example 4-39 is that, while the
books table is itself deeply involved in the join, none of its columns are retrieved in
the final result set. The books table is included in the JOIN clauses
in order to provide criteria through which other tables are joined together. Each of the tables whose columns are retrieved in the query rely on
the books table in order to draw relationships with any other table through the id
column (with the exception of the publishers table, which relates to the publisher_id
column in the editions table).
|
|
|