Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com
Answertopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions
Privacy Policy

  




 

 

Databases - Practical PostgreSQL
Previous Page Home Next Page

Joining Data Sets with JOIN

As demonstrated by the use of the WHERE clause on two table sources in the Section called Selecting Sources with the FROM Clause ," you have the ability to retrieve data from different data sources by combining their columns into joined rows. In SQL, this process is formally called a join .

The essential concept behind a join is that two or more data sets, when joined, have their columns combined into a new set of rows containing each of the columns requested from each of the data sets. The foundation of all joins is the Cartesian product , which is the set of all possible combinations between two data sets. That product may then be refined into a smaller subset by a set of criteria in the JOIN syntax. These criteria describe a relationship between data sets, though such a definition is not required.

There are three general types of joins:

Cross joins

Creates a Cartesian product (or cross product) between two sets of data. It is called a product because it does not define a relationship between the sets; instead, it returns every possible combination of rows between the joined sets, essentially multiplying the sources by one another.

Inner joins

Creates a subset of the Cartesian product between two sets of data, requiring a conditional clause to specify criteria upon which to join records. The condition must return a Boolean value to determine whether or not a row is included in the joined set.

Outer joins

Similar to an inner join, in that it accepts criteria which will match rows between two sets of data, but returns at least one instance of each row from a specified set. This is either the left set (the data source to the left of the JOIN keyword), the right set (the data source to the right of the JOIN keyword), or both sets, depending on the variety of outer join employed. The missing column values for the empty half of the row which does not meet the join condition are returned as NULL values.

Cross joins

A cross join is functionally identical to listing comma-delimited sources. It therefore should almost always be accompanied by a WHERE clause to qualify the relationship between the joined data sets. Example 4-35 demonstrates the same functional query used in Example 4-27, substituting the comma for the formal JOIN syntax.

Example 4-35. A simple CROSS JOIN

booktown=# 
SELECT b.id, title, a.id, last_name

booktown-# 
       FROM books AS b CROSS JOIN authors AS a

booktown-# 
       WHERE b.author_id = a.id;

  id   |            title            |  id   |  last_name
-------+-----------------------------+-------+--------------
   190 | Little Women                |    16 | Alcott
   156 | The Tell-Tale Heart         |   115 | Poe
 41472 | Practical PostgreSQL        |  1212 | Worsley
  2038 | Dynamic Anatomy             |  1644 | Hogarth
  1608 | The Cat in the Hat          |  1809 | Geisel
  1590 | Bartholomew and the Oobleck |  1809 | Geisel
  4513 | Dune                        |  1866 | Herbert
  4267 | 2001: A Space Odyssey       |  2001 | Clarke
  1501 | Goodnight Moon              |  2031 | Brown
  7808 | The Shining                 |  4156 | King
 41473 | Programming Python          |  7805 | Lutz
 41477 | Learning Python             |  7805 | Lutz
 41478 | Perl Cookbook               |  7806 | Christiansen
 25908 | Franklin in the Dark        | 15990 | Bourgeois
  1234 | The Velveteen Rabbit        | 25041 | Bianco
(15 rows)

This syntax is merely a more formal way of stating the relationship between the two data sets. There is no functional difference between the CROSS JOIN syntax and using a simple comma delimited list of columns.

Inner and outer join syntax

More useful are the inner and outer joins, which require a qualification of the relationship between joined data sets in the JOIN syntax itself. The following is the syntax for an inner or outer join:

  
source1
 [ NATURAL ] 
join_type
 
source2

  [ ON ( 
condition
 [, ...] ) | USING ( 
column
 [, ...] ) ]

source1

Identifies the first data set that is being joined (i.e., a table name or sub-query).

[ NATURAL ]

Implies that the two data sets should be joined on equivalent values between like-named columns (e.g., if two tables have a column called id, it will join rows where the id values are equivalent). The NATURAL clause will respect column aliases, if applied. The use of the NATURAL clause makes it both unnecessary and invalid to try to specify either of the ON or USING clauses.

join_type

Specifies the type of JOIN intended. Valid values in this context are [ INNER ] JOIN (specifying just JOIN implies an INNER JOIN), LEFT [ OUTER] JOIN, RIGHT [ OUTER ] JOIN, and FULL [ OUTER ] JOIN.

source2

Identifies the second data set that is being joined (i.e., a table name, or sub-query).

[ ON ( condition [, ...] ) |

Identifies the second data set that is being joined (i.e., a table name, or sub-query).

Specifies the relationship between source1 and source2. Any arbitrary criteria may be specified within the ON clause, just as you would specify conditions following a WHERE clause. Column and table aliases are allowed in this criteria.

USING ( column [, ...] ) ]

Specifies like-named columns between source1 and source2 with which to join rows by equivalent values. Similar to a NATURAL JOIN, but allows you to indicate what specific columns to join on, whereas NATURAL will join on all like-named columns. Similar to NATURAL joins, column aliases are respected in the USING clause's parameters.

Inner joins

The SQL92 INNER JOIN syntax is a tool that helps differentiate the conditions with which you are joining data sources (the JOIN conditions) from the conditions with which you are evaluating rows for inclusion in your data set (the WHERE conditions). For example, consider the two SELECT statements in Example 4-36.

Example 4-36. Comparing INNER JOIN to WHERE

booktown=# 
SELECT title, last_name, first_name

booktown-# 
       FROM books, authors

booktown-# 
       WHERE (books.author_id = authors.id)

booktown-# 
       AND last_name = 'Geisel';

            title            | last_name |  first_name
-----------------------------+-----------+---------------
 The Cat in the Hat          | Geisel    | Theodor Seuss
 Bartholomew and the Oobleck | Geisel    | Theodor Seuss
(2 rows)

booktown=# 
SELECT title, last_name, first_name

booktown-# 
       FROM books AS b INNER JOIN authors AS a

booktown-# 
       ON (b.author_id = a.id)

booktown-# 
       WHERE last_name = 'Geisel';

            title            | last_name |  first_name
-----------------------------+-----------+---------------
 The Cat in the Hat          | Geisel    | Theodor Seuss
 Bartholomew and the Oobleck | Geisel    | Theodor Seuss
(2 rows)

The two forms of syntax in Example 4-36 are functionally identical, and return the same results. The INNER JOIN syntax allows you to segregate the relational criteria from your evaluation criteria by only defining the set relationships in the ON clause. This can make involved queries much easier to read and maintain, as you do not need to interpret what each condition described by the WHERE clause is conceptually achieving.

Notice that the second query demonstrates the use of aliases b and a in the ON clause for the books and authors tables, respectively. The use of these aliases in the ON clause is perfectly valid, and often preferable from a perspective of improved readability.

In cases of simple equivalence joins, it may be more convenient for you to use either the USING or NATURAL clauses instead of the ON clause. These are only applicable on data sets with identically named columns. If you have columns that define a relationship between two sets that are not identically named, you may still use the USING or NATURAL clauses by employing column aliases, as demonstrated in Example 4-37, to re-name one or both of the columns to a uniform name.

Example 4-37. The NATURAL and USING clauses

booktown=# 
SELECT title, last_name, first_name

booktown-# 
       FROM books INNER JOIN authors AS a (author_id)

booktown-# 
       USING (author_id)

booktown-# 
       WHERE last_name = 'Geisel';

            title            | last_name |  first_name
-----------------------------+-----------+---------------
 The Cat in the Hat          | Geisel    | Theodor Seuss
 Bartholomew and the Oobleck | Geisel    | Theodor Seuss
(2 rows)


booktown=# 
SELECT title, last_name, first_name

booktown-# 
       FROM books NATURAL INNER JOIN authors AS a (author_id)

booktown-# 
       WHERE last_name = 'Geisel';

            title            | last_name |  first_name
-----------------------------+-----------+---------------
 The Cat in the Hat          | Geisel    | Theodor Seuss
 Bartholomew and the Oobleck | Geisel    | Theodor Seuss
(2 rows)

The first SELECT statement in Example 4-37 assigns the alias of author_id to the first column in the authors table (which is actually named id). By passing the author_id identifier to the USING clause, PostgreSQL then searches for a column identifier in each data set with that name to join rows on values found to be equivalent.

Inner joins are adequate for a wide variety of queries, but there are times when an outer join is required to get all of the data you need. The key to understanding the difference between inner and outer joins is in knowing how each type of join handles rows that do not meet their defined relationship.

In short, an inner join will discard any row for which it cannot find a corresponding value between the sets being joined (as specified by either the ON or USING clause).

Outer joins

In contrast to inner joins, an outer join can retain rows where corresponding values between sets are not found, populating the missing columns with NULL values. Whether or not the outer join does retain that row depends on which set is missing the value and the kind of outer join that is specified.

There are three forms of outer joins:

Left outer joins

Will always return at least one instance of each row in the set of rows to the left of the JOIN keyword. Missing columns in the right set are populated with NULL values.

Right outer joins

Will always return at least one instance of each row in the set of rows to the right of the JOIN keyword. Missing columns in the left set are populated with NULL values.

Full outer joins

Will always return at least one instance of each row in each joined set. Missing columns on either side of the new set will be populated with NULL values.

Consider again Book Town's books table, and another Book Town table called editions. While the books table stores general information on a given title, the editions table stores specific information pertaining to each edition, such as an the book's ISBN, publisher, and publication date. The editions table has a column called book_id which corresponds to the books table's primary key column, id.

Suppose that you want to retrieve each of Book Town's titles, along with its isbn, if applicable. Performing a query with an inner join between the books and editions tables will correctly return a data set with title and isbn columns. However, as demonstrated in Example 4-38, if a book does not yet have a printed edition (or if that edition has not yet been entered into Book Town's database), those titles will not be displayed.

In contrast, the statement immediately following the inner join in Example 4-38 employs an outer join, returning 20 rows. Three of the returned rows do not have ISBN numbers, but are not omitted due to the definition of the join.

Example 4-38. Inner joins versus outer joins

booktown=# 
SELECT title, isbn

booktown-# 
       FROM books INNER JOIN editions

booktown-# 
       ON (books.id = editions.book_id);

            title            |    isbn
-----------------------------+------------
 The Tell-Tale Heart         | 1885418035
 The Tell-Tale Heart         | 0929605942
 Little Women                | 0760720002
 The Velveteen Rabbit        | 0679803335
 Goodnight Moon              | 0694003611
 Bartholomew and the Oobleck | 0394800753
 The Cat in the Hat          | 039480001X
 The Cat in the Hat          | 0394900014
 Dynamic Anatomy             | 0823015505
 2001: A Space Odyssey       | 0451457994
 2001: A Space Odyssey       | 0451198492
 Dune                        | 0441172717
 Dune                        | 044100590X
 The Shining                 | 0451160916
 The Shining                 | 0385121679
 Franklin in the Dark        | 0590445065
 Programming Python          | 0596000855
(17 rows)

booktown=# 
SELECT title, isbn

booktown-# 
       FROM books LEFT OUTER JOIN editions

booktown-# 
       ON (books.id = editions.book_id);

            title            |    isbn
-----------------------------+------------
 The Tell-Tale Heart         | 1885418035
 The Tell-Tale Heart         | 0929605942
 Little Women                | 0760720002
 The Velveteen Rabbit        | 0679803335
 Goodnight Moon              | 0694003611
 Bartholomew and the Oobleck | 0394800753
 The Cat in the Hat          | 039480001X
 The Cat in the Hat          | 0394900014
 Dynamic Anatomy             | 0823015505
 2001: A Space Odyssey       | 0451457994
 2001: A Space Odyssey       | 0451198492
 Dune                        | 0441172717
 Dune                        | 044100590X
 The Shining                 | 0451160916
 The Shining                 | 0385121679
 Franklin in the Dark        | 0590445065
 Practical PostgreSQL        |
 Programming Python          | 0596000855
 Learning Python             |
 Perl Cookbook               |
(20 rows)

The join specified by the second query in Example 4-38 uses the LEFT OUTER JOIN clause to define its join type. This is because the query focuses on titles from the books table that have ISBN numbers, and not those editions having ISBN numbers that do not correspond to titles. As the books table is to the left of the JOIN keyword, it is defined as a left outer join to achieve this. If the focus of the query was to see both ISBN numbers without titles as well as titles without ISBN numbers, the same query could instead be modified to be a full outer join with the FULL OUTER JOIN clause.

The difference between inner and outer joins illustrated in Example 4-38 is a vital concept to understand, as misuse of joins can lead to both omitted and unexpected rows.

Note: The actual OUTER keyword is an optional term in a PostgreSQL outer join. Specifying a join as either a LEFT JOIN, RIGHT JOIN or FULL JOIN implicitly defines it as an outer join.

Intricate joins

It should be understood that while a single JOIN clause connects only two sets of data, in practice, joins are not restricted to only two data sources. You may arbitrarily specify numerous JOIN clauses following sets that are themselves constructed from joins, just as you may specify numerous data sources separated by commas.

When connecting several joins together, it is a good practice to group each join and sub-join within parentheses. Explicitly grouping joins in this fashion insures that there is no ambiguity, to either PostgreSQL or a developer, as to which data sets are joined, and in what order.

Example 4-39. Joining many data sources

booktown=# 
SELECT a.last_name, p.name AS publisher, e.isbn, s.subject

booktown-# 
       FROM ((((authors AS a INNER JOIN books AS b

booktown(# 
                ON (a.id = b.author_id))

booktown(# 
       INNER JOIN editions AS e ON (e.book_id = b.id))

booktown(# 
       INNER JOIN publishers AS p ON (p.id = e.publisher_id))

booktown(# 
       INNER JOIN subjects AS s ON (s.id = b.subject_id));

 last_name |          publisher          |    isbn    |     subject
-----------+-----------------------------+------------+------------------
 Hogarth   | Watson-Guptill Publications | 0823015505 | Arts
 Brown     | HarperCollins               | 0694003611 | Children's Books
 Geisel    | Random House                | 0394800753 | Children's Books
 Geisel    | Random House                | 039480001X | Children's Books
 Geisel    | Random House                | 0394900014 | Children's Books
 Bourgeois | Kids Can Press              | 0590445065 | Children's Books
 Bianco    | Penguin                     | 0679803335 | Classics
 Lutz      | O'Reilly & Associates       | 0596000855 | Computers
 Alcott    | Henry Holt & Company, Inc.  | 0760720002 | Drama
 Poe       | Mojo Press                  | 1885418035 | Horror
 Poe       | Books of Wonder             | 0929605942 | Horror
 King      | Doubleday                   | 0451160916 | Horror
 King      | Doubleday                   | 0385121679 | Horror
 Clarke    | Roc                         | 0451457994 | Science Fiction
 Clarke    | Roc                         | 0451198492 | Science Fiction
 Herbert   | Ace Books                   | 0441172717 | Science Fiction
 Herbert   | Ace Books                   | 044100590X | Science Fiction
(17 rows)

An interesting observation to be made about Example 4-39 is that, while the books table is itself deeply involved in the join, none of its columns are retrieved in the final result set. The books table is included in the JOIN clauses in order to provide criteria through which other tables are joined together. Each of the tables whose columns are retrieved in the query rely on the books table in order to draw relationships with any other table through the id column (with the exception of the publishers table, which relates to the publisher_id column in the editions table).

Databases - Practical PostgreSQL
Previous Page Home Next Page

 
 
  Published under the terms of the Open Publication License Design by Interspire