The GROUP BY clause introduces a powerful SQL concept:
aggregation
. To aggregate means to gather into a sum, or whole. The practical effect of aggregating
in a SQL query is that any rows whose results from the GROUP BY expression match
identically are grouped together into a
single aggregate row
. The GROUP BY
expression may define a column, but it may also be any operation upon a column as well. If several columns or expressions
are specified (delimited by commas), the entire set of specified criteria must be identical for rows to be
grouped together.
To effectively use aggregation you must understand that any target columns requested by an aggregating query
which are
not
specified in the GROUP BY clause will be
inaccessible, unless selected through an
aggregate function
. An aggregate function accepts a column
name (or expression involving at least one column name) which can represent
several values
(i.e., from
several grouped rows), performs an operation on those values, and returns a single value.
Common aggregate functions include count(), which returns the number of rows in the
set, max(), which returns the maximum value in the column, and
min(), which returns the minimum value in the column. An aggregate function operates
only on rows in the query's result set, and is therefore executed
after
conditional joins and
WHERE conditions have been processed.
Imagine that you wanted to know how many books Book Town stores in its database for each known publisher. You
could perform a simple join between the editions and
publishers tables in order to associate each publisher name with a title that they publish.
It would be tedious to manually count how many titles each publisher maintained, and in cases of larger data sets,
it can become difficult to manage larger result sets.
Example 4-40 demonstrates a join between these two Book Town tables, but also introduces two new
elements: the count() function, and the GROUP BY clause.
Example 4-40. Using GROUP BY
booktown=#
SELECT count(e.isbn) AS "number of books",
booktown-#
p.name AS publisher
booktown-#
FROM editions AS e INNER JOIN publishers AS p
booktown-#
ON (e.publisher_id = p.id)
booktown-#
GROUP BY p.name;
number of books | publisher
-----------------+-----------------------------
2 | Ace Books
1 | Books of Wonder
2 | Doubleday
1 | HarperCollins
1 | Henry Holt & Company, Inc.
1 | Kids Can Press
1 | Mojo Press
1 | O'Reilly & Associates
1 | Penguin
3 | Random House
2 | Roc
1 | Watson-Guptill Publications
(12 rows)
The GROUP BY clause in Example 4-40 instructs PostgreSQL to group
the rows in the joined data set by p.name, which in this query is a reference to the
name column in the publishers table. Therefore, any
rows that have the same publisher name will be grouped together, or aggregated. The
count() function then counts the number of isbn values
from the editions table that are in each aggregated row, and returns a single numeric
value representing the number of rows that were aggregated for each unique publisher.
Note that in Example 4-40 the argument of the editions table's
isbn column is chosen simply to indicate the objective of the example
(to count how many books there are per publisher). Any column name will return the same number, as
count() will always return the number of rows grouped in the current aggregate row.
Something to watch out for when designing aggregate queries is that the WHERE clause
cannot accept criteria involving aggregate functions. Instead, use the HAVING clause. It functions identically to the WHERE clause, but its conditions must be on aggregate
functions rather than single-row conditions. Syntactically, the HAVING clause follows the
GROUP BY clause, as demonstrated in Example 4-41.
Example 4-41. Using the HAVING clause
booktown=#
SELECT count(e.isbn) AS "number of books",
booktown-#
p.name AS publisher
booktown-#
FROM editions AS e INNER JOIN publishers AS p
booktown-#
ON (e.publisher_id = p.id)
booktown-#
GROUP BY publisher
booktown-#
HAVING count(e.isbn) > 1;
number of books | publisher
-----------------+--------------
2 | Ace Books
2 | Doubleday
3 | Random House
2 | Roc
(4 rows)
Both Example 4-40 and Example 4-41 create a data set through an inner join
between the editions and publishers table. However,
Example 4-41 constrains the final result to publishers having more than a single book in the
Book Town database, as set by the HAVING clause.
Note: If a result set's column is aliased via an AS clause to a name that overlaps
with a real column in one of the source data sets, and used in the GROUP BY clause,
PostgreSQL will assume that you are referring to the input column, not the output alias.