Saturday, November 08, 2014

TSQL Tutorial : Joins

Joins
A query that can access all the data it needs in a single table is a pretty rare one. John Donne said "no
man is an island," and, in relational databases, no table is, either. Usually, a query will have to go to two
or more tables to find all the information it requires. This is the way of things with relational databases.
Data is intentionally spread out to keep it as modular as possible. There are lots of good reasons for this
modularization (formally known as normalization) that I won't go into here, but one of its downsides is
that what might be a single conceptual entity (an invoice, for example) is often split into multiple physical
entities when constructed in a relational database.
Dealing with this fragmentation is where joins come in. A join consolidates the data in two tables into a
Single result set. The tables aren't actually merged; they just appear to be in the rows returned by the
query. Multiple joins can consolidate multiple tables—it's quite common to see joins that are multiple
levels deep involving scads of tables.
A join between two tables is established by linking a column or columns in one table with those in another
(CROSS JOINs are an exception, but more on them later). The expression used to join the two tables
constitutes the join condition or join criterion. When the join is successful, data in the second table is
combined with the first to form a composite result set—a set of rows containing data from both tables. In
short, the two tables have a baby, albeit an evanescent one.
There are two basic types of joins, inner joins and outer joins. The key difference between them is that
Outer joins include rows in the result set even when the join condition isn't met, while an inner join
doesn't. How is this? What data ends up in the result set when the join condition fails? When the join
criteria in an outer join aren't met, columns in the first table are returned normally, but columns from the
second table are returned with no value—as NULLs. This is handy for finding missing values and broken
links between tables.
There are two families of syntax for constructing joins—legacy and ANSI/ISO SQL-92 compliant. The
Legacy syntax dates back to SQL Server's days as a joint venture between Sybase and Microsoft. It's
more succinct than the ANSI syntax and looks like this:
SELECT customers.CustomerNumber, orders.Amount
FROM customers, orders
WHERE customers.CustomerNumber=orders.CustomerNumber
CustomerNumber Amount
-------------- --------
1 123.45
2 678.90
3 86753.09
1 678.90
Note the use of the WHERE clause to join the customers and orders tables together. This is an inner join.
If an order doesn't exist for a given customer, that customer is omitted completely from the list. Here's
the ANSI version of the same query:
SELECT customers.CustomerNumber, orders.Amount
FROM customers JOIN orders ON (customers.CustomerNumber=orders.CustomerNumber)
This one's a bit loquacious, but the end result is the same: customers and orders are merged using their
respective CustomerNumber columns.
As I mentioned earlier, it's common for queries to construct multilevel joins. Here's an example of a
Multilevel join that uses the legacy syntax:
SELECT customers.CustomerNumber, orders.Amount, items.Description
FROM customers, orders, items
WHERE customers.CustomerNumber=orders.CustomerNumber
AND orders.ItemNumber=items.ItemNumber
CustomerNumber Amount Description
-------------- -------- -----------
1 123.45 WIDGET A
2 678.90 WIDGET B
3 86753.09 WIDGET C
1 678.90 WIDGET B
This query joins the composite of the customers table and the orders table with the items table. Note that
The exact ordering of the WHERE clause is unimportant. In order to allow servers to fully optimize queries,
SQL requires that the ordering of the predicates in a WHERE clause must not affect the result set. They
must be associative—the query must return the same result regardless of the order in which they're
processed.
As with the two-table join, the ANSI syntax for multitable inner joins is similar to the legacy syntax. Here's
The ANSI syntax for the multitable join above:
SELECT customers.CustomerNumber, orders.Amount, items.Description
FROM customers JOIN orders ON (customers.CustomerNumber=orders.CustomerNumber)
JOIN items ON (orders.ItemNumber=items.ItemNumber)
Again, it's a bit wordier, but it performs the same function.
Outer Joins
Thus far, there hasn't been a stark contrast between the ANSI and legacy join syntaxes. Though not
syntactically identical, they seem to be functionally equivalent.
This all changes with outer joins. The ANSI outer join syntax addresses ambiguities inherent in using the
WHERE clause—whose terms are by definition associative—to perform table joins. Here's an example of
The legacy syntax that contains such ambiguities:

-- Bad SQL - Don't run
SELECT customers.CustomerNumber, orders.Amount, items.Description
FROM customers, orders, items
WHERE customers.CustomerNumber*=orders.CustomerNumber
AND orders.ItemNumber*=items.ItemNumber

Don't bother trying to run this—SQL Server won't allow it. Why? Because WHERE clause terms are
Required to be associative, but these aren't. If customers and orders are joined first, those rows where a
Customer exists but has no orders will be impossible to join with the items table since their ItemNumber
column will be NULL. On the other hand, if orders and items are joined first, the result set will include
ITEM records it likely would have otherwise missed. So the order of the terms in the WHERE clause is
significant when constructing multilevel joins using the legacy syntax.
It's precisely because of this ambiguity—whether the ordering of WHERE clause predicates is significant—
that the SQL-92 standard moved join construction to the FROM clause. Here's the above query rewritten
using valid ANSI join syntax:
SELECT customers.CustomerNumber, orders.Amount, items.Description
FROM customers LEFT OUTER JOIN orders ON
(customers.CustomerNumber=orders.CustomerNumber)
LEFT OUTER JOIN items ON (orders.ItemNumber=items.ItemNumber)
CustomerNumber Amount Description
-------------- -------- -----------
1 123.45 WIDGET A
1 678.90 WIDGET B
2 678.90 WIDGET B
3 86753.09 WIDGET C
Here, the ambiguities are gone, and it's clear that the query is first supposed to join the customers and
orders tables, then join the result with the items table. (Note that the OUTER keyword is optional.)
To understand how this shortcoming in the legacy syntax can affect query results, consider the following
query. We'll set it up initially so that the outer join works as expected:

SELECT customers.CustomerNumber, orders.Amount
FROM customers, orders
WHERE customers.CustomerNumber*=orders.CustomerNumber
AND orders.Amount>600
CustomerNumber Amount
-------------- --------
1 678.90
2 678.90
3 86753.09
Since every row in customers finds a match in orders, the problem isn't obvious. Now let's change the
query so that there are a few mismatches between the tables, like so:

SELECT customers.CustomerNumber+2, orders.Amount
FROM customers, orders
WHERE customers.CustomerNumber+2*=orders.CustomerNumber
AND orders.Amount>600
This version simply adds 2 to CustomerNumber to ensure that at least a few of the joins will fail and the
columns in orders will be returned as NULLs. Here's the result set:
CustomerNumber Amount
-------------- --------
3 86753.09
4 NULL
5 NULL
See the problem? Those last two rows shouldn't be there. Amount is NULL in those rows (because there
are no orders for customers4 and5), and whether it exceeds $600 is unknown. The query is supposed to
return only those rows whose Amount column is known to exceed $600, but that's not the case. Here's
the ANSI version of the same query:

SELECT customers.CustomerNumber+2, orders.Amount
FROM customers LEFT OUTER JOIN orders ON
(customers.CustomerNumber+2=orders.CustomerNumber)
WHERE orders.Amount>600
CustomerNumber Amount
-------------- --------
3 86753.09
The SQL-92 syntax correctly omits the rows with a NULL Amount. The reason the legacy query fails here is
that the predicates in its WHERE clause are evaluated together. When Amount is checked against the >600 predicate, it has not yet been returned as NULL, so it's erroneously included in the result set.
By the time it's set to NULL, it's already in the result set, effectively negating the >600 predicate.

Though the inner join syntax you choose is largely a matter a preference, you should still use the SQL-92
syntax whenever possible. It's hard enough keeping up with a single way of joining tables, let alone two
different ways. And, as we've seen, there are some real problems with the legacy outer join syntax.
Moreover, Microsoft strongly recommends the use of the ANSI syntax and has publicly stated that the
legacy outer join syntax will be dropped in a future release of the product. Jumping on the ANSI/ISO
bandwagon also makes sense from another perspective: interoperability. Given the way in which the
DBMS world—like the real world—is shrinking, it's not unusual for an application to communicate with or
rely upon more than one vendor's DBMS. Heterogeneous joins, passthrough queries, and vendor-to
vendor replication are now commonplace. Knowing this, it makes sense to abandon proprietary syntax
elements in favor of those that play well with others.

Other Types of Joins
Thus far, we've explored only left joins—both inner and outer. There are a few others that are worth
mentioning as well. Transact-SQL also supports RIGHT OUTER JOINs, CROSS JOINs, and FULL OUTER
JOINs.
A RIGHT OUTER JOIN isn't really that different from a LEFT OUTER JOIN. In fact, it's really just a LEFT
OUTER JOIN with the tables reversed. It's very easy to restate a LEFT OUTER JOIN as a RIGHT OUTER
JOIN. Here's the earlier LEFT OUTER JOIN query restated:

SELECT customers.CustomerNumber+2, orders.Amount
FROM orders RIGHT OUTER JOIN customers ON
(customers.CustomerNumber+2=orders.CustomerNumber)
Amount
------ --------
3 86753.09
4 NULL
5 NULL
A RIGHT JOIN returns the columns in the first table as NULLs when the join condition fails. Since you
Decide which table is the first table and which one's the second, whether you use a LEFT JOIN or a RIGHT
JOIN is largely a matter a preference.
A CROSS JOIN, by contrast, is an intentional Cartesian product. The size of a Cartesian product is the
number of rows in one table multiplied by those in the other. So for two tables with three rows each, their
CROSS JOIN or Cartesian product would consist of nine rows. By definition, CROSS JOINs don't need or
support the use of the ON clause that other joins require. Here's a CROSS JOIN of the customers and
orders tables:

SELECT customers.CustomerNumber, orders.Amount
FROM orders CROSS JOIN customers
Guru’s Guide to Transact-SQL
14
CustomerNumber Amount
-------------- --------
1 123.45
1 678.90
1 86753.09
1 678.90
2 123.45
2 678.90
2 86753.09
2 678.90
3 123.45
3 678.90
3 86753.09
3 678.90
(12 row(s) affected)
A FULL OUTER JOIN returns rows from both tables regardless of whether the join condition succeeds.
When a join column in the first table fails to find a match in the second, the values from the second table
are returned as NULL, just as they are with a LEFT OUTER JOIN. When the join column in the second table
fails to find a matching value in the first table, columns in the first table are returned as NULL, as they are
in a RIGHT OUTER JOIN. You can think of a FULL OUTER JOIN as the combination of a LEFT JOIN and a
RIGHT JOIN.
Here's the earlier LEFT OUTER JOIN restated as a FULL OUTERJOIN:

SELECT customers.CustomerNumber+2, orders.Amount
FROM customers FULL OUTER JOIN orders ON
(customers.CustomerNumber+2=orders.CustomerNumber)
Amount
------ --------
3 86753.09
4 NULL
5 NULL
NULL 123.45
NULL 678.90

NULL 678.90

TSQL Tutorial..

No comments:

Post a Comment