Note: these pages are no longer maintained

Never the less, much of the information is still relevant.
Beware, however, that some of the command syntax is from older versions, and thus may no longer work as expected.
Also: external links, from external sources, inside these pages may no longer function.

	Recipe #2: Your first JOIN queries
2011 January 28

Previous Slide

Table of Contents

Next Slide

You already know the basic foundations about simple SQL queries.
Any previous example encountered since now simply queried a single table:
anyway SQL has no imposed limits, so you can query an arbitrary number of tables at the same time.
But in order to do this you must understand how to correctly handle a JOIN.

SELECT *
FROM counties, regions;

county_id	county_name	car_plate_code	region_id	region_id	region_name
1	TORINO	TO	1	1	PIEMONTE
1	TORINO	TO	1	2	VALLE D'AOSTA
1	TORINO	TO	1	3	LOMBARDIA
1	TORINO	TO	1	4	TRENTINO-ALTO ADIGE
1	TORINO	TO	1	5	VENETO
...	...	...	...	...	...

Apparently this query immediately works;
but once you get a quick glance at the result-set you'll immediately discover something really puzzling:

an unexpected huge number of rows has been returned.
and each single County seems to be related with any possible Region.

Every time SQL queries two different tables at the same time, the Cartesian Product of both datasets is calculated.
i.e. each row coming from the first dataset is JOINed with any possible row coming from the second dataset.
This one is a blind combinatorial process, so it very difficultly can produce useful results.
And this process can easily generate a really huge result-set: this must absolutely be avoided, because:

a very long (very, very long) time may be required to complete the operation.
you can easily exhaust operating system resources before completion.

All this said, it's quite obvious that some appropriate JOIN condition has to be set in order to maintain under control the Cartesian Product, so to actually return only meaningful rows.

SELECT *
FROM counties, regions
WHERE counties.region_id = regions.region_id;

This query is exactly the same of the previous one: but this time we introduced an appropriate JOIN condition.
Some points to be noted:

using two (or more) tables can easily lead to name ambiguity:
e.g. in this case we have two different columns named region_id, one in the counties table, the other in the regions table.
we must use fully qualified names to avoid any possible ambiguity:
e.g. counties.region_id identifies the region_id column belonging to the counties table, in an absolutely unambiguous way.
defining the WHERE counties.region_id = regions.region_id clause we impose an appropriate JOIN condition.
After this the Cartesian Product will be accordingly filtered, so to insert into the result-set only the rows actually satisfying the imposed condition, ignoring any other.

SELECT c.county_id AS county_id,
  c.county_name AS county_name,
  c.car_plate_code AS car_plate_code,
  r.region_id AS region_id,
  r.region_name AS region_name
FROM counties AS c,
  regions AS r
WHERE c.region_id = r.region_id;

county_id	county_name	car_plate_code	region_id	region_name
1	TORINO	TO	1	PIEMONTE
2	VERCELLI	VC	1	PIEMONTE
3	NOVARA	NO	1	PIEMONTE
4	CUNEO	CN	1	PIEMONTE
5	ASTI	AT	1	PIEMONTE
6	ALESSANDRIA	AL	1	PIEMONTE
...	...	...	...	...

And this one always is the same as above, simply written adopting a most polite syntax:

using extensively the AS clause so to define alias names for both columns and tables make JOIN queries to be much more concise and readable, and easiest to understand.

SELECT lc.lc_id AS lc_id,
  lc.lc_name AS lc_name,
  lc.population AS population,
  c.county_id AS county_id,
  c.county_name AS county_name,
  c.car_plate_code AS car_plate_code,
  r.region_id AS region_id,
  r.region_name AS region_name
FROM local_councils AS lc,
  counties AS c,
  regions AS r
WHERE lc.county_id = c.county_id
  AND c.region_id = r.region_id;

lc_id	lc_name	population	county_id	county_name	car_plate_code	region_id	region_name
1001	AGLIE'	2574	1	TORINO	TO	1	PIEMONTE
1002	AIRASCA	3554	1	TORINO	TO	1	PIEMONTE
1003	ALA DI STURA	479	1	TORINO	TO	1	PIEMONTE
...	...	...	...	...	...	...	...

Joining three (or even more) tables isn't much more difficult:
you simply have to apply any required JOIN condition as appropriate.

Performance considerations

Executing complex queries involving many different tables may easily run in a very slow and sluggish mode.
This will most easily noticed when such tables contain a huge number of rows.
Explaining all this isn't at all difficult: in order to calculate the Cartesian Product the SQL engine has to access many and many times each table involved in the query.

The basic behavior is the one to perform a full table scan each time: and obviously scanning a long table many and many times requires a long time.
So the main key-point in order optimize your queries is the one to avoid using full table scans as much as possible.
All this is fully supported, and it's easy to be implemented.

Each time the SQL-planner (an internal component of the SQL-engine) detects that an appropriate INDEX is available, there is no need at all to perform full table scans, because each single row can now be immediately accessed using this Index.
And this one will obviously be a much faster process.
Any column (or group of columns) frequently used in JOIN clauses is a good candidate for a corresponding INDEX.
Anyway, creating an Index implies several negative consequences:

the storage allocation required by the DB-file will increase (sometimes will dramatically increase).
performing INSERT, UPDATE and/or DELETE ops will require a longer time, because the Index has to be accordingly updated.
And this obviously imposes a further overhead.

So (not surprisingly) it's a trade-off process: you must evaluate carefully when an INDEX is absolutely required, and attempt to reach a well balanced mix.
i.e a compromise between contrasting requirements, under various conditions and in different users-cases.
In other words there is no absolute rule: you must find your optimal case-by-case solution performing several practical tests, until you get the optimal solution fulfilling your requirements.

Previous Slide

Table of Contents

Next Slide

	Author: Alessandro Furieri a.furieri@lqt.it
	This work is licensed under the Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

	Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.