SpatiaLite logo

Recipe #2:
Your first JOIN queries

2011 January 28

Previous Slide Table of Contents Next Slide

You already know the basic foundations about simple SQL queries.
Any previous example encountered since now simply queried a single table:
anyway SQL has no imposed limits, so you can query an arbitrary number of tables at the same time.
But in order to do this you must understand how to correctly handle a JOIN.


SELECT *
FROM counties, regions;

county_id county_name car_plate_code region_id region_id region_name
1 TORINO TO 1 1 PIEMONTE
1 TORINO TO 1 2 VALLE D'AOSTA
1 TORINO TO 1 3 LOMBARDIA
1 TORINO TO 1 4 TRENTINO-ALTO ADIGE
1 TORINO TO 1 5 VENETO
... ... ... ... ... ...
Apparently this query immediately works;
but once you get a quick glance at the result-set you'll immediately discover something really puzzling:
Every time SQL queries two different tables at the same time, the Cartesian Product of both datasets is calculated.
i.e. each row coming from the first dataset is JOINed with any possible row coming from the second dataset.
This one is a blind combinatorial process, so it very difficultly can produce useful results.
And this process can easily generate a really huge result-set: this must absolutely be avoided, because:
  • a very long (very, very long) time may be required to complete the operation.
  • you can easily exhaust operating system resources before completion.
All this said, it's quite obvious that some appropriate JOIN condition has to be set in order to maintain under control the Cartesian Product, so to actually return only meaningful rows.


SELECT *
FROM counties, regions
WHERE counties.region_id = regions.region_id;
This query is exactly the same of the previous one: but this time we introduced an appropriate JOIN condition.
Some points to be noted:

SELECT c.county_id AS county_id,
  c.county_name AS county_name,
  c.car_plate_code AS car_plate_code,
  r.region_id AS region_id,
  r.region_name AS region_name
FROM counties AS c,
  regions AS r
WHERE c.region_id = r.region_id;

county_id county_name car_plate_code region_id region_name
1 TORINO TO 1 PIEMONTE
2 VERCELLI VC 1 PIEMONTE
3 NOVARA NO 1 PIEMONTE
4 CUNEO CN 1 PIEMONTE
5 ASTI AT 1 PIEMONTE
6 ALESSANDRIA AL 1 PIEMONTE
... ... ... ... ...
And this one always is the same as above, simply written adopting a most polite syntax:

SELECT lc.lc_id AS lc_id,
  lc.lc_name AS lc_name,
  lc.population AS population,
  c.county_id AS county_id,
  c.county_name AS county_name,
  c.car_plate_code AS car_plate_code,
  r.region_id AS region_id,
  r.region_name AS region_name
FROM local_councils AS lc,
  counties AS c,
  regions AS r
WHERE lc.county_id = c.county_id
  AND c.region_id = r.region_id;

lc_id lc_name population county_id county_name car_plate_code region_id region_name
1001 AGLIE' 2574 1 TORINO TO 1 PIEMONTE
1002 AIRASCA 3554 1 TORINO TO 1 PIEMONTE
1003 ALA DI STURA 479 1 TORINO TO 1 PIEMONTE
... ... ... ... ... ... ... ...
Joining three (or even more) tables isn't much more difficult:
you simply have to apply any required JOIN condition as appropriate.

Performance considerations

Executing complex queries involving many different tables may easily run in a very slow and sluggish mode.
This will most easily noticed when such tables contain a huge number of rows.
Explaining all this isn't at all difficult: in order to calculate the Cartesian Product the SQL engine has to access many and many times each table involved in the query.

The basic behavior is the one to perform a full table scan each time: and obviously scanning a long table many and many times requires a long time.
So the main key-point in order optimize your queries is the one to avoid using full table scans as much as possible.
All this is fully supported, and it's easy to be implemented.

Each time the SQL-planner (an internal component of the SQL-engine) detects that an appropriate INDEX is available, there is no need at all to perform full table scans, because each single row can now be immediately accessed using this Index.
And this one will obviously be a much faster process.
Any column (or group of columns) frequently used in JOIN clauses is a good candidate for a corresponding INDEX.
Anyway, creating an Index implies several negative consequences:
  • the storage allocation required by the DB-file will increase (sometimes will dramatically increase).
  • performing INSERT, UPDATE and/or DELETE ops will require a longer time, because the Index has to be accordingly updated.
    And this obviously imposes a further overhead.
So (not surprisingly) it's a trade-off process: you must evaluate carefully when an INDEX is absolutely required, and attempt to reach a well balanced mix.
i.e a compromise between contrasting requirements, under various conditions and in different users-cases.
In other words there is no absolute rule: you must find your optimal case-by-case solution performing several practical tests, until you get the optimal solution fulfilling your requirements.

Previous Slide Table of Contents Next Slide

CC-BY-SA logo Author: Alessandro Furieri a.furieri@lqt.it
This work is licensed under the Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

GNU logo Permission is granted to copy, distribute and/or modify this document under the terms of the
GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.