R*Tree Comparative Benchmark

Author: Sandro Furieri 2023-10-31


Intro

During last week several interesting things happened, and the final result was a massive optimization of the R*Tree module.
This is a strategic component of SpatiaLite because it's the corner stone of the Spatial Index implementation, so any improvement in this critical area is surely welcome.
  • Pieter Roggemans started a discussion about the slowness of loading a huge Spatial Index.
    But even more important he suggested possible improvements based on most modern R*Tree algorithms published on academic literature.
  • Richard Hipp (the main developer of SQLite) quickly introduced several optimizations to the R*Tree implementation.
  • and finally Even Rouault (the main developer of GDAL/OGR) adapted to SQLite a third-party library supporting fast Bulk Load of RTrees.
Short conclusion: it's now time to practically test the actual outcome of all these recent optimizations and enhancements affecting the Spatial Index.

Methodology

  • the same tests were repeated twice on the same database.
    • the first time using the previous versions of SQLite (3.43.1).
    • the second time using the latest BETA release of SQLite (3.44.0) supporting all the new R*Tree optimizations.
      In this second pass also SpatiaLite was updated so to fully support the new R*Tree Bulk Load feature.
  • all tests were conducted on a Fedora Linux Virtual Machine.
  • during each test the database was fully loaded in memory so to ignore the effect of Disk I/O activities.
Tests based on Point Geometries

This dataset contains 1,522,752 features, actually corresponding to all House Numbers of Tuscany.

SQLOperationPrevious VersionLatest VersionComments
CREATE TABLE cp_1_points (
	id INTEGER PRIMARY KEY,
	comune TEXT,
	indirizzo TEXT,
	scritta TEXT);
SELECT AddGeometryColumn('cp_1_points', 'geom', 3003, 'POINT', 'XY');
SELECT CreateSpatialIndex('cp_1_points', 'geom');

INSERT INTO cp_1_points
SELECT id, comune, indirizzo, scritta, geom
FROM points;
				
  • Creating the destination table
  • Adding the Geometry column
  • Creating the Spatial Index
    before populating the table
  • Populating the destination table
36 secs 29 secs Time reduced by about 20%
thanks to the most recent optimizations
of the R*Tree module of SQLite.
CREATE TABLE cp_2_points (
id INTEGER PRIMARY KEY,
comune TEXT,
indirizzo TEXT,
scritta TEXT);
SELECT AddGeometryColumn('cp_2_points', 'geom', 3003, 'POINT', 'XY');

INSERT INTO cp_2_points
SELECT id, comune, indirizzo, scritta, geom
FROM points;

SELECT CreateSpatialIndex('cp_2_points', 'geom');
				
  • Creating the destination table
  • Adding the Geometry column
  • Populating the destination table
9 secs 8 secs Substantially unchanged
  • Creating the Spatial Index
    after populating the table

    Note: this corresponds to Bulk Load
14 secs 5 secs An astonishing time reduction of about 65%
the brand new Bulk Load support is really fast
DELETE FROM cp_2_points WHERE (id % 4) = 0;
				
  • Deleting 1 row every 4
    (and consequently updating the R*Tree)
6 secs 5 secs Substantially unchanged
SELECT s.sez2011, Count(*) AS cnt
FROM sezcen AS s
JOIN cp_1_points AS p ON (
  ST_Intersects(p.geom, s.geom) = 1 AND p.id IN 
  (SELECT rowid FROM SpatialIndex
      WHERE f_table_name = 'cp_1_points' 
           AND search_frame = s.geom))
GROUP BY s.sez2011
ORDER BY cnt DESC;
				
  • A Spatial Query
    heavily stressing the R*Tree
46 secs 43 secs Substantially unchanged

Tests based on Polygon Geometries

This dataset contains 2,053,985 features, actually corresponding to all Buildings of Tuscany.

SQLOperationPrevious VersionLatest VersionComments
CREATE TABLE cp_1_polygs (
	id INTEGER PRIMARY KEY,
	uv_id TEXT,
	edifc_duso TEXT);
SELECT AddGeometryColumn('cp_1_polygs', 'geom', 3003, 'POLYGON', 'XY');
SELECT CreateSpatialIndex('cp_1_polygs', 'geom');

INSERT INTO cp_1_polygs
SELECT id, uv_id, edifc_duso, geom
FROM polygs;
				
  • Creating the destination table
  • Adding the Geometry column
  • Creating the Spatial Index
    before populating the table
  • Populating the destination table
52 secs 45 secs Time reduced by about 15%
thanks to the most recent optimizations
of the R*Tree module of SQLite.
CREATE TABLE cp_2_polygs (
	id INTEGER PRIMARY KEY,
	uv_id TEXT,
	edifc_duso TEXT);
SELECT AddGeometryColumn('cp_2_polygs', 'geom', 3003, 'POLYGON', 'XY');

INSERT INTO cp_2_polygs
SELECT id, uv_id, edifc_duso, geom
FROM polygs;

SELECT CreateSpatialIndex('cp_2_polygs', 'geom');
				
  • Creating the destination table
  • Adding the Geometry column
  • Populating the destination table
13 secs 14 secs Substantially unchanged
  • Creating the Spatial Index
    after populating the table

    Note: this corresponds to Bulk Load
20 secs 8 secs An astonishing time reduction of about 60%
the brand new Bulk Load support is really fast
DELETE FROM cp_2_polygs WHERE (id % 4) = 0;
				
  • Deleting 1 row every 4
    (and consequently updating the R*Tree)
9 secs 11 secs Substantially unchanged
SELECT s.sez2011, Count(*) AS cnt
FROM sezcen AS s
JOIN cp_1_polygs AS p ON (
  ST_Intersects(p.geom, s.geom) = 1 AND p.id IN 
  (SELECT rowid FROM SpatialIndex
      WHERE f_table_name = 'cp_1_polygs' 
           AND search_frame = s.geom))
GROUP BY s.sez2011
ORDER BY cnt DESC;
				
  • A Spatial Query
    heavily stressing the R*Tree
74 secs 75 secs Substantially unchanged

Conclusions

  1. The effects of optimizations introduced by Richard Hipp in the R*Tree of SQLite are clearly visible:
    • on average inserting a row in the R*Tree it now takes 15% - 20% less time, that's a noticeable enhancement.
    • however, the improvement only concerns INSERT; DELETE or SELECT seem to be completely unaffected.
  2. The new RTree Bulk Load support is awesome, cutting times by about 2/3
  3. The real surprise of this bencmark was discovering how slow is populating a table after creating its Spatial Index.
    If you notice, the total time required for populating both the table and the Spatial Index at the same time practically
    doubles the time required for two separate steps:
    • there is very specific reason that explains this apparent anomaly.
    • keeping in sync the table and its Spatial Index is a critical task based on specific Triggers.
    • Triggers require some time to be processed and have a rather sensible impact on general performance.
    Conclusion: populating a table after creating a Spatial Index is never a brilliant idea.
    Populating the table first and then creating the Spatial Index only in a second time is always faster.
    The brand new RTree Bulk Load feature makes this specific step even faster.