R*Tree Comparative Benchmark

Author: Sandro Furieri 2023-10-31

Intro

During last week several interesting things happened, and the final result was a massive optimization of the R*Tree module.
This is a strategic component of SpatiaLite because it's the corner stone of the Spatial Index implementation, so any improvement in this critical area is surely welcome.

Pieter Roggemans started a discussion about the slowness of loading a huge Spatial Index.
But even more important he suggested possible improvements based on most modern R*Tree algorithms published on academic literature.
Richard Hipp (the main developer of SQLite) quickly introduced several optimizations to the R*Tree implementation.
and finally Even Rouault (the main developer of GDAL/OGR) adapted to SQLite a third-party library supporting fast Bulk Load of RTrees.

Short conclusion: it's now time to practically test the actual outcome of all these recent optimizations and enhancements affecting the Spatial Index.

Methodology

the same tests were repeated twice on the same database.
- the first time using the previous versions of SQLite (3.43.1).
- the second time using the latest BETA release of SQLite (3.44.0) supporting all the new R*Tree optimizations.
  In this second pass also SpatiaLite was updated so to fully support the new R*Tree Bulk Load feature.
all tests were conducted on a Fedora Linux Virtual Machine.
during each test the database was fully loaded in memory so to ignore the effect of Disk I/O activities.

Tests based on Point Geometries
This dataset contains 1,522,752 features, actually corresponding to all House Numbers of Tuscany.
SQL	Operation	Previous Version	Latest Version	Comments
CREATE TABLE cp_1_points ( id INTEGER PRIMARY KEY, comune TEXT, indirizzo TEXT, scritta TEXT); SELECT AddGeometryColumn('cp_1_points', 'geom', 3003, 'POINT', 'XY'); SELECT CreateSpatialIndex('cp_1_points', 'geom'); INSERT INTO cp_1_points SELECT id, comune, indirizzo, scritta, geom FROM points;	Creating the destination table Adding the Geometry column Creating the Spatial Index before populating the table Populating the destination table	36 secs	29 secs	Time reduced by about 20% thanks to the most recent optimizations of the *RTree** module of SQLite.
CREATE TABLE cp_2_points ( id INTEGER PRIMARY KEY, comune TEXT, indirizzo TEXT, scritta TEXT); SELECT AddGeometryColumn('cp_2_points', 'geom', 3003, 'POINT', 'XY'); INSERT INTO cp_2_points SELECT id, comune, indirizzo, scritta, geom FROM points; SELECT CreateSpatialIndex('cp_2_points', 'geom');	Creating the destination table Adding the Geometry column Populating the destination table	9 secs	8 secs	Substantially unchanged
	Creating the Spatial Index after populating the table Note: this corresponds to Bulk Load	14 secs	5 secs	An astonishing time reduction of about 65% the brand new Bulk Load support is really fast
DELETE FROM cp_2_points WHERE (id % 4) = 0;	Deleting 1 row every 4 (and consequently updating the R*Tree)	6 secs	5 secs	Substantially unchanged
SELECT s.sez2011, Count(*) AS cnt FROM sezcen AS s JOIN cp_1_points AS p ON ( ST_Intersects(p.geom, s.geom) = 1 AND p.id IN (SELECT rowid FROM SpatialIndex WHERE f_table_name = 'cp_1_points' AND search_frame = s.geom)) GROUP BY s.sez2011 ORDER BY cnt DESC;	A Spatial Query heavily stressing the R*Tree	46 secs	43 secs	Substantially unchanged

Tests based on Polygon Geometries
This dataset contains 2,053,985 features, actually corresponding to all Buildings of Tuscany.
SQL	Operation	Previous Version	Latest Version	Comments
CREATE TABLE cp_1_polygs ( id INTEGER PRIMARY KEY, uv_id TEXT, edifc_duso TEXT); SELECT AddGeometryColumn('cp_1_polygs', 'geom', 3003, 'POLYGON', 'XY'); SELECT CreateSpatialIndex('cp_1_polygs', 'geom'); INSERT INTO cp_1_polygs SELECT id, uv_id, edifc_duso, geom FROM polygs;	Creating the destination table Adding the Geometry column Creating the Spatial Index before populating the table Populating the destination table	52 secs	45 secs	Time reduced by about 15% thanks to the most recent optimizations of the *RTree** module of SQLite.
CREATE TABLE cp_2_polygs ( id INTEGER PRIMARY KEY, uv_id TEXT, edifc_duso TEXT); SELECT AddGeometryColumn('cp_2_polygs', 'geom', 3003, 'POLYGON', 'XY'); INSERT INTO cp_2_polygs SELECT id, uv_id, edifc_duso, geom FROM polygs; SELECT CreateSpatialIndex('cp_2_polygs', 'geom');	Creating the destination table Adding the Geometry column Populating the destination table	13 secs	14 secs	Substantially unchanged
	Creating the Spatial Index after populating the table Note: this corresponds to Bulk Load	20 secs	8 secs	An astonishing time reduction of about 60% the brand new Bulk Load support is really fast
DELETE FROM cp_2_polygs WHERE (id % 4) = 0;	Deleting 1 row every 4 (and consequently updating the R*Tree)	9 secs	11 secs	Substantially unchanged
SELECT s.sez2011, Count(*) AS cnt FROM sezcen AS s JOIN cp_1_polygs AS p ON ( ST_Intersects(p.geom, s.geom) = 1 AND p.id IN (SELECT rowid FROM SpatialIndex WHERE f_table_name = 'cp_1_polygs' AND search_frame = s.geom)) GROUP BY s.sez2011 ORDER BY cnt DESC;	A Spatial Query heavily stressing the R*Tree	74 secs	75 secs	Substantially unchanged

Conclusions

The effects of optimizations introduced by Richard Hipp in the R*Tree of SQLite are clearly visible:
- on average inserting a row in the R*Tree it now takes 15% - 20% less time, that's a noticeable enhancement.
- however, the improvement only concerns INSERT; DELETE or SELECT seem to be completely unaffected.
The new RTree Bulk Load support is awesome, cutting times by about 2/3
The real surprise of this bencmark was discovering how slow is populating a table after creating its Spatial Index.
If you notice, the total time required for populating both the table and the Spatial Index at the same time practically
doubles the time required for two separate steps:
- there is very specific reason that explains this apparent anomaly.
- keeping in sync the table and its Spatial Index is a critical task based on specific Triggers.
- Triggers require some time to be processed and have a rather sensible impact on general performance.
Conclusion: populating a table after creating a Spatial Index is never a brilliant idea.
Populating the table first and then creating the Spatial Index only in a second time is always faster.
The brand new RTree Bulk Load feature makes this specific step even faster.

R*Tree Comparative Benchmark

Author: Sandro Furieri 2023-10-31

Intro

Methodology

This dataset contains 1,522,752 features, actually corresponding to all House Numbers of Tuscany.

This dataset contains 2,053,985 features, actually corresponding to all Buildings of Tuscany.

Conclusions