KNN
Not logged in

VirtualKNN: a quick intro

back to index

Credits
Development of VirtualKNN has been entirely funded by Tuscany Region - Territorial and Environmental Information System
Regione Toscana - Settore Sistema Informativo Territoriale ed Ambientale.



What is the KNN (K-Nearest Neighbors) problem ?
Imagine a set of arbitrary geometries; which may well be, a very huge dataset containing some millions of features.
Now imagine that for whatever reason you are interested in quickly identifying all features within a close spatial proximity to an arbitrary location.
This is the typical KNN problem.
Just few practical real world examples:
  • Suppose while using a smartphone app: your current geographic location is well known by the integrated GPS sensor, so you are simply expecting that the app will immediately tell where you are (and the exact location of) the nearest fuel stations (hotels, restaurants etc).
  • Suppose a zoological/ecological study: you know where are located the nesting areas of some rare water bird species, also the location of all rivers, lakes, ponds and marshlands in the area you are interested in.
    Now you are attempting to get a reasonable association between nesting areas and feeding zones based on minimal distance criteria.
Whereby, in both cases you can't safely identify a reasonable minimal radius so to usefully restrict your search by, since you can't make any valid assumption about the geographic dispersion of your targets.
The distances may noticeably vary from case to case, and you haven't any idea about this, since it strictly depends on the actual location of the search origin and on the statistical dispersion of the targets to be searched.



You can download a sample db-file exactly corresponding to the examples we'll use in this first tutorial.

This dataset corresponds to a collection of about 30.000 airports scattered all over the world; the original input shapefile was downloaded from here
The reference system is 4326 WGS84 (based on longitude and latitude angles).



The map on the right shows the geographic context of this first KNN example.
Tyrrhenian Sea

A first, simple and very inefficient, approach

SELECT geonameid, name, country, 
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports;
--------------------------------
6299623	| Marina Di Campo | IT | 33.043320

SELECT geonameid, name, country, 
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623);
---------------------------------
6299392 | Bastia-Poretta | FR | 65.226573

SELECT geonameid, name, country,
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392);
---------------------------------
6299628 | Pisa / S. Giusto | IT	| 82.387014

SELECT geonameid, name, country, 
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392, 6299628);
---------------------------------
6299630 | Grosseto Airport | IT | 91.549773

SELECT geonameid, name, country, 
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392, 6299628, 6299630);
---------------------------------
6694495 | Corte | FR | 102.819778
We could effectively execute a first SQL query using the Min() aggregate function in order to identify the closest airport to an arbitrary position (latitude=43.0, longitude=10.0) located in the middle of the Tyrrhenian Sea.
Then we could continually repeat the identical query, each time excluding all airports we've already identified in the previous processed steps.
The final result will be a full list of the first five airports closest to the reference location ordered by increasing distance.

rankgeonameidnamecountrydist_km
16299623Marina Di CampoIT33.043320
26299392Bastia-PorettaFR65.226573
36299628Pisa / S. GiustoIT82.387014
46299630Grosseto AirportIT91.549773
56694495CorteFR102.819778

Such a simple approach is highly impractical, because it strictly requires to hand-write several ad hoc SQL queries one by one.
Writing a fully automated SQL script isn't at all a simple task, and some kind of higher level scripting (e.g. Python) would be easily required in any realistic scenario.
The automation of all required SQL queries isn't the most critical issue we have to face; the real, critical, bottleneck with this simple approach, is that each of these queries will cause an awful full table scan
This will be the cause of an intolerably sluggish performance, and the overall performance will quickly become worse as the dataset progressively increases in size.

Conclusion:
Some easier and smarter approach is required: possibly one taking full profit from an R*TRee Spatial Index supporting the Geometries to be searched.


VirtualKNN

Starting with version 4.4.0 SpatiaLite supports a VirtualKNN Virtual Table specifically intended as a complete and highly efficient solution to the KNN problem.
CREATE VIRTUAL TABLE knn USING VirtualKNN();
Every new db-file being created with 4.4.0 will always automatically define a KNN virtual table.
On any earlier version of a SpatiaLite db-file, you can add the KNN support by manually executing the above SQL statement.
Note: VirtualKNN necessarily requires 4.4.0 binary support, so any attempt to open a db-file including a VirtualKNN table by using some previous version (<= 4.3.0a) will surely raise an error condition.

A first basically simple KNN query

SELECT * FROM knn
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43);
f_table_namef_geometry_columnref_geometrymax_itemsposfiddistance
airportsgeomBLOB sz=60 GEOMETRY31629962333043.319520
airportsgeomBLOB sz=60 GEOMETRY32629939265226.573149
airportsgeomBLOB sz=60 GEOMETRY33629962882387.014028

  • a VirtualKNN query closely resembles a VirtualSpatialIndex query. This is not surprising, since both of them are intended to be wrapped around an underlying R*Tree Spatial Index.
  • Any valid VirtualKNN query must use a form such as: WHERE knn-column = value AND knn-column = value ...
  • the input columns (the ones you can reference into a WHERE clause) are:
    • f_table_name (mandatory)
      name of the GeoTable containing the Geometries to be searched.
    • f_geometry_column (optional)
      name of the column of the above table containing the Geometries to be searched.
      1. If the table identified by f_table_name just contains a single Geometry column you can safely omit the f_geometry_column argument (it will be automatically set).
      2. If, however, the table identified by f_table_name contains two (or more) Geometry columns explicitly specifying the f_geometry_column argument is strictly required, to avoid an ambiguous definition of the search context.
      3. In both cases f_table_name and f_geometry_column must exactly match a properly defined Geometry column supported by a corresponding Spatial Index.
    • ref_geometry (mandatory)
      any arbitrary Geometry (POINT, LINESTRING, POLYGON or whatever else) intended to represent the origin of the KNN search.
      This Geometry must contain the same SRID as the target Geometries to be searched.
    • max_items (optional)
      maximum number of rows to be returned into the resultset.
      The valid range is from 1 to 1024 (higher values will require a longer time to be processed).
      By default only the first 3 results will be returned.
  • the output columns (the ones containing values retrieved by the KNN search) are:
    • pos (INTEGER)
      relative rank (sorted by distance): the closest item will be #1, the second closest item will be #2 and so on.
    • fid (INTEGER)
      the unique ROWID value of the corresponding feature into the searched GeoTable.
    • distance (DOUBLE)
      the minimum distance between the corresponding feature into the searched GeoTable and the origin defined by ref_geometry.
      1. When the defined SRID uses a planar (aka projected) reference system, all distances will be returned in the units defined by the projection (meters, feet, chains etc.).
      2. When using a geographic type (longitude and latitude degrees), all distances will always be measured in meters, with the most precise geodetic formulas being automatically applied.


A second sample of a more sophisticated KNN query

SELECT a.pos AS rank, b.geonameid, b.name, b.country, a.distance / 1000.0 AS dist_km
FROM knn AS a
JOIN airports AS b ON (b.geonameid = a.fid)
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43) AND max_items = 5;
rankgeonameidnamecountrydist_km
16299623Marina Di CampoIT33.043320
26299392Bastia-PorettaFR65.226573
36299628Pisa / S. GiustoIT82.387014
46299630Grosseto AirportIT91.549773
56694495CorteFR102.819778

This second KNN query will return the same identical results we already received before by using the simple approach.
The striking difference is that this KNN query successfully completes in just a fraction of a second, while executing the first inefficient queries required many seconds (and that with a relatively small dataset).

Summary: KNN queries are really efficient and fast, because they directly interact with the lowermost levels of the R*Tree Spatial Index implementation.


A third sample of KNN query

SELECT a.pos AS rank, b.geonameid, b.name, b.country, a.distance / 1000.0 AS dist_km
FROM knn AS a
JOIN airports AS b ON (b.geonameid = a.fid)
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(-17.3, -44) AND max_items = 5;
rankgeonameidnamecountrydist_km
17668433Cabo Frio AirportBR3256.589964
23359001YoungsfieldZA3262.271027
37730153Umberto Modiano AirportBR3262.693032
43362355Robben IslandZA3265.302568
56300622S. P. Aldeia AerodromeBR3267.234142

This last KNN query is very similar to the previous one.
This time we've placed the origin location somewhere in the blue deep waters of the South Atlantic (more or less midway between Brazil and South Africa) and consequently the nearest airports are not really so near, because they are located many thousands of Kilometers away. As you can easily check by yourself a KNN query will efficiently perform even under such unusual conditions.

South Atlantic

Summary: KNN queries never assume any predefined distribution of the searched items, and will extract the correct results (sorted by distance), despite the very irregular sample distribution.


Advanced tutorial

You can download the sample db-file used in the samples of this advanced tutorial.

This is a much more demanding dataset with a collection of about 2.4 million house numbers.
The original input shapefile was downloaded from Tuscany Region (Grafo Stradale civici.shp) and was subsequently rearranged into a more convenient form.
The dataset is released under the CC-BY-SA 4.0 license terms.
The used reference system is 3003 Monte Mario / Italy zone 1; this is a planar (projected) reference system measured in meters.
SELECT a.pos AS rank, a.fid AS fid, a.distance AS dist_m, d.prov AS province,
       d.nome AS municipality, c.toponimo AS street_name, b.civico AS house_num,
       b.geom AS geom
FROM knn AS a
JOIN civici AS b ON (b.rowid = a.fid)
JOIN toponimi AS c ON (c.fid = b.id_toponimo)
JOIN comuni AS d ON (d.fid = c.id_comune)
WHERE a.f_table_name = 'civici' AND a.ref_geometry = MakePoint(1733003, 4816332, 3003) AND a.max_items = 10;

SELECT a.pos AS rank, a.fid AS fid, a.distance AS dist_m, b.prov AS province,
       b.comune AS municipality, b.toponimo AS street_name, b.civico AS house_num, 
       b.geom AS geom
FROM knn AS a
JOIN vw_civici AS b ON (b.rowid = a.fid)
WHERE a.f_table_name = 'vw_civici' AND a.ref_geometry = MakePoint(1733003, 4816332, 3003) AND a.max_items = 10;
The following map corresponds to the above queries: the origin point is positioned on a road junction in the central area of a densely populated town (Arezzo).
Not surprisingly the ten nearest house numbers have been located within a radius of about 10m.

House Numbers #1

Both queries will return identical results.
A closer look will show you that the second query is based on a Spatial View

Summary: a VirtualKNN query can indifferently target either a GeoTable or a properly registered Spatial View.

SELECT a.pos AS rank, a.fid AS fid, a.distance AS dist_m, b.prov AS province,
       b.comune AS municipality, b.toponimo AS street_name, b.civico AS house_num, 
       b.geom AS geom
FROM knn AS a
JOIN vw_civici AS b ON (b.rowid = a.fid) 
WHERE a.f_table_name = 'civici' AND a.ref_geometry = MakePoint(1595625, 4767420, 3003) AND max_items = 20;
The following map corresponds to the above query. Now the origin point is positioned in the Tyrrhenian Sea, about 26 km from the Tuscany mainland and the Islands of Elba and Capraia.
Here too VirtualKNN is capable of identifying the nearest house numbers in just a fraction of a second.

House Numbers #2

Summary: VirtualKNN is highly efficient and very fast. Even when querying such a huge dataset, while exploring the most contrasting (very high / low density) data distributions .

back to index