Artifact [b2b9aae4a1]
Not logged in

Artifact b2b9aae4a113080437528c66cc9e59140ec4fa8d:

Wiki page [KNN] by sandro 2015-12-08 23:36:09.
D 2015-12-08T23:36:09.130
L KNN
P eb42d6f232949205c79055e3cf5260a7c9275862
U sandro
W 19650
<table cellspacing="12" width="100%">
<tr><td colspan="2">
<table width="100%" bgcolor="#f0f0f8">
<tr><td align="center">
<h1>VirtualKNN: a quick intro</h1>
</td></tr></table>
<table width="100%"><tr>
<td width="33%" align="left"></td>
<td align="center"><a href="https://www.gaia-gis.it/fossil/libspatialite/wiki?name=misc-docs">back to index</a></td>
<td width="33%" align="right"></td>
</tr></table><br>
<table bgcolor="#d0e0ff" width="100%">
<tr><td align="center">
<b>Credits</b><br>
Development of <b><i>VirtualKNN</i></b> has been entierely funded by
<a href="http://en.wikipedia.org/wiki/Tuscany">Tuscany Region</a> - Territorial and Environmental Information System<br>
<a href="http://www.regione.toscana.it/ambienteeterritorio/geografiageologia/index.html">Regione Toscana</a> - Settore Sistema Informativo Territoriale ed Ambientale.
</td></tr></table><br>
<hr><br>
<table width="100%" bgcolor="#ffffdf" cellspacing="4" cellpadding="8">
<tr><td><b>What is the KNN (<i>K-Nearest Neighbors</i>) problem ?</b></td></tr>
<tr><td>Imagine a set of arbitrary geometries; may well be, a really huge dataset containing some million features.<br>
Now imagine that for any good reason you are interested in quickly identifying all features laying in close spatial proximity to an arbitrary location.<br>
This is the typical case of a <b>KNN</b> problem.</td></tr>
<tr><td>Just few practical real world examples:
<ul>
<li>Suppose a smartphone app: your current geographics location is well known by the integrated GPS sensor, so you are simply expecting that the app will immediately tell you which are (and where they are exactely located) the nearest fuel stations (or maybe hotels, restaurants or whatever else).</li>
<li>Suppose a zoological/echological study: you know where are located the nesting areas of some rare water bird species, and you know where are located all rivers, lakes, ponds and marshlands in the region of your intereset.<br>
Now you are attempting to get a reasonable association between nesting areas and feeding zones based on minimal distance criteria.</li>
</ul>
</td></tr>
<tr><td><b>Note</b>: in both cases you can't safely identify a reasonable <i><u>minimal radius</u></i> so to usefully restrict your search, simply because you can't make any valid assumption about the geographic distribution of your targets.<br>
The distances may noticeably vary from case to case, and you haven't any idea about this because it strictly depends on the actual location of the search origin and on the statistical dispersion of the targets to be searched.</td></tr>
</table><br>
<hr><br>
<table width="100%">
<tr><td valign="middle">
You can eventually <a href="https://www.gaia-gis.it/gaia-sins/knn/airports.7z">download</a> a sample db-file exactly corresponding to the practical examples we'll use in this first tutorial.<br><br>
This dataset corresponds to a collection of about <b>30.000 airports</b> scattered all over the world; the original input shapefile was downloaded from <a href="https://www.sharegeo.ac.uk/handle/10672/303">here</a><br>
The reference system is <b>4326 WGS84</b> (based on <i>longitued</i> and <i>latitude</i> angles).
<br><br><br><br>
The map on the right shows the geographic context of this first KNN example.</td>
<td><img src="https://www.gaia-gis.it/gaia-sins/knn/tyrrhenian-sea.png" alt="Tyrrhenian Sea"></td>
</table><br>
<h2>A first naive (and very inefficient) approach</h2>
<verbatim>
SELECT geonameid, name, country, 
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports;
--------------------------------
6299623	| Marina Di Campo | IT | 33.043320

SELECT geonameid, name, country, 
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623);
---------------------------------
6299392 | Bastia-Poretta | FR | 65.226573

SELECT geonameid, name, country,
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392);
---------------------------------
6299628 | Pisa / S. Giusto | IT	| 82.387014

SELECT geonameid, name, country, 
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392, 6299628);
---------------------------------
6299630 | Grosseto Airport | IT | 91.549773

SELECT geonameid, name, country, 
    Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392, 6299628, 6299630);
---------------------------------
6694495 | Corte | FR | 102.819778
</verbatim>
We could effectively execute a first SQL query using the <b>Min()</b> aggregate function in order to identify the closest airport to an arbitrary position (<i>latitude</i>=<b>43.0</b>, <i>longitude</i>=<b>10.0</b>) located into the middle of the Tyrrhenian Sea.<br>
Then we could eventually repeat more times the same identical query, each time excluding all airports we've already identified in the previous processing steps.<br>
The final result will be the full list of the first five airports closest to the reference location ordered by increasing distance.
<br><br>
<table bgcolor="#d0ffb0" cellspacing="4" cellpadding="4" border="1">
<tr><td bgcolor="#c0ddao"><b>rank</b></td><td bgcolor="#c0ddao"><b>geonameid</b></td><td bgcolor="#c0ddao"><b>name</b></td><td bgcolor="#c0ddao"><b>country</b></td><td bgcolor="#c0ddao"><b>dist_km</b></td></tr>
<tr><td align="right">1</td><td align="right">6299623</td><td>Marina Di Campo</td><td>IT</td><td align="right">33.043320</td></tr>
<tr><td align="right">2</td><td align="right">6299392</td><td>Bastia-Poretta</td><td>FR</td><td align="right">65.226573</td></tr>
<tr><td align="right">3</td><td align="right">6299628</td><td>Pisa / S. Giusto</td><td>IT</td><td align="right">82.387014</td></tr>
<tr><td align="right">4</td><td align="right">6299630</td><td>Grosseto Airport</td><td>IT</td><td align="right">91.549773</td></tr>
<tr><td align="right">5</td><td align="right">6694495</td><td>Corte</td><td>FR</td><td align="right">102.819778</td></tr>
</table><br>
Such a naive approach will surely be highly impractical, because it strictly requires to hand-write several <i>ad hoc</i> SQL queries one by one.<br>
Writing a fully automated SQL script isn't at all a simple task, and some kind of higher level scripting (e.g. Python) would be easily required in any realistic scenario.<br>
Anyway automating someway all required SQL queries isn't the most critical issue we have to face; the real critical bottle neck in this too much naive approach is that each single query will necessarily perform an awfull <a href="https://en.wikipedia.org/wiki/Full_table_scan">full table scan</a><br>
This will directly cause barely tolerable sluggish performances, and the overall performance will quickly degradate as the dataset size will progressively increase.<br><br>
<b>Conclusion</b>:<br>
Some simpler and smarter approach is surely required: possibly one taking full profit from an <b>R*TRee Spatial Index</b> supporting the Geometries to be searched.<br><br>
<hr>
<h2>VirtualKNN</h2>
Starting since version <b>4.4.0</b> SpatiaLite supports a <b>VirtualKNN</b> <a href="https://www.sqlite.org/vtab.html">Virtual Table</a> specifically intended as a complete and highly efficient solution to the <b>KNN</b> problem.
<verbatim>
CREATE VIRTUAL TABLE knn USING VirtualKNN();
</verbatim>
Every new db-file being created by <b>4.4.0</b> itself will always automatically define a <b>KNN</b> virtual table.<br>
You could eventually add full KNN support to any db-file created by any earlier version of SpatiaLite by manually executing the above SQL statement.<br>
<b>Note</b>: VirtualKNN necessarily requires 4.4.0 binary support, so any attempt to open a db-file including a VirtualKNN table by using some previoys version (<= <b>4.3.0a</b>) will surely raise an error condition.
<h3>A first basically simple KNN query</h3>
<verbatim>
SELECT * FROM knn
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43);
</verbatim>
<table bgcolor="#d0ffb0" cellspacing="4" cellpadding="4" border="1">
<tr><td bgcolor="#c0ddao"><b>f_table_name</b></td><td bgcolor="#c0ddao"><b>f_geometry_column</b></td><td bgcolor="#c0ddao"><b>ref_geometry</b></td><td bgcolor="#c0ddao"><b>max_items</b></td><td bgcolor="#c0ddao"><b>pos</b></td><td bgcolor="#c0ddao"><b>fid</b></td><td bgcolor="#c0ddao"><b>distance</b></td></tr>
<tr><td>airports</td><td>geom</td><td>BLOB sz=60 GEOMETRY</td><td align="right">3</td><td align="right">1</td><td align="right">6299623</td><td align="right">33043.319520</td></tr>
<tr><td>airports</td><td>geom</td><td>BLOB sz=60 GEOMETRY</td><td align="right">3</td><td align="right">2</td><td align="right">6299392</td><td align="right">65226.573149</td></tr>
<tr><td>airports</td><td>geom</td><td>BLOB sz=60 GEOMETRY</td><td align="right">3</td><td align="right">3</td><td align="right">6299628</td><td align="right">82387.014028</td></tr>
</table><br>
<ul>
<li>a VirtualKNN query closely resembles a VirtualSpatialIndex query.
This should not surprise too much, because both them are intended to be wrapped around an underlying <b>R*Tree Spatial Index</b>.</li> 
<li>Any valid VirtualKNN query should necessarily have a form like:
<b>WHERE <i>knn-column</i> = <i>value</i> AND <i>knn-column</i> = <i>value</i> ...</b></li>
<li>the input columns (the ones you can reference into a <b>WHERE</b> clause) are:
<ul>
<li><b>f_table_name</b> (<i>mandatory</i>)<br>
name of the GeoTable containing the Geometries to be searched.</li>
<li><b>f_geometry_column</b> (<i>optional</i>)<br>
name of the column of the above table containing the Geometries to be searched.
<ol>
<li>If the table identified by <b>f_table_name</b> just contains a single Geometry column you can safely omit to specify the <b>f_geometry_column</b> argument (it will be automatically set).</li>
<li>If instead the table identified by <b>f_table_name</b> contains two (or more) Geometry columns explicitly specifying the <b>f_geometry_column</b> argument is strictly required, so to get an unambiguos definition of the intended search context.</li>
<li>In any case <b>f_table_name</b> and <b>f_geometry_column</b> must exectly match a properly defined Geometry column supported by a corresponding Spatial Index.</li>
</ol></li>
<li><b>ref_geometry</b> (<i>mandatory</i>)<br>
any arbitrary Geometry (<b>POINT</b>, <b>LINESTRING</b>, <b>POLYGON</b> or whatever else) intended to represent the origin of the KNN search.<br>
Should necessarily be in the same SRID of the target Geometries to be searched.</li>
<li><b>max_items</b> (<i>optional</i>)<br>
maximum number of rows to be returned into the resultset.<br>
The valid range is from <b>1</b> to <b>1024</b> (higher values will require a longer time to be processed).<br>
By default only the first <b>3</b> nearest Geometries will be identified.</li>
</ul></li>
<li>the output columns (the ones containing values retrieved by the KNN search) are:
<ul>
<li><b>pos</b> (<i>INTEGER</i>)<br>
relative rank: the closest item will be <b>#1</b>, the second closest item will be <b>#2</b> and so on.</li>
<li><b>fid</b> (<i>INTEGER</i>)<br>
the unique <b>ROWID</b> value of the corresponding feature into the searched GeoTable.</li>
<li><b>distance</b> (<i>DOUBLE</i>)<br>
the minimum distance intercurring between the corresponding feature into the searched GeoTable and the origin defined by <b>ref_geometry</b>.
<ol>
<li>The unit measure for all distances will always be the one declared by the corresponding SRID definition, if this one corresponds to some <b>planar</b> (aka <b>projected</b>) reference system.</li>
<li>If instead the corresponding SRID definition is of the <b>geographic</b> type (<i>longitude</i> and <i>latitude</i> angles) all distances will always be measured in <b>metres</b>, and the most precise geodetic formulas will be automatically applied.</li>
</ol></li>
</ul></li>
</ul><br>
<hr>
<h3>A second example of a more sophisticard KNN query</h3>
<verbatim>
SELECT a.pos, b.name, b.country, a.distance / 1000.0 AS dist_km
FROM knn AS a
JOIN airports AS b ON (b.geonameid = a.fid)
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43) AND max_items = 5;
</verbatim>
<table bgcolor="#d0ffb0" cellspacing="4" cellpadding="4" border="1">
<tr><td bgcolor="#c0ddao"><b>rank</b></td><td bgcolor="#c0ddao"><b>geonameid</b></td><td bgcolor="#c0ddao"><b>name</b></td><td bgcolor="#c0ddao"><b>country</b></td><td bgcolor="#c0ddao"><b>dist_km</b></td></tr>
<tr><td align="right">1</td><td align="right">6299623</td><td>Marina Di Campo</td><td>IT</td><td align="right">33.043320</td></tr>
<tr><td align="right">2</td><td align="right">6299392</td><td>Bastia-Poretta</td><td>FR</td><td align="right">65.226573</td></tr>
<tr><td align="right">3</td><td align="right">6299628</td><td>Pisa / S. Giusto</td><td>IT</td><td align="right">82.387014</td></tr>
<tr><td align="right">4</td><td align="right">6299630</td><td>Grosseto Airport</td><td>IT</td><td align="right">91.549773</td></tr>
<tr><td align="right">5</td><td align="right">6694495</td><td>Corte</td><td>FR</td><td align="right">102.819778</td></tr>
</table><br>
This second KNN query will return the same identical results we already got before by using the silly naive approach.<br>
The striking difference is that this KNN query succesfully completed in just a fraction of a second, wihlst execyting the naive queries complexively required many seconds (and please consider that this one is a relatively small dataset).<br><br>
<b>Lesson to learn</b>: KNN queries are really efficient and fast, because they directly interact with the lowermost levels of the R*Tree Spatial Index implementation.<br><br>
<hr>
<h3>A third example of KNN query</h3>
<verbatim>
SELECT a.pos AS rank, b.geonameid, b.name, b.country, a.distance / 1000.0 AS dist_km
FROM knn AS a
JOIN airports AS b ON (b.geonameid = a.fid)
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(-17.3, -44) AND max_items = 5;
</verbatim>
<table bgcolor="#d0ffb0" cellspacing="4" cellpadding="4" border="1">
<tr><td bgcolor="#c0ddao"><b>rank</b></td><td bgcolor="#c0ddao"><b>geonameid</b></td><td bgcolor="#c0ddao"><b>name</b></td><td bgcolor="#c0ddao"><b>country</b></td><td bgcolor="#c0ddao"><b>dist_km</b></td></tr>
<tr><td align="right">1</td><td align="right">7668433</td><td>Cabo Frio Airport</td><td>BR</td><td align="right">3256.589964</td></tr>
<tr><td align="right">2</td><td align="right">3359001</td><td>Youngsfield</td><td>ZA</td><td align="right">3262.271027</td></tr>
<tr><td align="right">3</td><td align="right">7730153</td><td>Umberto Modiano Airport</td><td>BR</td><td align="right">3262.693032</td></tr>
<tr><td align="right">4</td><td align="right">3362355</td><td>Robben Island</td><td>ZA</td><td align="right">3265.302568</td></tr>
<tr><td align="right">5</td><td align="right">6300622</td><td>S. P. Aldeia Aerodrome</td><td>BR</td><td align="right">3267.234142</td></tr>
</table><br>
This last KNN query is very similar to the previous one.<br>
This time we've placed the origin location somewhere into the South Atlantic blue deep waters (more or less midway from both Brazil and South Africa) and consequntly the nearest airports are not really so near, because they are located many thousands Km away.
As you can easily check by yourself a KNN query will brilliantly perform even under such unusual conditions.<br><br>
<img src="https://www.gaia-gis.it/gaia-sins/knn/south-atlantic.png" alt="South Atlantic"><br><br>
<b>Lesson to learn</b>: KNN queries never assume any predefined distribution of the searched items, and can automatically adapt in a very efficient way to the most irregular sample distributions.<br><br>
<hr>
<h2>Advanced tutorial</h2>
You can eventually <a href="https://www.gaia-gis.it/gaia-sins/knn/tuscany_housenumbers.7z">download</a> a sample db-file exactly corresponding to the practical examples we'll use in this advanced tutorial.<br><br>
This is a much more demanding dataset corresponding to a collection of about <b>2.4 million house numbers</b>.<br>
The original input shapefile was downloaded from <a href="http://www502.regione.toscana.it/geoscopio/cartoteca.html">Tuscany Region</a> (<b><i>Grafo Stradale civici.shp</i></b>) and was subsequently rearrangend into a more convenient form.<br>
The dataset is released under the <b>CC-BY-SA 4.0</b> license terms.<br>
The reference system is <b>3003 Monte Mario / Italy zone 1</b>; this is a <b>planar</b> (<i>projected</i>) reference system measured in <b>metres</b>.
<br><br>
<verbatim>
SELECT a.pos AS rank, a.fid AS fid, a.distance AS dist_m, d.prov AS province,
       d.nome AS municipality, c.toponimo AS street_name, b.civico AS house_num,
       b.geom AS geom
FROM knn AS a
JOIN civici AS b ON (b.rowid = a.fid)
JOIN toponimi AS c ON (c.fid = b.id_toponimo)
JOIN comuni AS d ON (d.fid = c.id_comune)
WHERE a.f_table_name = 'civici' AND a.ref_geometry = MakePoint(1733003, 4816332, 3003) AND a.max_items = 10;

SELECT a.pos AS rank, a.fid AS fid, a.distance AS dist_m, b.prov AS province,
       b.comune AS municipality, b.toponimo AS street_name, b.civico AS house_num, 
       b.geom AS geom
FROM knn AS a
JOIN vw_civici AS b ON (b.rowid = a.fid)
WHERE a.f_table_name = 'vw_civici' AND a.ref_geometry = MakePoint(1733003, 4816332, 3003) AND a.max_items = 10;
</verbatim>
The following map corresponds to the above queries: the origin point is positioned on a road juction in the central area of a densely populated town (Arezzo).<br>
Not surprisingly the ten nearest house numbers have been located within a radius of about 10m.<br><br>
<img src="https://www.gaia-gis.it/gaia-sins/knn/housenum1.png" alt="House Numbers #1">
<br><br>
Both queries will return the same identical results (you can easily check by yourself).<br>
Anyway if you pay close attention you'll easily discover that the second query is completely based on a <b>Spatial View</b><br><br>
<b>Lesson to learn</b>: a VirtualKNN query can indiferently target either a GeoTable or a properly registered Spatial View.<br><br>
<verbatim>
SELECT a.pos AS rank, a.fid AS fid, a.distance AS dist_m, b.prov AS province,
       b.comune AS municipality, b.toponimo AS street_name, b.civico AS house_num, 
       b.geom AS geom
FROM knn AS a
JOIN vw_civici AS b ON (b.rowid = a.fid)WHERE a.f_table_name = 'civici' AND a.ref_geometry = MakePoint(1595625, 4767420, 3003) AND max_items = 20;
</verbatim>
The following map corresponds to the above query. Now the origin point is positioned direclty into the Tyrrhenian Sea; the Tuscany mainland, the Island of Elba and the Island of Capraia are as far as about <b>26 Km</b>.<br>
Anyway in this case too VirtualKNN confirms to be able to identify the nearest house number in just a fraction of a second.<br><br>
<img src="https://www.gaia-gis.it/gaia-sins/knn/housenum2.png" alt="House Numbers #2"><br><br>
<b>Lesson to learn</b>: VirtualKNN is always highly efficient and very fast. Even when querying an huge dataset, and when exploring the most contrasting data distributions (very high density / very low density).<br><br>
<table width="100%"><tr>
<td width="33%" align="left"></td>
<td align="center"><a href="https://www.gaia-gis.it/fossil/libspatialite/wiki?name=misc-docs">back to index</a></td>
<td width="33%" align="right"></td>
</tr></table>
Z 5f5f7fd88cea1e75a59602d76b7e688c