Wiki page
[KNN] by
sandro
2015-12-08 12:45:09.
D 2015-12-08T12:45:09.047
L KNN
U sandro
W 6682
<table cellspacing="12" width="100%">
<tr><td colspan="2">
<table width="100%" bgcolor="#f0f0f8">
<tr><td align="center">
<h1>VirtualKNN: a quick intro</h1>
</td></tr></table>
<table width="100%"><tr>
<td width="33%" align="left"></td>
<td align="center"><a href="https://www.gaia-gis.it/fossil/libspatialite/wiki?name=misc-docs">back to index</a></td>
<td width="33%" align="right"></td>
</tr></table><br>
<table bgcolor="#d0e0ff" width="100%">
<tr><td align="center">
<b>Credits</b><br>
Development of <b><i>VirtualKNN</i></b> has been entierely funded by
<a href="http://en.wikipedia.org/wiki/Tuscany">Tuscany Region</a> - Territorial and Environmental Information System<br>
<a href="http://www.regione.toscana.it/ambienteeterritorio/geografiageologia/index.html">Regione Toscana</a> - Settore Sistema Informativo Territoriale ed Ambientale.
</td></tr></table>
<hr><br>
<table width="100%" bgcolor="#ffffdf" cellspacing="4" cellpadding="8">
<tr><td><b>What is the KNN (<i>K-Nearest Neighbors</i>) problem ?</b></td></tr>
<tr><td>Imagine a set of arbitrary geometries; may well be, a really huge dataset containing some million features.<br>
Now imagine that for any good reason you are interested in quickly identifying all features laying in close spatial proximity to an arbitrary location.<br>
This is the typical case of a <b>KNN</b> problem.</td></tr>
<tr><td>Just few practical real world examples:
<ul>
<li>Suppose a smartphone app: your current geographics location is well known by the integrated GPS sensor, so you are simply expecting that the app will immediately tell you which are (and where they are exactely located) the nearest fuel stations (or maybe hotels, restaurants or whatever else).</li>
<li>Suppose a zoological/echological study: you know where are located the nesting areas of some rare water bird species, and you know where are located all rivers, lakes, ponds and marshlands in the region of your intereset.<br>
Now you are attempting to get a reasonable association between nesting areas and feeding zones based on minimal distance criteria.</li>
</ul>
</td></tr>
<tr><td><b>Note</b>: in both cases you can't safely identify a reasonable <i><u>minimal radius</u></i> so to usefully restrict your search, simply because you can't make any assumption about the geographic dispension of your targets.<br>
The distances may noticeably vary from case to case, and you haven't any idea about this because it strictly depends on the actual location of the search origin and on the statistical dispersion of the targets to be searched.</td></tr>
<tr><td><hr></td></tr>
<tr><td>
<b>A first naive (and very inefficient) approach</b><br>
<verbatim>
SELECT geonameid, name, country,
Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports;
--------------------------------
6299623 | Marina Di Campo | IT | 33.043320
SELECT geonameid, name, country,
Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623);
---------------------------------
6299392 | Bastia-Poretta | FR | 65.226573
SELECT geonameid, name, country,
Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392);
---------------------------------
6299628 | Pisa / S. Giusto | IT | 82.387014
SELECT geonameid, name, country,
Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392, 6299628);
---------------------------------
6299630 | Grosseto Airport | IT | 91.549773
SELECT geonameid, name, country,
Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km
FROM airports
WHERE geonameid NOT IN (6299623, 6299392, 6299628, 6299630);
---------------------------------
6694495 | Corte | FR | 102.819778
</verbatim>
We could effectively execute a first SQL query using the <b>Min()</b> aggregate function in order to identify the closest airport to an arbitrary position (<i>latitude</i>=<b>43.0</b>, <i>longitude</i>=<b>10.0</b>) located into the middle of the Tyrrhenian Sea.<br>
Then we could eventually repeat more times the same identical query, each time excluding all airports we've already identified in the previous processing steps.<br>
The final result will be the full list of the first five airports closest to the reference location ordered by increasing distance.
<br><br>
<table bgcolor="#d0ffb0" cellspacing="4" cellpadding="4" border="1">
<tr><td bgcolor="#c0ddao"><b>rank</b></td><td bgcolor="#c0ddao"><b>geonameid</b></td><td bgcolor="#c0ddao"><b>name</b></td><td bgcolor="#c0ddao"><b>country</b></td><td bgcolor="#c0ddao"><b>dist_km</b></td></tr>
<tr><td align="right">1</td><td align="right">6299623</td><td>Marina Di Campo</td><td>IT</td><td align="right">33.043320</td></tr>
<tr><td align="right">2</td><td align="right">6299392</td><td>Bastia-Poretta</td><td>FR</td><td align="right">65.226573</td></tr>
<tr><td align="right">3</td><td align="right">6299628</td><td>Pisa / S. Giusto</td><td>IT</td><td align="right">82.387014</td></tr>
<tr><td align="right">4</td><td align="right">6299630</td><td>Grosseto Airport</td><td>IT</td><td align="right">91.549773</td></tr>
<tr><td align="right">5</td><td align="right">6694495</td><td>Corte</td><td>FR</td><td align="right">102.819778</td></tr>
</table><br>
Such a naive approach will surely be highly impractical, because it strictly requires to hand-write several <i>ad hoc</i> SQL queries one by one.<br>
Writing a fully automated SQL script isn't at all a simple task, and some kind of higher level scripting (e.g. Python) would be easily required in any realistic scenario.<br>
Anyway automating someway all required SQL queries isn't the most critical issue we have to face; the real critical bottle neck in this too much naive approach is that each single query will necessarily perform an awfull <a href="https://en.wikipedia.org/wiki/Full_table_scan">full table scan</a><br>
This will directly cause barely tolerable sluggish performances, and the overall performance will quickly degradate as the dataset size will progressively increase.<br><br>
<b>Conclusion</b>:<br>
Some simpler and smarter approach is surely required: possibly one taking full profit from an <b>R*TRee Spatial Index</b> supporting the Geometries to be searched.
</td></tr>
</table>
<table width="100%"><tr>
<td width="33%" align="left"></td>
<td align="center"><a href="https://www.gaia-gis.it/fossil/libspatialite/wiki?name=misc-docs">back to index</a></td>
<td width="33%" align="right"></td>
</tr></table>
Z dd98cc34264c35e3c0dceb1764068a19