D 2015-12-08T12:45:09.047 L KNN U sandro W 6682

VirtualKNN: a quick intro

back to index

Credits
Development of VirtualKNN has been entierely funded by Tuscany Region - Territorial and Environmental Information System
Regione Toscana - Settore Sistema Informativo Territoriale ed Ambientale.


What is the KNN (K-Nearest Neighbors) problem ?
Imagine a set of arbitrary geometries; may well be, a really huge dataset containing some million features.
Now imagine that for any good reason you are interested in quickly identifying all features laying in close spatial proximity to an arbitrary location.
This is the typical case of a KNN problem.
Just few practical real world examples:
  • Suppose a smartphone app: your current geographics location is well known by the integrated GPS sensor, so you are simply expecting that the app will immediately tell you which are (and where they are exactely located) the nearest fuel stations (or maybe hotels, restaurants or whatever else).
  • Suppose a zoological/echological study: you know where are located the nesting areas of some rare water bird species, and you know where are located all rivers, lakes, ponds and marshlands in the region of your intereset.
    Now you are attempting to get a reasonable association between nesting areas and feeding zones based on minimal distance criteria.
Note: in both cases you can't safely identify a reasonable minimal radius so to usefully restrict your search, simply because you can't make any assumption about the geographic dispension of your targets.
The distances may noticeably vary from case to case, and you haven't any idea about this because it strictly depends on the actual location of the search origin and on the statistical dispersion of the targets to be searched.

A first naive (and very inefficient) approach
SELECT geonameid, name, country, Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km FROM airports; -------------------------------- 6299623 | Marina Di Campo | IT | 33.043320 SELECT geonameid, name, country, Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km FROM airports WHERE geonameid NOT IN (6299623); --------------------------------- 6299392 | Bastia-Poretta | FR | 65.226573 SELECT geonameid, name, country, Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km FROM airports WHERE geonameid NOT IN (6299623, 6299392); --------------------------------- 6299628 | Pisa / S. Giusto | IT | 82.387014 SELECT geonameid, name, country, Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km FROM airports WHERE geonameid NOT IN (6299623, 6299392, 6299628); --------------------------------- 6299630 | Grosseto Airport | IT | 91.549773 SELECT geonameid, name, country, Min(ST_Distance(MakePoint(10, 43), geom, 1)) / 1000.0 AS dist_km FROM airports WHERE geonameid NOT IN (6299623, 6299392, 6299628, 6299630); --------------------------------- 6694495 | Corte | FR | 102.819778 We could effectively execute a first SQL query using the Min() aggregate function in order to identify the closest airport to an arbitrary position (latitude=43.0, longitude=10.0) located into the middle of the Tyrrhenian Sea.
Then we could eventually repeat more times the same identical query, each time excluding all airports we've already identified in the previous processing steps.
The final result will be the full list of the first five airports closest to the reference location ordered by increasing distance.

rankgeonameidnamecountrydist_km
16299623Marina Di CampoIT33.043320
26299392Bastia-PorettaFR65.226573
36299628Pisa / S. GiustoIT82.387014
46299630Grosseto AirportIT91.549773
56694495CorteFR102.819778

Such a naive approach will surely be highly impractical, because it strictly requires to hand-write several ad hoc SQL queries one by one.
Writing a fully automated SQL script isn't at all a simple task, and some kind of higher level scripting (e.g. Python) would be easily required in any realistic scenario.
Anyway automating someway all required SQL queries isn't the most critical issue we have to face; the real critical bottle neck in this too much naive approach is that each single query will necessarily perform an awfull full table scan
This will directly cause barely tolerable sluggish performances, and the overall performance will quickly degradate as the dataset size will progressively increase.

Conclusion:
Some simpler and smarter approach is surely required: possibly one taking full profit from an R*TRee Spatial Index supporting the Geometries to be searched.
back to index
Z dd98cc34264c35e3c0dceb1764068a19