SpatiaLite: metacatalog

Tutorial: how to build and update a MetaCatalog

Downloading the sample dataset

In this tutorial we'll use the standard geonames dataset; more precisely for practical reasons we'll use the shortened version you can download from here: cities1000.zip

Loading the sample dataset

The GeoNames dataset simply is a text file; so the best way to start the import process is the one to create a VirtualText table directly based on the external file.
Please use the following settings:

First line contains data, there are no Titles/Headers
There are no separator characters
Columns are delimited by a TAB character
the charset-encoding is UTF-8

Creating the final destination table

Now you'll be ready to create the final destination table.

CREATE TABLE geonames (
    geonameid INTEGER NOT NULL PRIMARY KEY,
    name TEXT NOT NULL,
    asciiname TEXT,
    alternatenames TEXT,
    latitude DOUBLE NOT NULL,
    longitude DOUBLE NOT NULL,
    feature_class TEXT NOT NULL,
    feature_code TEXT NOT NULL,
    country_code TEXT NOT NULL,
    cc2 TEXT,
    admin1_code TEXT,
    admin2_code TEXT, 
    admin3_code TEXT,
    admin4_code TEXT,
    population INTEGER, 
    elevation INTEGER,
    dem TEXT,
    timezone TEXT,
    modification_date TEXT);

Then you'll simply populate the destination table by executing this SQL statement:

INSERT INTO geonames
SELECT col001, col002, col003, col004, col005, col006, col007, 
       col008, col009, col010, col011, col012, col013, col014,
       col015, col016, col017, col018, col019
FROM cities1000;

You can now drop the VirtualText table; it's no longer required.

DROP TABLE cities1000;

Just a last final operation; you'll now add and populate a Geometry column corresponding to the Point geographic position.

SELECT AddGeometryColumn('geonames', 'geom', 4326, 'POINT', 'XY');

UPDATE geonames SET geom = MakePoint(longitude, latitude, 4326);

All done; now you are ready to begin this tutorial.

Creating the MetaCatalog

SELECT CreateMetaCatalogTables(1);
-------------
1

You already know this; you simply have to invoke the CreateMetaCatalogTables() function.
This will create both splite_metacatalog and splite_metacatalog_statistics tables; but only splite_metacatalog will be immediately populated.

Populating (or updating) splite_metacatalog_statistics will require some more careful planning:

you are probably not interested in covering any possible column; just few selected columns could probably contains useful analytic informations.
any column containing highly dispersed value distributions surely is a very poor candidate, and should be discarded:
- e.g. the geonameid surely is a very poor candidate: it's a Primary Key, so for sure it will always contain unique values.
- the name, asciiname and alternatenames surely are of very little interest: you can easily imagine to encounter highly dispersed values.
- more or less the same considerations could be applied to the latitude, longitude, population and elevation columns.
the real object of interest is in identifying the (few) columns containing codes: here you can expect to encounter strongly concentrated values:
- e.g. the country_code column seems to be a really interesting column.
- exactly as the feature_class, feature_code, admin1_code, admin2_code, admin3_code and admin4_code

All right, now you've identified all the columns to be monitored; you could eventually invoke many times the UpdateMetaDataCatalogStatistics() function, one for each single column. It will certainly work, but it will be surely unpractical; even worst, if you'll eventually had to update yet again the MetaCatalog in the next days after applying some editing, you'll then be necessarily forced to repeat the whole sequence.
Happily there is a better alternative.

Creating and populating an helper master_table

CREATE TABLE geonames_helper (
    tblname TEXT,
    colname TEXT);

INSERT INTO geonames_helper (tblname, colname) VALUES ('geonames', 'feature_class');
INSERT INTO geonames_helper (tblname, colname) VALUES ('geonames', 'feature_code');
INSERT INTO geonames_helper (tblname, colname) VALUES ('geonames', 'country_code');
INSERT INTO geonames_helper (tblname, colname) VALUES ('geonames', 'admin1_code');
INSERT INTO geonames_helper (tblname, colname) VALUES ('geonames', 'admin2_code');
INSERT INTO geonames_helper (tblname, colname) VALUES ('geonames', 'admin3_code');
INSERT INTO geonames_helper (tblname, colname) VALUES ('geonames', 'admin4_code');

SELECT * FROM geonames_helper;
------------------------------
tblname         colname
------------------------------
geonames	feature_class
geonames	feature_code
geonames	country_code
geonames	admin1_code
geonames	admin2_code
geonames	admin3_code
geonames	admin4_code

All right; now you've permanently registered all columns intended to be monitored in the MetaCatalog.

Populating / Updating the MetaCatalog

SELECT UpdateMetaCatalogStatistics(1, 'geonames_helper', 'tblname', 'colname');

You can now invoke the UpdateMetaCatalogStatistics() function in a more convenient form:

the first argument 1 will simply request to start an implicit Transaction, so to speed up write operations.
the second argument geonames_helper is the name of the helper/master table to be queried.
the third and fourth arguments respectively corresponds to:
- the name of the master table column containing table's names (i.e. tblname).
- the name of the master table column containing column's names (i.e. colname).
UpdateMetaCatalogStatistics() will now be able to automatically identify all table/columns requiring to be updated.
Please note: this tutorial is based on an elementary simple example. But nothing forbids to extend this schema so to cover many tenths (or even hundredths) tables and columns. A nicely flexible and freely extensible mechanism.
Please note too: you can eventually progressively refine your own optimal configuration by trial and error:
- you are absolutely free to delete any item from the master table when you discover that it's of little interest.
- exactly as you are free to insert any missing item in a second time.

Conclusions

If you are specifically interested in creating or validating datasets intended for public dissemination using the MetaCatalog could be an interesting option.
It supports in a very quick and simple way many analytics processes, it's very flexible and highly configurable, and it supports a persistent configuration.
Statistics data collected in the MetaCatalog can substantially help many complex validation processes, and could eventually make simpler, faster and safer many activities widely based on manual input or in integrating/merging data coming from disparate sources.

back