Note: these pages are no longer maintained

Never the less, much of the information is still relevant.
Beware, however, that some of the command syntax is from older versions, and thus may no longer work as expected.
Also: external links, from external sources, inside these pages may no longer function.

SpatiaLite logo

About Charset encodings

2011 January 28

Previous Slide Table of Contents Next Slide

What's a Charset encoding ? (and why the hell have I to take care of such nasty things ?)

Very simply said: any computer really is a stupid machine based on a messy bunch of crappy silicon: it has some intrinsic capability to understand simple arithmetic and boolean algebra, but it's absolutely unable to understand text.
You can easily had stored somewhere in your brain the misleading notion that a computer can actually handle text, but that's not exactly the truth.
To be most precise, it's rather a hocus-pocus finalized to mock you and your limited senses, dumb human being: any computer simply handles digits, but peripheral devices (screen, keyboard, printer ..) are purposely designed to give you the (illusory) impression that your PC actually understands text.

All this is an absolutely conventional process: you and your PC must agree about some correspondence table to be used in order to translate obscure digit sequences into readable words. In technical terms, such a conventional correspondence table is known as a Charset Encoding.
How many different alphabets are used into the Earth ? lots and lots ... Latin, Greek, Cyrillic, Hebrew, Arabic, Chinese, Japanese and many others ... and accordingly to this, lots and lots of different Charset Encodings has been defined during the years (you know: electronic / computer industry is fond of conflicting standard uncontrolled proliferation).

SQLite/SpatiaLite internally always uses the UTF-8 encoding, which is universal (i.e. you can safely store any known alphabet within the same DB at the same time): unhappily, Shapefiles (and many other datasets, such as SVC/TXT files) aren't UTF-8 based (they use some national encoding instead), so you are forced to explicitly select the Charset encoding to be used each time you have to import (or export) any data. I'm really sorry for this, but that's reality.

Anyway, consider all this not as a complication, but as a big resource: this way you'll be correctly able to import/export any arbitrary dataset coming from exotic not-latin countries such as Israel, Japan, Vietnam, Greece or Russia.
And after all, being able to display multi-alphabet text strings (such as the following one) can make your friends to become green with envy: Roma,Ρώμη,Рим,로마
Some useful further references:

Previous Slide Table of Contents Next Slide

CC-BY-SA logo Author: Alessandro Furieri
This work is licensed under the Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

GNU logo Permission is granted to copy, distribute and/or modify this document under the terms of the
GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.