| Author |
Message |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 15/06/2010 22:30:30
|
Big Jon
Joined: 15/06/2010 21:57:05
Messages: 3
Offline
|
I think I've run into a data consistency issue.
From the README, the alternateNames.txt table should contain a (proper) superset of the values in the "allCountries.txt" alternatenames column. However, it does not.
A quick and dirty (very naive) check shows just shy of 900,000 values that are in the alternatenames column that do not show up in the alterNames.txt table.
An example:
Serbia and Montenegro
Thoughts or comments?
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 16/06/2010 06:58:50
|
marc
Joined: 08/12/2005 07:39:47
Messages: 4501
Offline
|
I don't understand what you mean. In any case they are not meant to be exactly the same.
Marc
|
 |
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 16/06/2010 19:17:01
|
Big Jon
Joined: 15/06/2010 21:57:05
Messages: 3
Offline
|
I know they are not mean to be the same.
The readme.txt file says:
Remark : the field 'alternatenames' in the table 'geoname' is a short version of the 'alternatenames' table. You probably don't need both.
If you don't need to know the language of a name variant, the field 'alternatenames' will be sufficient. If you need to know the language
of a name variant, then you will need to load the table 'alternatenames' and you can drop the column in the geoname table.
What that statement tells me is that the alternatenames field is a subset of the data contained in the alternatenames table.
However, that doesn't appear to be the case.
Using the postgresql functions string_to_array and unnest, I believe the following query returns the alternatenames that are present in an import of the 'allcountries' table, but which are not present in the 'alternatenames' table (using a case-insensitive comparison):
Code:
SELECT FOO.* FROM
(SELECT DISTINCT
unnest(string_to_array(alternatenames, ',')) AS name
FROM allcountries) FOO
WHERE NOT EXISTS
(SELECT 1
FROM alternatenames
WHERE lower(alternatenames.name) = lower(FOO.name));
That query results in some 777315 values from the 'alternatenames' column that are not present in the 'alternatenames' table.
With respect to the example 'Serbia and Montenegro', that appears to have been an error on my part, however this example:
Residence du Ruanda
is present in allcountries.txt but is not present in alternatenames.txt (geonameid 4951 .
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 17/06/2010 07:53:34
|
marc
Joined: 08/12/2005 07:39:47
Messages: 4501
Offline
|
"Residence du Ruanda" is the ascii transliteration for "Résidence du Rwanda". I have updated the documentation.
Best
Marc
|
 |
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 17/06/2010 23:29:32
|
Big Jon
Joined: 15/06/2010 21:57:05
Messages: 3
Offline
|
The documentation says that the column (alternatenames) is an ASCII transliteration. OK, but, as an example:
781358 Tropojë Tropoje Tropoja,Tropoje-Fshat,Tropojë-Fshat,Тропоя 42.40417 20.16667 A ADM4 AL AL 47 781360 0 471 Europe/Tirane 2010-06-16
The alternatenames column clearly contain non-ASCII characters.
From the readme:
Remark : the field 'alternatenames' in the table 'geoname' is a short version of the 'alternatenames' table without links and postal codes but with ascii transliterations. You probably don't need both.
Just to clarify -- the alternatenames column is intended to be an ASCII transliterated variation of all of the alternate names for a given entity, except for those which are links or postal codes, right?
How is the ascii transliteration being performed?
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 18/06/2010 06:55:54
|
marc
Joined: 08/12/2005 07:39:47
Messages: 4501
Offline
|
It field is not the ascii transliteration. It includes ascii transliteration together with the non-ascii text.
Marc
|
 |
|
|
 |
|
|