| Author |
Message |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 11/07/2008 13:56:47
|
DarkAngel
Joined: 11/07/2008 13:47:49
Messages: 4
Offline
|
Hi
I have noticed what looks to be a possible problem or at least a minor inconsistency with the alternate names. I have only been looking at the data for less than a day so if i'm wrong appologies up front.
I was trying to do some basic conversion of the data to analyse it. Part of this was working from the all countries table export. Yeterday I spotted the comma delimited alternate names column had commas in some values, which meant changing my tokenisation rules to check the following character was not a space which seemed to be the rule for using a comma within a single value. This however caused a problem because one value ID:352303 had a comma at the end of the alternate name column.
I went over to the alternate name table export and looked it up there and found the value in that table actualy has a comma at the end of the value. I thought I'd check for other rows with commas present without a whitespace after them and found a total of seven. The IDs for which are :
244459, 2310084, 2319287, 2320240, 2321302, 2314197, 2318030
Would someone else with more experience with the data confirm this is something that needs cleaning up to make the data more consistent for processing when dealing with the allcountries export data in particular.
Thanks
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 11/07/2008 13:58:22
|
DarkAngel
Joined: 11/07/2008 13:47:49
Messages: 4
Offline
|
Also could someone just confirm the alternate names column in the allcountries table contains both the values from the alternate names table, plus where necessary an ASCII equivalent. I just noticed there were more values in the allcountries column than values in the alternate name table.
Thanks
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 12/07/2008 22:38:12
|
marc
Joined: 08/12/2005 07:39:47
Messages: 4501
Offline
|
The alternatename column in the allcountries table is a convenience column, it is redundant to the alternate name table. Instead of parsing it you might consider to directly use the alternatename table information.
Marc
|
 |
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 14/07/2008 14:42:22
|
DarkAngel
Joined: 11/07/2008 13:47:49
Messages: 4
Offline
|
According to the readme file in the export area, if you don't require the language information for the alternate names you probably dont need the alternate names table. I was working to that assumption for my needs. As I mentioned in my second post the column in allcountries also contains an ASCII form of the names, via some transliteration process I assume, which is ideal for fulltext searching. So taking the data from the alternate names table would require that additional conversion as well as deduping the values.
For example take London (ID:2643743). There are 49 values for alternate names in the allcountries table, and 97 in the alternate names table. The alternate names has duplicate string values due to multiple languages having the same character sequence, so removing those you get 45 values left. The four remaining differences are:
Ljondan
Londan
Luan GJon
Lundunir
All ASCII forms of names in other languages.
So I can probably work from the alternate name table, I just need the conversion process as well. Could you tell me what library was used for this, ICU perhaps?
Thanks
Tony
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 16/07/2008 22:12:30
|
marc
Joined: 08/12/2005 07:39:47
Messages: 4501
Offline
|
The list of alternate names is not meant to be parsed and I don't want to kind of sign a contract that it will be done this or that way. It is a comma separated list that can be used for search - nothing more.
If you want anyform of contract then please use the alternate names dump.
Best
Marc
|
 |
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 18/07/2008 11:21:23
|
DarkAngel
Joined: 11/07/2008 13:47:49
Messages: 4
Offline
|
Sure i can appreciate that, but even saying its comma delimited list implies certain things. Like how commas in the individual values should be handled. If you want to do correct phrase matching for example you dont want your phrases crossing between individual values in the list, otherwise thats not a valid search hit. So still being able to parse it correctly for search can be important. It was only seven occurences where it didn't conform to the pattern that had been used extensively every where else with regard to a comma being followed by a space within an actual alternate name value.
Like I said though I'm happy to go with the separate table. I just need to work on doing the language conversion piece to handle it correctly if i want ASCII values where possible.
|
|
|
 |
|
|