GeoNames Home | Postal Codes | Download / Webservice | About 

GeoNames Forum
  [Search] Search   [Recent Topics] Recent Topics   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
ASCII Names  XML
Forum Index -> General Go to Page: 1, 2 Next 
Author Message
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

Am I correct in understanding that the asciiname fields in admin1CodesASCII.txt, admin2Code.txt and GeoName records (from <countrycode>.zip, allcountries.zip, cities1000.zip etc.) are supposed to only contain ASCII values? And are we talking ASCII 0x00 to 0x7F here or 0x00 to 0xFF?

I have found many entries with my NGeoNames library that do not match this criteria so either a) I'm misunderstanding what's supposed to be in these fields or b) I should create a list for you to correct these entries
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

Ok, so since unfortunately I haven't gotten an answer yet I will post my findings here. This post contains 2 textfiles for a scan of allCountries.txt which I did;

allcountries_nonascii.txt
Contains all entries that contain "NameASCII" fields containing characters outside the 'normal ASCII range' 0x00 - 0x7F

allcountries_nonextascii.txt
Contains all entries that contain "NameASCII" fields containing characters outside the 'extended ASCII range' 0x00 - 0xFF

I will post findings for admin1Codes and admin2Codes in a few minutes.
 Description [Disk] Download
 Filesize 9 Kbytes
 Downloaded:  3282 time(s)

 Description [Disk] Download
 Filesize 14 Kbytes
 Downloaded:  3364 time(s)

[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

For Admin2Codes the following entries contain NON-(EXT)ASCII:


7669203 AZ.09.7669203 Xətai Rayonu
8531817 HR.21.8531817 Gornji Grad – Medvescak
8531823 HR.21.8531823 Pescenica – Zitnjak
8531824 HR.21.8531824 Novi Zagreb – istok
8531825 HR.21.8531825 Novi Zagreb – zapad
8531826 HR.21.8531826 Tresnjevka – sjever
8531827 HR.21.8531827 Tresnjevka – jug
8531832 HR.21.8531832 Podsused – Vrapce
8986343 SK.03.806 Okres Kosice–okolie

Admin1Codes currently seems to be okay; I do recall (and I'm pretty sure) finding non-ascii entries about a week or two ago.
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

I wish someone would answer this question. Also, I have included in this post an updated file containing my findings (\t-separated: filename,id,value) for all files containing "ASCII-only fields".
 Description List of files/records containing non-ASCII data in the ASCIIName field(s) [Disk] Download
 Filesize 38 Kbytes
 Downloaded:  2967 time(s)

[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

Because I keep reading: "if there's an error in the data, correct it yourself" in the forums I've started to correct entries to the best of my abilities. I'll try to do several dozen a day so it'll be done by next week sometime. Some entries are simple (replacing unicode dashes like – with a simple 'minus' -), others take a bit more consideration where, I don't know, ß becomes ss or ö becomes oe for example.

Meanwhile: would you please consider adding a check in the 'core API' or somewhere where user-data goes into the system so that non-ASCII containing entries will get/be denied? A very simple regex (check for matches on [^\x00-\x7f], if so: throw error/exception/whatever) will do.
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

RobIII wrote:

Meanwhile: would you please consider adding a check in the 'core API' or somewhere where user-data goes into the system so that non-ASCII containing entries will get/be denied? A very simple regex (check for matches on [^\x00-\x7f], if so: throw error/exception/whatever) will do. 


I'd like to stress above quoted question. Here I am, correcting dozens of entries by hand and meanwhile somebody else is entering new entries. For this to work this needs to be addressed. And it can't be that hard to fix this.

[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

RobIII wrote:
A very simple regex (check for matches on [^\x00-\x7f ], if so: throw error/exception/whatever) will do. 


Also: the quoted regex is maybe a bit too lenient still; it should be more like:
[^\x20-\x7f]
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

Can someone PLEASE contact this "jsvk"?
http://www.geonames.org/recent-changes/user/jsvk/

I was JUST about done and then (s)he came along. His/her intentions are surely fine (as are mine) but I feel like mopping with the tap open...
[WWW]
marc



Joined: 08/12/2005 07:39:47
Messages: 4241
Offline

Hi Rob

There are two name fields in the extracted files. One is called 'name' and the other is called 'asciiname':

name : name of geographical point (utf varchar(200)
asciiname : name of geographical point in plain ascii characters, varchar(200)

Why don't you simply use the asciiname field if you don't like accents or other occasional non-ascii characters? It does not seem very useful to me to change the definition of the 'name' field and restrict it to ascii only. This would only destroy information, most users are interested in the accents, those who only want the ascii chars can use the respective field.

Regards

Marc

[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

However, if you look in the dump(s) currently there's not many ASCIIName fields left that contain non-ASCII data, only a few left. I changed/fixed the rest (as best to my abilities). You can compare a diff of, say, last week to current and see which records have been changed.
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

marc wrote:
Hi Rob

There are two name fields in the extracted files. One is called 'name' and the other is called 'asciiname':

name : name of geographical point (utf varchar(200)
asciiname : name of geographical point in plain ascii characters, varchar(200)

Why don't you simply use the asciiname field if you don't like accents or other occasional non-ascii characters? It does not seem very useful to me to change the definition of the 'name' field and restrict it to ascii only. This would only destroy information, most users are interested in the accents, those who only want the ascii chars can use the respective field.

Regards

Marc 


Marc,

It is the ASCIIName field(s) that contain non-ascii data.
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

Today I did the final changes; I will post a 'diff' of the sep 24 and sep 25 dumps as soon as the sep 25 dump is available.

To demonstrate the problem, here are a few sample records:

pt.txt:
Code:
 8299447	Rua de Entrecampos N.º 14	Rua de Entrecampos N.º 14		38.74435	-9.14494	R	ST	PT		14	1106	110643		0		83	Europe/Lisbon	2012-05-23
 8299448	Rua Dona Estefânia N.º 48	Rua Dona Estefania N.º 48		38.73138	-9.14084	R	RD	PT		14	1106	110644		0		77	Europe/Lisbon	2012-05-23
 


es.txt:
Code:
 9170527	Oferta Fortuna (1ª Linea Miami Playa)	Oferta Fortuna (1ª Linea Miami Playa)		41.00795	0.93729	S	HTL	ES		56	T	43092		0		1	Europe/Madrid	2014-06-24
 9409995	M. D'Or Multiservicios 2ªlinea	M. D'Or Multiservicios 2ªlinea		40.10786	0.15107	S	HTL	ES		60	CS	12085		0		-9999	Europe/Madrid	2014-09-02
 

In both files the ASCIIName field contains chars like ª and º but in previous "cleanup rounds" I did I also fixed chars like ã, á, é, ç, ´, ʿ, ·, ® (in Disneyland® Paris) etc. etc. Where possible I did simple replacements:
– => -
´ => '
ʿ => '
In other cases I replaced, for example:
Ö => oe
ß => ss
etc. etc.

Where possible I added an alternative name with the original value and 'ascii-fied' the "english name" as for example this entry or [url=http://www.geonames.org/8468995/mira-sintra-melecas%E2%80%8E-station.html]this entry[/].

Less obvious, but still non-ascii:

pl.txt:
Code:
 8426116	Śródmieście – Północ	Srodmiescie – Polnoc		50.04056	22.006	P	PPL	PL		80	1863	186301		0		202	Europe/Warsaw	2012-11-22
 8426117	Śródmieście – Południe	Srodmiescie – Poludnie		50.03527	22.00429	P	PPL	PL		80	1863	186301		0		205	Europe/Warsaw	2012-11-22
 

In pl.txt the problem is in the dash in the ASCIIName field.

All entries I changed were 100% sure containing non-ASCII data in the ASCIIName field as you can see in the above examples and/or the diff I will post later today.
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

As promised; here is a "diff" of the changes between 24th and 25th of september. The original files of the dumps of both dates have been included so you can compare for yourself.

Download file here (16MB)

Below is a screenshot (from regexhero) of the diff.txt with all non-ascii chars highlighted:



Today I cleaned up the two final records that I found (and somehow must've missed) so those should be gone by tonights' export/dump too. Until some new entries are created that is... So could someone please have a look at how these entries get into the database in the first place?
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

After the dumps being clean for some days I cleaned up another 11 entries today:



This is getting annoying and I'm also getting annoyed not getting any response about all this (nor any hope to a solution). Could someone please respond?
[WWW]
marc



Joined: 08/12/2005 07:39:47
Messages: 4241
Offline

Hi Rob

You could speed up things a lot with more specific error reports. You are mixing several different things which does not make it easier to answer quickly.
It is important to know that the main name is in no way meant to be ascii, many of your changes therefore will require additional fixing as many other users will complain that imporant info was lost when making the main name purely ascii.

Basically I see two groups of where you have found room for improvement:

a) the automatic ascii conversion (basically the icu4j library) needs some tweaking
b) some import scripts need to remove additional white space (00A0, 009A) from the main name.

Regards

Marc

[WWW]
djluker


[Avatar]

Joined: 22/06/2013 02:21:02
Messages: 3
Offline

Rob,

I've noticed the same, here and in other fields; I'm almost convinced that it's a 'feature' to get one to spring for the premium "consistency checked" version. It's a simple consistency check to implement, but apparently, no one is interested in implementing it.

To get an error-free import into my database using the allCountries data set, I also had to change admin2 code to unicode (?!) and increase CC2 size to 100. I think the easiest thing to do in your case is just create a script or something to go through the database and replace diacritics or something.

The readme should be updated to reflect the (currently) correct column types:

ID (bigint)
NAME (nvarchar(200))
ASCIINAME (nvarchar(200))
ALTERNATENAMES (ntext)
LATITUDE (float)
LONGITUDE (float)
FEAT_CLASS (char(1))
COUNTRY_CODE (varchar(2))
CC2 (varchar(100))
ADMIN1 (varchar(20))
ADMIN2 (nvarchar(80))
ADMIN3 (varchar(20))
ADMIN4 (varchar(20))
POPULATION (bigint)
ELEVATION (int)
DEM (int)
TIMEZONE (varchar(40))
MOD_DATE (date)

There are many other problems I've encountered; currently, I'm going through top level countries and admin1 regions with Wikipedia as my guide to correct some of the errors. When I finish, if anyone would be interested in my "data massaging" SQL scripts to tidy things up, I'd be happy to share.
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

marc wrote:
Hi Rob
You could speed up things a lot with more specific error reports. You are mixing several different things which does not make it easier to answer quickly. 


I'm not mixing anything; I specifically check for fields in the dumps that claim to contain ASCII-only and have, on many occasions, posted ID's, text-files etc. Hell, I even posted diffs in this thread.

Here's one I did today (file attached):



marc wrote:

It is important to know that the main name is in no way meant to be ascii, many of your changes therefore will require additional fixing as many other users will complain that imporant info was lost when making the main name purely ascii. 

Again; I'm only reporting data that CLAIMS to be ASCII-only; specifically in the files:

admin1CodesASCII.txt
admin2Codes.txt

and in the "'geoname' table" files ("allCountries", "cities1000", "cities5000", "cities15000", "null" and "<CC>") all have a field ASCIIName (3rd field in the file).

marc wrote:

Basically I see two groups of where you have found room for improvement:

a) the automatic ascii conversion (basically the icu4j library) needs some tweaking
b) some import scripts need to remove additional white space (00A0, 009A) from the main name.

Regards

Marc 


a) I guess so
b) All "non-space" whitespace (e.g. all non-0x20 spaces) like these should be removed/replaced with a "normal" space.

I have no other way to 'correct' these things than to simply search for the 'incorrect' string at http://www.geonames.org/ like Wadi Al-‘Asakirah, then click the entry (1), then edit the name (2).



If there's a better way or any other suggestions then please let me know.
Also, in cases where I "destroyed" data (like "Gradska četvrt Peščenica – Žitnjak" to "Gradska cetvrt Pescenica - Zitnjak") I always added the "correct (non-ascii) writing" as an alternate (and preferred if none present) name.





I could be going about it all-wrong and if I am I apologize. Just understand all I'm trying to do is help out...
 Description [Disk] Download
 Filesize 2 Kbytes
 Downloaded:  2268 time(s)

[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

A little over a month later:



The filename, the entry ID and actual text (with 'errors' highlighted).

The same entries are, logically, also all in allcountries.txt.

I've corrected each entry as I did before, but I would like some feedback. Everytime I post it takes ages for someone to reply and somehow nobody seems to care. The entries I fixed today (check the diff!!) are not encoding issues on my side! These entries were entered into the database with the wrong encoding. Aside from the wrong encoding, again, these entries contain non-ASCII data in fields that claim to have ASCII-only data.

I have posted several posts, I keep updating this topic, but I wish someone from geonames.org would go over them and respond to all questions I asked, provide me with (usefull!) feedback and take this issue seriously. I don't care if it takes 2 years before there's time to fix the issue, but at least let me know it's on the agenda and being looked into.
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

I guess what I'm trying to say: I'm putting real effort into this; it would be nice if that efford were 'rewarded' with some ackowledgement of the issue at hand at least.
[WWW]
RobIII


[Avatar]
Joined: 25/06/2014 03:05:47
Messages: 27
Location: Netherlands
Offline

And here we go again.

I'm going to build a tool specifically to dump these erroneous entries so you can see for yourself.
[WWW]
 
Forum Index -> General Go to Page: 1, 2 Next 
Go to:   
Powered by JForum 2.1.5 © JForum Team