GeoNames Forum

Hi!

The problem I'm trying to solve is as follows: I need to match street addresses from different databases which are spelling variations of each other. By spelling variations I mean things like abbreviations, spelling mistakes, including certain types of information that may be implied anyway, different conventions for writing postal routing information, etc.

I need to find some representation which provides all and only the right matches, but which will tend to err on the side of matching too much stuff, and which a database server can efficiently work with in terms of indexing.

The obvious choice would be to do geocoding and truncate a lat/long to an int.

But the thing is: Locating the address in physical space is not even really part of the problem I need to solve. Is there a way of doing it that has less implied complexity?

The number of addresses in the database is in the 7-figure range, and the processes that operate on it should be repeatable on a single machine within the scope of a few days, at least for reasonably chunky subsets of that data.

At the moment I'm considering Bing maps, which apparently lets you do 5Mio geocoding operations per day using their bulk interface if you pay them. The other alternative, I guess, would be to go for a local instance of gisgraphy.

Any thoughts on the matter? Are other people doing similar stuff, or is this just total overkill? Has anybody found a simpler solution?