GeoNames Home | Postal Codes | Download / Webservice | About 

GeoNames Forum
  [Search] Search   [Recent Topics] Recent Topics   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
Synchronizing Data  XML
Forum Index -> General
Author Message
geekeasy



Joined: 13/08/2006 16:27:26
Messages: 11
Offline

My site (Photo The Planet) is starting to take off. The number of users is growing quickly and we've gone from 80 photos to 160 in just the last 2 days.

Check out progress through in google earth : Photo The Planet - All Photos

People seem to love the site, but the one complaint that I'm getting is about missing places. People want to be able to add places. All of the important tourist sites in London seem to be missing. People are dying to add them.

And, so I'm going to have to figure out a way to allow my users to add new places to your db, and to synchronize our data.

I'll be honest. When I first grabbed your db, I didn't realize that you had a webservice that I could access for searches. Should I now switch over to using that, it would solve my problem of synchronization. But I know nothing about your infrastructure. I'm expecting/hoping that my site will take off. How well set up are you to handle tens of thousands of queries per day? What about hundreds of thousands?

My gut feeling is that I'd rather not load down your servers and be able to better guarantee performance for my users. So, I could switch over to using the web service, but I don't think that is the best option for either of us.

So... if I do not switch over to using the webservice, I really need to figure out a way to synchronize the data between our sites, and allow users on my site to add places. I thought about linking users to your site, giving them instructions on how to sign in and add places. Then I'd grab the additions by constantly scraping your rss feed looking for changes, plus getting a full copy of the DB every month or so to catch any changes that I missed. But that's seems pretty un-ideal.

I could grab the entire DB more frequently, but that seems like a massive waste of your bandwidth and mine.

So... what do you think about publishing a daily list of changes to the db? That way, I could grab only the changes, and only once a day.

And... what do you think about allowing for certain trusted hosts to add places automatically without going through your graphical front end. I imagine a simple php or perl script that would perform some validations, insert the new place, and then return the new GeonamesId as a result.

There is one small problem with this setup where the same place could be added twice, but that would only happen if someone adds a place to your site, and then less than 24 hours later someone adds the same place through my site also. That seems pretty unlikely right now.

I would certainly be willing to help out with any coding. And I think you'll start to get lots of places added by my users.

What do you think?
[WWW]
marc



Joined: 08/12/2005 07:39:47
Messages: 4412
Offline

Hello

I think you should use our search web service. You would need a lot of time to implement a similar full text search. Geonames is currently serving up to 800'000 web service queries per day and growing rapidly.
As for downloading the dump you don't have to care about bandwith, we use only a fraction of the bandwith included with our hosting plan.

We are already testing a web service to update/insert records. drop me an email if you are interested in a closed beta test.

Cheers

Marc

[WWW]
geekeasy



Joined: 13/08/2006 16:27:26
Messages: 11
Offline

> I think you should use our search web service.

Okay.... I'll experiment with it and see how fast search results are returned through the web service vs. having the local db.

But I think that having the queries run locally will be much, much faster.

> You would need a lot of time to implement a similar full text search.

Creating a table full of words that points back to the place table doesn't sound so difficult. Though, I don't know if you're doing anything more advanced than that.

> As for downloading the dump you don't have to care about bandwith, we use only a fraction of the bandwith included with our hosting plan.

Okay, me too. But I'm worried a bit about throughput. Will my user's be negatively effected while I'm downloading this 135 meg file?

Again, I'd have to experiment to find out.

If I decide to stick with my local db, would you consider a daily dump of changes? I have to ask.

> We are already testing a web service to update/insert records. drop me an email if you are interested in a closed beta test.

Absolutely interested. I'll send you an email.

Cheers and thanks!
-Adam
[WWW]
marc



Joined: 08/12/2005 07:39:47
Messages: 4412
Offline

> Creating a table full of words that points back to the place table doesn't
> sound so difficult. Though, I don't know if you're doing anything more
> advanced than that.
The trick in implementing a search engine is the sorting of the result. A search for 'New York' for example returns nearly 40'000 rows. In addition to a table full of words you will need at least the term frequency (tf) and the inverted document frequency (idf). And you will have to manage tf and idf for deletes, updates and inserts.

>Will my user's be negatively effected while I'm downloading this 135 meg file?
The main problem will be to update your database with the download.

>If I decide to stick with my local db, would you consider a daily dump of changes?
No problem. How many days do you think should in this 'most recent' dump? I don't think one day is enough, because with one day only you run into probs if you miss one day.


Marc

[WWW]
Anonymous



(btw this is barryhunter from nearby.org.uk - it seems a I can't register being colour blind cant read the damn capatchas!)

Following on from a comment about searching for things like 'United Kingdom' I think this is where you could really use the webservice as I thing geonames does a good job of identifying placenames.

Of course you should still maintain a copy for distance searches and also general browsing and stuff.

However in a case like this you could first do a check on the country column and return any results in that country reather than just find a place with that name
geekeasy



Joined: 13/08/2006 16:27:26
Messages: 11
Offline


This doesn't really have anything to do with geonames so I'll take it to a private email Barry.

Oh... and Marc, I've been busy, but I'll get back to you soon.

Cheers,
-Adam
[WWW]
geekeasy



Joined: 13/08/2006 16:27:26
Messages: 11
Offline

Marc:
> The trick in implementing a search engine is the sorting of the result. A search for 'New York' for example returns nearly 40'000 rows. In addition to a table full of words you will need at least the term frequency (tf) and the inverted document frequency (idf). And you will have to manage tf and idf for deletes, updates and inserts.

Actually, on a bit of research I found out that mySql actually now takes care of all of this for you. You can create FULLTEXT indexes to do fast fulltext searching. And it will automatically rank the results for you.

However, I'm currently using the more restrictive boolean mode which does not rank results, but only returns values when both "new" and "york" exist in the name or alternate names.

The only thing that I believe I'm now missing is accent insensitive searches. "Resume" will unfortunately not find "Resumé".

... perhaps in the next version of mySql.

> No problem. How many days do you think should in this 'most recent' dump? I don't think one day is enough, because with one day only you run into probs if you miss one day.

So, let's go ahead with this daily dump. (And I really do appreciate it!)

I think that if I missed a day, I could just check the cron logs and grab that day manually. But I'll take your advice here. What would you recommend?
[WWW]
marc



Joined: 08/12/2005 07:39:47
Messages: 4412
Offline

Hi Adam

We have planned to solve the synchronization problem between geonames and mirror sites using a message queue. Every change on the geonames database will result in a message in the message queue. Mirror sites can register for messages and update their database with the information in the message they receive for every change.

With the 'daily modification dump' I see the following problems:

1. It will add another file to the download section and confuse users
2. It is redundant to the existing dump. It is already very simple to extract the modifications for a particular day from the dump.
3. It does not solve the problem of immediate synchronization. If your users change records or add new records to the geonames database it will cause frustration to them if their updates are not immediately visible on your site.

Marc

[WWW]
geekeasy



Joined: 13/08/2006 16:27:26
Messages: 11
Offline

Marc:
> We have planned to solve the synchronization problem between geonames and mirror sites using a message queue. Every change on the geonames database will result in a message in the message queue. Mirror sites can register for messages and update their database with the information in the message they receive for every change.

Sounds great to me.

For obvious reasons, I did not want to download and import the entire database every day. The daily changes dump seemed "good enough", and also very easy to implement. Really the motivation for that approach was that I wanted to create no more extra work for you than absolutely necessary.

But if you want to implement a more advanced synchronization technique, I'm all for it.

Cheers,
-Adam
[WWW]
marc



Joined: 08/12/2005 07:39:47
Messages: 4412
Offline

Hi Adam

A daily change dump is now available in the download section. It is called modifications-<date>.txt :

http://download.geonames.org/export/dump/


As for the message queue I think the apache message queue implementation looks promising :

http://incubator.apache.org/activemq/home.html

It has support for clients in a couple of languages :
http://incubator.apache.org/activemq/cross-language-clients.html

Would you be interested in a prototype version of a geonames modification message queue?

Marc

[WWW]
Anonymous



Hi,

generally, I believe the messaging approach may be a little clumsy in certain cases. Instead I would suggest making up a new service, taking a single date parameter in a conventional date-time format, giving records which have been changed/added since that date. This of course assumes that the date of changes/additions is collected, if not, why not to start doing so? I think it is useful to know this information.
What do you think about it?

Cheers,
Petr Krebs
marc



Joined: 08/12/2005 07:39:47
Messages: 4412
Offline

Hi Petr

The dates for updates and creations are stored in the database and available as modification date in the database dump. It should already be fairly easy to download the dump and extract all rows with a modifcation date later then date xy.

A message queue is required for real time updates, if you want to be sure a geonames mirror is always having the latest modifications with only some miliseconds delay time. This is also needed to cluster the geonames server and allow for hundred thousands of requests per day.

Cheers

Marc

[WWW]
Anonymous



Yeah, but the trouble lies in downloading 130+ mb data even when few bytes or kilobytes of them are really relevant to the update.
Alternate solution could be to keep updates gathered over last week/month or so in a single file for download. Then, if my data is older than this period of time, only then I would decide to download complete data. This would greatly reduce the need to do the complete download.

Thanks a lot for your replies to my questions, I can hardly express how much I appreciate such great approach!

Cheers,
Petr Krebs
marc



Joined: 08/12/2005 07:39:47
Messages: 4412
Offline

You are saying you are not downloading the database to a server? From server to server it should not take more than some seconds to download 130MB.

We also have to take into consideration that the geonames dataset is still fundamentally changing. Some weeks ago all 6 million records have received a timezone field, other fields to be added will be admin2 and a better satellite based elevation model.

Regards

Marc

[WWW]
 
Forum Index -> General
Go to:   
Powered by JForum 2.1.5 © JForum Team