| Author |
Message |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 12/05/2008 21:30:21
|
Liam
Joined: 12/05/2008 04:07:58
Messages: 3
Offline
|
Hi --
I've been experimenting with the Wikipedia web service, and have some questions.
1. The summary returned starts at a few words into the actual summary, e.g. the 'London Aquarium' summary looks like this:
Code:
<title>London Aquarium</title>
<summary>
on the South Bank of the River Thames in central London, near the London Eye. It first opened in 1997. The aquarium claims that a million visitors a year view its displays. It is a collection of water tanks showing around 350 species of fish. The aquarium includes three floors and 14 different zones (freshwater stream, Atlantic upper, rivers and ponds, Pacific upper, Indian Ocean, Atlantic (...)
</summary>
What happened to the first few words?
And is there any way to get the full summary?
2. Is there any way to get a particular entry, perhaps using the geonamesID field? We get back a list of entries, and would like to get select a particular one, for example to extract other xml sections. Is that data available via geoNames? If not, maybe I can get it using a service like http://wikixmldb.dyndns.org/
Has anyone else done this?
3.I don't understand the query logic, especially as it pertains to ranking. For example, if I search for "Golden Gate Bridge" I would really like the entry with the title "Golden Gate Bridge" to show up at the top of the list.
Adding the title parameter helps, although it also selects an article titled
'Golden Gate National Recreation Area' which doesn't even have bridge in the title.
Searching for title with just 'golden gate' also turns up 'golden temple' and 'golden, colorado'. That doesn't make sense!
4. Really what I want to achieve is a combination of a fuzzy text search and a spatial search. I do a search on the title and text, and give extra relevance weighting to terms that are in the title, and then add some other relevancy weighting based on distance from a particular point.
Or, I might do a spatial search first, and then do the text search within its result set (like a subquery).
If anyone can steer me in the right direction, I would really appreciate it!
Best,
Liam
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 18/05/2008 11:36:03
|
marc
Joined: 08/12/2005 07:39:47
Messages: 4501
Offline
|
Hi Liam
1. The summary is automatically extracted from the wikipedia full article markup. It can only be as good as the data, and the wikipedia data is not consistently following any rules and there are a lot of errors in the way people use the markup.
2. We could expose the wikipedia Ids. The wikipedia articles don't have any GeoNamesIds and not all of them are associated with a geonames record. For those that are mapped to a geonames record we could implement some service getting the corresponding wikipedia article for a given geonamesId.
3. A bug with compound search terms for the title field. Thanks for finding it. I have fixed it.
4. It is probably easiest to implement to retrieve the full text search and sort the result according to the distance.
Marc
|
 |
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 20/05/2008 21:53:32
|
Liam
Joined: 12/05/2008 04:07:58
Messages: 3
Offline
|
Marc,
1. Exposing the wikipediaIds would be useful.
I send the user a list of articles, and the user chooses one, so we need to be able to retrieve that specific entry.
Currently, I'm calling wikipedia_find_nearby with the article's exact lat/lng and setting maxRows=1 -- not exactly an elegant way to do it!
If you implement it, we would appreciate it.
(I'm using the Geo::GeoNames Perl module, so I will probably need to update that...)
2.Thanks for fixing the bug!
3. I implemented a sort based on distance. Good enough for now.
|
|
|
 |
|
|
|
|