Juan Wajnerman

Juan Wajnerman

Juan is the one everybody turns to when there is a question that no one else can answer: whether Internet is down or the architecture for a massive system needs to be designed, Juan always has the correct solution. Wise and humble, he applies his vast experience with patience and pragmatism. Outside the techie world, he enjoys tennis and playing with his two kittens.

Location Extraction

In this post I’ll describe I have been doing for Instedd during the last couple of weeks. In one of the projects we have we need to classify a series of articles depending on the geographical location they are talking about. This process is known as geotagging, and is really important on the biosurveillance areas.

Geotagging items is not a new thing, and many web sites already supports adding geographic information to the objects their handle. For example, Flickr allows you to set the coordinates where the picture was taken. Wikipedia also has structured information that contains the latitude and longitude for articles about a place in the world. On the other side, specs like GeoRSS can be used to augment the information given by a feed. However, even though all these new geo-related features are being widely adopted, there are still much information out there, that would need a human reading the text to understand which places is it mentioning.

So, we decided to make this process automatically as most as possible, extracting the information from the text itself. This is know as “location extraction”, and is actually a branch of a more general thing named “entity extraction”.

(more…)