I've had a query via private email that I think others might be interested in the answer to. > Hi, I just read the post (and not very carefully cause it's late) but > I'm wondering if you're writing a front end to feed a lot of attributes > for each token to Weka, something like this guy (student of > Kushmerick), > > http://www.aidanf.net/taxonomy/term/11/9 I've come across this project before, and taken a look at ELIE[1]. He's using a different approach that's more applicable to extracting information from different types of document to those I intend to work with. ELIE seems to be designed for extracting information from documents such as seminar announcements, news articles announcing company acquisitions and other documents of that sort. The approach I'm implementing relies on finding some sort of regular structure to the documents. It should be very effective for extracting information from semi-structured documents, that is documents such as web pages (price listings, auctions, search result pages) but not for retrieving information from free-text such as news articles. ELIE seems to try to learn at a higher level some common features shared by all lecture announcements (for instance), based on the forms they tend to take. I might not be being all too clear here, but what I'm trying to say is that there seems to be a distinct divide between systems like ELIE, RAPIER (which I'm more familiar with), Kushmerick's BWI and systems such as STALKER (which I take most of my inspiration from) or SoftMealy. A good source for a wide-ranging survey of some approaches in the field is Information Extraction from World Wide Web A Survey (1999) - Line Eikvil[2]. > or your'e building a sequence matching HTML parser like python > webstemmer > > http://www.unixuser.org/~euske/python/webstemmer/howitworks.html > > Maybe it'll be clear to me when i reread this tomorrow. I don't do any HTML parsing. I've checked out webstemmer before too, it's really quite different to what I'm aiming for. It uses some heuristics to compare documents from the same site to try to work out which HTML blocks represent the main content. It's different in that it's not designed to learn how to extract specific information, just to employ heuristics that happen to work when trying to separate content from advertisements and navigation. I like the idea of layout patterns as a sort of fingerprint for similar documents. I think the explanation of how webstemmer works is *excellent*, especially the use of diagrams. I hope to produce some similarly effective document(s) for the end of SoC. The key parts of my approach are describing the information to be extracted in terms of a tree, and building rules that consume all tokens until a match is found. e.g. the rule skip_to "<b>" or the rule skip_to :html_tag (a wild card) would skip to the first <b> in the document "This is an <b>example</b>. Be <b>bold</b>. Rules can be composed of multiple skip_to statements (consume until you find this, then consume more tokens until you find that), and skip_to statements may have multiple parameters to match a particular sequence of tokens. I have decided not to explain these ideas in depth at this stage, Section 3 of the STALKER paper referenced in my previous post should help you out if I'm not making sense. Any further questions or clarifications, don't hesitate to ask. I'll be away and probably without internet access from 7th-14th, so don't be offended if I don't reply in that period. Alex 1. http://www.aidanf.net/software/elie-an-adaptive-information-extraction-system 2. http://sherry.ifi.unizh.ch/eikvil99information.html