I've had a query via private email that I think others might be interested in 
the answer to.

> Hi,  I just read the post (and not very carefully cause it's late) but
> I'm wondering if you're writing a front end to feed a lot of attributes
> for each token to Weka, something like this guy (student of
> Kushmerick),
>
> http://www.aidanf.net/taxonomy/term/11/9
I've come across this project before, and taken a look at ELIE[1]. He's using 
a different approach that's more applicable to extracting information from 
different types of document to those I intend to work with. ELIE seems to be 
designed for extracting information from documents such as seminar 
announcements, news articles announcing company acquisitions and other 
documents of that sort. The approach I'm implementing relies on finding some 
sort of regular structure to the documents. It should be very effective for 
extracting information from semi-structured documents, that is documents such 
as web pages (price listings, auctions, search result pages) but not for 
retrieving information from free-text such as news articles. ELIE seems to 
try to learn at a higher level some common features shared by all lecture 
announcements (for instance), based on the forms they tend to take.

I might not be being all too clear here, but what I'm trying to say is that 
there seems to be a distinct divide between systems like ELIE, RAPIER (which 
I'm more familiar with), Kushmerick's BWI and systems such as STALKER (which 
I take most of my inspiration from) or SoftMealy. A good source for a 
wide-ranging survey of some approaches in the field is Information Extraction 
from World Wide Web A Survey (1999) - Line Eikvil[2].

> or your'e building a sequence matching HTML parser like python
> webstemmer
>
> http://www.unixuser.org/~euske/python/webstemmer/howitworks.html
>
> Maybe it'll be clear to me when i reread this tomorrow.
I don't do any HTML parsing. I've checked out webstemmer before too, it's 
really quite different to what I'm aiming for. It uses some heuristics to 
compare documents from the same site to try to work out which HTML blocks 
represent the main content. It's different in that it's not designed to learn 
how to extract specific information, just to employ heuristics that happen to 
work when trying to separate content from advertisements and navigation. I 
like the idea of layout patterns as a sort of fingerprint for similar 
documents. I think the explanation of how webstemmer works is *excellent*, 
especially the use of diagrams. I hope to produce some similarly effective 
document(s) for the end of SoC.

The key parts of my approach are describing the information to be extracted in 
terms of a tree, and building rules that consume all tokens until a match is 
found. 
        e.g. the rule skip_to "<b>" or the rule skip_to :html_tag (a wild 
card) would skip to the first <b> in the document "This is an <b>example</b>. 
Be <b>bold</b>.

Rules can be composed of multiple skip_to statements (consume until you find 
this, then consume more tokens until you find that), and skip_to statements 
may have multiple parameters to match a particular sequence of tokens. I have 
decided not to explain these ideas in depth at this stage, Section 3 of the 
STALKER paper referenced in my previous post should help you out if I'm not 
making sense. Any further questions or clarifications, don't hesitate to ask. 
I'll be away and probably without internet access from 7th-14th, so don't be 
offended if I don't reply in that period.

Alex


1. 
http://www.aidanf.net/software/elie-an-adaptive-information-extraction-system
2. http://sherry.ifi.unizh.ch/eikvil99information.html