On 7/5/06, Justin Bailey <jgbailey / gmail.com> wrote: > I like the interface, and the "humane" access it gives to the structure of > the page. It appears to handle single items and lists well. Minor clarification here, because a lot of examples for this stuff is referencing the "page"; strictly speaking, it's "the 'humane' access it gives to the structure of the document". Logically, this could be used to go through any semi-structured document (YAML, OOo files, etc.) although some formats (e.g., OOo) may require additional work to have the markup be clean. > Will I be able to point Ariel at a set of documents, and have it spit out a > reusable class which I can include in another program? For example, I have a > Bible reference parser (i.e. things like Gen 1:1, etc.) that scrapes web > pages to get the actual verses. Right now I use hand-built regular > expressions and some patterns to get the right page for a given book, > chapter and verse. Could I use Ariel to generate the "lookup" code instead? That is my understanding of Alex's project goal. Remeber that there's a training phase involved. > 2. How should a document be labeled? > > In order to feed the learner, you must save a copy of the type of document > > you > > want to extract information from, and then mark up the information you > > want > > extracted. What markers would be appropriate? > > Something such as <l:comment_list>....</l:comment_list> is a possibility. > Have you heard of microformats? Essentially, its a way to markup existing > HTML pages with added attributes to indicate structure.Its more less > intrusive than adding new tags, etc. You can read about them here > > http://microformats.org/about/ Microformats are interesting, but not 100% applicable. One of the reasons I pushed as hard as I did to make sure that Alex's project was included in Ruby Central's project list was that I saw it as more than just web scraping. -austin -- Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/ * austin / halostatue.ca * http://www.halostatue.ca/feed/ * austin / zieglers.ca