On Wednesday 05 July 2006 16:50, Justin Bailey wrote:
> Thanks for the update - sounds pretty interesting. Some thoughts below:
>

> Will I be able to point Ariel at a set of documents, and have it spit out a
> reusable class which I can include in another program? For example, I have
> a Bible reference parser (i.e. things like Gen 1:1, etc.)  that scrapes web
> pages to get the actual verses. Right now I use hand-built regular
> expressions and some patterns to get the right page for a given book,
> chapter and verse. Could I use Ariel to generate the "lookup" code instead?

I think I understand what you're asking here. Given a page such as 
http://www.blueletterbible.org/Gen/Gen001.html - you might define the 
structure such as:
  doc_tree = Ariel::StructureNode.new do |r|
    r.verse_list do |v|
      v.reference
      v.text
      end
    end
  end

After training, given a chapter from that site, the generated rules could be 
applied to produce a list of reference and verse pairs. So you could then 
just search that list for the member of verse_list with reference=="Gen 1:1". 
Is that basically what you're thinking? If so, Ariel should be very suited to 
that task. That actually looks like quite a nice page to use in testing...

> Another very small program I wrote scrapes calculator results from Google (
> e.g. enter "2 + 2" into the google search box and get back "4"). Here again
> I used StringScanner and some regular expressions to get the result and
> transform it slightly. Would Ariel be able to help me with that?

Certainly.

> == Questions for the Ruby community ==
>
> > 1. What form would you like extracted data to take?
> > YAML and XML output shouldn't be a problem, but I'm thinking about the
> > outputted Ruby data structure. Supposing that the doc_tree defined above
> > were applied to a document, the extracted structure could be queried
> > like: p root.title.extracted_text
> > p root.date.year.extracted_text
> > p root.comment_list[3].author.extracted_text
> > root.children would produce an array of the title object, author, and so
> > on. root.comment_list.children[3] == root.comment_list[3]. Any ideas?
>
> You could take a page from the Rails "pluralize" methods and also offer:
>
>   root.comments[3] == root.comment_list.children[3] == root.comment_list[3]
>
>   'root.comments_list' would also work.
It works ok for Rails, but I'm not sure that automatic pluralization is worth 
the hassle. This is why it probably makes sense to let people do something to 
the effect of Ariel::StructureNode.new {|r| r.comments :list}.


> Of the chioces you show, though, I like the less verbose
> "root.comment_list[3]"
> - though if comment_list is more than just a simple array the interface
> could get tricky.
In the cases of lists, I'm thinking that comment_list[3] is just a more 
convenient form of comment_list.children[3] - as each extracted comment is a 
child of the extracted list.

> 2. How should a document be labeled?
>
> > In order to feed the learner, you must save a copy of the type of
> > document you
> > want to extract information from, and then mark up the information you
> > want
> > extracted. What markers would be appropriate?
> > Something such as <l:comment_list>....</l:comment_list> is a possibility.
>
> Have you heard of microformats? Essentially, its a way to markup existing
> HTML pages with added attributes to indicate structure.Its more less
> intrusive than adding new tags, etc. You can read about them here
>
>   http://microformats.org/about/
>
> And see what Yahoo is doing with them on their local results pages:
>
> http://ylocalblog.com/blog/2006/06/21/we-now-support-microformats/

Yes, I'm familiar with microformats. As Austin says, perhaps they're not right 
for my project as it takes an approach that should be capable of dealing with 
a wide variety of semi-structured documents. Is being intrusive a problem? 
Perhaps using non-XML like labels has advantages, they'd stick out from the 
other tags. Perhaps something such as <<comment_list>>....<</comment_list>> 
would stick out more, and I suppose people could specify their own 
separators: ~~comment_list~~....~~/comment_list~~. I guess I just need a 
sensible convention that people should be able to modify if they need to (if, 
for instance, parts of the document look like valid labeled examples).

> 3. Which is better?
>
> > (a). doc_tree = Ariel::StructureNode.new {|r| r.comment_list}
> > (b). doc_tree = Ariel::StructureNode.new {|r| r.comments :list}
> > (c) doc_tree = Ariel::StructureNode.new {|r| r.list :comments}
> > It's certainly possible for (a) and (b) to both be allowed.
>
> I prefer (a), though (c) seems intriguing. It sort of implies the interface
> for Ariel objects is always the same (get a single item, get a list, etc)
> and you just pass different symbols in. That could make it harder to
> introspect against, though. For any of these interfaces you should strive
> to make sure actual methods are implemented and its not just a lot of
> method_missing tricks. Actual implementations are a lot easier to deal with
> when meta-programming than hacking around method_missing logic.

I'm also intrigued by (c) (Austin suggested it before I posted this report to 
the list). If I stick with form (a), then I think I prefer allowing (c) as an 
alternative form rather than (b). r.comment_list and r.list :comments seem 
more closely linked.

> Thanks again for the update, hope this feedback helps!
>
> Justin

It certainly does, thank you.

Alex