Very impressive library! I remember when you posted about this at the
beginning of the summer. I took the library and pointed it at the
USCCBs online version of the Bible and got some very impressive
results! It was able to identify book, chapter and verse with only a
few examples.

With only 3 sample pages, and the following structure I was able to
get very reliable results

structure = Ariel::Node::Structure.new do |r|
  r.item :book do |b|
    b.item :title
    b.item :chapter do |c|
      c.item :title
      c.list :verses do |v|
        v.list_item :verse
      end
    end
  end
end

I was particularly impressed that it understood how I re-used the
title tag in different contexts (i.e. for the book and chapter title).

If you'd like me to email you my structure files, my examples, and the
tests I use them in I'd be glad to. It's three small files but
probably too much for the list.

Comments and questions I jotted down while playing with this:

* Most of the chapter pages have footnotes interspersed throughout the
text. These are hyperlinks to anchors below the main body of the
chapter. Can Ariel correctly identify footnotes and pull in the text
for them?

* Ariel gets confused if you have tags in the example document that
are not in the structure, but look like Ariel tags. Example: I had a
<l:verses> tag which contained <l:verse> tags. I realized 'verses'
was not needed so removed the item definition from the structure but
not the example file. Ariel was not able to find the verse items until
I removed the <l:verses> tag.

* Typing "extracted_text" to get the  text of each node is cumbersome.
If its not already, maybe overload to_s on nodes to display the text?

* Dealing with items is a little cumbersome. To get the number of
verses in a chapter, I have to type
e[:book][:chapter][:verses].children.length. Since I am already
treating the nodes like arrays, having a 'length' method would be
nice:  e[:book][:chapter][:verses].length

* Better progress indication during learning phase. Hard to tell if
program is hung or if it is managing to do something. The CPU is
pegged but its hard to tell what progress is being made.

* More info about the search/at methods and expressions they can take.
RDOC and the tutorial only hint at what you can do.

* Falls apart if tags entered are not well formed and gives little
indication why. For example, I had missed an end tag on a list_item.
The program didnt use the examples provided (said "learning node X
with 2 examples" when I had 3) and then would quit with the error "No
examples are suitable for exhaustive rule learning"