On Wednesday 23 August 2006 00:25, Justin Bailey wrote:
> Very impressive library! I remember when you posted about this at the
> beginning of the summer. I took the library and pointed it at the
> USCCBs online version of the Bible and got some very impressive
> results! It was able to identify book, chapter and verse with only a
> few examples.

Thank you so much for taking the time for writing this detailed email.

> With only 3 sample pages, and the following structure I was able to
> get very reliable results
>
> structure = Ariel::Node::Structure.new do |r|
>   r.item :book do |b|
>     b.item :title
>     b.item :chapter do |c|
>       c.item :title
>       c.list :verses do |v|
>         v.list_item :verse
>       end
>     end
>   end
> end
>
> I was particularly impressed that it understood how I re-used the
> title tag in different contexts (i.e. for the book and chapter title).

This is because of the way it checks for nesting when extracting label tags, 
which is why it gets confused when you have an extra tag. 
<l:item><l:title>....</l:title>...</l:item><l:title>...</l:title>
When it encounters the first <l:item>, it increments the nesting level by one, 
and again when it encounters the first <l:title>. The two closing tags 
decrement it, and then when <l:title> (which we're searching for in this 
example) is encountered and the nesting level is 0, we know it's the right 
one.

> If you'd like me to email you my structure files, my examples, and the
> tests I use them in I'd be glad to. It's three small files but
> probably too much for the list.

Yes, please do email them to me.

> Comments and questions I jotted down while playing with this:
>
> * Most of the chapter pages have footnotes interspersed throughout the
> text. These are hyperlinks to anchors below the main body of the
> chapter. Can Ariel correctly identify footnotes and pull in the text
> for them?

I'll have to take a look at your examples, but if I understand correctly not 
really. Perhaps you can extract footnote references (as in #footnote34) from 
the relevant page section, and separately extract all footnotes (with a 
footnote.reference). Then you can match them up, is this what you're trying 
to do? I haven't thought about having linked items like that 
before...interesting.

> * Ariel gets confused if you have tags in the example document that
> are not in the structure, but look like Ariel tags. Example: I had a
> <l:verses> tag which contained <l:verse> tags. I realized 'verses'
> was not needed so removed the item definition from the structure but
> not the example file. Ariel was not able to find the verse items until
> I removed the <l:verses> tag.

The checking and error reporting when parsing labeled documents isn't that 
great at the moment, I'll have to rework it a bit to make it easier to work 
out where there are errors if they exist. I'm not sure I follow here, you 
have :verses in your example above. Putting list items in a container is the 
recommended way of doing things:

<ul>
<l:verses><li><l:verse>Verse 1<l:verse></li>
<li><l:verse>Verse 2</l:verse></li>
<li><l:verse>Verse 3<l:verse></li></l:verses>
</ul>

You could put the <l:verses> right next to the first <l:verse> and the same 
with the </l:verse> if you wanted.

When defining structure, you should really only put a list_item as a single 
child of a list (internally a list is just an item.....I mean if you think 
about it extracting the whole list above is the same as extracting any other 
piece of text that occurs once). If you have multiple list_item's at the same 
level I think you'd get a lot of things going wrong. I'll add a check for 
this - a list_item should have no siblings.

> * Typing "extracted_text" to get the  text of each node is cumbersome.
> If its not already, maybe overload to_s on nodes to display the text?

Will do this.

> * Dealing with items is a little cumbersome. To get the number of
> verses in a chapter, I have to type
> e[:book][:chapter][:verses].children.length. Since I am already
> treating the nodes like arrays, having a 'length' method would be
> nice:  e[:book][:chapter][:verses].length

This is a case where I'd like you to use #search. It's easier for you too - 
what if no value for chapter was extracted for whatever reason? You'd get an 
error with the code above (because you'd be using [] on the nil value 
returned by e[:book][:chapter]), but e.search('book/chapter/verses/*').length 
would just return 0. (e/'book/chapter/verses/*').length is equivalent. I 
guess I haven't defined #size/#length because it makes sense when you're 
talking about a list, but means little when you're talking about 
e.chapter.size I think. #size = number of children seems reasonable enough 
though, I'll add that.

> * Better progress indication during learning phase. Hard to tell if
> program is hung or if it is managing to do something. The CPU is
> pegged but its hard to tell what progress is being made.

You're seeing at least messages like this?:
info: Learning rules for node version_history with 2 examples
info: Learnt start rules [#<Ariel::Rule:0xb79d7c64 @exhaustive=false, 
@direction=:forward, @landmarks=[["<td>"], ["Versions"], ["<td>"]]>]

You can fill your screen with status updates by using the -D switch if using 
the command line script, by setting $DEBUG or by Ariel::Log.set_level :debug

It's hard to know what status information to output. Other than printing the 
name of the item we're learning rules for and the rules as they're learnt, 
I'm not sure what else would mean something to the user who isn't familiar 
with Ariel internals and wouldn't be too excessively verbose. If you just 
want to know something's going on behind the scenes, then try one of the 
switches above.

> * More info about the search/at methods and expressions they can take.
> RDOC and the tutorial only hint at what you can do.

They're very limited at the moment, there's nothing more to them than listing 
parameters between /, and * are supported much like directory globbing. 
There's no way to specify certain parameters (like to select only verses 
lists with more than 5 children). But then Ruby has powerful array operations 
like #select and #reject for this sort of querying. I made this interface as 
basic as possible, not being sure what people would need/use. What sort of 
queries would you like to perform? I was planning on adding range selection, 
so you could do e.search 'book/chapter/verses/[0..5/whatever'. Clearly in 
your structure it would be as easy to just slice the result array.

This is where I could really use some practical examples to beef out the 
documentation. Maybe some of the functionalities people might want are easily 
provided using Ruby's standard library, but the documentation should give 
pointers on where to look, and suggest useful techniques.

> * Falls apart if tags entered are not well formed and gives little
> indication why. For example, I had missed an end tag on a list_item.
> The program didnt use the examples provided (said "learning node X
> with 2 examples" when I had 3) and then would quit with the error "No
> examples are suitable for exhaustive rule learning"

Mentioned this problem with error reporting above. I've added it to the issue 
tracker, this is definitely something that makes Ariel less user friendly.

Can you recreate this with one of your labeled files? The message only 
learning node x with 2 examples when there are 3 seems a little odd. The "No 
examples are suitable for exhaustive rule learning" takes a little bit of 
explaining, that I probably don't have time to do properly. But basically, 
taking the example I used above. I could have labeled it like this:

<ul>
<li><l:verses><l:verse>Verse 1<l:verse></li>
<li><l:verse>Verse 2</l:verse></li>
<li><l:verse>Verse 3<l:verse></l:verses></li>
</ul>

Remember how Ariel learns rules - it finds a rule that consumes all the tokens 
up to the one that is labeled. Assume we're finding start rules (end rules 
have the same issue), there are no tokens between the beginning of the 
extracted verses list and the label. So the only possible rule is an empty 
rule with no landmarks, which of course can't be applied exhaustively to 
iterate over the whole list. This is why this example must be ignored in the 
current (somewhat naive) implementation. I find it works pretty well, it just 
could be better. Looking in to this is one of my post-SoC aims. A problem 
with lists is that you don't want to make users label every item, or even 
count them. The good thing is that lists are generally very regular and have 
simple rules to split them.

Returning to the example, if we can't make a start rule that locates the start 
of the first verse, then how do we extract it? The answer is the end rule, 
say we have an end rule that has </li> as a landmark, the lowest end location 
will have a position less than the first start location, so Ariel assumes 
that all tokens from the first to the lowest end location are a list item. 
Hope that makes a little sense. This isn't something I've explained much/at 
all in the documentation, because it requires quite a lot of understanding of 
how Ariel works, and I'm hoping to look at ways to change the way this works.

Thanks so much again for taking the time to look through my project and share 
your experiences, hope my response has been some help. Apologies if it's a 
little long.

Regards,

Alex