----- Original Message -----
From: "Dave Oshel" <dcoshel / vcmails.com>
To: "ruby-talk ML" <ruby-talk / ruby-lang.org>
Sent: Thursday, June 12, 2003 9:29 AM
Subject: Re: HTML -> list of sentences? (semi-impossible task)


> It depends on what you mean by "sentence", 'ey?  Do you mean natural
> language (English? Rumanian? Urdu? Hakka? Thai? Japanese?), or
> artificial formalisms like programming languages (Perl, Ruby, FORTH)?

In this case, English sentences. Not as in formal grammars, or as
in prison sentences. Not that those two are so different.

> But someone went to a lot of trouble to carve up their perceptions of
> reality (heh) into procrustean HTML, so you may as well begin there.
> Determine the major syntactical units  (TABLE, DIV, P, HR, PRE, TT, H1,
> etc.).  Recursing, determine what is a "sentence" on semantic,
> idiomatic (BR, B, U), or at least grammatical  (カ、ネー、ニ、
ヘ、。。。), grounds.
>   Collect these purely formal "sentences" and send the list to
> post-processing (possibly human inspection) to be vetted and refined
> (e.g., does your system account for utterances which are meaningful but
> grammatically abbreviated, like "What up?" (MTV argot used by
> advertisers to slide nickels out of pockets) or "Annta desu" (kids
> choosing sides for oni in Osaka). )

I think even that is perhaps too much intelligence.I don't want to
build in knowledge about nouns and verbs.

> If you have access to a page's CSS, your hints about what the author(s)
> intended are much expanded.  Maybe not so impossible after all?  This
> does not seem like a difficult task to me, but maybe I haven't
> appreciated the context from which the question is posed?

My parents sometims quote a comedian from before I was born: "Easy for
you, difficult for me."

>  Does the
> solution have to be extremely general, or is it a one-shot?

Ehh, somewhat general in the sense of several chapters. But very
one-shot in that I'm looking at one particular document, and it's
about Ruby. ;)

I think the replies I've got are fairly promising along with my
own dirty hack from last night.

Cheers,
Hal


> David
>
>
> On Wednesday, June 11, 2003, at 09:38  PM, Hal E. Fulton wrote:
>
> > Hello, all.
> >
> > Here's an idea I'm toying with. Suggestions
> > are welcome.
> >
> > I want to take an HTML document (reasonably
> > well-formed, but not guaranteed) and remove
> > all the tags from it...
> >
> > ...and get a list of the *sentences* in the
> > document.
> >
> > There are, of course, several things that make
> > this difficult:
> >   - need to distinguish between end-of-sentence
> >     and embedded punctuation, including both
> >     abbreviations and textual references to
> >     Ruby methods such as eof? and split!
> >   - need to treat sentence fragments as sentences
> >   - need to ignore blocks of code
> >   - etc.
> >
> > My current approach is to start with htmlsplit
> > from the RAA. This is fairly simplistic, but
> > at least it doesn't have any dependencies.
> >
> > Not sure whether to do it in two steps or not:
> > 1. Convert to text
> > 2. Process
> >
> > Might be just as easy to do it in one step if
> > I knew what I was doing.
> >
> > Also not sure what is the best tool/library for
> > this job.
> >
> > Comments welcome.
> >
> > Hal
> >
> > --
> > Hal Fulton
> > hal9000 / hypermetrics.com
> >
> >
> >
> >
> --
> David C. Oshel               mailto:dcoshel / mac.com
> Cedar Rapids, Iowa       http://homepage.mac.com/dcoshel
> ``I think most pleasantly in metaphors, and smoking brings metaphors to
> mind." - Augustus Srb, in Alexei Panshin's  _Star Well_
>
>