----- Original Message ----- From: "Dave Oshel" <dcoshel / vcmails.com> To: "ruby-talk ML" <ruby-talk / ruby-lang.org> Sent: Thursday, June 12, 2003 9:29 AM Subject: Re: HTML -> list of sentences? (semi-impossible task) > It depends on what you mean by "sentence", 'ey? Do you mean natural > language (English? Rumanian? Urdu? Hakka? Thai? Japanese?), or > artificial formalisms like programming languages (Perl, Ruby, FORTH)? In this case, English sentences. Not as in formal grammars, or as in prison sentences. Not that those two are so different. > But someone went to a lot of trouble to carve up their perceptions of > reality (heh) into procrustean HTML, so you may as well begin there. > Determine the major syntactical units (TABLE, DIV, P, HR, PRE, TT, H1, > etc.). Recursing, determine what is a "sentence" on semantic, > idiomatic (BR, B, U), or at least grammatical (カ、ネー、ニ、 ヘ、。。。), grounds. > Collect these purely formal "sentences" and send the list to > post-processing (possibly human inspection) to be vetted and refined > (e.g., does your system account for utterances which are meaningful but > grammatically abbreviated, like "What up?" (MTV argot used by > advertisers to slide nickels out of pockets) or "Annta desu" (kids > choosing sides for oni in Osaka). ) I think even that is perhaps too much intelligence.I don't want to build in knowledge about nouns and verbs. > If you have access to a page's CSS, your hints about what the author(s) > intended are much expanded. Maybe not so impossible after all? This > does not seem like a difficult task to me, but maybe I haven't > appreciated the context from which the question is posed? My parents sometims quote a comedian from before I was born: "Easy for you, difficult for me." > Does the > solution have to be extremely general, or is it a one-shot? Ehh, somewhat general in the sense of several chapters. But very one-shot in that I'm looking at one particular document, and it's about Ruby. ;) I think the replies I've got are fairly promising along with my own dirty hack from last night. Cheers, Hal > David > > > On Wednesday, June 11, 2003, at 09:38 PM, Hal E. Fulton wrote: > > > Hello, all. > > > > Here's an idea I'm toying with. Suggestions > > are welcome. > > > > I want to take an HTML document (reasonably > > well-formed, but not guaranteed) and remove > > all the tags from it... > > > > ...and get a list of the *sentences* in the > > document. > > > > There are, of course, several things that make > > this difficult: > > - need to distinguish between end-of-sentence > > and embedded punctuation, including both > > abbreviations and textual references to > > Ruby methods such as eof? and split! > > - need to treat sentence fragments as sentences > > - need to ignore blocks of code > > - etc. > > > > My current approach is to start with htmlsplit > > from the RAA. This is fairly simplistic, but > > at least it doesn't have any dependencies. > > > > Not sure whether to do it in two steps or not: > > 1. Convert to text > > 2. Process > > > > Might be just as easy to do it in one step if > > I knew what I was doing. > > > > Also not sure what is the best tool/library for > > this job. > > > > Comments welcome. > > > > Hal > > > > -- > > Hal Fulton > > hal9000 / hypermetrics.com > > > > > > > > > -- > David C. Oshel mailto:dcoshel / mac.com > Cedar Rapids, Iowa http://homepage.mac.com/dcoshel > ``I think most pleasantly in metaphors, and smoking brings metaphors to > mind." - Augustus Srb, in Alexei Panshin's _Star Well_ > >