From: Paul Lutus [mailto:nospam / nosite.zzz]
Sent: Thursday, November 30, 2006 8:20 PM
>Victor "Zverok" Shepelev wrote:
>
>> From: Dmitry Borodaenko [mailto:angdraug / gmail.com]
>> Sent: Thursday, November 30, 2006 4:21 PM
>>>On 11/30/06, Victor Zverok Shepelev <vshepelev / imho.com.ua> wrote:
>>>> My task is: I have some HTML fragment; no limitations on it
>correctness,
>>>> except of there can't be tag cutted:
>>>(...)
>>>> Can it be done with Hpricot? Or any other options?
>>>
>>>Tried HTMLTidy[0]?
>>
>> Not really tried, but had thought about.
>> The problem is I need something really "small, smart and simple" not
>"huge
>> and almighty" (as Tidy seems).
>
>Not "huge and almighty" but "small, smart and simple" ... I believe that's
>my cue.
>
>Have you considered writing your own miniature library? Maybe, a library
>consisting of 20 lines of Ruby instructions (regulars: note the absence of
>a certain trigger word)?
>
>Why not express the problem to be solved more explicitly and clearly?
>
>And ... were the HTML pages written by humans or a machine? I ask because
>machine-generated HTML tends to be more syntactically reliable.
>
>If I can have a sufficiently clear statement of the problem to be solved,
>I
>can suggest a solution -- or post one.
>
>On re-reading your first post in this thread, I venture to say that the
>pages are sufficiently disorganized that an ad hoc solution is the best
>approach overall, one in which various regular expression filters are used
>to extract essential page data, and the pages can then be reconstructed
>using stricter HTML or XHTML syntax.
>
>So, let's write some cod ... oops, I mean let's write a small library.

OK, here's the model of what I'm doing: small app, which interacts with
dictionaries like Wikipedia:
* user inputs something like "w matz"
* the software download first lines of http://en.wikipedia.org/wiki/Matz
(first one or two meaningful paragraphs) and displays them.

What to download and to show is setted by simple templates (regexpes for
now, but may be something Xpath-like).

Now we have some part of page, need to delete all tables, images, and so on,
and strip all "non-content" tags (everything but p, ul, ol, li, b, i...),
and I need to have "consistent" HTML to show.

It is a task definition.

The task may vary for different dictionaries. For ex., with some
dictionaries tables must not be deleted, but "normalized":
"<td>text1<td>text2" => "<table><tr><td>text1<td>text2</table>" 
Or even XHTMLish "<table><tr><td>text1</td><td>text2</td></tr></table>"


>--
>Paul Lutus
>http://www.arachnoid.com

V.