From: Paul Lutus [mailto:nospam / nosite.zzz]
Sent: Thursday, November 30, 2006 11:00 PM
>Victor "Zverok" Shepelev wrote:
>
>> It is a task definition.
>>
>> The task may vary for different dictionaries. For ex., with some
>> dictionaries tables must not be deleted, but "normalized":
>> "<td>text1<td>text2" => "<table><tr><td>text1<td>text2</table>"
>
>Both the before and after forms show big syntax errors. I hope you
>understand HTML syntax, if not, this may be more difficult than I thought.

I understand HTML syntax. And I see no problem in above.
Closing tags for <tr> and <td> are both optional in HTML 4.01 w3c spec.

>Perhaps you could post what you consider to be the desired end result for a
>particular entry from the "dictionary" site of your choice.

OK. Here it is:
Source page: http://en.wikipedia.org/wiki/Ukraine
Start pattern: <!-- start content -->
End pattern: <h2>
Elements to exclude: tables, images.

Desired output (with text in middle of paragraph skipped):
---------------------------
<p><b>Ukraine</b> (<a href="/wiki/Ukrainian_language" title="Ukrainian language">Ukrainian</a>: <span lang="uk" xml:lang="uk">ܧѧߧ</span>, <i>Ukraina</i>, <span title="Pronunciation in IPA" class="IPA">/ukrajina/</span>) is a <a href="/wiki/Country" title="Country">country</a> in <a href="/wiki/Eastern_Europe" title="Eastern Europe">Eastern Europe</a>.
....
It became independent again after the <a href="/wiki/History_of_the_Soviet_Union_%281985-1991%29" title="History of the Soviet Union (1985-1991)">Soviet Union's collapse</a> in 1991.</p>
---------------------------

That's all.

>By the way (my boilerplate remark about page scraping), if this is for any
>purpose other than your own personal use, it represents a copyright
>problem.

My application would be kinda browser (nano-browser), I don't want to "grab" dictionaries. 

>I want to emphasize this is not difficult at all, once there is a clear
>statement of purpose. In can be done in a few (maybe a few dozen) lines of
>Ruby code.

I know. I'm not a nuby (my poor language in mails is due to natural language problems, not very low knowledge).
I've just asked about existing libraries.

>
>--
>Paul Lutus
>http://www.arachnoid.com

V.