>>>>> "M" == Marko Schulz <in6x059 / public.uni-hamburg.de> writes:

>> * it try to separate comment from the rest of the document, i.e. the
>> biggest error is here
>> 
>> #########################################################
>> # first we'll shoot all the <!-- comments -->
>> #########################################################

M> What is so bad about that?

 See my regexp given in [ruby-talk:20477]. This regexp just try to extract
 a block of text beginning with a '<' (and which is valid) and ending with
 '>' with another block of text.

 If your HTML document is valid and not too complex, it's possible to do it
 like this.

 But if you want to extract components from the document (like comments,
 <a ...>, <table>, ...) you must first parse the document and a regexp is
 not appropriated for this.


Guy Decoux