Kyle Heck wrote:
> I'm writing a web crawler, and in that crawler I want to remove all
> scripts in the pages I crawl.
> 
> I should be able to do a simple gsub!(/<!--.*-->/,"") right?  Well, I do
> that and unfortunately it doesn't remove some scripts.  Take google for
> instance.  It removes the first script, but not the second.  I'm really
> confused.  Since google has two scripts, <!-- happens twice, so do -->
> so it's not like the full regexp should ever fail to be triggered.
> 
> Any insight on the issue would be GREAT?! :D
> 
> Thanks,
> Kyle Heck

I'm not sure what are you after actually, but apart from the <script> 
tags Rob mentioned, you might need to remove the onClick, onMouseOver 
and other handlers. And since the handlers can be within almost any tag 
it would be very hard to find and remove them correctly with just a few 
regexps. You should use a real HTML parser (the preffered Ruby one seems 
to be called hpricot ... I guess the author wanted to be funny). If this 
is meant to make the display of the pages secure you should also rather 
"keep only the tags and attributes that are safe" than "remove stuff 
that's not safe". You might easily overlook something.

If you happened to use the-language-that-musn't-be-named, you'd just use 
HTML::TagFilter 
(http://search.cpan.org/~wross/HTML-TagFilter-1.03/TagFilter.pm). Good 
luck.

Jenda

-- 
Posted via http://www.ruby-forum.com/.