> however, the CSS and Javascript lines are not 
> removed.  So I think I can gsub the CSS and Javascript blocks with the 
> multiline regexp gsub.
> 
> I wonder though if there is a quick way, that will do what the lynx on 
> UNIX does... just print out a plain and readable text page.

i got it to work till:


require 'open-uri'
require 'hpricot'

c = open('http://www.google.com').read


c.gsub!(/<style.*?<\/style.*?>/m, " ")
c.gsub!(/<script.*?<\/script.*?>/m, " ")

c.gsub!(/<(span|tr|td|&nbsp;).*?>/, " ")
c.gsub!(/<(br|p|div|table).*?>/, "\n")

d = Hpricot(c).inner_text
d.gsub!(/\s+/, " ")
d.gsub!(/\n+/, "\n")

print d


but it is not so pretty.  and it is not filtering the non-printable 
character too.

-- 
Posted via http://www.ruby-forum.com/.