------art_110378_8267316.1136952806778
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

i like this code i found, i did not make but found. and i wish i could give
credit to who created it but i lost the website

 require 'cgi'

def html2text html
  text = html.
    gsub(/( |\n|\s)+/im, ' ').squeeze(' ').strip.
    gsub(/<([^\s]+)[^>]*(src|href)=\s*(.?)([^>\s]*)\3[^>]*>\4<\/\1>/i, '\4')

  links = []
  linkregex = /<[^>]*(src|href)=\s*(.?)([^>\s]*)\2[^>]*>\s*/i
  while linkregex.match(text)
    links << $~[3]
    text.sub!(linkregex, "[#{links.size}]")
  end

  text = CGI.unescapeHTML(
    text.
      gsub(/<(script|style)[^>]*>.*<\/\1>/im, '').
      gsub(/<!--.*-->/m, '').
      gsub(/<hr(| [^>]*)>/i, "___\n").
      gsub(/<li(| [^>]*)>/i, "\n* ").
      gsub(/<blockquote(| [^>]*)>/i, '> ').
      gsub(/<(br)(| [^>]*)>/i, "\n").
      gsub(/<(\/h[\d]+|p)(| [^>]*)>/i, "\n\n").
      gsub(/<[^>]*>/, '')
  ).lstrip.gsub(/\n[ ]+/, "\n") + "\n"

  for i in (0...links.size).to_a
    text = text + "\n  [#{i+1}] <#{CGI.unescapeHTML(links[i])}>" unless
links[i].nil?
  end
  links = nil
  text
end


input =" <h1>Title</h1> This is the body. Testing <a href='
http://www.google.com/'>link to Google</a>.<p /> Testing image <img
src='/noimage.png'>.<br /> The End."

print html2text(input)




On 1/10/06, Eric Schwartz <emschwar / mail.ericschwartz.us> wrote:
>
> Austin Ziegler <halostatue / gmail.com> writes:
> > On 09/01/06, Eric Schwartz <emschwar / mail.ericschwartz.us> wrote:
> > > More like, "Just pray the HTML you are modifying doesn't happen to be
> > > completely valid, but not formed in exactly the way you are
> > > expecting."  For instance, the following HTML snippet is completely
> > > valid, but screws up the regex:
> > >
> > > <p>a <img src="greaterthan.gif" alt=">" /> b</p>
> >
> > Actually, that is *not* completely valid, at least not valid XHTML
> > (which is what I use these days).
>
> When wrapped with the appropriate tags, it validated HTML 4.01, which
> is what I recommend most people generate these days (because of some,
> but not all, of the reasons elucidated at
> http://codinginparadise.org/weblog/2005/08/xhtml-considered-harmful.html).
> So yes, it is valid HTML, which is all I claimed it to be.
>
> I specifically didn't mention XHTML, since the bits of the thread I
> saw referenced HTML, and they're enough different I figured XHTML
> would have been mentioned if that's what was wanted.  Of course with
> XHTML, you have CDATA sections, which can contain all sorts of
> nastiness that can trip you up just as badly.
>
> > You have to do that as:
> >   <p>a <img src="greaterthan.gif" alt="&gt;" /> b</p>
> >
> > But my regexp wasn't intended to be complete; there are full libraries
> > out there for that.
>
> Right; my point was that in my experience, regexes seem to work just
> fine, until suddenly they don't, and then you have to spend silly
> amounts of time compensating for them-- or you could just use a proper
> library in the first place, and not have to worry about it.
>
> -=Eric
>
>

------art_110378_8267316.1136952806778--