I'm relatively new to Ruby (and therefore Nokogiri) and am trying to
parse some HTML that will ultimately be written to a MySQL database.  In
the interim, I'm writing it to a text file for troubleshooting purposes.

Here's the relevant piece of the HTML I'd like to parse:

<!-- body="start" -->
<div class="mail">
<address class="headers">
<span id="from">
<dfn>From</dfn>: Paul David Mena &lt;<a
href="mailto:pauldavidmena_at_gmail.com?Subject=Re:%20twilight">pauldavidmena_at_gmail.com</a>&gt;
</span><br />
<span id="date"><dfn>Date</dfn>: Tue, 26 Mar 2013 18:13:21
-0400</span><br />
</address>
<p>
Line 1
<br />
Line 2
<br />
Line 3
<br />
<p><pre>
--
Paul David Mena
--------------------
pauldavidmena_at_gmail&#46;<!--nospam-->com
</pre>
<span id="received"><dfn>Received on</dfn> Tue Mar 26 2013 - 22:13:23
EDT</span>
</div>
<!-- body="end" -->

My goal is to strip out everything between the "address" and "pre" tags
and to output only:

Line 1

Line 2

Line 3

My code, however, is stripping out one or the other, depending upon
where I place the definition.  Here is the code:

#!/usr/bin/env ruby

require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document
  attr_reader :plaintext
  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @pre = false
    @address = false
    @plaintext = ""
  end

  def start_element(name, attrs = [])
    if name == "address"
      @address = true
    end
  end

  def end_element(name, attrs = [])
    if name == "address"
      @address = false
    end
  end

  def start_element(name, attrs = [])
    if name == "pre"
      @pre = true
    end
  end

  def end_element(name, attrs = [])
    if name == "pre"
      @pre = false
    end
  end

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip       # strip leading and trailing whitespaces
      when /^body="start"/     # match starting comment
        @interesting = true
      when /^body="end"/
        @interesting = false  # match closing comment
    end
  end

  # This callback method is called with any string between
  # a tag.
  def characters(string)
    if @interesting and not @pre
      if @interesting and not @address
        @plaintext << string
      end
    end
  end
end

fname = ARGV[0]
start_column = 4
end_column = 6

target_range = (start_column-1)..(end_column-1)
IO.foreach(fname) do |line|
  if line.match(/<dfn>Date<\/dfn>/)
    pieces = line.split(" ")

    @date_string = pieces[target_range].join("-")
#   puts @date_string
  end
end

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

# puts pte.plaintext

begin
  file = File.open("snippet.txt", "w")
  file.write(@date_string)
  file.write(pte.plaintext)
rescue IOError => e
  #some error occur, dir not writable etc.
ensure
  file.close unless file == nil
end

-- 
Posted via http://www.ruby-forum.com/.