Daz, thank you so much for taking the time to code that.  I was also
busy today, and got my code working with REXML.  Could you please take
a look at my code below and share your thoughts on whether you'd still
switch to htmltools.

The issue is that I'm creating a hundred different screen scrapers for
every frequent flyer program.  Any scraper is, of course, brittle, but
it seemed to me like a DOM/XPath-based technique is both less likely to
break from small tweaks to the page and is also generally far more
concise to program.  The downside, and it may be too big, is that my
code is awfully inefficient, and also requires that tidy be run on the
HTML before I start.

Also, since you're taking a look, could you please tell me if there's
any more concise way to initialize my arrays.  (Ruby generally seems to
figure out variables, but this would only run if I explicitly used
Array.new.)

require "rexml/document"
include REXML
string = <<EOF
        <html>
        <tr>
        <td class="t4" nowrap="nowrap">9-Jan-05</td>
        <td class="t4">OZ 0204 F
Class
        <a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
        ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
        LAX</a></td>
        <td class="t4" nowrap="nowrap">5,968</td>
        <td class="t4" nowrap="nowrap">2,984</td>
        <td class="t4" nowrap="nowrap">8,952</td>
        </tr>
        <tr>
        <td class="t4" nowrap="nowrap">19-Jan-05</td>
        <td class="t4">MILEAGE PLUS UPGRADE AWARD
        15,000 MILES</td>
        <td class="t4" nowrap="nowrap">-15,000</td>
        <td>&nbsp;</td>
        <td class="t4" nowrap="nowrap">-15,000</td>
        </tr>
        </html>
EOF

def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside
# the parent of the now removed tag
while rexml_array.elements["//#{tag}"]
        rexml_array.elements["//#{tag}"].replace_with( Text.new(
                rexml_array.elements["//#{tag}"].text.strip))
        end
end

doc = Document.new( string.gsub!(/\n|&nbsp;/," "), {
        :compress_whitespace => :all } )
tablearray = Array.new
XPath.each( doc,"//tr[count(td)=5]") { |row|
        rowarray = Array.new
        rowdom = Document.new( row.to_s)
        XPath.each( rowdom,"//td") { |cell|
                remove_tag( cell,"a")
                rowarray << cell.texts.to_s
                }
       tablearray << rowarray
        }
tablearray.each {|el| print el.join(":"),"\n"}



Even better is some other scraping I do on the same page, where in each
case I only need a one-dimensional array:

XPath.each( xml, "//td[@class='t3'][2]") { |cell|
        summaryarray << cell.texts.to_s }

XPath.each( xml,
        "//td[@colspan='4']/child::*") { |cell|
        actsumarray << cell.text.to_s }

Thanks again, Daz, for taking the time to look at my (first ever Ruby)
code.  Any other suggestions you could offer would be greatly
appreciated.

         - dan
--
Dan Kohn <mailto:dan / dankohn.com>
<http://www.dankohn.com/>  <tel:+1-415-233-1000>