Daz, thank you so much for taking the time to code that. I was also
busy today, and got my code working with REXML. Could you please take
a look at my code below and share your thoughts on whether you'd still
switch to htmltools.
The issue is that I'm creating a hundred different screen scrapers for
every frequent flyer program. Any scraper is, of course, brittle, but
it seemed to me like a DOM/XPath-based technique is both less likely to
break from small tweaks to the page and is also generally far more
concise to program. The downside, and it may be too big, is that my
code is awfully inefficient, and also requires that tidy be run on the
HTML before I start.
Also, since you're taking a look, could you please tell me if there's
any more concise way to initialize my arrays. (Ruby generally seems to
figure out variables, but this would only run if I explicitly used
Array.new.)
require "rexml/document"
include REXML
string = <<EOF
<html>
<tr>
<td class="t4" nowrap="nowrap">9-Jan-05</td>
<td class="t4">OZ 0204 F
Class
<a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
LAX</a></td>
<td class="t4" nowrap="nowrap">5,968</td>
<td class="t4" nowrap="nowrap">2,984</td>
<td class="t4" nowrap="nowrap">8,952</td>
</tr>
<tr>
<td class="t4" nowrap="nowrap">19-Jan-05</td>
<td class="t4">MILEAGE PLUS UPGRADE AWARD
15,000 MILES</td>
<td class="t4" nowrap="nowrap">-15,000</td>
<td> </td>
<td class="t4" nowrap="nowrap">-15,000</td>
</tr>
</html>
EOF
def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside
# the parent of the now removed tag
while rexml_array.elements["//#{tag}"]
rexml_array.elements["//#{tag}"].replace_with( Text.new(
rexml_array.elements["//#{tag}"].text.strip))
end
end
doc = Document.new( string.gsub!(/\n| /," "), {
:compress_whitespace => :all } )
tablearray = Array.new
XPath.each( doc,"//tr[count(td)=5]") { |row|
rowarray = Array.new
rowdom = Document.new( row.to_s)
XPath.each( rowdom,"//td") { |cell|
remove_tag( cell,"a")
rowarray << cell.texts.to_s
}
tablearray << rowarray
}
tablearray.each {|el| print el.join(":"),"\n"}
Even better is some other scraping I do on the same page, where in each
case I only need a one-dimensional array:
XPath.each( xml, "//td[@class='t3'][2]") { |cell|
summaryarray << cell.texts.to_s }
XPath.each( xml,
"//td[@colspan='4']/child::*") { |cell|
actsumarray << cell.text.to_s }
Thanks again, Daz, for taking the time to look at my (first ever Ruby)
code. Any other suggestions you could offer would be greatly
appreciated.
- dan
--
Dan Kohn <mailto:dan / dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>