Dan Kohn wrote:
> [...]
> The issue is that I'm creating a hundred different screen scrapers for
> every frequent flyer program.  Any scraper is, of course, brittle, but
> it seemed to me like a DOM/XPath-based technique is both less likely to
> break from small tweaks to the page and is also generally far more
> concise to program.  The downside, and it may be too big, is that my
> code is awfully inefficient, and also requires that tidy be run on the
> HTML before I start.

Hi Dan,

Your code, IMHO, is inefficient due to the use of 'industrial grade'
software for a lightweight task, not from your coding.
I've run traces on REXML progs and the detailed work it carries out
is quite incredible (and necessary for its power).
Estimating conservatively, from timing and profiling of comparable
scripts, I'd say that I could run 15 pages through 'tools' to each
going through REXML ... probably as many as 30 ... even more while
you're pre-processing with Tidy.

>
> Also, since you're taking a look, could you please tell me if there's
> any more concise way to initialize my arrays.  (Ruby generally seems to
> figure out variables, but this would only run if I explicitly used
> Array.new.)
>

That's not a factor :)

>
> Thanks again, Daz, for taking the time to look at my (first ever Ruby)
> code.  Any other suggestions you could offer would be greatly
> appreciated.
>
>          - dan

Glad to help.

Just one suggestion; your REXML experience won't be wasted --
don't hesitate to use REXML when it's needed (or at the weekends ;)
- it is /class/, as you know.
For this specific task, with speed being important, you need to use
a lighter package.  I've used only one for any length of time, so I
can't compare with others.
Many folks would tackle this job with hand-parsing/regexps or this:
http://raa.ruby-lang.org/project/htmltokenizer/  - which may offer
you even better performance.


# Script used for timing comparisons against your latest.
#--------------------------------------------------------
exa = HTMLTree::Parser.new(verbose=true, ws=false)
exa.feed(string)  # replacing '.parse_file_named'

tablearray = []
exa.html.children.select {|e0| e0.tag == 'tr'}.each do |tr|
  rowarray = []
  tr.select {|e1| e1.tag == 'td'}.each do |td|
    data = ''
    td.each do |item|
      data << item.to_s if item.data?
    end
    data.gsub!(/(\s|&nbsp;)+/, ' ')
    rowarray << data
  end
  tablearray << rowarray
end
tablearray.each {|el| puts el.join(":")}
#-------------------------------------------------------------------
9-Jan-05:OZ 0204 F Class ICN to LAX:5,968:2,984:8,952
19-Jan-05:MILEAGE PLUS UPGRADE AWARD 15,000 MILES:-15,000: :-15,000
#-------------------------------------------------------------------


Cheers,

daz
-- 

BTW, 'tools' does a similar job to Tidy (outputting to REXML format !):

  require 'html/xpath'  # http://ruby-htmltools.rubyforge.org/
  exa = HTMLTree::Parser.new(verbose=false, strip_white=false)
  exa.feed(string)
  puts exa.tree.as_rexml_document