On 11/4/06, Jeff Wood <jeff / dark-light.com> wrote:
> Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
> it currently eats almost 800Mb of ram before it seems to do anything ...
>
> Does anybody have any tips on getting REXML to run faster and/or smaller
> ???
>
> I know it's slow just because it's pure ruby ... and there's a lot going
> on ... but ... I can sit here for many minutes just waiting for ANY
> console output showing that it's actually gotten to the first
> root.elements.each( xpath_expr ) iteration ...
>
> Hints/Tips are/would be VERY much appreciated.

magic/xml has extremely convenient stream parsing interface.
It's based on REXML so it's pretty slow, but it handles hundreds of
MBs big XMLs using just a few MBs of memory.

The idea is simple - you give it a block, and the block
keeps getting incomplete subtrees. It can either decide
to complete the current subtree (all children read to memory),
or to get inside it.

It's something like:

XML.parse_as_twigs(STDIN) {|node|
  next unless node.name == :page
  node.complete! # Read all children of <page>...</page> node
  t = node[:@title] # :@title is a child
  i = node[:@id]    # :@id is another child
  print "#{i}: #{t}\n"
}

A short tutorial at http://zabor.org/taw/magic_xml/tutorial.html

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl's XML::Twig).

Enjoy :-)

-- 
Tomasz Wegrzanowski [ http://t-a-w.blogspot.com/ ]