On 11/4/06, Jeff Wood <jeff / dark-light.com> wrote: > Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ... > it currently eats almost 800Mb of ram before it seems to do anything ... > > Does anybody have any tips on getting REXML to run faster and/or smaller > ??? > > I know it's slow just because it's pure ruby ... and there's a lot going > on ... but ... I can sit here for many minutes just waiting for ANY > console output showing that it's actually gotten to the first > root.elements.each( xpath_expr ) iteration ... > > Hints/Tips are/would be VERY much appreciated. magic/xml has extremely convenient stream parsing interface. It's based on REXML so it's pretty slow, but it handles hundreds of MBs big XMLs using just a few MBs of memory. The idea is simple - you give it a block, and the block keeps getting incomplete subtrees. It can either decide to complete the current subtree (all children read to memory), or to get inside it. It's something like: XML.parse_as_twigs(STDIN) {|node| next unless node.name == :page node.complete! # Read all children of <page>...</page> node t = node[:@title] # :@title is a child i = node[:@id] # :@id is another child print "#{i}: #{t}\n" } A short tutorial at http://zabor.org/taw/magic_xml/tutorial.html I think subtree-based parsers are a great tradeoff between convenience of read-everything parsers and low memory use of stream-based parsers. Deciding inside a block seems much more natural than predefining matched tags (like in Perl's XML::Twig). Enjoy :-) -- Tomasz Wegrzanowski [ http://t-a-w.blogspot.com/ ]