[ugh, sent the last message before it was finished. sorry.]

Bryan Murphy <bryan / terralab.com> writes:
> In all honesty, I really don't think you should be processing 300,000 
> line XML files with any DOM-like XML interface.  There is a LOT of 
> overhead and wasted memory in the DOM tree, and for an XML file that big 
> it must be huge (not to even consider the time it takes to parse the XML 
> into the DOM tree).   Every time you access that information you're 
> scanning the contents of the ENTIRE XML document in memory.  With a 
> document that size this is bad bad bad!

It seems like the claim that I should have to give up the nice
interface just because the problem gets large is fundamentally flawed
somehow.  The acceptance of such rules of thumb is surely one of the
reasons why the XML world sucks so much.

The files are only 13MB or so, which isn't even large by new PC
standards.

> What I think you really should be using is some sort of streaming parser 
> or pull based parser when you are dealing with documents of this 
> magnitude.  You can build parsers that are considerably faster (orders 
> of magnitude) and load data into much more compact (and applicable) data 
> structures in memory.  
> 
> Yes, I know that writing a state based streaming parser is a bit harder 
> than doing the same with REXML, but when you are dealing with this 
> magnitude of data the tradeoffs are worth it in the long run imho (and 
> building a good state based parser is a fun learning experience if 
> you've never done it before)!
> 
> Bryan

Easy now.  What I said was that I thought a REXML API backed by
libxml2 might be nice.  It's not as if my project is waiting for it,
or I would have written it by now.  Since the project couldn't wait, I
just implemented the parts where an XPath interface was convenient
using Perl and libxml2 instead.  I tried REXMLBuilder (uses
XMLParser), and the parsing is fast, but XPath is still too slow.

To answer another post, I'll probably look at the Ruby libgdome
interface next, but as to whether I'm expecting to be happy with it,
I'll just quote Sean Russell:

  "The extant XML APIs, in general, suck. They take a markup language
  which was specifically designed to be very simple, elegant, and
  powerful, and wrap an obnoxious, bloated, and large API around it."

Steve