------art_3324_14055986.1189862204126
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On 9/15/07, nutsmuggler <benini.davide / gmail.com> wrote:
>
> Hello folks.
> I managed to write a SGML parser with the hpricot library. As I
> explained in a previous thread, I just need to compare source and
> traget tags of translation memory files from IBM Translation manager.
> The script now runs effectively, but I realised that it cannot cope
> with large files; I tried to process TM file larger than 1MB and the
> script took ages to generate the output. Should I switch to a compiled
> language for this specific task?
> At any rate, here is the script, it's very basic; please let me know
> if I did something wrong or if its slowness is a necessary drawback of
> ruby being interpreted. Cheers,
> Davide



Davide,

This is not a result necessarily of ruby being slow.  Hpricot is a DOM
parser, and also (by default ) tries to fix up tags.  This will parse the
entire file to memory and build an internal tree structure out of it.  The
alternative is to use a SAX based or streaming parser.  This is what
happened in a part of Merb.  The streaming parser, REXML, was much faster
than Hpricot for the same job because it is, well a streaming parser instead
of a DOM one.

It is my understanding that a streaming parser is best for large files, so
if you can use one of these.

Cheers
Daniel





#!/usr/local/bin/ruby
> require 'rubygems'
> require 'hpricot'
>
> $pattern  server"
> result  ile.new("result.html", "w")
> $stdout  esult
> puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
>         'http://www.w3.org/TR/html4/strict.dtd'>\n
> <head>\n
>         <meta http-equiv
ontent-type' contentext/html;
> charsetf-8'>\n > <title>Ricerca di '#{$pattern}'</title>\n > <style typeext/css'> > body { > } > p { > margin: 0px; > } > p.source { > background: #FFFFCC; > padding: 10px 5px 10px 5px; > } > p.target { > background: #F8A271; > padding: 10px 5px 10px 5px; > } > span.pattern { > background: #B6B6B6; > } > </style> > </head>\n > <body>\n" > # per aprire lo stdin > # doc pricot.XML(STDIN) > > > doc pricot.XML(open("bch01aad006_MEMORIA.EXP")) > doc.search("Source").each do |item| > if item.innerHTML /#{$pattern}/ > highlightedSource tem.innerHTML.gsub(/#{$pattern}/, "<span > classattern'>#{$pattern}</span>") > puts "<p classource'>EN: #{highlightedSource}</p>\n" > puts "<p classarget'>IT: #{item.next_sibling.html}</p>\n > <hr/>" > end > end > puts "</body>" > > > ------art_3324_14055986.1189862204126--