------ art_3324_14055986.1189862204126
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
On 9/15/07, nutsmuggler <benini.davide / gmail.com> wrote:
>
> Hello folks.
> I managed to write a SGML parser with the hpricot library. As I
> explained in a previous thread, I just need to compare source and
> traget tags of translation memory files from IBM Translation manager.
> The script now runs effectively, but I realised that it cannot cope
> with large files; I tried to process TM file larger than 1MB and the
> script took ages to generate the output. Should I switch to a compiled
> language for this specific task?
> At any rate, here is the script, it's very basic; please let me know
> if I did something wrong or if its slowness is a necessary drawback of
> ruby being interpreted. Cheers,
> Davide
Davide,
This is not a result necessarily of ruby being slow. Hpricot is a DOM
parser, and also (by default ) tries to fix up tags. This will parse the
entire file to memory and build an internal tree structure out of it. The
alternative is to use a SAX based or streaming parser. This is what
happened in a part of Merb. The streaming parser, REXML, was much faster
than Hpricot for the same job because it is, well a streaming parser instead
of a DOM one.
It is my understanding that a streaming parser is best for large files, so
if you can use one of these.
Cheers
Daniel
#!/usr/local/bin/ruby
> require 'rubygems'
> require 'hpricot'
>
> $pattern server"
> result ile.new("result.html", "w")
> $stdout esult
> puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
> 'http://www.w3.org/TR/html4/strict.dtd'>\n
> <head>\n
> <meta http-equiv
ontent-type' content ext/html;
> charset f-8'>\n
> <title>Ricerca di '#{$pattern}'</title>\n
> <style type ext/css'>
> body {
> }
> p {
> margin: 0px;
> }
> p.source {
> background: #FFFFCC;
> padding: 10px 5px 10px 5px;
> }
> p.target {
> background: #F8A271;
> padding: 10px 5px 10px 5px;
> }
> span.pattern {
> background: #B6B6B6;
> }
> </style>
> </head>\n
> <body>\n"
> # per aprire lo stdin
> # doc pricot.XML(STDIN)
>
>
> doc pricot.XML(open("bch01aad006_MEMORIA.EXP"))
> doc.search("Source").each do |item|
> if item.innerHTML /#{$pattern}/
> highlightedSource tem.innerHTML.gsub(/#{$pattern}/, "<span
> class attern'>#{$pattern}</span>")
> puts "<p class ource'>EN: #{highlightedSource}</p>\n"
> puts "<p class arget'>IT: #{item.next_sibling.html}</p>\n
> <hr/>"
> end
> end
> puts "</body>"
>
>
>
------ art_3324_14055986.1189862204126--