"Randy Kramer" <rhkramer / gmail.com> schrieb im Newsbeitrag
news:200503221157.21361.rhkramer / gmail.com...
> Robert,
>
> I want to thank you for all your help, it's like having a personal
tutor!

You're welcome!  I'm glad I could help by sharing my experiences.

> Some feedback / observations below that don't really require any
response.

Well, some comments below nevertheless...  :-)

> On Monday 21 March 2005 09:54 am, Robert Klemme wrote:
> > "Randy Kramer" <rhkramer / gmail.com> schrieb im Newsbeitrag
> > news:200503210919.00242.rhkramer / gmail.com...
> > > On Monday 21 March 2005 04:44 am, Robert Klemme wrote:
> > > > Some remarks:
> > > >  - The comparison between 5 and 6 does not seem fair, as you
iterate
> > in 6
> > > > but not in 5.
>
> After some more testing, your remark seems more on target than I
originally
> thought--I can account for almost all the 30x increase in required time
(for
> 6) by the additional iterations (repeated invocations of the RE engine).
> It's like the invocation is the expensive part, and whether it looks for
a
> pattern at one point or scans the remainder of the(se short) strings is
> negligible.  (See the results of tests 6d and 6e below.)

Well, that clearly shows that simple scanning with a RE is superior to
iterating and then scanning.

> > A particular performance show stopper in test 6 is String#[] i.e. you
> > create a new String object for each test; object creation is
comparatively
> > expensive even though Strings share their internal buffer.  But the GC
has
> > to be informed etc. and this is quite some overhead.  If you want fast
> > code, create as few instances as possible.  The same holds for Java in
99%
> > of all cases.
>
> I'm surprised that Ruby creates a new String object for each test--I
would
> have hoped/thought that it was simply letting me "peek" at a portion of
the
> existing string (especially since it's only a test).

The internal buffer (the characters) is shared but there is a new Ruby
instance each time you invoke String#[]:

>> 10.times { puts s1[2,4].id }
134979736
134979676
134979652
134979592
134979496
134979472
134979436
134979364
134979268
134979196
=> 10

>  I presume that the
> StringScanner behaves more sanely in that respect, but I guess I'll find
> out. ;-)

Never used that myself but it's sure worth a try.

> Thanks for all of the following!  I did substitute them in test 6 to see
what
> they would do.
>
> # old (6): ~18 seconds
> # with range (6a): didn't work, see below
> # with upto (6b): ~14 seconds
> # with times (6c): ~12 seconds
> #6d: ~0.75 seconds (This is the test that convinced me the iterations
are the
> problem, I revised the (with times) program to only call the RE once,
> although it still scans only from the start of the string--I guess I
should
> try test 6e with the RE not anchored.)
> #6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now
I'm
> puzzled, how is this faster than the anchored version??  Anyway, at this
time
> I don't care, I'll just "file it away" as a little anomaly to perhaps
> understand some day (and, as I haven't run the test multiple times or
similar
> in an attempt to discount garbage collection, maybe that is the
problem.)
>
> I did create new test programs (6a, 6b, 6c) but I haven't uploaded them
to the
> TWiki--if you are really interested I can do that, but, as I say below,
I'm
> not going to lose sleep over the problem with range.
>
> For some reason that I haven't figured out (yet?), the "with range"
option
> didn't work.  I'm not going to lose sleep over it--I did try some
> troubleshooting, but it may be a rather subtle bug (or I have a very
dense
> head).
>
> When I run it as part of a program (re_test_6a.rb), I get the following
error
> messages:
>
> bash-2.05b$ re_test6a.rb
> /re_test6a.rb:40: Invalid char `\240' in expression
> /re_test6a.rb:41: Invalid char `\240' in expression
> ..
> /re_test6a.rb:66: Invalid char `\240' in expression
> bash-2.05b$

See comment below.

> When I simply copy the "individual loop" part of the code (i.e., the
portion
> you show below under # with range) into IRB and running it (after
defining
> the appropriate strings), I get (and get kicked out of IRB) BTW, this is
the
> result of attempting to paste the five lines into IRB as a group:
>
> irb(main):021:0>   (0...(s1.length-6)).each do |i|
> irb(main):022:1*    if s1[i] == ?[
> SyntaxError: compile error
> (irb):21: syntax error
>         from (irb):21
>         from (null):0
> bash-2.05b$

>> s1="a"*10
=> "aaaaaaaaaa"
>> (0...(s1.length-6)).each do |i|
?> if s1[i] == ?[
>> puts "yes"
>> end
>> end
=> 0...4

I guess this and the other syntax error above are caused by copying and
pasting some characters outside the ASCII range.  I have experienced
similar errors in the past.  Sometimes they look like whitespace
characters so you don't recognize them on first sight.

> As I try to troubleshoot (by removing pieces from the loop), everything
seems
> to work OK (and I'm learning what some of those pieces do ;-)
>
> Anyway, since I went this far, I have uploaded programs 6a thru 6e to
the
> TWiki, but I am not requesting / suggesting that anyone try to spend
time
> debugging 6a.
>
> http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests?
>
> > # old
> > i = 0
> > until i==s1.length-6 do
> >   if s1[i] == 91
> >     s1[i,s1.length] =~
> > /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
> >   end
> >   i += 1
> > end
> >
> > # with range
> > (0...(s1.length-6)).each do |i|
> >   if s1[i] == ?[
> >     s1[i,s1.length] =~
> > /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
> >   end
> > end
> >
> > # with upto
> > 0.upto(s1.length-7) do |i|
> >   if s1[i] == ?[
> >     s1[i,s1.length] =~
> > /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
> >   end
> > end
> >
> > # with times
> > (s1.length-6).times do |i|
> >   if s1[i] == ?[
> >     s1[i,s1.length] =~
> > /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
> >   end
> > end
>
> For anyone following along ;-) my next efforts are going to be focused
on
> StringScanner and then making the necessary substitutions.  In parallel
I
> will probably try to refine the REs.

Please let me/us know how that works out.

> The remainder of this looks useful as well!
>
> regards,
> Randy Kramer
>
> > Hm, if you know that the size of files is limited (i.e. something like
> > just a few KB) then it's usually worth slurping in the whole file with
> > something like this
> >
> > contents = File.open(f){|io| io.read}
> >
> > and then iterate through the whole thing with #scan.  You can still
use ^
> > to anchor at line beginnings.
> >
> > # get the initial sequen until the first non whitespace
> > # just an example
> > contents.scan /^\s+\S/ do |m|
> >   p m[0]
> > end

Kind regards

    robert