"Randy Kramer" <rhkramer / gmail.com> schrieb im Newsbeitrag news:200503221157.21361.rhkramer / gmail.com... > Robert, > > I want to thank you for all your help, it's like having a personal tutor! You're welcome! I'm glad I could help by sharing my experiences. > Some feedback / observations below that don't really require any response. Well, some comments below nevertheless... :-) > On Monday 21 March 2005 09:54 am, Robert Klemme wrote: > > "Randy Kramer" <rhkramer / gmail.com> schrieb im Newsbeitrag > > news:200503210919.00242.rhkramer / gmail.com... > > > On Monday 21 March 2005 04:44 am, Robert Klemme wrote: > > > > Some remarks: > > > > - The comparison between 5 and 6 does not seem fair, as you iterate > > in 6 > > > > but not in 5. > > After some more testing, your remark seems more on target than I originally > thought--I can account for almost all the 30x increase in required time (for > 6) by the additional iterations (repeated invocations of the RE engine). > It's like the invocation is the expensive part, and whether it looks for a > pattern at one point or scans the remainder of the(se short) strings is > negligible. (See the results of tests 6d and 6e below.) Well, that clearly shows that simple scanning with a RE is superior to iterating and then scanning. > > A particular performance show stopper in test 6 is String#[] i.e. you > > create a new String object for each test; object creation is comparatively > > expensive even though Strings share their internal buffer. But the GC has > > to be informed etc. and this is quite some overhead. If you want fast > > code, create as few instances as possible. The same holds for Java in 99% > > of all cases. > > I'm surprised that Ruby creates a new String object for each test--I would > have hoped/thought that it was simply letting me "peek" at a portion of the > existing string (especially since it's only a test). The internal buffer (the characters) is shared but there is a new Ruby instance each time you invoke String#[]: >> 10.times { puts s1[2,4].id } 134979736 134979676 134979652 134979592 134979496 134979472 134979436 134979364 134979268 134979196 => 10 > I presume that the > StringScanner behaves more sanely in that respect, but I guess I'll find > out. ;-) Never used that myself but it's sure worth a try. > Thanks for all of the following! I did substitute them in test 6 to see what > they would do. > > # old (6): ~18 seconds > # with range (6a): didn't work, see below > # with upto (6b): ~14 seconds > # with times (6c): ~12 seconds > #6d: ~0.75 seconds (This is the test that convinced me the iterations are the > problem, I revised the (with times) program to only call the RE once, > although it still scans only from the start of the string--I guess I should > try test 6e with the RE not anchored.) > #6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now I'm > puzzled, how is this faster than the anchored version?? Anyway, at this time > I don't care, I'll just "file it away" as a little anomaly to perhaps > understand some day (and, as I haven't run the test multiple times or similar > in an attempt to discount garbage collection, maybe that is the problem.) > > I did create new test programs (6a, 6b, 6c) but I haven't uploaded them to the > TWiki--if you are really interested I can do that, but, as I say below, I'm > not going to lose sleep over the problem with range. > > For some reason that I haven't figured out (yet?), the "with range" option > didn't work. I'm not going to lose sleep over it--I did try some > troubleshooting, but it may be a rather subtle bug (or I have a very dense > head). > > When I run it as part of a program (re_test_6a.rb), I get the following error > messages: > > bash-2.05b$ re_test6a.rb > /re_test6a.rb:40: Invalid char `\240' in expression > /re_test6a.rb:41: Invalid char `\240' in expression > .. > /re_test6a.rb:66: Invalid char `\240' in expression > bash-2.05b$ See comment below. > When I simply copy the "individual loop" part of the code (i.e., the portion > you show below under # with range) into IRB and running it (after defining > the appropriate strings), I get (and get kicked out of IRB) BTW, this is the > result of attempting to paste the five lines into IRB as a group: > > irb(main):021:0> (0...(s1.length-6)).each do |i| > irb(main):022:1* if s1[i] == ?[ > SyntaxError: compile error > (irb):21: syntax error > from (irb):21 > from (null):0 > bash-2.05b$ >> s1="a"*10 => "aaaaaaaaaa" >> (0...(s1.length-6)).each do |i| ?> if s1[i] == ?[ >> puts "yes" >> end >> end => 0...4 I guess this and the other syntax error above are caused by copying and pasting some characters outside the ASCII range. I have experienced similar errors in the past. Sometimes they look like whitespace characters so you don't recognize them on first sight. > As I try to troubleshoot (by removing pieces from the loop), everything seems > to work OK (and I'm learning what some of those pieces do ;-) > > Anyway, since I went this far, I have uploaded programs 6a thru 6e to the > TWiki, but I am not requesting / suggesting that anyone try to spend time > debugging 6a. > > http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests? > > > # old > > i = 0 > > until i==s1.length-6 do > > if s1[i] == 91 > > s1[i,s1.length] =~ > > /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/ > > end > > i += 1 > > end > > > > # with range > > (0...(s1.length-6)).each do |i| > > if s1[i] == ?[ > > s1[i,s1.length] =~ > > /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/ > > end > > end > > > > # with upto > > 0.upto(s1.length-7) do |i| > > if s1[i] == ?[ > > s1[i,s1.length] =~ > > /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/ > > end > > end > > > > # with times > > (s1.length-6).times do |i| > > if s1[i] == ?[ > > s1[i,s1.length] =~ > > /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/ > > end > > end > > For anyone following along ;-) my next efforts are going to be focused on > StringScanner and then making the necessary substitutions. In parallel I > will probably try to refine the REs. Please let me/us know how that works out. > The remainder of this looks useful as well! > > regards, > Randy Kramer > > > Hm, if you know that the size of files is limited (i.e. something like > > just a few KB) then it's usually worth slurping in the whole file with > > something like this > > > > contents = File.open(f){|io| io.read} > > > > and then iterate through the whole thing with #scan. You can still use ^ > > to anchor at line beginnings. > > > > # get the initial sequen until the first non whitespace > > # just an example > > contents.scan /^\s+\S/ do |m| > > p m[0] > > end Kind regards robert