"Randy Kramer" <rhkramer / gmail.com> schrieb im Newsbeitrag 
news:200503190900.45805.rhkramer / gmail.com...
> Thanks to all who replied so far.  I also want to look into the 
> StringScanner
> approach (I'll reply separately with some questions about that), and I 
> can't
> believe I couldn't find the ways to delete the first character of a 
> string.
> Guess I am a newbie!
>
> On Saturday 19 March 2005 08:04 am, Robert Klemme wrote:
>> I'd bet that this approach is slower than a pure regexp based approach.
>
> So far, you're very right--my approach took about 30 times as long as the 
> pure
> regexp approach, although my Ruby code might not be very efficient.  (In 
> case
> nobody noticed, I'm very much a newbie to Ruby.)
>
>> If
>> you cannot stuck all exact regexps into one (see below) then maybe some
>> form of stripped regexps might help.  For example:
>
> This sounds like its worth a try, but:
>   1) I haven't created all the necessary REs yet
>   2) Question below (for clarification)
>
>> rx1 = /ab+/
>> rx2 = /cd+/
>>
>> rx_all = /(ab+)|(cd+)/
>>
>> rx_stripped = /[ab](\w+)/
>
> Question: IIUC, the [ab] above should be [ac]?

Exactly.  My mistake.

>> # then, use these on the second part
>> rx_stripped_1 = /^b+/
>> rx_stripped_2 = /^d+/
>>
>> This is just a simple example for demonstration.  For these simple 
>> regexps
>> rx_all is the most efficient one I'm sure.
>
>> What does "fairly large" mean?  I would try to start with stucking *all*
>> these regexps into one - if the rx engine does not choke on that regexp 
>> I'd
>> assume that this is the most efficient way to do it, as then you have the
>> best ratio of machine code to ruby interpretation.  Maybe you just show 
>> us
>> all these regexps so we can better understand the problem.
>
> It's hard even to guess, I intended to combine several REs into one anyway
> when they had a lot of commonality.  For example, the TWiki markup for
> headings (which I'm planning to use) is like this:
>
> ---* Level 1
> ---** Level 2
> ---*** Level 3
> ---**** Level 4
> ---***** Level 5
> ---****** Level 6
>
> I've planned to use one RE for all the above, then determine the level 
> from
> the length of the match (like level = len - 3).

Sounds very reasonable.

> Likewise, "inline" markup is *for bold*, _for italic,_ __for bold 
> italic__,
> and so forth.  I'd try to have one RE looking for words preceded by _, *, 
> or
> __, and another with words ending with the same.  (And might combine words
> marked with % for %TWikiVariables% as well.

Seeing this lets me think of another approach: split the string into tokens 
with a simple regexp and do the rest of the calculation on the tokens. 
Drawback is of course if the input is large this will not perform very well. 
Hm...

> With "optimizations" like this, I'd guess on the order of 15 or so 
> regexps.
>
>> Now I'm getting really curios.  Care to post some more details?
>
> I presume you mean on the 1 to 10% savings?

Yeah, and on the rx's as well (you did show some of theme alread).

>  I planned to do that, I'll try to
> put something on WikiLearn this weekend then post something here.

Some experiments:

>> str = "** caption \nfoo _italic_ asdasd __bold__ var=%var%"
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

>> str.scan /(^\s*\*+)|(([*_])\3*)(.*?)\2|(%\w+%)/m do |m| p m end
["**", nil, nil, nil, nil]
[nil, "_", "_", "italic", nil]
[nil, "__", "_", "bold", nil]
[nil, nil, nil, nil, "%var%"]
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

>> str.scan /(^\s*\*+)|(([*_])\3*)(.*?)\2|%(\w+)%/m do |m| p m end
["**", nil, nil, nil, nil]
[nil, "_", "_", "italic", nil]
[nil, "__", "_", "bold", nil]
[nil, nil, nil, nil, "var"]
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

Kind regards

    robert