David A. Black wrote:

>Hi --
>
>On Sun, 19 Sep 2004, Charles Hixson wrote:
>
>  
>
>>I'm sure there must be a more idiomatic+efficient way to do this, but I 
>>can't figure it out.  Any suggestions?
>>Also, I'm not sure all of the tests are necessary.  Many of them were 
>>added to avoid "Nil class does not implement..." messages.  Is there a 
>>better approach?
>>
>># parse1 separates a chunk into the non-word stuff before it, the word 
>>stuff, and the non-word stuff after it
>>#    word stuff is letters, digits, hyphens, and periods
>>    def    parse1(chunk)
>>        pb    =    /^([^-A-Za-z0-9]*)/
>>        pe    =    /([^-A-Za-z0-9]*)$/
>>        mtch    =    pb.match(chunk)
>>        a    =    mtch[0]
>>        mtch    =    pe.match(mtch.post_match)
>>        b    =    mtch.pre_match
>>        c    =    mtch[0]
>>        #print    " parse1:a  #{a.inspect} "            if    a    and   
>> a.length > 0
>>        yield    a            if    a    and    a.length > 0
>>        #print    " parse1:b #{b.inspect} "            if    b    and   
>> b.length > 0
>>        yield    b            if    b    and    b.length > 0
>>        #print    " parse1:c #{c.inspect} "            if    c    and   
>> c.length > 0
>>        yield    c            if    c    and    c .length > 0
>>    end
>>    
>>
>
>The spacing got screwed up there, as you can see, but anyway --
>
>I believe that pre_match and post_match will always be empty strings,
>if there's no match, not nil.  So the "if a" test is not necessary (if
>I'm right).  However, calling #[] on the results of a match will raise
>an exception (trying to call #[] on nil) if there was no match, so you
>have to be careful with the "a = mtch[0]" line.
>
>I wonder also whether it's useful to yield only non-empty strings.
>The caller then has to test the strings to see which of the positions
>they're from.  It might be better to yield three things every time, so
>the caller knows what's being yielded.
>
>All of which leads me to this probably over-simplified code:
>
>  def parse1(chunk)
>    chunk.scan(/^(\W*)(\w+)(\W*)$/).flatten.each {|s| yield s}
>  end
>
>(I've used \W and \w where you'd need to use something more
>custom-made -- though I don't think your character classes do what you
>want, because they don't include periods.)
>David
>
That's a very interesting rewrite of the parse1 match patterns.
It's a bit more complicated than that, e.g., at the match boundary 
apostrophe's aren't a part of the middle, but in the middle (of the 
middle) they are.  Think about 'don't'.  And periods aren't a legitimate 
part of most middle-chunks.  But sometimes they are.  This can't be 
resolved by simple matching.  So I'm going to need to pre-process to 
identify known good values and replace them by something that will 
pass...and then back convert them afterwards.

If I could make the start and the end greedy, and the middle 
non-greedy....(I need to check this!)  I could drastically simplify 
parse1.  But my main question is really about the nested routines that 
yield values.  This looks messy, but it does provide reasonably easy 
extension (e.g., note the late addition of a routine to handle internal 
elipsis.  Again, I really should preprocess to convert an elipsis into a 
single special character...and then return it to normal form later.)

OTOH, if I could write the correct pattern, I could go back one step 
earlier, to where I originally break the chunks off the string with:
chunks  =  lin.chomp.split
and replace it with something like
chunks   =   line.scan(/(\W*)(\w*)(/W*)s+/).flatten

as you indicate I'll need to use a much fancier pattern than the default 
wW, if for no other reason, then because they incluce spaces.  (Figuring 
out the proper pattern for the middle section will be QUITE an 
interesting endeavor!)
P.S.:  Does scan keep recycling it's pattern, or would I need to replace 
it with (an elaboration of)
chunks   =   line.scan(/^((\W*)(\w*)(/W*)s+/)$).flatten