On Thu, Sep 15, 2011 at 10:32 PM, Ian Hobson <ian.hobson / ntlworld.com> wrot=
e:
> On 15/09/2011 19:56, Pascua 9804 wrote:
>>
>> Thanks to all who tried to help. =A0Here's the final answer.
>>
>>
>> #!/usr/bin/env ruby
>> string=3D"The quick brown fox jumped over the lazy dog"
>> def get_subsection(word, sentence)
>> sentence.scan(Regexp.new(/(?:\W{0,1}\w+\W){0,3}over(?:\W{1}\w+){0,3}/))
>> end
>>
>> puts get_subsection("quick", string)
>> puts get_subsection("lazy", string)
>> puts get_subsection("fox", string)
>> puts get_subsection("dog", string)
>>
>>
>> The regex in the middle of the syntax is where I struggled, but with a
>> little bit of help from the guru I was able to solve the problem.
>>
> I think that is a maintenance nightmare!
>
> As Jamie Zawinski said - Some people, when confronted with a problem, thi=
nk
> "I know, I'll use regular expressions." Now they have two problems.

Nah.

> For large source texts it will be horribly slow, and memory hungry, and f=
or
> large search lists it will slow down even more. =A0Huge, slow and hard to
> maintain =3D not good.

That entirely depends on the problem to solve and the approach with
regexp chosen.

> What the OP wanted was a sequence of 7 words, where the 4th is the word
> sought, and the string can be missing words "before" or "after" the sourc=
e
> string.
>
> So you need two parallel lists of strings. =A0The first is a list of toke=
ns
> from the source, where each token is separated from the next by white-spa=
ce.
> The second are words, created from the tokens by removing punctuation.

I'd work with a single list of words and non words interchanged.  That
should make generation of the combined matching sequence easier.

> Slide through the source, token at a time, and if the forth word of the w=
ord
> list is one of the ones you want,
> use the token list to reconstruct the fragment of the source, (without
> newlines) and emit the result.
>
> In order to handle the start-up and close-down properly, I would consider
> preloading the token list with null strings, and arrange the "get next
> token" function to return three null strings after end of file, before
> signalling the end.
> However there are other methods.
>
> This is one pass, so you don't need the source all in memory. It will be
> order source size in time, and order the number of words sought in space.
> =A0Fast, compact and easy to alter the rule or length of the lists.

I find this simpler:

def word_scan(s, *words)
  return to_enum(:word_scan, s, *words) unless block_given?
  return if words.empty?

  s.scan /\b#{Regexp.union words}\b/ do |wd|
    pre =3D $`
    post =3D $'
    yield pre[/(?:\w+\W+){0,3}\z/] + wd + post[/\A(?:\W+\w+){0,3}/]
  end
end

s =3D "Robert likes green beans, girls with moustaches, and teddy bears.
 John thinks Robert is strange"

puts 1
word_scan(s, "Robert") {|m| p m}

puts 2
word_scan(s, "green", "teddy") {|m| p m}
p word_scan(s, "green", "teddy").to_a

You can use it with and without block following the idiom to get an
Enumerable if there is no block.  However, for really large inputs
your approach is likely better.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/