On 6/10/07, Robert Klemme <shortcutter / googlemail.com> wrote:
> On 10.06.2007 18:25, Erwin Abbott wrote:
> > I ended up refactoring, but earlier I was parsing some text by
> > associating an array of attributes (like, [/a/, /b/, /c$/] that might
> > match the first 3 words) with a block that processed the matching
> > text, and then moved the position in the string forward by 3 words. I
> > tried wanted to be able to do this like:
>
> Did I understand that properly, you want to process three words at a
> time and then the next three words?  Then you could do...

I already have solved the problem, but maybe someone will find this
useful in the future. Basically we start with position=0 (position is
the index of the words array). Each "match" is tried until one
succeeds, and the position is incremented by the number of words it
operated on. So if I called match.call(/the/, /quick/, /brown/, /.*/,
/e$/), it would read 5 words starting at "position" and if all the
arguments matched the words, it would process the 5 words in some way
and then increment the position by 5.

In my application I'm not really using regexs though, my words are
tokens with various tags, and I'm matching based on the tags. This is
all being used to pase date strings like "Wed Aug 5th 2008" might be
matched by a rule like match.call(:weekday, :month, :ordinal, :year)
for example. Then there might be another rule like match.call(:num,
:num, :year) that would match "05 05 2005" and would decide how to
parse it.

> It's still unclear to me how exactly you want the matching to work.  Are
> all your "attributes" matched against all three words?  Do you
> positional matches?  In the code all rx's are matched against words in
> the same position and if all match the block is invoked on the words.

Basically you have it right, the words have to match their
/respective/ attribute. But it's not a fixed number of words at a
time, because match.call(/the/) would only match one word (then
process it, then increment the position index by one).

Initially (it was late at night, mind you) I though having a closure
would work nicely because I could access position, words, and some
other variables in the caller's scope and wouldn't have to pass those
along every time. But it was too tricky/messy because I also needed to
restart at the beginning of the loop after a success (to start trying
all the patterns again), and I needed to know if anything had matched
(so I could increment position by 1, else have an infinite loop).

What I ended up doing was having a function to store the list of
attributes and the block that should be called to "process" the
matching words, and then another function that began scanning the word
list from position=0, testing all the attributes (like match.call
would've), and taking care of incrementing the position index the
right amount.  Here's parts of the code:

  def self.date_scanner *tags, &block
    @@date_scanners << [tags, block]
  end

  def self.setup_date_scanners
    @@date_scanners = []

    date_scanner(NLTime::Day, :time, :tz) do |d, t|
      # two timezones were given, like 12:30:00 -0400 (EDT); ignore
rightmost one
      d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
    end

    date_scanner(NLTime::Day, :time) do |d, t|
      d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
    end

    date_scanner(:time, NLTime::Day) do |t, d|
      d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
    end

    date_scanner(:month, :num, :time, :year) do |m, a, t, y|
      # May 05 12:00:00 -0000 2005
      day = NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
      day.time(t.get_tag(NLTime::Time))
    end

    date_scanner(:year, :num, :num) do |y, a, b|
      # 2005 05 05
      NLTime::Day.civil(b.word, a.word, y.get_tag(NLTime::Year))
    end

    date_scanner(:year, :month, :num) do |y, m, a|
      # 2005 May 05
      NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
    end

    date_scanner(:month, :num, :year) do |m, a, y|
      # May 05 2005
      NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
    end

    # ...
  end

  def self.scan_dates tokens, order=:dm
    # TODO:
    # order=:dm assume day/month like american format
    # order=:md assume month/day like european format

    # processed tokens
    ptokens = []; k = 0

    while k < tokens.size
      found = false

      @@date_scanners.each do |tags, block|
        if s = tokens[-tags.size-k..-1-k]
          # assume success until one of the tags doesn't match
          found = true

          # match tags to tokens
          s.zip(tags).each do |token, tag|
            unless token.has_tag? tag
              # not a match... next scanner, please
              found = false
              break
            end
          end

          if found
            # this scanner matches, have the tokens processed
            if date = block.call(*s)
              token = NLTime::Token.new(date.to_s, :entity, date)
              ptokens.unshift token

              # increment the position by number of tokens processed
by the block
              k += tags.size

              # don't try to match any more scanners
              break
            else
              # the block failed, try the next scanner
              found = false
            end

          end
        end
      end

      unless found
        # none of the scanners matched
        ptokens.unshift tokens[-1-k]
        k += 1
      end
    end

    ptokens
  end

The scan_dates operates on an array of NLTime::Tokens, which have
various tags. The tags can be symbols, which basically categorize
words (like "Jan" would have :month tag), or they can be objects (like
we might have tagged 2005 with a NLTime::Year object representing the
year 2005).  This should "replace" sequences of tokens that were
matched by a scanner with a new token, tagged with an instance of
NLTime::Day or Time.

> I still think you're not yet there.

Well, my code does what I want it to do... so I'm not sure what you mean?

> Kind regards
>
> 	robert