On 6/10/07, Robert Klemme <shortcutter / googlemail.com> wrote: > On 10.06.2007 18:25, Erwin Abbott wrote: > > I ended up refactoring, but earlier I was parsing some text by > > associating an array of attributes (like, [/a/, /b/, /c$/] that might > > match the first 3 words) with a block that processed the matching > > text, and then moved the position in the string forward by 3 words. I > > tried wanted to be able to do this like: > > Did I understand that properly, you want to process three words at a > time and then the next three words? Then you could do... I already have solved the problem, but maybe someone will find this useful in the future. Basically we start with position=0 (position is the index of the words array). Each "match" is tried until one succeeds, and the position is incremented by the number of words it operated on. So if I called match.call(/the/, /quick/, /brown/, /.*/, /e$/), it would read 5 words starting at "position" and if all the arguments matched the words, it would process the 5 words in some way and then increment the position by 5. In my application I'm not really using regexs though, my words are tokens with various tags, and I'm matching based on the tags. This is all being used to pase date strings like "Wed Aug 5th 2008" might be matched by a rule like match.call(:weekday, :month, :ordinal, :year) for example. Then there might be another rule like match.call(:num, :num, :year) that would match "05 05 2005" and would decide how to parse it. > It's still unclear to me how exactly you want the matching to work. Are > all your "attributes" matched against all three words? Do you > positional matches? In the code all rx's are matched against words in > the same position and if all match the block is invoked on the words. Basically you have it right, the words have to match their /respective/ attribute. But it's not a fixed number of words at a time, because match.call(/the/) would only match one word (then process it, then increment the position index by one). Initially (it was late at night, mind you) I though having a closure would work nicely because I could access position, words, and some other variables in the caller's scope and wouldn't have to pass those along every time. But it was too tricky/messy because I also needed to restart at the beginning of the loop after a success (to start trying all the patterns again), and I needed to know if anything had matched (so I could increment position by 1, else have an infinite loop). What I ended up doing was having a function to store the list of attributes and the block that should be called to "process" the matching words, and then another function that began scanning the word list from position=0, testing all the attributes (like match.call would've), and taking care of incrementing the position index the right amount. Here's parts of the code: def self.date_scanner *tags, &block @@date_scanners << [tags, block] end def self.setup_date_scanners @@date_scanners = [] date_scanner(NLTime::Day, :time, :tz) do |d, t| # two timezones were given, like 12:30:00 -0400 (EDT); ignore rightmost one d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time)) end date_scanner(NLTime::Day, :time) do |d, t| d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time)) end date_scanner(:time, NLTime::Day) do |t, d| d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time)) end date_scanner(:month, :num, :time, :year) do |m, a, t, y| # May 05 12:00:00 -0000 2005 day = NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year)) day.time(t.get_tag(NLTime::Time)) end date_scanner(:year, :num, :num) do |y, a, b| # 2005 05 05 NLTime::Day.civil(b.word, a.word, y.get_tag(NLTime::Year)) end date_scanner(:year, :month, :num) do |y, m, a| # 2005 May 05 NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year)) end date_scanner(:month, :num, :year) do |m, a, y| # May 05 2005 NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year)) end # ... end def self.scan_dates tokens, order=:dm # TODO: # order=:dm assume day/month like american format # order=:md assume month/day like european format # processed tokens ptokens = []; k = 0 while k < tokens.size found = false @@date_scanners.each do |tags, block| if s = tokens[-tags.size-k..-1-k] # assume success until one of the tags doesn't match found = true # match tags to tokens s.zip(tags).each do |token, tag| unless token.has_tag? tag # not a match... next scanner, please found = false break end end if found # this scanner matches, have the tokens processed if date = block.call(*s) token = NLTime::Token.new(date.to_s, :entity, date) ptokens.unshift token # increment the position by number of tokens processed by the block k += tags.size # don't try to match any more scanners break else # the block failed, try the next scanner found = false end end end end unless found # none of the scanners matched ptokens.unshift tokens[-1-k] k += 1 end end ptokens end The scan_dates operates on an array of NLTime::Tokens, which have various tags. The tags can be symbols, which basically categorize words (like "Jan" would have :month tag), or they can be objects (like we might have tagged 2005 with a NLTime::Year object representing the year 2005). This should "replace" sequences of tokens that were matched by a scanner with a new token, tagged with an instance of NLTime::Day or Time. > I still think you're not yet there. Well, my code does what I want it to do... so I'm not sure what you mean? > Kind regards > > robert