On 9/1/05, Morgan <taria / the-arc.net> wrote:
> Okay. The thing making this difficult is handling things that span across
> tags. Running a gsub that matches something entirely within a single
> tag won't produce problems, nor will reversing it, nor will anything else
> you do to it that I can think of. Matching a pattern across tags I'm
> pretty sure can be done, but it'll probably be a pain to do, and I'm starting
> to wonder if there's any point to it. Substitution across tags is probably
> doable if you can solve the pattern matching problem, but how do you
> decide sensibly what ends up in what tag?

A solution, similar to that employed by ncurses and many other UI
systems, is to use the concept of an extended character. Each
character in the string is flagged with applicable attributes.
Translating marked up ASCII to a list of extended characters is easy
enough: maintain a bitmask of attributes and turn them on/off as you
encounter tags; apply the current bitmask to each character
encountered.

Translating back from an extended character string to ascii markup can
be accomplished with an algorithm like the following (I'm using an
array instead of bitmask for readability):

def encode( extended_chars, start_flags=[], clean=0 )
  current_flags = start_flags
  encoded_ascii = ''
  extended_chars.each do |char|
    (current_flags - char.flags).each do |flag|
      encoded_ascii << flag.close_tag
    end
    (char.flags - current_flags).each do |flag|
      encoded_ascii << flag.open_tag
    end
    current_flags = char.flags
    encoded_ascii << char.ascii_char
  end
  if clean
    current_flags.each do |flag|
      encoded_ascii << flag.close_tag
    end
    current_flags = []
  end
  return (encoded_ascii, current_flags.clone)
end

Ideally, the list of encoded characters would be encapsulated in an
object that acts like a string (implementing gsub, reverse, etc.) The
operations would rearrange/remove individual extended characters from
the object without changing any of the flags associated with any one
character.

As an example application, your string would decode as follows:

something = decode("A <C red>red</C> and <C blue>blue</C> baseball bat.")
# => A, ' ', r|red, e|red, d|red, ' ', a, n, d, ' ', b|blue, l|blue,
u|blue, e|blue, ' ', b, a, s,  ...

The regex /red and blue/ would match this substring
# r|red, e|red, d|red, ' ', a, n, d, ' ', b|blue, l|blue, u|blue, e|blue

That substring is replaced with the substring (since it wasn't encoded):
# o, s, t, r, i, c, h

And the result is:
# A, ' ', o, s, t, r, i, c, h, ' ', b, a, s, ...

Obviously, no part is red or blue. Assume we'd actually marked up "ostrich" as
#  "<C red>os</C>tri<C green>ch</C>"
# => o|red, s|red, t, r, i, c|green, h|green

And matched against the shorted substring "d and bl" then the result would be:
# A, ' ', r|red, e|red, o|red, s|red, t, r, i, c|green, h|green,
u|blue, e|blue, ' ', b, a, s, ...
# => "A <C red>reos</C>tri<C green>ch</C><C blue>ue</C> baseball bat."

Jacob Fugal

DISCLAIMER: None of the above is intended to be complete, bug-free or
efficient. An actual implementation would need all of those. This is
just meant to be an example algorithm that would make the discussed
operations possible.