Hi,

Mikel Lindsaar wrote:
> I don't really want to set the regexp to UTF-8 or something and then
> transliterate the match strings as that just isn't going to scale I
> think when you are talking about emails which can have almost anything
> in them, and making a regexp for every encoding type also isn't the
> solution.

You should set regexp as ASCII or ASCII-8BIT.

> The only other solution I can think of is going through TMail and
> making all encodings internal to TMail one type (say UTF-8) and then
> transliterating all input and output to match.  But I am not totally
> sure what I will run into on that, as while I understand some of the
> issues of encodings and charactersets, I am by no means an expert on
> the subject.

Yes, you should make all encodings internal to TMail one type, ASCII-8BIT.
When you do
  str.gsub(/\n|\r\n|\r/) { "\r\n" }
, you may think you are working with character string.
But it's wrong, you are working with byte string.

As you said before, you want to make a regexp for every encoding type.
Every encoding type means, you are working under characters: bytes.

So you set encoding as ASCII-8BIT before work with bytes,
and set suitable encoding before work with characters.

str = NKF.nkf("-j", "\u{3042 3044 3046}")
enc = str.encoding
str.force_encoding(Encoding::ASCII_8BIT)
true if /\A\e/ =~ str
str.force_encoding(enc)


-- 
NARUSE, Yui  <naruse / airemix.jp>