Quoteing c1205 / er.uqam.ca, on Fri, May 28, 2004 at 10:01:43PM +0900: > I was playing around with the RMail package and I was missing RFC-2047 > support. I found the "module Rfc2047" in > <20031204151316.GC849@jupp%gmx.de> Probably the one I wrote. > but noticed the following: > > In the regex to discover encoded words: > > | WORD = %r{=\?([!#$%&'*+-/0-9A-Z\\^\`a-z{|}~]+)\?([BbQq])\?([!->@-~]+)\?=} # :nodoc: > > I had to change % to \% to run. Maybe it's just Cygwin. Looks like you are using ruby1.8. There's lots of warnings, too. I'll fix it sometime, or you can send me a patch? :-) > The second thing is that the module doesn't correctly interpret the > "encoded-word - linear white space - encoded word" sequence, where > all the white space should be deleted. > > So I added a regex to delete this whitespace before further processing: > > > module Rfc2047 > > > > WORD = %r{=\?([!#$\%&'*+-/0-9A-Z\\^\`a-z{|}~]+)\?([BbQq])\?([!->@-~]+)\?=} # :nodoc: > >| WORDSEQ = %r{(=\?[!#$\%&'*+-/0-9A-Z\\^\`a-z{|}~]+\?[BbQq]\?[!->@-~]+\?=)\s*(=\?[!#$\%&'*+-/0-9A-Z\\^\`a-z{|}~]+\?[BbQq]\?[!->@-~]+\?=)} Two comments: 1 - I don't think this will work. It will fix: encoded-word - linear white space - encoded word but not: encoded-word - linear white space - encoded word - linear white space - encoded word I.e, it only does pairs, so I don't think it does what you want. 2 - it will trash your input argument, which is fairly undesireable I think you could do the match with a regex by using some-kind of regex operator that matched a WORD, but didn't consume it. See below, I don't have time to test it thoroughly, just one test case, but maybe it will work for you. If it does, I think I could rewrite the regex to do this in a single sweep, though I don't see efficiency as a concern, we're talking mail headers here, they aren't that big! > I also observed that decoding of non-Western character sets (Win-1251 > to Big5) to UTF-8 didn't work. Does anybody already suspect why or do > I have to track down the error further? Which version of rfc2047.rb do you have? I'm at 1.4, and it has a fix for this, I believe, see below. Sam # $Id: rfc2047.rb,v 1.4 2003/04/18 20:55:56 sam Exp $ # # An implementation of RFC 2047 decoding. # # This module depends on the iconv library by Nobuyoshi Nakada, which I've # heard may be distributed as a standard part of Ruby 1.8. Many thanks to him # for helping with building and using iconv. # # Thanks to "Josef 'Jupp' Schugt" <jupp / gmx.de> for pointing out an error with # stateful character sets. # # Copyright (c) Sam Roberts <sroberts / uniserve.com> 2004 # # This file is distributed under the same terms as Ruby. require 'iconv' module Rfc2047 WORD = %r{=\?([!#$%&'*+-/0-9A-Z\\^\`a-z{|}~]+)\?([BbQq])\?([!->@-~]+)\?=} # :nodoc: WORDSEQ = %r{(#{WORD.source})\s+(?=#{WORD.source})} # Decodes a string, +from+, containing RFC 2047 encoded words into a target # character set, +target+. See iconv_open(3) for information on the # supported target encodings. If one of the encoded words cannot be # converted to the target encoding, it is left in its encoded form. def Rfc2047.decode_to(target, from) from = from.gsub(WORDSEQ, '\1') out = from.gsub(WORD) do |word| charset, encoding, text = $1, $2, $3 # B64 or QP decode, as necessary: case encoding when 'b', 'B' #puts text text = text.unpack('m*')[0] #puts text.dump when 'q', 'Q' # RFC 2047 has a variant of quoted printable where a ' ' character # can be represented as an '_', rather than =32, so convert # any of these that we find before doing the QP decoding. text = text.tr("_", " ") text = text.unpack('M*')[0] # Don't need an else, because no other values can be matched in a # WORD. end # Convert: # # Remember - Iconv.open(to, from)! begin text = Iconv.iconv(target, charset, text).join #puts text.dump rescue Errno::EINVAL, Iconv::IllegalSequence # Replace with the entire matched encoded word, a NOOP. text = word end end end end