On 12/4/10, Ammar Ali <ammarabuali / gmail.com> wrote:
> On Sat, Dec 4, 2010 at 9:38 AM, Rajarshi Chakravarty
> <raj_plays / yahoo.com> wrote:
>> Hi,
>> I read records from a text file and insert them in the DB.
>> Sometimes the data contains non ascii characters and I want to keep
>> these out of the DB.
>> How can I cleanse them and where?
>> I mean should it be done while reading data or has ActiveRecord got any
>> feature to do it?
>
> What do you exactly mean by "non ascii"? Do you mean extended ascii
> (aka high ascii), printable ascii, or unicode?
>
> Without knowing details, I would suggest a regular expression like:
>
>   text.gsub /[^[:ascii:]]/, ''
>
> Or if you're using a ruby older than 1.9 or want cross-version
> compatibility:
>
>   text.gsub /[^\x00-\x7F]/, ''
>
> Note that the class [:ascii:] and the range in the second regular
> expression include all valid ascii characters, which include the
> control characters and \r (0x0D), \n (x0A), etc. If you only want the
> alphabet, newlines, and punctuation, then you need to exclude the
> control characters and try something like:
>
>   text.gsub /[^\x20-\x7F\x0D\x0A]/, ''

Hmm, actually it should be gsub! rather than gsub here.

Ammar's answer is a good first approximation and may be all you need,
however, it is not universally correct. It's better to find out what
the input's encoding is, and then:
  (in 1.8) trancode to utf8 or something before stripping out the
non-ascii chars
  (in 1.9) set the encoding of the input correctly to make Ammar's
first example work for you

This line:
  text.gsub! /[^\x00-\x7F]/, ''
will be just fine if the input is known to be utf8 or some other
well-behaved encoding. (The euc family of encodings, for example, are
also well-behaved.) But it will fail and leave some garbage in your
strings if the encoding is sjis or big5. (Well-behaved in this case
means that the encoding is a superset of ascii and encodes non-ascii
characters entirely with bytes which are not allowed in ascii-7 text.
That is, bytes >=0x80.)