On Mon, Dec 6, 2010 at 1:29 AM, Caleb Clausen <vikkous / gmail.com> wrote:
> On 12/4/10, Ammar Ali <ammarabuali / gmail.com> wrote:
>> On Sat, Dec 4, 2010 at 9:38 AM, Rajarshi Chakravarty
>> <raj_plays / yahoo.com> wrote:
>>> Hi,
>>> I read records from a text file and insert them in the DB.
>>> Sometimes the data contains non ascii characters and I want to keep
>>> these out of the DB.
>>> How can I cleanse them and where?
>>> I mean should it be done while reading data or has ActiveRecord got any
>>> feature to do it?
>>
>> What do you exactly mean by "non ascii"? Do you mean extended ascii
>> (aka high ascii), printable ascii, or unicode?
>>
>> Without knowing details, I would suggest a regular expression like:
>>
>>  text.gsub /[^[:ascii:]]/, ''
>>
>> Or if you're using a ruby older than 1.9 or want cross-version
>> compatibility:
>>
>>  text.gsub /[^\x00-\x7F]/, ''
>>
>> Note that the class [:ascii:] and the range in the second regular
>> expression include all valid ascii characters, which include the
>> control characters and \r (0x0D), \n (x0A), etc. If you only want the
>> alphabet, newlines, and punctuation, then you need to exclude the
>> control characters and try something like:
>>
>>  text.gsub /[^\x20-\x7F\x0D\x0A]/, ''
>
> Hmm, actually it should be gsub! rather than gsub here.
>
> Ammar's answer is a good first approximation and may be all you need,
> however, it is not universally correct. It's better to find out what
> the input's encoding is, and then:
> (in 1.8) trancode to utf8 or something before stripping out the
> non-ascii chars
> (in 1.9) set the encoding of the input correctly to make Ammar's
> first example work for you
>
> This line:
> text.gsub! /[^\x00-\x7F]/, ''
> will be just fine if the input is known to be utf8 or some other
> well-behaved encoding. (The euc family of encodings, for example, are
> also well-behaved.) But it will fail and leave some garbage in your
> strings if the encoding is sjis or big5. (Well-behaved in this case
> means that the encoding is a superset of ascii and encodes non-ascii
> characters entirely with bytes which are not allowed in ascii-7 text.
> That is, bytes >=0x80.)

Thanks for the corrections Caleb. I missed those possible side effects.

Cheers,
Ammar