On May 7, 2007, at 9:43 PM, Nanyang Zhan wrote:

> John Joyce wrote:
>> On May 7, 2007, at 8:35 PM, akbarhome wrote:
>>
>>>>>    if x[0].to_i > 128 then
>>>> English that
>>>> Posted viahttp://www.ruby-forum.com/.
>>> => U+6469 <CJK Ideograph>
>>> irb(main):028:0> format "%X", ustr[0].to_i.to_s
>>> => "6469"
>>> irb(main):029:0>
>>>
>>>
>> You could identify the encoding or just make it unicode, then check
>> if the characters fall into a range in unicode, that will identify  
>> them.
>> One shortcut is checking for leading zeros in the unicode character's
>> code.
>
>  John Joyce, Thank you for your explanation.
> Now I get akbarhome's idea. So I need to download the unicode lib here
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
> Then covert the strings into unicode, and then compare the characters
> with the CJK Unicode Table from here:
> http://www.khngai.com/chinese/charmap/tbluni.php?page=5
> Yes,It must work!
>
> but look this:
>>> str1 = "中文 English Words"
> => "中文 English Words"
>>> str1[0]
> => 228
>>> str2 = "テ婆ami: chi"
> => "テ婆ami: chi"
>>> str2[0]
> => 195
>>> str3 = "English Words"
> => "English Words"
>>> str3[0]
> => 69
>
> may be there are numbers that are right for Chinese,
> if only I known which number Chinese Characters start and end, there
> will be a much simple solution.
>
> -- 
> Posted via http://www.ruby-forum.com/.
>
yes, that's pretty much how unicode is supposed to work.
In theory you could take a sample range of characters to guess the  
document language even.
The problem is that unicode allows multilanguage documents, which in  
some cases is difficult because of fonts and systems' implementations.
But yes you're on the right track now (IMHO).

And yes, the overhead will be greater, but that's just a fact of  
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names :  Traditional and  
Simpllified.
If you were doing Japanese text, separating English or other western  
languages wouldn't be so easy, since Japanese essentially includes a  
number of other languages' character sets in its unicode set and in  
everyday usage.