On May 8, 2007, at 3:54 PM, Nanyang Zhan wrote:

> John Joyce wrote:
>
>> I don't know if the two main chinese sets are encoded as different
>> ranges or simply declared in some way.
>> In general in Unicode a character is the same character even when it
>> appears in a different language.
>
> Many characters of these two set of Chinese(in fact, including Chinese
> Characters in Japanese  and Korean...) are the same. Aren't they  
> encoded
> to the same codes when they are identical?
>

Yes. There is lots of overlap. So there is not always a clean  
separation line. But, the Japanese and Korean phonetic characters  
will be in a range. You might never use all the kanji/hanzi chinese  
characters, and a few are Japanese only (very few).


> Gary Thomas wrote:
>> I believe the range is (in hex) 3400 to 97A5
> You must mean Unicode range.
> http://www.khngai.com/chinese/charmap/tbluni.php?page=0
>

Yes that's exactly what he means.
> John Joyce wrote:
>> You might want to check the RubyGems gem   unihan
> .... hmmmmm.. if only I could find out what it does...
> John Joyce wrote:
>

I took a look at it. It's the database of characters, sort of. It is  
a big text file list. Not a proper gem at all actually. The same db  
file can be downloaded from Unicode.org separately. It doesn't  
contain the actual characters, just their codes and some comments and  
groupings.
>> http://www.alanwood.net/unicode/index.html
>
>> I've been interested in this subject myself, but it is a big one.
>
> Interesting subject indeed it is.
>
> Today I tried this(!!!!under RoR console!!!!):
>>> c=%w{ ɡ               
>>>                 
>>>      ݡ            
>>>             }
> => ["", "ɡ", "", "", "", "", "", "",  
> "", "", "", "", "",
> "", "", "", "", "", "", "", "", "",  
> "諒", "", "", "ꩥ", " 걥",
> "", "", "", "", "", "", "", "ݡ", "",  
> "", "", "", "", "",
> "", "", "", "", "", "", "", "", "",  
> "", "", "", "", "",
> "", "", ""]
>>> c.collect.map{|o| o[0]}
> => [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239,
> 226, 239, 239, 239, 228, 228, 229, 229, 229, 229, 229, 229, 229, 229,
> 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, 231,
> 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, 233,
> 233, 233, 233]
>>> c.collect.map{|o| o[0]}.sort
> => [226, 226, 226, 226, 228, 228, 229, 229, 229, 229, 229, 229, 229,
> 229, 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231,
> 231, 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233,
> 233, 233, 233, 233, 239, 239, 239, 239, 239, 239, 239, 239, 239, 239,
> 239, 239, 239]
>>> c.collect.map{|o| o[0]}.sort.uniq
> => [226, 228, 229, 230, 231, 233, 239]
>
> There punctuations are those commonly used in China.
> There Chinese characters are randomly pickup from
> http://www.khngai.com/chinese/charmap/tbluni.php?page=0
> (from all the six pages.)
>
> maybe 226 to 239 is the range I need.
>
> -- 
> Posted via http://www.ruby-forum.com/.
>

If you have access to a Macintosh, the character pallette is pretty  
helpful for exploring CJK character ranges as subgroupings within the  
range.