On May 8, 2007, at 3:54 PM, Nanyang Zhan wrote: > John Joyce wrote: > >> I don't know if the two main chinese sets are encoded as different >> ranges or simply declared in some way. >> In general in Unicode a character is the same character even when it >> appears in a different language. > > Many characters of these two set of Chinese(in fact, including Chinese > Characters in Japanese and Korean...) are the same. Aren't they > encoded > to the same codes when they are identical? > Yes. There is lots of overlap. So there is not always a clean separation line. But, the Japanese and Korean phonetic characters will be in a range. You might never use all the kanji/hanzi chinese characters, and a few are Japanese only (very few). > Gary Thomas wrote: >> I believe the range is (in hex) 3400 to 97A5 > You must mean Unicode range. > http://www.khngai.com/chinese/charmap/tbluni.php?page=0 > Yes that's exactly what he means. > John Joyce wrote: >> You might want to check the RubyGems gem unihan > .... hmmmmm.. if only I could find out what it does... > John Joyce wrote: > I took a look at it. It's the database of characters, sort of. It is a big text file list. Not a proper gem at all actually. The same db file can be downloaded from Unicode.org separately. It doesn't contain the actual characters, just their codes and some comments and groupings. >> http://www.alanwood.net/unicode/index.html > >> I've been interested in this subject myself, but it is a big one. > > Interesting subject indeed it is. > > Today I tried this(!!!!under RoR console!!!!): >>> c=%w{¡È ¡É¡£ ¡¤ ¡ª ¡ã ¡Ð ¡¨ ¡Æ ¡ª ¡÷ ¡ô ¡ð ¡ó ¡Ä >>> ¡ö ¡Ê ¡Ë °ì ù» ÌÞ ùØ Óü »Ñ ¼÷ ×Ý >>> Èâ ÙÞ ²³ Ý¡ Ýù ࢠáß >>> µª æË Âà } > => ["¡È", "¡É¡£", "¡¤", "¡ª", "¡ã", "¡Ð", "¡¨", "¡Æ", > "¡ª", "¡÷", "¡ô", "¡ð", "¡ó", > "¡Ä", "¡ö", "¡Ê", "¡Ë", "°ì", "ù»", "", "", "ÌÞ", > "é諒", "éó¥½", "éû¥½", "ê©¥½", " ê±¥½", > "", "×Ý", "", "Èâ", "ÙÞ", "", "²³", "Ý¡", "Ýù", > "", "", "à¢", "", "", > "áß", "", "", "", "", "µª", "æË", "Âà", "", > "", "", "", "", "", > "", "", ""] >>> c.collect.map{|o| o[0]} > => [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239, > 226, 239, 239, 239, 228, 228, 229, 229, 229, 229, 229, 229, 229, 229, > 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, 231, > 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, 233, > 233, 233, 233] >>> c.collect.map{|o| o[0]}.sort > => [226, 226, 226, 226, 228, 228, 229, 229, 229, 229, 229, 229, 229, > 229, 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, > 231, 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, > 233, 233, 233, 233, 239, 239, 239, 239, 239, 239, 239, 239, 239, 239, > 239, 239, 239] >>> c.collect.map{|o| o[0]}.sort.uniq > => [226, 228, 229, 230, 231, 233, 239] > > There punctuations are those commonly used in China. > There Chinese characters are randomly pickup from > http://www.khngai.com/chinese/charmap/tbluni.php?page=0 > (from all the six pages.) > > maybe 226 to 239 is the range I need. > > -- > Posted via http://www.ruby-forum.com/. > If you have access to a Macintosh, the character pallette is pretty helpful for exploring CJK character ranges as subgroupings within the range.