Issue #13241 has been updated by Jan Lelis.


Great idea, I'd love to have such capabilities built into the language!

I've recently build this for scripts, blocks, and general categories on Ruby level (see https://github.com/janlelis/unicode-scripts), so let me share some thoughts on the API:

- I think, it should be always *plural methods* which return a list of properties used in the string, since Ruby does not distinguish between single characters and strings. The first example would then rather be: `"Aあア".scripts  => [:hiragana, :katakana, :latin]` (like the fourth example). I find it better that it would always return an array than being confused by the fact that it would only consider the first character.
- With the same reasoning, I would go for having only a `properties` method, and no singular `property` method
- Although I kind of like the `.properties([:script, :general_category])` API, it can be a little confusing when using the proposed *plural methods* approach: It implicitly switches its mode of operation to character by character, soley based on the passed argument being an array. I'd suggest to make this explicit, maybe by using another method such as `.each_properties`, just going with `each_char.properties` (probably cannot get optimized properly), or using a keyword argument like `by_char: true`
- Should there be only a `.properties` method (which could be used with scripts, blocks, general categories, etc.) or should there also be individual methods (like `.scripts`, `.blocks`, …)? I think both ways would be acceptable, but I like the idea of having individual methods for the most important properties.
- A little more bikeshedding: Maybe the properties should be returned as strings instead of symbols. They represent some kind of data, so to me it feels like strings are the more appropriate choice. Another example, if we have such functionality for blocks as well, "Miscellaneous Mathematical Symbols-B" would have to returned as a symbol - which just does not look so good. This is only about the values returned, all method arguments would still be symbols/keyword arguments.

What do you all think?

----------------------------------------
Feature #13241: Method(s) to access Unicode properties for characters/strings
https://bugs.ruby-lang.org/issues/13241#change-63091

* Author: Martin Drst
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
[This is currently an exploratory proposal.]

Onigmo allows Unicode properties in regular expressions. With this, it's e.g. possible to check whether a string contains some Hiragana:

```
"ABC あ DEF" =~ /\p{hiragana}/
```

However, it is currently impossible to ask for e.g. the script of a character. I propose to add a method (or some methods) to String to be able to get such properties. Various (to some extent conflicting) examples:

```
"Aあア".script => :latin # returns script of first character only

"Aあア".script => [:latin, :hiragana, :katakana] # returns array of property values

"Aあア".property(:script) => :latin # returns specified property of first character only

"Aあア".property(:script) => [:latin, :hiragana, :katakana] # returns array of specified properties' values

"Aあア".properties([:script, :general_category]) => [[:latin, :Lu], [:hiragana, :Lo], [:katakana, :Lo]]
                        # returns arrays of property values, one array per character
```

The interface is still in flux, comments welcome!

Implementation depends on #13240.


In Python, such functionality (however, quite limited in property coverage, and not directly on String) is available in the standard library (see https://docs.python.org/3/library/unicodedata.html).



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>