ICU Text Transforms in Cocoa

Datetime:2016-08-23 01:39:51          Topic: Cocoa           Share

ICU string transforms are cool. The ICU libraries provide a bunch of powerful text transformations that are very useful for processing user input, especially if your code needs to handle languages other than English and scripts other than Latin. For instance, you could transliterate a text written in Simplified Chinese to Latin characters, strip accents and other diacritical marks, delete invisible characters, and convert the input to lowercase before you feed the normalized string into your database’s search API, all in a single line of code.

On Apple platforms, string transforms have been exposed through the CFStringTransform function in Core Foundation for a long time. Read Mattt Thompson’s excellent overview on NSHipster for more on this API.

With iOS 9 and OS X 10.11, string transforms have made the jump to the Foundation framework. Documentation for the new stringByApplyingTransform(_:reverse:) method on NSString is still missing, but the docs for CFStringTransform tell you what you need to know, and Nate Cook shows a few examples in this NSHipster article . Here’s how you would do the conversion from Chinese to Latin:

import Foundation
let shanghai = "上海"
shanghai.stringByApplyingTransform(NSStringTransformToLatin,
    reverse: false) // returns "shàng hǎi"

So far, so good. Apple currently provides constants for 16 possible transforms. Most of these refer to script transliterations, and then there are some others that let you strip combining marks and diacritics from the input, or convert characters to their code point numbers or official Unicode names. Additionally, most transforms can be reversed via the second argument to stringByApplyingTransform . This is already very powerful, especially when you chain multiple transformations (first transliterate, then strip diacritical marks).

Freeform Transforms

What I never realized, although it is mentioned both in the CFStringTransform documentation and in the NSHipster article, is that you can even go a step further. ICU defines its own syntax for specifying a transform, and if you pass a string conforming to this syntax to stringByApplyingTransform or CFStringTransform , it just works.

The documentation for this in the ICU User Guide is very good and includes lots of examples. I encourage you to check it out. Here are some of my own examples:

Convert to lowercase.

Input Transform Result
HELLO WORLD Lower hello world

Convert only vowels to lowercase.The square brackets specify a filter.

Input Transform Result
HELLO WORLD [AEIOU] Lower HeLLo WoRLD

Convert to Latin, then to ASCII, then to lowercase.Separate multiple rules with semicolons. The Latin-ASCII steps removes diacritical marks and will also try to convert symbols and punctuation from outside the ASCII range to their nearest ASCII equivalent.

Input Transform Result
上海 Any-Latin; Latin-ASCII; Lower shang hai
København Any-Latin; Latin-ASCII; Lower kobenhavn
กรุงเทพมหานคร Any-Latin; Latin-ASCII; Lower krungthephmhankhr
Αθήνα Any-Latin; Latin-ASCII; Lower athena
“Æ « © 1984” Any-Latin; Latin-ASCII; Lower "ae << (c) 1984"

Delete punctuation.The Remove rule can be very powerful. The filter (in brackets) can either consist of a string of characters the rule should apply to (see above), or as in this case, a named Unicode character category .

Input Transform Result
“Make it so,” said Picard. [:Punctuation:] Remove Make it so said Picard

Delete everything that is not a letter.Use a caret ^ to negate a filter.

Input Transform Result
5 plus 6 equals 11 :+1:! [:^Letter:] Remove plusequals

Convert to typographical punctuation.The Publishing rule converts straight punctuation marks into their typographical equivalents.

Input Transform Result
"How's it going?" Publishing “How’s it going?”

Convert to hex representation.Several formats are supported. The default format is Java . Note that Java outputs UTF-16 code units (the emoji is encoded in two parts) while the other formats output code points.

Input Transform Result
:smiley:! Hex \uD83D\uDE03\u0021
:smiley:! Hex/Java \uD83D\uDE03\u0021
:smiley:! Hex/Unicode U+1F603U+0021
:smiley:! Hex/Perl \x{1F603}\x{21}
:smiley:! Hex/XML &#x1F603;&#x21;

Normalize to different normalization forms.

Input Transform Result
é NFD; Hex/Unicode U+0065U+0301
é NFC; Hex/Unicode U+00E9
2⁸ NFKD 28
2⁸ NFKC 28

Imagine you’d have to write this yourself.

I learned this from Florian and Daniel’sCore Data book. They explain how you can use string transforms to normalize search terms the user has entered before feeding it to the database. This can vastly improve search performance and yield better search results.





About List