🔫 Unicode Support Shootout

👍 The Good, the Bad, & the (mostly) Ugly 👎

Tom Christiansen
<tchrist@perl.com>

These slides are at http://training.perl.com/OSCON2011/index.html

What is this?

🔎 Scope of This Talk

The Unicode Standard

A Sample Unicode Corpus

The PubMed Central Open Access corpus comprises around 11G of Unicode data spread across around 200k different files. Here are its most commonly occurring non‐ASCII code points:

   1  18.55%  18.55%  U+02013 ‹–›  GC=Pd    EN DASH                               31   0.49%  88.28%  U+02264 ‹≤›  GC=Sm    LESS-THAN OR EQUAL TO
   2   7.42%  25.97%  U+000A0  ⧆   GC=Zs    NO-BREAK SPACE                        32   0.44%  88.72%  U+000AE ‹®›  GC=So    REGISTERED SIGN
   3   7.03%  33.01%  U+000B1 ‹±›  GC=Sm    PLUS-MINUS SIGN                       33   0.43%  89.15%  U+000E4 ‹ä›  GC=Ll    LATIN SMALL LETTER A WITH DIAERESIS
   4   5.46%  38.47%  U+02212 ‹−›  GC=Sm    MINUS SIGN                            34   0.42%  89.57%  U+02020 ‹†›  GC=Po    DAGGER
   5   4.20%  42.66%  U+02003  ⧆   GC=Zs    EM SPACE                              35   0.41%  89.98%  U+003B4 ‹δ›  GC=Ll    GREEK SMALL LETTER DELTA
   6   3.68%  46.35%  U+003BC ‹μ›  GC=Ll    GREEK SMALL LETTER MU                 36   0.37%  90.35%  U+000E1 ‹á›  GC=Ll    LATIN SMALL LETTER A WITH ACUTE
   7   3.62%  49.97%  U+003B2 ‹β›  GC=Ll    GREEK SMALL LETTER BETA               37   0.34%  90.69%  U+02192 ‹→›  GC=Sm    RIGHTWARDS ARROW
   8   3.57%  53.53%  U+003B1 ‹α›  GC=Ll    GREEK SMALL LETTER ALPHA              38   0.33%  91.02%  U+000ED ‹í›  GC=Ll    LATIN SMALL LETTER I WITH ACUTE
   9   3.43%  56.96%  U+0200A  ⧆   GC=Zs    HAIR SPACE                            39   0.31%  91.33%  U+003C3 ‹σ›  GC=Ll    GREEK SMALL LETTER SIGMA
  10   3.22%  60.18%  U+000B0 ‹°›  GC=So    DEGREE SIGN                           40   0.30%  91.62%  U+000C5 ‹Å›  GC=Lu    LATIN CAPITAL LETTER A WITH RING ABOVE
  11   2.93%  63.11%  U+02009  ⧆   GC=Zs    THIN SPACE                            41   0.29%  91.92%  U+003BB ‹λ›  GC=Ll    GREEK SMALL LETTER LAMDA
  12   2.62%  65.73%  U+02019 ‹’›  GC=Pf    RIGHT SINGLE QUOTATION MARK           42   0.29%  92.21%  U+000F3 ‹ó›  GC=Ll    LATIN SMALL LETTER O WITH ACUTE
  13   2.51%  68.24%  U+02032 ‹′›  GC=Po    PRIME                                 43   0.26%  92.47%  U+02122 ‹™›  GC=So    TRADE MARK SIGN
  14   2.44%  70.68%  U+000D7 ‹×›  GC=Sm    MULTIPLICATION SIGN                   44   0.26%  92.73%  U+02236 ‹∶›  GC=Sm    RATIO
  15   2.04%  72.72%  U+0201D ‹”›  GC=Pf    RIGHT DOUBLE QUOTATION MARK           45   0.22%  92.95%  U+003C7 ‹χ›  GC=Ll    GREEK SMALL LETTER CHI
  16   2.04%  74.76%  U+0201C ‹“›  GC=Pi    LEFT DOUBLE QUOTATION MARK            46   0.21%  93.17%  U+02021 ‹‡›  GC=Po    DOUBLE DAGGER
  17   1.54%  76.30%  U+00394 ‹Δ›  GC=Lu    GREEK CAPITAL LETTER DELTA            47   0.21%  93.37%  U+003C4 ‹τ›  GC=Ll    GREEK SMALL LETTER TAU
  18   1.42%  77.71%  U+000B5 ‹µ›  GC=Ll    MICRO SIGN                            48   0.20%  93.57%  U+003B8 ‹θ›  GC=Ll    GREEK SMALL LETTER THETA
  19   1.34%  79.05%  U+003B3 ‹γ›  GC=Ll    GREEK SMALL LETTER GAMMA              49   0.20%  93.77%  U+003B5 ‹ε›  GC=Ll    GREEK SMALL LETTER EPSILON
  20   1.21%  80.26%  U+000E9 ‹é›  GC=Ll    LATIN SMALL LETTER E WITH ACUTE       50   0.18%  93.95%  U+02026 ‹…›  GC=Po    HORIZONTAL ELLIPSIS
  21   1.15%  81.41%  U+02014 ‹—›  GC=Pd    EM DASH                               51   0.16%  94.11%  U+02211 ‹∑›  GC=Sm    N-ARY SUMMATION
  22   1.14%  82.55%  U+02018 ‹‘›  GC=Pi    LEFT SINGLE QUOTATION MARK            52   0.15%  94.26%  U+003C0 ‹π›  GC=Ll    GREEK SMALL LETTER PI
  23   1.00%  83.54%  U+000A9 ‹©›  GC=So    COPYRIGHT SIGN                        53   0.15%  94.40%  U+000EF ‹ï›  GC=Ll    LATIN SMALL LETTER I WITH DIAERESIS
  24   0.71%  84.25%  U+02265 ‹≥›  GC=Sm    GREATER-THAN OR EQUAL TO              54   0.15%  94.55%  U+000A7 ‹§›  GC=So    SECTION SIGN
  25   0.60%  84.85%  U+000F6 ‹ö›  GC=Ll    LATIN SMALL LETTER O WITH DIAERESIS   55   0.15%  94.70%  U+02005  ⧆   GC=Zs    FOUR-PER-EM SPACE
  26   0.60%  85.45%  U+000B7 ‹·›  GC=Po    MIDDLE DOT                            56   0.14%  94.84%  U+003C9 ‹ω›  GC=Ll    GREEK SMALL LETTER OMEGA
  27   0.60%  86.05%  U+02022 ‹•›  GC=Po    BULLET                                57   0.14%  94.98%  U+000E8 ‹è›  GC=Ll    LATIN SMALL LETTER E WITH GRAVE
  28   0.59%  86.64%  U+0223C ‹∼›  GC=Sm    TILDE OPERATOR                        58   0.13%  95.12%  U+000F8 ‹ø›  GC=Ll    LATIN SMALL LETTER O WITH STROKE
  29   0.57%  87.22%  U+003BA ‹κ›  GC=Ll    GREEK SMALL LETTER KAPPA              59   0.13%  95.24%  U+003C1 ‹ρ›  GC=Ll    GREEK SMALL LETTER RHO
  30   0.57%  87.79%  U+000FC ‹ü›  GC=Ll    LATIN SMALL LETTER U WITH DIAERESIS   60   0.13%  95.37%  U+000E3 ‹ã›  GC=Ll    LATIN SMALL LETTER A WITH TILDE 

PMC: Astral Projections

Here are that corpus’s code points above the BMP, in order of popularity. Note 👽 Area 51 !

  1  U+01D49E ‹𝒞›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL C        31  U+01D4C2 ‹𝓂›  GC=Ll    MATHEMATICAL SCRIPT SMALL M
  2  U+01D4AF ‹𝒯›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL T        32  U+01D54D ‹𝕍›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL V
  3  U+01D4AE ‹𝒮›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL S        33  U+01D4B6 ‹𝒶›  GC=Ll    MATHEMATICAL SCRIPT SMALL A
  4  U+01D49F ‹𝒟›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL D        34  U+01D4BE ‹𝒾›  GC=Ll    MATHEMATICAL SCRIPT SMALL I
  5  U+01D4B3 ‹𝒳›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL X        35  U+01D4CC ‹𝓌›  GC=Ll    MATHEMATICAL SCRIPT SMALL W
  6  U+01D4A9 ‹𝒩›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL N        36  U+01D516 ‹𝔖›  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL S
  7  U+01D4AB ‹𝒫›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL P        37  U+01D4CF ‹𝓏›  GC=Ll    MATHEMATICAL SCRIPT SMALL Z
  8  U+01D4A2 ‹𝒢›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL G        38  U+01D53B ‹𝔻›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL D
  9  U+01D49C ‹𝒜›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL A        39  U+01D54B ‹𝕋›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL T
 10  U+01D53C ‹𝔼›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL E 40  U+01D4BB ‹𝒻›  GC=Ll    MATHEMATICAL SCRIPT SMALL F
 11  U+01D4AA ‹𝒪›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL O        41  U+01D4CA ‹𝓊›  GC=Ll    MATHEMATICAL SCRIPT SMALL U
 12  U+01D4A5 ‹𝒥›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL J        42  U+01D507 ‹𝔇›  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL D
 13  U+01D4A6 ‹𝒦›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL K        43  U+01D542 ‹𝕂›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL K
 14  U+01D4B1 ‹𝒱›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL V        44  U+01D546 ‹𝕆›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL O
 15  U+01D4B2 ‹𝒲›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL W        45  U+01D4BD ‹𝒽›  GC=Ll    MATHEMATICAL SCRIPT SMALL H
 16  U+01D4B4 ‹𝒴›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Y        46  U+01D4C5 ‹𝓅›  GC=Ll    MATHEMATICAL SCRIPT SMALL P
 17  U+01D4B5 ‹𝒵›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Z        47  U+01D505 ‹𝔅›  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL B
 18  U+01D4B0 ‹𝒰›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL U        48  U+01D50E ‹𝔎›  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL K
 19  U+01D4AC ‹𝒬›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Q        49  U+01D541 ‹𝕁›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL J
 20  U+01D54A ‹𝕊›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL S 50  U+01D543 ‹𝕃›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL L
 21  U+01D539 ‹𝔹›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL B 51  U+100002  ⧆   GC=Co    <private use code point in plane 16>
 22  U+01D5A7 ‹𝖧›  GC=Lu    MATHEMATICAL SANS-SERIF CAPITAL H    52  U+01D4B8 ‹𝒸›  GC=Ll    MATHEMATICAL SCRIPT SMALL C
 23  U+01D517 ‹𝔗›  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL T       53  U+01D4C1 ‹𝓁›  GC=Ll    MATHEMATICAL SCRIPT SMALL L
 24  U+01D4C3 ‹𝓃›  GC=Ll    MATHEMATICAL SCRIPT SMALL N          54  U+01D53D ‹𝔽›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL F
 25  U+01D535 ‹𝔵›  GC=Ll    MATHEMATICAL FRAKTUR SMALL X         55  U+01D53E ‹𝔾›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL G
 26  U+01D4BF ‹𝒿›  GC=Ll    MATHEMATICAL SCRIPT SMALL J          56  U+01D54C ‹𝕌›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL U
 27  U+01D540 ‹𝕀›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL I 57  U+01D6A4 ‹𝚤›  GC=Ll    MATHEMATICAL ITALIC SMALL DOTLESS I
 28  U+01D465 ‹𝑥›  GC=Ll    MATHEMATICAL ITALIC SMALL X          58  U+01D7D9 ‹𝟙›  GC=Nd    MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
 29  U+01D4CE ‹𝓎›  GC=Ll    MATHEMATICAL SCRIPT SMALL Y
 30  U+01D538 ‹𝔸›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL A 

PMC: Graphemes & Ideographs

Feature Support Summary

Here’s what each language natively supports in its standard distribution. A white check mark isn’t quite all there, and a superscripted plus says it goes beyond what’s stated. 🌟

Unicode 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 ᴘʜᴘ Go 💎 Ruby 🐍 Python Java 🐪 Perl
Internally UCS‐2 or
UTF‐16
UTF‐8⁻ UTF‐8 varies UCS‐2 or
UCS‐4
UTF‐16 UTF‐8⁺
Identifiers
Casefolding none simple simple full none simple full
Casemapping simple simple simple full simple full full
Graphemes
Normalization ─⁺
UCA Collation ✔⁺
Named Characters
Properties two (non‐regex)⁻ three (non‐regex)⁻ two⁺ every⁺

Unicode Properties in Regexes

Property Javascript Javascript
w/XRegExp
ᴘʜᴘ5
w/PCRE
Go5
w/RE2
Ruby1.9 Python3.2 Python3.2
w/regex
Java1.6 Java1.7 ICU Perl
General Category
Script
Two‐Parters
Long Names
Loose Matching
Name/Value Aliases
User‐Defined
UTS#18 RL 1.2
UTS#18 RL 1.2a ?
UTS#18 RL 2.2
UTS#18 RL 2.7

Conservation of Pain & Suffering?

Trends Across Languages

Javascript ≠ Java’s Crypt

The UTF‐16 née UCS‐2 Curse

Testing for UTF‐16 vs UCS‐2

Javascript Regexes 👎

The XRegExp Plugin (cont...)

PHP

PHP: PCRE Issues

PHP: The Ugly News

PHP Titlecasing

Go Figure!

Go: The Good News

Go: The Bad News

Go: The Ugly News

Ruby: The Bad News

Ruby: The Ugly News

Ruby: The Good News

Ruby: The Best News

Ruby Regexes: Good News...

Ruby Regexes: ...and Bad News

Python: (Some) Good News

Python: Not Good News

Python Regexes: The Bad News

Python Regexes: The Good News

Python: The Ugly Bits

Testing for UCS‐2 vs UCS‐4

Fixing Python Redirection Troubles

Python: A Complete Solution?

Even 🐍ᵛ³ + regex cannot fix all 🐍ᵛ² Unicode problems, because 🐍’s character model is inherently broken. It completely violates UTS#18’s Level 1 requirement for regular expressions — the minimal level for useful Unicode support — that the engine support Unicode characters as basic logical units independent of serialization like UTF‑*:

 % python3.2
 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19)
 [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import re
 >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
 >>> print(g)
ᾲ
 >>> print(re.search(r'\w', g))
 <_sre.SRE_Match object at 0x10051f988>
 >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
 >>> print(p)
𝒫
 >>> print(re.search(r'\w', p))
None
 >>> print(re.search(r'..', p))   # ← 𝙏𝙃𝙄𝙎 𝙄𝙎 𝙏𝙃𝙀 𝙑𝙄𝙊𝙇𝘼𝙏𝙄𝙊𝙉 𝙍𝙄𝙂𝙃𝙏 𝙃𝙀𝙍𝙀 
<_sre.SRE_Match object at 0x10051f988>
 >>> print(len(chr(0x1D4AB)))
2 

Java

Java: The Ugly News

Java: The Bad and Half‐bad

Java Tips

More Java Tips

Java Safe Streaming

Here’s an open3 type procedure for Java, with correct constructors:

    Process
    slave_process = Runtime.getRuntime().exec("perl -CS script args");

 OutputStream
 __bytes_into_his_stdin  = slave_process.getOutputStream();

 OutputStreamWriter
   chars_into_his_stdin  = new OutputStreamWriter(
                             __bytes_into_his_stdin,
         /* DO NOT OMIT! */  Charset.forName("UTF-8").newEncoder()
                         );

 InputStream
 __bytes_from_his_stdout = slave_process.getInputStream();

 InputStreamReader
   chars_from_his_stdout = new InputStreamReader(
                             __bytes_from_his_stdout,
         /* DO NOT OMIT! */  Charset.forName("UTF-8").newDecoder()
                         );

 InputStream
 __bytes_from_his_stderr = slave_process.getErrorStream();

 InputStreamReader
   chars_from_his_stderr = new InputStreamReader(
                             __bytes_from_his_stderr,
         /* DO NOT OMIT! */  Charset.forName("UTF-8").newDecoder()
                         ); 

Java Regexes: (Finally) Some Good News

Java Regexes: Still a Ways to Go

The JDK7 Regex Revolution

JDK7: And That’s Not All

Perl Unicode Gotchas

Perl Unicode Specialties

🏁 And Our Winner Is...

Good Luck 🍀 and Good Night 🌃

Appendix 1: Font suggestions

Appendix 2: Tools

Contact Information