👍 The Good, the Bad, & the (mostly) Ugly 👎
These slides are at http://training.perl.com/OSCON2011/index.html
http://training.perl.com/OSCON2011/index.html
Pod::S5
by Tom Linden, which in
turn uses S5
by Eric Meyer. Slideshow controls appears if you hover
near the bottom right corner, but I find it easier to use
keystroke commands to navigate.
The PubMed Central Open Access corpus comprises around 11G of Unicode data spread across around 200k different files. Here are its most commonly occurring non‐ASCII code points:
1 18.55% 18.55% U+02013 ‹–› GC=Pd EN DASH 31 0.49% 88.28% U+02264 ‹≤› GC=Sm LESS-THAN OR EQUAL TO 2 7.42% 25.97% U+000A0 ⧆ GC=Zs NO-BREAK SPACE 32 0.44% 88.72% U+000AE ‹®› GC=So REGISTERED SIGN 3 7.03% 33.01% U+000B1 ‹±› GC=Sm PLUS-MINUS SIGN 33 0.43% 89.15% U+000E4 ‹ä› GC=Ll LATIN SMALL LETTER A WITH DIAERESIS 4 5.46% 38.47% U+02212 ‹−› GC=Sm MINUS SIGN 34 0.42% 89.57% U+02020 ‹†› GC=Po DAGGER 5 4.20% 42.66% U+02003 ⧆ GC=Zs EM SPACE 35 0.41% 89.98% U+003B4 ‹δ› GC=Ll GREEK SMALL LETTER DELTA 6 3.68% 46.35% U+003BC ‹μ› GC=Ll GREEK SMALL LETTER MU 36 0.37% 90.35% U+000E1 ‹á› GC=Ll LATIN SMALL LETTER A WITH ACUTE 7 3.62% 49.97% U+003B2 ‹β› GC=Ll GREEK SMALL LETTER BETA 37 0.34% 90.69% U+02192 ‹→› GC=Sm RIGHTWARDS ARROW 8 3.57% 53.53% U+003B1 ‹α› GC=Ll GREEK SMALL LETTER ALPHA 38 0.33% 91.02% U+000ED ‹í› GC=Ll LATIN SMALL LETTER I WITH ACUTE 9 3.43% 56.96% U+0200A ⧆ GC=Zs HAIR SPACE 39 0.31% 91.33% U+003C3 ‹σ› GC=Ll GREEK SMALL LETTER SIGMA 10 3.22% 60.18% U+000B0 ‹°› GC=So DEGREE SIGN 40 0.30% 91.62% U+000C5 ‹Å› GC=Lu LATIN CAPITAL LETTER A WITH RING ABOVE 11 2.93% 63.11% U+02009 ⧆ GC=Zs THIN SPACE 41 0.29% 91.92% U+003BB ‹λ› GC=Ll GREEK SMALL LETTER LAMDA 12 2.62% 65.73% U+02019 ‹’› GC=Pf RIGHT SINGLE QUOTATION MARK 42 0.29% 92.21% U+000F3 ‹ó› GC=Ll LATIN SMALL LETTER O WITH ACUTE 13 2.51% 68.24% U+02032 ‹′› GC=Po PRIME 43 0.26% 92.47% U+02122 ‹™› GC=So TRADE MARK SIGN 14 2.44% 70.68% U+000D7 ‹×› GC=Sm MULTIPLICATION SIGN 44 0.26% 92.73% U+02236 ‹∶› GC=Sm RATIO 15 2.04% 72.72% U+0201D ‹”› GC=Pf RIGHT DOUBLE QUOTATION MARK 45 0.22% 92.95% U+003C7 ‹χ› GC=Ll GREEK SMALL LETTER CHI 16 2.04% 74.76% U+0201C ‹“› GC=Pi LEFT DOUBLE QUOTATION MARK 46 0.21% 93.17% U+02021 ‹‡› GC=Po DOUBLE DAGGER 17 1.54% 76.30% U+00394 ‹Δ› GC=Lu GREEK CAPITAL LETTER DELTA 47 0.21% 93.37% U+003C4 ‹τ› GC=Ll GREEK SMALL LETTER TAU 18 1.42% 77.71% U+000B5 ‹µ› GC=Ll MICRO SIGN 48 0.20% 93.57% U+003B8 ‹θ› GC=Ll GREEK SMALL LETTER THETA 19 1.34% 79.05% U+003B3 ‹γ› GC=Ll GREEK SMALL LETTER GAMMA 49 0.20% 93.77% U+003B5 ‹ε› GC=Ll GREEK SMALL LETTER EPSILON 20 1.21% 80.26% U+000E9 ‹é› GC=Ll LATIN SMALL LETTER E WITH ACUTE 50 0.18% 93.95% U+02026 ‹…› GC=Po HORIZONTAL ELLIPSIS 21 1.15% 81.41% U+02014 ‹—› GC=Pd EM DASH 51 0.16% 94.11% U+02211 ‹∑› GC=Sm N-ARY SUMMATION 22 1.14% 82.55% U+02018 ‹‘› GC=Pi LEFT SINGLE QUOTATION MARK 52 0.15% 94.26% U+003C0 ‹π› GC=Ll GREEK SMALL LETTER PI 23 1.00% 83.54% U+000A9 ‹©› GC=So COPYRIGHT SIGN 53 0.15% 94.40% U+000EF ‹ï› GC=Ll LATIN SMALL LETTER I WITH DIAERESIS 24 0.71% 84.25% U+02265 ‹≥› GC=Sm GREATER-THAN OR EQUAL TO 54 0.15% 94.55% U+000A7 ‹§› GC=So SECTION SIGN 25 0.60% 84.85% U+000F6 ‹ö› GC=Ll LATIN SMALL LETTER O WITH DIAERESIS 55 0.15% 94.70% U+02005 ⧆ GC=Zs FOUR-PER-EM SPACE 26 0.60% 85.45% U+000B7 ‹·› GC=Po MIDDLE DOT 56 0.14% 94.84% U+003C9 ‹ω› GC=Ll GREEK SMALL LETTER OMEGA 27 0.60% 86.05% U+02022 ‹•› GC=Po BULLET 57 0.14% 94.98% U+000E8 ‹è› GC=Ll LATIN SMALL LETTER E WITH GRAVE 28 0.59% 86.64% U+0223C ‹∼› GC=Sm TILDE OPERATOR 58 0.13% 95.12% U+000F8 ‹ø› GC=Ll LATIN SMALL LETTER O WITH STROKE 29 0.57% 87.22% U+003BA ‹κ› GC=Ll GREEK SMALL LETTER KAPPA 59 0.13% 95.24% U+003C1 ‹ρ› GC=Ll GREEK SMALL LETTER RHO 30 0.57% 87.79% U+000FC ‹ü› GC=Ll LATIN SMALL LETTER U WITH DIAERESIS 60 0.13% 95.37% U+000E3 ‹ã› GC=Ll LATIN SMALL LETTER A WITH TILDE
Here are that corpus’s code points above the BMP, in order of popularity. Note 👽 Area 51 !
1 U+01D49E ‹𝒞› GC=Lu MATHEMATICAL SCRIPT CAPITAL C 31 U+01D4C2 ‹𝓂› GC=Ll MATHEMATICAL SCRIPT SMALL M 2 U+01D4AF ‹𝒯› GC=Lu MATHEMATICAL SCRIPT CAPITAL T 32 U+01D54D ‹𝕍› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL V 3 U+01D4AE ‹𝒮› GC=Lu MATHEMATICAL SCRIPT CAPITAL S 33 U+01D4B6 ‹𝒶› GC=Ll MATHEMATICAL SCRIPT SMALL A 4 U+01D49F ‹𝒟› GC=Lu MATHEMATICAL SCRIPT CAPITAL D 34 U+01D4BE ‹𝒾› GC=Ll MATHEMATICAL SCRIPT SMALL I 5 U+01D4B3 ‹𝒳› GC=Lu MATHEMATICAL SCRIPT CAPITAL X 35 U+01D4CC ‹𝓌› GC=Ll MATHEMATICAL SCRIPT SMALL W 6 U+01D4A9 ‹𝒩› GC=Lu MATHEMATICAL SCRIPT CAPITAL N 36 U+01D516 ‹𝔖› GC=Lu MATHEMATICAL FRAKTUR CAPITAL S 7 U+01D4AB ‹𝒫› GC=Lu MATHEMATICAL SCRIPT CAPITAL P 37 U+01D4CF ‹𝓏› GC=Ll MATHEMATICAL SCRIPT SMALL Z 8 U+01D4A2 ‹𝒢› GC=Lu MATHEMATICAL SCRIPT CAPITAL G 38 U+01D53B ‹𝔻› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL D 9 U+01D49C ‹𝒜› GC=Lu MATHEMATICAL SCRIPT CAPITAL A 39 U+01D54B ‹𝕋› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL T 10 U+01D53C ‹𝔼› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL E 40 U+01D4BB ‹𝒻› GC=Ll MATHEMATICAL SCRIPT SMALL F 11 U+01D4AA ‹𝒪› GC=Lu MATHEMATICAL SCRIPT CAPITAL O 41 U+01D4CA ‹𝓊› GC=Ll MATHEMATICAL SCRIPT SMALL U 12 U+01D4A5 ‹𝒥› GC=Lu MATHEMATICAL SCRIPT CAPITAL J 42 U+01D507 ‹𝔇› GC=Lu MATHEMATICAL FRAKTUR CAPITAL D 13 U+01D4A6 ‹𝒦› GC=Lu MATHEMATICAL SCRIPT CAPITAL K 43 U+01D542 ‹𝕂› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL K 14 U+01D4B1 ‹𝒱› GC=Lu MATHEMATICAL SCRIPT CAPITAL V 44 U+01D546 ‹𝕆› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL O 15 U+01D4B2 ‹𝒲› GC=Lu MATHEMATICAL SCRIPT CAPITAL W 45 U+01D4BD ‹𝒽› GC=Ll MATHEMATICAL SCRIPT SMALL H 16 U+01D4B4 ‹𝒴› GC=Lu MATHEMATICAL SCRIPT CAPITAL Y 46 U+01D4C5 ‹𝓅› GC=Ll MATHEMATICAL SCRIPT SMALL P 17 U+01D4B5 ‹𝒵› GC=Lu MATHEMATICAL SCRIPT CAPITAL Z 47 U+01D505 ‹𝔅› GC=Lu MATHEMATICAL FRAKTUR CAPITAL B 18 U+01D4B0 ‹𝒰› GC=Lu MATHEMATICAL SCRIPT CAPITAL U 48 U+01D50E ‹𝔎› GC=Lu MATHEMATICAL FRAKTUR CAPITAL K 19 U+01D4AC ‹𝒬› GC=Lu MATHEMATICAL SCRIPT CAPITAL Q 49 U+01D541 ‹𝕁› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL J 20 U+01D54A ‹𝕊› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL S 50 U+01D543 ‹𝕃› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL L 21 U+01D539 ‹𝔹› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL B 51 U+100002 ⧆ GC=Co <private use code point in plane 16> 22 U+01D5A7 ‹𝖧› GC=Lu MATHEMATICAL SANS-SERIF CAPITAL H 52 U+01D4B8 ‹𝒸› GC=Ll MATHEMATICAL SCRIPT SMALL C 23 U+01D517 ‹𝔗› GC=Lu MATHEMATICAL FRAKTUR CAPITAL T 53 U+01D4C1 ‹𝓁› GC=Ll MATHEMATICAL SCRIPT SMALL L 24 U+01D4C3 ‹𝓃› GC=Ll MATHEMATICAL SCRIPT SMALL N 54 U+01D53D ‹𝔽› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL F 25 U+01D535 ‹𝔵› GC=Ll MATHEMATICAL FRAKTUR SMALL X 55 U+01D53E ‹𝔾› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL G 26 U+01D4BF ‹𝒿› GC=Ll MATHEMATICAL SCRIPT SMALL J 56 U+01D54C ‹𝕌› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL U 27 U+01D540 ‹𝕀› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL I 57 U+01D6A4 ‹𝚤› GC=Ll MATHEMATICAL ITALIC SMALL DOTLESS I 28 U+01D465 ‹𝑥› GC=Ll MATHEMATICAL ITALIC SMALL X 58 U+01D7D9 ‹𝟙› GC=Nd MATHEMATICAL DOUBLE-STRUCK DIGIT ONE 29 U+01D4CE ‹𝓎› GC=Ll MATHEMATICAL SCRIPT SMALL Y 30 U+01D538 ‹𝔸› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL A
Bidi_Class=Nonspacing_Mark
(BC=NSM
for short) are here in abundance:
◌̂
,
◌̅
,
◌̄
,
◌̇
,
◌̃
,
◌⃗
,
◌⃞
,
◌̲
,
◌̀
,
◌͂
,
◌́
,
◌̊
,
◌̆
,
◌︀
,
◌̵
,
◌̑
,
◌̧
,
◌⃑
,
◌̣
,
◌̈
,
◌̣
,
◌̷
,
◌́
,
◌̋
,
◌̌
,
◌͘
,
◌⃛
,
◌֗
,
◌̨
,
◌̓
,
◌̈́
,
◌ْ
, and
◌ྞ
.
Here’s what each language natively supports in its standard distribution. A white check mark isn’t quite all there, and a superscripted plus says it goes beyond what’s stated. 🌟
Unicode | 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 | ᴘʜᴘ | Go | 💎 Ruby | 🐍 Python | ☕ Java | 🐪 Perl |
---|---|---|---|---|---|---|---|
Internally | UCS‐2 or UTF‐16 |
UTF‐8⁻ | UTF‐8 | varies | UCS‐2 or UCS‐4 |
UTF‐16 | UTF‐8⁺ |
Identifiers | ─ | ✔ | ✔ | ✔ | ✅∓ | ✔ | ✔ |
Casefolding | none | simple | simple | full | none | simple | full |
Casemapping | simple | simple | simple∓ | full | simple | full | full |
Graphemes | ─ | ✅ | ─ | ─ | ─ | ─ | ✔ |
Normalization | ─ | ✔ | ─⁺ | ─ | ✔ | ✔ | ✔ |
UCA Collation | ─ | ─ | ─ | ─ | ─ | ─ | ✔⁺ |
Named Characters | ─ | ─ | ─ | ─ | ✅ | ─ | ✔⁺ |
Properties | ─ | two | (non‐regex)⁻ | three | (non‐regex)⁻ | two⁺ | every⁺ |
Property | Javascript | Javascript w/ XRegExp |
ᴘʜᴘ5 w/ PCRE |
Go5 w/ RE2 |
Ruby1.9 | Python3.2 | Python3.2 w/ regex |
Java1.6 | Java1.7 | ICU | Perl |
---|---|---|---|---|---|---|---|---|---|---|---|
General Category | ─ | ✔ | ✔ | ✔ | ✔ | ─ | ✔ | ✔ | ✔ | ✔ | ✔ |
Script | ─ | ✔ | ✔ | ✔ | ✔ | ─ | ✔ | ─ | ✔ | ✔ | ✔ |
Two‐Parters | ─ | ─ | ─ | ─ | ─ | ─ | ✔ | ─ | ✅⁻ | ✔ | ✔ |
Long Names | ─ | ─ | ─ | ─ | ─ | ─ | ✔ | ─ | ─ | ✔ | ✔ |
Loose Matching | ─ | ─ | ─ | ─ | ─ | ─ | ✔ | ─ | ✅ | ✔ | ✔ |
Name/Value Aliases | ─ | ─ | ─ | ─ | ─ | ─ | ✔ | ─ | ✅ | ✔ | ✔ |
User‐Defined | ─ | ─ | ─ | ─ | ─ | ─ | ─ | ─ | ─ | ─ | ✔ |
UTS#18 RL 1.2 | ─ | ─ | ─ | ─ | ─ | ─ | ✔ | ─ | ✔ | ✔ | ✔ |
UTS#18 RL 1.2a | ─ | ─ | ─? | ─ | ✅ | ✅ | ✔⁺ | ─ | ✔ | ✔ | ✔ |
UTS#18 RL 2.2 | ─ | ─ | ✅ | ─ | ─ | ─ | ✔ | ─ | ─ | ✔ | ✔ |
UTS#18 RL 2.7 | ─ | ─ | ─ | ─ | ─ | ─ | ✅ | ─ | ─ | ✔ | ✔ |
http://unicode.org/charts/case/ http://unicode.org/charts/case/chart_Latin.html
\N{WHITE SMILING FACE}
; you
must use magic numbers like \u263A
instead. Notice that’s not
enough digits.
\uXXXX
,
Javascript supports only ¹⁄₁₇ᵗʰ of Unicode, the 5.88% that occupies Plane Zero.
charCodeAt
and fromCharCode
only ever deal with 16‐bit
quantities, not with real, 21‐bit Unicode code points.
MATHEMATICAL SCRIPT CAPITAL A
,
you have to specify not one character but
two “char units”: "\uD835\uDC9C"
. 😱
# ERROR!! document.write(String.fromCharCode(0x1D49C)); # needed bogosity document.write(String.fromCharCode(0xD835,0xDC9C));
var div = document.createElement("DIV"); div.innerHTML = "�x10000;"; var isUtf16 = div.firstChild.nodeValue.charCodeAt(0) == 0xd800;
\w
, \d
, &c work on trans‐ASCII Unicode as required by
Unicode’s UTS #18 Regular Expressions in its Annex C: Compatibility
Properties. This is a severe hardship.
[𝒜-𝒵]
since that gets misinterpreted as [\uD835\uDC9C-\uD835\uDCB5]
.
Actually, you won’t even get that far:
new RegExp("[𝒜-𝒵]", "i") returned a SyntaxError: RegExp constructor: invalid regular expression
\p{PROP}
and \P{PROP}
:
http://xregexp.com/plugins/
General Category
abbreviation like \p{Sc}
Script
name like \p{Latin}
Block
property like
\p{InMathematicalAlphanumericSymbols}
(except that one won’t work:
it lies outside the ¹⁄₁₇ᵗʰ of Unicode Javascript can even think about)
XRegExp("[^\\p{Latin}\\p{Common}]");
XRegExp
plugin doesn’t even support basic properties like Alphabetic
,
Uppercase
, and Lowercase
, all required by UTS#18 for Level 1 Unicode Support.
preg_*
functions indeed use PCRE.
/u
on your
patterns to get the Unicode character sense instead of the “byte” sense.
\w
, \d
, &c to work the way UTS#18 says they must.
General Category
abbreviations
and the Script
properties.
\X
is the old broken (?>\PM\pM*)
, which is 🛂 not 🛂
a Unicode grapheme cluster as currently defined
in UTS#18 & UTS#29.
strlen
, strstr
, strpos
, and substr
mb_strlen
, mb_strstr
, mb_strpos
, and
mb_substr
instead.
"\x{1F3}ur"
to Dzur "\x{1F2}ur"
:
# PASS mb_convert_case("dzur", MB_CASE_TITLE, "UTF-8");
"\x{DF}"
to properly become Ss,
file "\x{FB01}le"
to become Fi, nor ſtop "\x{FB05}op"
to become Stop.
# FAIL mb_convert_case("ſtop", MB_CASE_TITLE, "UTF-8");
exp/regexp
. It provides for (simple) Unicode casefolding and
includes access to the Unicode General Category
and Script
properties.
String
functions like upcase
or capitalize
won’t even look
at anything but ASCII.
\p{Lu}
are fine in your regexes because they work on your system — under a UTF‐8 locale.
Now. But move the script to something laboring under CP1252 and suddenly you
have an error!
/u
regex flag.
#!/usr/bin/env ruby #coding: utf-8
#!/usr/bin/env ruby #coding: utf-8 𝔘𝔫𝔦𝔠𝔬𝔡𝔢 = "super"; puts 𝔘𝔫𝔦𝔠𝔬𝔡𝔢 niño = "boy"; puts niño both = 𝔘𝔫𝔦𝔠𝔬𝔡𝔢 + niño; puts both # Yep, it’s superboy!
unicode_utils
gem, available via gem
ininstall unicode_utils
or from
http://unicode-utils.rubyforge.org/
#!/usr/bin/env ruby #coding: utf-8 require 'unicode_utils' puts "lower is " + UnicodeUtils.downcase(str) puts "title is " + UnicodeUtils.titlecase(str) puts "upper is " + UnicodeUtils.upcase(str)
% ruby -le 'print " WEISS" =~ /weiß/i' 1
#!/usr/bin/env ruby #coding: utf-8 str = " ᾲ στο διάολο"; puts str =~ /ᾺΙ ΣΤΟ ΔΙΆΟΛΟ/i ? "PASS" : "FAIL"; PASS
General Category
and Script
.
\w
, \d
, &c won’t work right, and even \p{alphabetic}
is broken.
\w
won’t work except on ASCII, even with the /u
flag.
\N{GREEK SMALL LETTER FINAL SIGMA}
for U+03C2, ς.
\N{GREEK SMALL LETTER STIGMA}
for U+03DB, ϛ.
\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}
.
% perl5.14.0 -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' Ā̀ % perl5.14.0 -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' | uniquote -v \N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}
re
library is really quite deficient from
a Unicode perspective.
re.U
, then your \w
, \d
, &c actually do work according to UTS#18.
Mostly.
re
library doesn’t offer any Unicode properties.
LATIN SMALL LETTER LONG S
ſ (as in the old‐style poſt) and either of
the regular versions, whether S or s.
COMBINING GREEK YPOGEGRAMMENI
, ◌ͅ , and any other iota like
GREEK CAPITAL LETTER IOTA
Ι,
GREEK SMALL LETTER IOTA
ι, or
GREEK PROSGEGRAMMENI
ι.
re
module Python comes with, for both v2 and v3.
http://pypi.python.org/pypi/regex
écran
or
el_niño
, or even
東京
or 文字化け
,
the UCS‐2 Curse prevents having one named 𝔘𝔫𝔦𝔠𝔬𝔡𝔢
, let alone
𐐔𐐯𐑅𐐨𐑉𐐯𐐻
or
𐌰𐍄𐍄𐌰‿𐌿𐌽𐍃𐌰𐍂‿𐌸𐌿‿𐌹𐌽‿𐌷𐌹𐌼𐌹𐌽𐌰𐌼
. (This has just been fixed [2011-08-11])
Atta_unsar_
þu_in_himinam
is ok, though.
import sys wide_enough = (sys.maxunicode >= 0x10FFFF) print("my largest character is", sys.maxunicode)
sys.maxunicode
is 0x10FFFF,
whereas it is otherwise only 0xFFFF.
if not wide_enough: raise Exception("Narrow build lacks full Unicode support")
PYTHONIOENCODING
environment variable so it stops guessing.
(This is like setting PERL_UNICODE
to S
.)
$ export PYTHONIOENCODING=utf8
stdin
and stdout
encoding errors raise exceptions,
but those on stderr
do not.
import io import sys for s in ("stdin","stdout","stderr"): setattr(sys, s, io.TextIOWrapper(getattr(sys, s).detach(), encoding="utf8"))
{!}python
from within vi on the pod source, darn it!!
Even 🐍ᵛ³ + regex
cannot fix all 🐍ᵛ² Unicode problems, because 🐍’s
character model is inherently broken. It completely violates UTS#18’s
Level 1 requirement for regular expressions — the minimal level for useful Unicode support —
that the engine support Unicode characters as basic logical units independent of serialization like UTF‑*:
% python3.2 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19) [GCC 4.2.1 (Apple Inc. build 5664)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}" >>> print(g) ᾲ >>> print(re.search(r'\w', g)) <_sre.SRE_Match object at 0x10051f988> >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}" >>> print(p) 𝒫 >>> print(re.search(r'\w', p)) None >>> print(re.search(r'..', p)) # ← 𝙏𝙃𝙄𝙎 𝙄𝙎 𝙏𝙃𝙀 𝙑𝙄𝙊𝙇𝘼𝙏𝙄𝙊𝙉 𝙍𝙄𝙂𝙃𝙏 𝙃𝙀𝙍𝙀 <_sre.SRE_Match object at 0x10051f988> >>> print(len(chr(0x1D4AB))) 2
char
,
nor in a Character
for that matter. 🚽
int
or a String
: 👉 “int is the new char.”
char
s do not deal with Unicode characters: String.length
must be replaced by codePointCount
, charAt
by codePointAt
, &c.
\N{PILE OF POO}
, so
you have to put all these magic numbers in your code. But which
magic numbers? It’s a 💩 situation.
\uXXXX
.
% javac -encoding UTF-8 SomeanNoyinGanDillegiBlenAme.java
std{in,out,err}
,
your code is not portable.
% java -Dfile.encoding=UTF-8 SomeanNoyinGanDillegiBlenAme
InputStreamReader(InputStream in) InputStreamReader(InputStream in, Charset cs) InputStreamReader(InputStream in, CharsetDecoder dec) InputStreamReader(InputStream in, String charsetName) OutputStreamWriter(OutputStream out) OutputStreamWriter(OutputStream out, Charset cs) OutputStreamWriter(OutputStream out, CharsetEncoder enc) OutputStreamWriter(OutputStream out, String charsetName)
Here’s an open3
type procedure for Java, with correct constructors:
Process slave_process = Runtime.getRuntime().exec("perl -CS script args"); OutputStream __bytes_into_his_stdin = slave_process.getOutputStream(); OutputStreamWriter chars_into_his_stdin = new OutputStreamWriter( __bytes_into_his_stdin, /* DO NOT OMIT! */ Charset.forName("UTF-8").newEncoder() ); InputStream __bytes_from_his_stdout = slave_process.getInputStream(); InputStreamReader chars_from_his_stdout = new InputStreamReader( __bytes_from_his_stdout, /* DO NOT OMIT! */ Charset.forName("UTF-8").newDecoder() ); InputStream __bytes_from_his_stderr = slave_process.getErrorStream(); InputStreamReader chars_from_his_stderr = new InputStreamReader( __bytes_from_his_stderr, /* DO NOT OMIT! */ Charset.forName("UTF-8").newDecoder() );
Pattern
class for regexes, and would embellish any uses of Character
and
String
with strong provisos.
NativeRegEx.cpp
in the public Android repository.
\x{XXXXXX}
regex escape gets around The UTF‐16 Curse.
The previously broken [𝒜‑𝒵]
can now be written [\x{1D49C}‑\x{1D4B5}]
.
UNICODE_CHARACTER_CLASSES
compilation flag changes the
charclass abbreviations like \w
&c, and also the POSIX ones like
\p{alpha}
, to make them do what UTS#18 says they’re supposed to do.
"(?U)"
flag embeddable in the pattern to do the same
thing.
String.equalsIgnoreCase()
still erroneously
treats strings as UCS‐2 sequences not as UTF‐16, it fails to notice that
𐐔𐐇𐐝𐐀𐐡𐐇𐐓 and 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 are the same. Use ICU’s CaseInsenitiveString.equals()
instead for correct behavior.
\p{script=SCRIPTNAME}
,
also usable as \p{isSCRIPTNAME}
or \p{SCRIPTNAME}
.
Other_Alphabetic
.
PERL_UNICODE
envariable to "SA"
, or
in certain very limited & one‐shot cases, even to "SAD"
.
use v5.14; # many, 𝘮𝘢𝘯𝘺 Unicode fixes use utf8; # so source code is UTF-8 use strict; use warnings; use autodie; ## 𝘽𝙐𝙂! conflicts with 𝘶𝘴𝘦 𝘰𝘱𝘦𝘯 below!! use charnames qw< :full >; # standard \N{charnames} use open qw< :std :encoding(UTF-8) >; # default streams
\N{NAMED CHARACTER}
:
you can create your own customized versions of those, too. I use this all the time.
Unicode::Tussle
bundle from CPAN will really make
your Unicode life a lot easier.
http://search.cpan.org/perldoc?Unicode::Tussle % sudo perl -MCPAN -e 'install Unicode::Tussle'
𝔢𝔵𝔢𝔲𝔫𝔱 𝔬𝔪𝔫𝔢𝔰
🍸