Feb. 2016

Last Modified: Sun Feb 7 02:37:05 UTC 2016

Charあらららcter Widths 2016-02-07 [Sun] 10:15

So nobody wants to hear my rant about the Unicode character widths, sigh...

The "character width" is a concept that is how many digits/columns that one character has to take when it's displayed on a terminal. Traditionally, most Japanese hiragana and kanji were considered as "full-width (zenkaku)" while other ASCII characters are "half-width (hankaku)". The concept of "zenkaku" and "hankaku" date back to the old DOS era. But actually, there's so much more. According to this Standard, there are five types of character widths defined in Unicode. Namely, "Full-width", "Half-width", "Wide", "Narrow", and "Ambiguous". What the heck. Its full specification per character is found here and you'll find it kinda random because they are determined mostly on the historical basis. Apparently no one cared to follow this completely when writing their terminal or terminal based apps. Since I use Emacs on tmux on xterm using Kappa-UCS, the character widths that each program expects should be perfectly aligned, otherwise you'll have some garbled output on the window. And you can see it's unrealistic to expect that all the above four implementations agree on its character width, for every character. Kappa probably follows the old DOS convention, but unfortunately some of the characters are in the "Ambiguous" category in the today's Standard and therefore its treatment is considered inconsistent. Xterm has a few options to handle character widths, and some of them eventually rely on the system (glibc)'s wcwidth(3). Its source is unclear, but this SO question and the looks of the current source code suggest it's based on the Standard. Now, tmux apparently uses a rather inaccurate data based on this wcwidth implementation or something. Then, there's this Emacs dude. It's character width is determined by the char-table-width elisp variable, which is defined in share/lisp/international/characters.el, but it changes depending on its current-language-environment.

This is too much crap to handle.

By the way, what kind of list has "ASCII", "Arabic", "Chinese-BIG5", "Chinese-CNS", "Cyrillic-ALT", "English", "Japanese", "Latin-1", "Russian", "UTF-8" and "Windows-1255" in the same place? I'ma go ahead and smash my head to the wall for a while.


Yusuke Shinyama
Document ID: 7b9d1e5cecc907bcc427592facaf7ac4