UTF-8 support
Goal description
Current support for UTF-8, while generally good, is woefully inadequate in some areas. Thus, let's completely eradicate mojibake. The scope of this release goal is encoding, input and display only, finer points of Unicode are out of scope.
There are four sub-goals:
- all programs should, in their default configuration, accept UTF-8 input and pass it through uncorrupted. Having to manually specify encoding is acceptable only in a programmatic interface, GUI/std{in,out,err}/ command line/plain files should work with nothing but LC_CTYPE.
- all GUI/curses/etc programs should be able to display UTF-8 output where appropriate
- all file names in source and binary packages must be valid UTF-8
- all text files should be encoded in UTF-8
A more detailed description has been posted on debian-devel thread.
Current status
- a number of programs mangle UTF-8 encoded data. Sometimes, only 7-bit ASCII is supported, more often, the program defaults to some specific ancient encoding like ISO-8859-1, even if LC_ALL/LC_CTYPE/LANG are properly set. For example, /usr/bin/mysql will corrupt data even if the schema was manually declared as using UTF-8, unless you request the encoding in every invocation.
- almost all GTK-2/GTK-3/QT programs already support proper display of UTF-8, those using traditional toolkits often don't. Same applies to curses programs.
Lintian already has a tag file-name-is-not-valid-UTF-, for binary packages only. Only 5 packages fail here. There's no tag for source packages yet, although this was noted as problematic for eg. UDD import.
There's a somewhat different proposal at #701081.
- a number of Debian-specific files (such as changelog, copyright, etc) already require UTF-8.
- contents of random files varies widely. Using "file" for classification, around 3k binary packages contains text files with ancient encodings. It is hard to tell what is a text file, what is not.
perl 5.18 requires declaring encoding for POD content that's >7 bits. This tends to cause a FTBFS and thus has been nearly already dealt with. As perl sources are text, this release goal would add the requirement of using specifically UTF-8, this is easy to do automatically.
- perl files with no pod are easy to detect as well
- most other scripting languages can be detected via hashbang.
How to help
- Configure daemons to use UTF-8 locale or otherwise let them process arbitrary data by default.
- Make sure all programs ran by an user obey locale settings if the locale uses UTF-8. Support for ancient encodings is at most a wishlist issue as their use is nearly completely gone.
- Make non-UTF-8 file names a hard reject.
- Figure out ways to convert shipped text files. This should be easy to do via a debhelper program, the hard part is in detecting what is text and what is not. Preferably, at least all of /usr/share/doc/ should be converted; /*/*bin/ is somewhat less useful but is a low-hanging fruit. A file type that can be automatically detected can then be checked for UTF-ness.
- Manually list text files that need conversion.
Advocates
Adam Borowski <kilobyte@angband.pl>