= UTF-8 support = == Goal description == Current support for UTF-8, while generally good, is woefully inadequate in some areas. Thus, let's completely eradicate [[https://en.wikipedia.org/wiki/Mojibake|mojibake]]. The scope of this release goal is encoding, input and display only, finer points of Unicode are out of scope. There are four sub-goals: * all programs should, in their default configuration, accept UTF-8 input and pass it through uncorrupted. Having to manually specify encoding is acceptable only in a programmatic interface, GUI/std{in,out,err}/ command line/plain files should work with nothing but LC_CTYPE. * all GUI/curses/etc programs should be able to display UTF-8 output where appropriate * all file names in source and binary packages must be valid UTF-8 * all text files should be encoded in UTF-8 A more detailed description has been posted on debian-devel [[https://lists.debian.org/debian-devel/2013/08/msg00217.html|thread]]. == Current status == * a number of programs mangle UTF-8 encoded data. Sometimes, only 7-bit ASCII is supported, more often, the program defaults to some specific ancient encoding like ISO-8859-1, even if LC_ALL/LC_CTYPE/LANG are properly set. For example, /usr/bin/mysql will corrupt data even if the schema was manually declared as using UTF-8, unless you request the encoding in every invocation. * almost all GTK-2/GTK-3/QT programs already support proper display of UTF-8, those using traditional toolkits often don't. Same applies to curses programs. * Lintian already has a tag ''file-name-is-not-valid-UTF-'', for binary packages only. Only 5 packages fail here. There's no tag for source packages yet, although this was noted as problematic for eg. UDD import. * There's a somewhat different proposal at [[http://bugs.debian.org/701081|#701081]]. * a number of Debian-specific files (such as changelog, copyright, etc) already require UTF-8. * contents of random files varies widely. Using "file" for classification, around 3k binary packages contains text files with ancient encodings. It is hard to tell what is a text file, what is not. * perl 5.18 requires declaring encoding for POD content that's >7 bits. This tends to cause a FTBFS and thus has been nearly already dealt with. As perl sources are text, this release goal would add the requirement of using specifically UTF-8, this is easy to do automatically. * perl files with no pod are easy to detect as well * most other scripting languages can be detected via hashbang. == How to help == * Configure daemons to use UTF-8 locale or otherwise let them process arbitrary data by default. * Make sure all programs ran by an user obey locale settings if the locale uses UTF-8. Support for ancient encodings is at most a wishlist issue as their use is nearly completely gone. * Make non-UTF-8 file names a hard reject. * Figure out ways to convert shipped text files. This should be easy to do via a debhelper program, the hard part is in detecting what is text and what is not. Preferably, at least all of /usr/share/doc/ should be converted; /*/*bin/ is somewhat less useful but is a low-hanging fruit. A file type that can be automatically detected can then be checked for UTF-ness. * Manually list text files that need conversion. == Advocates == * Adam Borowski <>