Most of this is currently outdated. We had a i18n bof during DebConf8, where we discussed how this can actually work and be accepted by all of Debian, including the archive and dpkg and whatever. The notes from this BoF are available at i18n/TranslationDebsDebconfMeeting
Subsequent to that meeting, there was also a Debian QA / FTPMaster meeting in Extremadura in 2008 at which various elements of the Translation Deb support were agreed and collated into a DEP - Debian Enhancement Proposal.
As such, all future discussion of Translation Debs, TDebs and related issues needs to be done via the DEP:
Please see DEP-4 for the current TDeb status and discussion.
http://dep.debian.net/deps/dep4/
The following is retained for historical purposes only.
Translation debs or tdebs is a concept aimed to solve i18n/TranslationDataDistribution problem in Debian. It was discussed in I18N meeting of 2006 in Extremadura, Spain and during DebConf7 during the Wacky Ideas BoF. Here is the implementation approach of the concept. Other discussed implementations are available at i18n/TranslationDebsProposals.
Contents
- Please see DEP-4 for the current TDeb status and discussion.
-
Implementation design
- Definition of tdebs
- Dependency handling
- How do tdebs relate to regular debs? How to update translations in stable?
- Supplemental tools / Changes to handle tdebs / Clarifications
- Changes needed
-
Problems, Questions and Discussions about this proposal
- Size of the resulting Packages files would be huge
- Why not just add hooks to dpkg to not install locales that are not interesting to the user?
- Maybe better to use language packs
- Mirrors might drop some languages
- Some translation material might be arch specific, so placing it in an arch all tdeb is wrong
Implementation design
Definition of tdebs
An ancillary package that contains all localization information that corresponds to a {language,package} pair. It is also possible to group multiple translation of a package in one single package. The archive is a regular Debian package (with a different suffix).
This could contain all the types of localization material:
- .mo files /usr/share/locale/${LANG}/LC_MESSAGES/*.mo
- localized man pages
- any other localized material like audio, video and images that correspond to the package in question
Note: the po-debconf localization material is somewhat special since there are certain situations that need to be handled: preinst, postrm scripts need to be localized, so the debconf translation needs to be in place and configured already when the aforementioned scripts are ran. For this reason, their inclusion in the tdebs should be postponed until the problems are solved. (I welcome people to reiterate the problems we discussed about po-debconf or any problems regarding splitted po-debconf translations.)
The filename format would be something like:
$PACK_$VER_all.$LANG.tdeb
So for the Romanian localization material for wormux_0.7.4-3, the tdeb package name would be wormux_0.7.4-3_all.ro.tdeb.
For the cases where mutiple langauges are grouped in one single udeb there is the posibility to use a different langauge identifier in the name. (Posssibly the concatenated language codes or special identifiers like weeu - western europe).
Dependency handling
- tdebs are marked as automatically installed dependency of the main packages, but they themselves really depend on the debs.
- trying to install a tdeb without the .deb should normally fail.
- installing a deb via apt/aptitude/synaptic/any_other_aptitude_like_tool should result in installation of the deb and all the tdebs available for that version and should mark all tdebs as automated dependency.
How do tdebs relate to regular debs? How to update translations in stable?
Since translations are (usually) not the cause of application problems it would be nice to allow translation updates even after the release.
The following diagram shows how tdebs result from a package that is released (in stable) and translations are updated later.
regular_deb-src -+-(dpkg-bp)-+--> .deb packages (current debs without l10n material) (tdeb-ized) | | | +--> .tdeb packages (as many as l10n material exists) | +--(dpkg-gentdebsrc)---> (.dsc + .tar.gz) = tdeb-source package(s?) (for translation updates) | +--(dpkg-bp)--> .tdeb packages (newer)
dpkg-gentdebsrc is a tool that needs to be created. It creates a whole new debian source package which contains only l10n material. This new source, if compiled with dpkg-buildpackage, should generate a new set of tdebs that should supersede the initially generated ones.
Note: is not clear if generating a source tdeb for each language would be a good thing, but if done so, the translation maintainers could get each and everyone the opportunity to maintain their own language's translation.
Supplemental tools / Changes to handle tdebs / Clarifications
Aptitude/Apt/dpkg must be modified to allow installation of the binary and source tdeb packages (see explanations above and examples). (Eg.: apt-get source --l10n ro wormux should do the right thing)
- By default, tdebs are not visible in searches, views, etc. Users need to force the display by adding an option like "--l10n ro", or some menu option.
- dpkg-gentdebsrc - a helper command that creates source tdeb packages needs to be created (some people suggested that the tdeb-source packages could be made by hand in the beginning, but I believe that this could result in bad tdebs-sources; that can lead to many mistakes and packages of poor quality)
- the main deb package will no longer contain l10n material
- Q: how does one ensure smooth upgrades for translations without loosing them (e.g.: from etch to lenny without loosing translations)?
- A: providing material in both the base package and the tdebs files are handled via diverts (? am I missing something ?)
Changes needed
Changes in dpkg
New location in /var/lib/dpkg/info/l10n/$LANG (or /var/lib/dpkg/info/tdeb/$LANG) which contains the .list files.
Properties of the tdebs:
the tdebs can have all types of maintainer scripts provided by the main package (filename /var/lib/dpkg/info/l10n/package.{pre,post}{inst,rm})
or the tdebs themselves can provide the maintainer scripts for the tdebs (filename /var/lib/dpkg/info/l10n/$LANG/package.{pre,post}{inst,rm}).
the tdeb maintainer scripts override the scripts provided by the main package if the tdeb_version >> package_version. If the tdeb wants to stop to provide/no longer use the tdeb maintainer scripts, it must provide empty scripts.
- /var/lib/dpkg/status is modified to add a field "Installed-Translations:" that would consist of a comma separated list of translations the current package has installed.
Installation of tdeb packages will not be different in any way, but they must take place after the regular deb was installed (to have maintainer scripts available). dpkg will refuse to install a tdeb if the deb is not present (can be forced).
Discussion
Without further changes, would there be a way to remove a tdeb? -- FilipusKlutiero
Dpkg will have a separate modifier parameter --lang <lang>, so you can act on a tdeb instead of the main package; the packages' information is in a separate place, so current dpkg will not touch that info.
OK. I guess that means the section is not really complete [yet]... -- FilipusKlutiero
Changes in apt and archive
- The l10n material is selected by default for the default language. Adding other languages is done via the "EXTRALANGS" variable in /etc/default/locale. There is no need of any other supplemental enabling/disabling/configuration.
- The tdebs will live in the common pool section.
- Apt will take care to install all selected and available tdebs for a given installed package, after the deb was installed. Dependency resolving is done in apt libraries, NOT in the applications/frontends.
Selecting translations
/etc/apt/sources.list does not contain any changes
Translations to be installed are selected via the LANG variable in /etc/default/locale and the new EXTRALANGS variable in the same file.
Example for a system mainly localized in Romanian with French and Brazilian Portuguese extra translations:
$ cat /etc/default/locale LANG=ro_RO.UTF-8 EXTRALANGS='fr pt_BR'
Archive layout is not relevant and not bound to this
Advantages
mirror layout is not bound in any way
(we can decide later about this and doesn't impose restriction on other things)
- no need of empty Packages files
- default system language is pulled on upgrade from old to new apt
- does not pollute /etc/apt/sources.list with incompatible lines
- /etc/default/locale is better suited for this kind of things
Disadvantages
- multi-language systems will need to define EXTRALANGS before upgrade
- need to wait for a new release to enable translations by default due to previous point
Discussion
The first Google hit for "/etc/default/locale" is from sco.com. Since that file is apparently a standard Unix file, I doubt that it's a good idea to use it only for APT. I think that an APT configuration parameter such as APT::Default-Translations would be better. When this is not defined, APT could rely on /etc/default/locale or /etc/locale.gen. -- FilipusKlutiero
We already have /etc/default/locale and it was introduced recently (after or with the release of Sarge), so I am not that sure is a standard Unix file.
Format of the Translations file
The file is similar to a Packages file, except there is no description and other needless information.
Example for a Romanian tdeb for wormux 0.7.4-3, translation updated for the first time, version is 0.7.4-3+t1.
Package: wormux Installed-Size: $IS Size: $S Maintainer: Debian Games Team <pkg-games-devel @ lists.alioth.debian.org> Architecture: all Source: wormux-tdebs Version: 0.7.4-3+t1 Filename: pool/main/w/wormux/wormux_0.7.4-3+t1_all.ro.tdeb MD5sum: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX SHA1: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX SHA256: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
or, if there will be a tdeb-source for each language:
Package: wormux Installed-Size: $IS Size: $S Maintainer: Debian L10N Romanian <debian-l10n-romanian @ lists.debian.org> Architecture: all Source: wormux-tdeb-ro Version: 0.7.4-3+t1 Filename: pool/main/w/wormux/wormux_0.7.4-3+t1_all.ro.tdeb MD5sum: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX SHA1: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX SHA256: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Notes:
- Package is the deb package for which the translation is provided (do we need this?)
- The Maintainer field could change (a translation maintainership team) after release.
- The Source field could be (at different points in time - e.g.: "Debian Games Team" for 0.7.4-3, "Debian L10N Romanian" for 0.7.4-3+t1)
- The langauge name can be decided from the place in the archive.
- Depends: is not needed; it can be inferred.
Discussion
Problems, Questions and Discussions about this proposal
Size of the resulting Packages files would be huge
EddyP : well, no, because the tdebs are in a separate section (another source must be added in sources.list), so is a separate file. Indeed, there could be many Translations files, one per each language, but their sizes would depend on the available l10n material for each package. See the archive format above.
Why not just add hooks to dpkg to not install locales that are not interesting to the user?
EddyP : because this is what localepurge is doing and is a hack. Also, having tdebs would allow updates of translations after releases. Note that l10n material is more than just .mo files and localized man pages; it can contain material like audio files for a game with speech in a certain language or similar material. Also, because of economic reasons we should prefer slimmer packages and less bandwidth usage.
Maintainers do not have to care that much about translation updates and can delegate that to translation teams; translators are empowered to backport translations and will be able to make translations in stable releases.
Maybe better to use language packs
RaphaelHertzog: I edited the UsefulImprovements page to add a note concerning mirrors. They don't necessarily like less download if it means more files to download, because each file costs a disk seek which is time lost during which they can't send any data out. There's a reason why Ubuntu has big "language packs" instead of many small files and I believe that you must take that into account as well. As time goes, we can take less care of the local disk usage and accept some middle ground. IMO we should group translations even if it means that we have on disk some useless translations. It's already the case... and I won't be bothered to have some not used french translation instead of having all translations in all languages. Some intelligent grouping should be possible (all of essential, then all of standard, then the rest grouped by logical groups maybe our official sections).
- EddyP: I haven't thought of the more accesses issue until now
- EddyP: about the grouping, that is not such a great idea in contrast to 1 tdeb/lang/pack or groupping more languages per package:
- the main reason why one tdeb per language would be a good idea is that it would allow updates of specific translations
- no patch in libc is needed
Ubuntu has such a change in libc to use alternative places for the mo files. Choosing one or the other is not as straight forward as one might think, you have to make sure the translation you choose is newer than the other AND that is compatible with the current version of the app in the deb; there were some really nasty problems because of this - Segfaults since the number of params in the program did not fit the ones in the mo files (https://launchpad.net/ubuntu/+source/language-pack-es-base/+bug/42264) also, some problems seem to occur due to bundling (https://launchpad.net/ubuntu/+source/language-pack-es-base/+bug/52267)
- The translations belong to tdebs or to no other binary package (regular debs would not have them) so there is no overlap
making big native language-pack uploads due to small deltas (po files are text and is most likely that just small parts will be updated) doesn't seem like a good idea bandwidth-wise - don't forget, Ubuntu is not stripping translations from packages, they are pulling them together from rosetta; we would like to have tdebs built initialy together with the base package, then later treat them separately (see schematic above)
tdebs will contain all types of localization material, not only mo files (see the definition of tdebs)
I completely agree with EddyP. While intelligent grouping would be good, I disagree that we can have efficient grouping. The only criterion that comes to mind is the one you mention, priority, yet default Debian installs have extra packages and setting/adjusting priorities is increasingly done semi-randomly or not at all. I would speculate that under 5% of Debian installs do not have an optional or extra package. -- FilipusKlutiero
Are you saying that disk read is the bottleneck for a significant number of Debian mirrors? -- FilipusKlutiero
This is a non issue. The size of the archive and the number of files in it is such that the bigger the file the more it is likely to be fragmented on the disk, making the disk possibly seek as much for a big file as with a bunch of smaller files. Plus, the traffic on mirrors is significant enough that disks are already seeking a lot. Do you think a mirror has to serve only 1 file at a time ? -- ?MikeHommey
I have no figures for all mirrors. My remarks are based on a lengthy discussion with one of the admins of ftp.fr.debian.org (François Pétillon, see one of his mails on debian-devel, he works for Free.fr a big french ISP). He raised the point in particular for CD images, he really prefers that people download ISO files instead of using jigdo to regenerate the ISO. Because his resources are better used this way. And yes he's limited not by the bandwith but by the disk capacity. Even with stripping and/or mirroring, it's the disk the bottleneck. Even with 16Gb of RAM when you have hundreds of Gb of files to serve, the cache in RAM doesn't help much. -- RaphaelHertzog
Thanks. According to wp:Hard disk, random access time is on average about 10 ms. The same time spent doing a transfer should feed in average about 1 MB. Supposing that full CDs contain about 700 packages, half of disk IO to serve a full CD to jigdo is spent in seeks. These numbers would explain that admins of disk IO-restricted mirrors to complain about jigdo, so I guess the reality is in this order of magnitude. What this means is that to compensate the disadvantage of having to perform an extra seek to serve a package, splitting the l10n material would need to reduce the package [compressed] size by at least 1 MB. So, TDebs would still be an advantage for mirrors in the case of office suites, Iceweasel, Amarok and KDE (in the way this is currently done, that is kde-i18n-foo packages for the complete official KDE) for example. How much does a seek and a MB of bandwidth consumption cost in average to Debian mirrors? -- FilipusKlutiero
For our mirror (debian.tu-bs.de) disk seeks and disk bandwidth is the bottleneck. We were using raid5 before, so have switched to raid1 because it was too slow. Even now we can't fill the 1GBit link serving debian. When the whole file fits into memory (like a Kanotix release or the wikipedia DVD) the server can send 3-5 times the normal debian load (~200MBit). -- ?JanLübbe
- that is definetly a point, but you are forgetting that the possibility that the file to be in the memory exists because those files are downloaded frequently in a short period of time (short after the image release), so they remain cached from one request to the other. There isn't a way that I can think of to obtain something similar for debian packages (current ones) in a constant manner. Not sure if tdebs would make the problem worse in a significat way (one or more orders of magnitude).
- please bear in mind that for most mirrors they would most likely serve the same tdebs in a short period of time, thus they most likey would be in cache; more than that, is is possible that many of the non-local tdebs (e.g.: tdebs with german localilzation material on a french mirror) no never be touched. Plus, if the mirroring is done right, mirrors might choose not to mirror those languages which do not make sense for a mirror.
Mirrors might drop some languages
EddyP: If one mirror chooses to not mirror a language, since there is no special line in apt to indicate the place where to get those translations there is a need to either specify a new mirror just to get the tdebs or some mechanism/setting in /etc/default/locale. Which would be the best approach?
Some translation material might be arch specific, so placing it in an arch all tdeb is wrong
EddyP: Frans was pointing out during Neil's talk at FOSDEM 2008 (lo-res video) that there might be cases where a translation changes with the arch.
- I fail to see such a case fall outside of one of the two cases below. Please contradict me with other cases, if any.
- the translated material is actually part of an arch all package, see the D-I manual; it is generated/picked at build time
Solution package is already arch all, not a real problem
- some strings might be applicable only to an arch, not to all; the translations are picked at run time. Since, I don't know how other translation systems work, besides gettext, I will talk only about gettext. If you know other systems that behave differently from gettext in a way that would invalidate my points below, please comment.
- since building a .po with X valid translated strings into a .mo results in the same X number of .mo entries, independent on the arch (there's no mechanism to specify - this is irrelevant for arch A), all X translations will be present in the .mo files, independent of the build machine. Some strings might never be hit on an arch, but they will be available
- there is indeed the case that the reference ID for the translation is not unique; workarounds include:
- leave that in the arch specific package, like it is now, we don't loose anything
- hack the ID generation to add the arch somehow
- Frans pointed out the case of choose-mirror which builds the list of mirrors (which contain/use country names - which are translatable). So, a binary for arch X will contain the translated list of country names thar have at least one mirror for that arch.
- this could probably be dealt with so that the arch all tdeb ships all translations, while at build time the arch specific package has a list of the countries to display; at runtime, choose-mirror just picks the country names that are indicated in the list present in the .deb
- note that, AFAIK, choose-mirror uses po-debconf, which isn't yet handled by this proposal due to really problematic corner cases for availablility of the translations during preinst/install/upgrade
- the translated material is actually part of an arch all package, see the D-I manual; it is generated/picked at build time