Differences between revisions 13 and 14
Revision 13 as of 2007-02-21 00:01:58
Size: 20100
Editor: EddyPetrisor
Comment: one more thing about advantages over language-packs
Revision 14 as of 2007-02-21 06:02:13
Size: 20673
Comment: Problems, Questions and Discussions about this proposal
Deletions are marked like this. Additions are marked like this.
Line 288: Line 288:
 * Size of the resulting Packages files would be huge
   * EddyP : well, no, because the tdebs are in a separate section (another source must be added in sources.list), so is a separate file. Indeed, there could be many Translations files, one per each language, but their sizes would depend on the available l10n material for each package. See the archive format above, either the compatible or the incompatible proposal.
 * Why not just add hooks to dpkg to not install locales that are not interesting to the user ?
   * EddyP : because this is what localepurge is doing and is a hack. Also, having tdebs would allow updates of translations after releases. Note that l10n material is more than just .mo files and localized man pages; it can contain material like audio files for a game with speech in a certain language or similar material. Also, because of [wiki:Self:UsefulImprovements#head-c740b96135b206e2b2e7204d0c19fe230b84a5d7 economic reasons] we should prefer slimmer packages and less bandwidth usage.
 * RaphaelHertzog: I edited the UsefulImprovements page to add a note concerning mirrors. They don't necessarily like less download if it means more files to download, because each file costs a disk seek which is time lost during which they can't send any data out. There's a reason why Ubuntu has big "language packs" instead of many small files and I believe that you must take that into account as well. As time goes, we can take less care of the local disk usage and accept some middle ground. IMO we should group translations even if it means that we have on disk some useless translations. It's already the case... and I won't be bothered to have some not used french translation instead of having all translations in all languages. Some intelligent grouping should be possible (all of essential, then all of standard, then the rest grouped by logical groups maybe our official sections).
   * EddyP: I haven't thought of the more accesses issue until now
   * EddyP: about the grouping, that is not such a great idea in contrast to 1 tdeb/lang/pack :
     * the main reason why one tdeb per language would be a good idea is that it would allow updates of specific translations
     * no patch in libc is needed
    * Ubuntu has such a change in libc to use alternative places for the mo files. Choosing one or the other is not as straight forward as one might think, you have to make sure the translation you choose is newer than the other AND that is compatible with the current version of the app in the deb; there were some really nasty problems because of this - Segfaults since the number of params in the program did not fit the ones in the mo files (https://launchpad.net/ubuntu/+source/language-pack-es-base/+bug/42264) also, some problems seem to occur due to bundling (https://launchpad.net/ubuntu/+source/language-pack-es-base/+bug/52267)
       * The translations belong to tdebs or to no other binary package (regular debs would not have them) so there is no overlap
    * making big native language-pack uploads due to small deltas (po files are text and is most likely that just small parts will be updated) doesn't seem like a good idea bandwidth-wise - don't forget, Ubuntu is not stripping translations from packages, they are pulling them together from rosetta; we would like to have tdebs built initialy together with the base package, then later treat them separately (see [#head-1ab4c6127645c56e4cdc6f80bfd14899846d3f76 schematic above])
       * tdebs will contain all types of localization material, not only mo files (see [#head-22ccd9252aa978d1913e0b97889fa1533651b444 the definition of tdebs])
=== Size of the resulting Packages files would be huge ===
EddyP : well, no, because the tdebs are in a separate section (another source must be added in sources.list), so is a separate file. Indeed, there could be many Translations files, one per each language, but their sizes would depend on the available l10n material for each package. See the archive format above, either the compatible or the incompatible proposal.
=== Why not just add hooks to dpkg to not install locales that are not interesting to the user? ===
EddyP : because this is what localepurge is doing and is a hack. Also, having tdebs would allow updates of translations after releases. Note that l10n material is more than just .mo files and localized man pages; it can contain material like audio files for a game with speech in a certain language or similar material. Also, because of [wiki:Self:UsefulImprovements#head-c740b96135b206e2b2e7204d0c19fe230b84a5d7 economic reasons] we should prefer slimmer packages and less bandwidth usage.
=== Maybe better to use language packs ===
RaphaelHertzog: I edited the UsefulImprovements page to add a note concerning mirrors. They don't necessarily like less download if it means more files to download, because each file costs a disk seek which is time lost during which they can't send any data out. There's a reason why Ubuntu has big "language packs" instead of many small files and I believe that you must take that into account as well. As time goes, we can take less care of the local disk usage and accept some middle ground. IMO we should group translations even if it means that we have on disk some useless translations. It's already the case... and I won't be bothered to have some not used french translation instead of having all translations in all languages. Some intelligent grouping should be possible (all of essential, then all of standard, then the rest grouped by logical groups maybe our official sections).
 * EddyP: I haven't thought of the more accesses issue until now
 * EddyP: about the grouping, that is not such a great idea in contrast to 1 tdeb/lang/pack :
  * the main reason why one tdeb per language would be a good idea is that it would allow updates of specific translations
  * no patch in libc is needed
    * Ubuntu has such a change in libc to use alternative places for the mo files. Choosing one or the other is not as straight forward as one might think, you have to make sure the translation you choose is newer than the other AND that is compatible with the current version of the app in the deb; there were some really nasty problems because of this - Segfaults since the number of params in the program did not fit the ones in the mo files (https://launchpad.net/ubuntu/+source/language-pack-es-base/+bug/42264) also, some problems seem to occur due to bundling (https://launchpad.net/ubuntu/+source/language-pack-es-base/+bug/52267)
    * The translations belong to tdebs or to no other binary package (regular debs would not have them) so there is no overlap
    * making big native language-pack uploads due to small deltas (po files are text and is most likely that just small parts will be updated) doesn't seem like a good idea bandwidth-wise - don't forget, Ubuntu is not stripping translations from packages, they are pulling them together from rosetta; we would like to have tdebs built initialy together with the base package, then later treat them separately (see [#head-1ab4c6127645c56e4cdc6f80bfd14899846d3f76 schematic above])
    * tdebs will contain all types of localization material, not only mo files (see [#head-22ccd9252aa978d1913e0b97889fa1533651b444 the definition of tdebs])
 * I completely agree with EddyP. While intelligent grouping would be good, I disagree that we can have efficient grouping. The only criterion that comes to mind is the one you mention, priority, yet default Debian installs have extra packages and setting/adjusting priorities is increasingly done semi-randomly or not at all. I would speculate that under 5% of Debian installs do not have an optional or extra package. -- FilipusKlutiero
 * Are you saying that disk read is the bottleneck for a significant number of Debian mirrors? -- FilipusKlutiero

Translation debs or tdebs is a concept aimed to solve ["I18n/TranslationDataDistribution"] problem in Debian. It was discussed in I18N meeting of 2006 in Extremadura, Spain. Here are collected possible implementation of the concept. Other discussed implementations should be added to the page by their authors along with discussed pros and cons of each approach.

?TableOfContents([3])


Aigarius proposal

TDeb structure

TDebs could be made to be the same format as the regular deb files or be simple tar.gz (or tar.bz2) archives.

Dpkg level changes

A new folder would be introduced - /var/lib/dpkg/info/tdebs/ . This folder would contain .list files for tdebs and would be considered for removing files of tdebs and (optionally) for conflict resolution. Allowing tdebs to overwrite files from their respective base packages might ease the transition.

A new hook script would be added - /var/lib/dpkg/info/*.posttrans . If a package has special i18n requirements and some commands need to be run after installation of a translation package, then it could provide this script. It would be called by dpkg after installation (or removal) of a translation package with appropriate parameters (language iso code, for example).

EddyP: I wonder if there could be cases where a single posttrans wouldn't fit all the tdebs.

/var/lib/dpkg/status is modified to add a field "Installed-Translations:" that would consist of a comma separated list of translations the current package has installed.

When "dpkg --install somepackage-1.0-4.ru.tdeb" is run, then dpkg determines the base package name, asserts that base package is installed, unpacks the tdeb, puts its file list into /var/lib/dpkg/info/tdebs/somepackage.ru.list, adds the language to "Installed-Translations" field in status file and runs /var/lib/dpkg/info/somepackage.posttrans if it exists.

APT level changes

A configuration file /etc/apt/languages.list would be introduced that lists language codes for which translations must be installed.

Translations will be installed upon installation or upgrade of a package (for one package) or upon an upgrade or dist-upgrade (for all packages).

Translation packages will be in no way included in the dependency calculations. Packages for selected languages for all installed packages will be installed. All other translations will be removed.

Downloading and parsing of the Translations file from the mirror (see below) will need to be added. It would be preferable to parse this file sequentially while not storing it in the memory to reduce space concerns. Or implement fetching without an index.

Mirror level changes

A Translations file will need to be added at the same level as Sources files are now. The Translations file could be a simple as "packagename-version: comma separated list of iso codes of available translations". It would also be possible to avoid needing such file and simply constructing request urls from known components: package name, version and language. 404 error would indicate absence of such translation.

Translations could be located in the package pool in a separate subdirectory of a directory of the package, for example /debian/pool/main/s/sb/sbackup/tdebs/

Archive maintenance changes

(I know little of this, so this will need corrections)

TDebs could be created either by stripping translations from existing packages (temporary solution) or by using the (not ready yet) Big Universal Debian i18n System or by manual uploads. Translations could be extracted from packages either in build time (with the help of some debhelper) or even after the upload just prior installing the package into the archive.

Did I forget anything?


EddyP's proposal

Definition of tdebs

An ancillary package that contains all localization information that corresponds to a {language,package} pair. The archive is a regular Debian package (with a different suffix).

This could contain all the types of localization material:

  • .mo files /usr/share/locale/${LANG}/LC_MESSAGES/*.mo
  • localized man pages
  • any other localized material like audio, video and images that correspond to the package in question

Note: the po-debconf localization material is somewhat special since there are certain situations that need to be handled: preinst, postrm scripts need to be localized, so the debconf translation needs to be in place and configured already when the aforementioned scripts are ran. For this reason, their inclusion in the tdebs should be postponed until the problems are solved. (I welcome people to reiterate the problems we discussed about po-debconf or any problems regarding splitted po-debconf translations.)

The filename format would be something like:

$PACK_$VER_all.$LANG.tdeb

So for the Romanian localization material for wormux_0.7.4-3, the tdeb package name would be wormux_0.7.4-3_all.ro.tdeb.

Dependency handling

  • tdebs are marked as automatically installed dependency of the main packages, but they themselves really depend on the debs.
  • trying to install a tdeb without the .deb should normally fail.
  • installing a deb via apt/aptitude/synaptic/any_other_aptitude_like_tool should result in installation of the deb and all the tdebs available for that version and should mark all tdebs as automated dependency.

How do tdebs relate to regular debs? How to update translations in stable?

Since translations are (usually) not the cause of application problems it would be nice to allow translation updates even after the release.

The following diagram shows how tdebs result from a package that is released (in stable) and translations are updated later.

regular_deb-src -+-(dpkg-bp)-+--> .deb packages (current debs without l10n material)
 (tdeb-ized)     |           |
                 |           +--> .tdeb packages (as many as l10n material exists)
                 |
                 +--(dpkg-gentdebsrc)---> (.dsc + .tar.gz) = tdeb-source package(s?) (for translation updates)
                                                                  |
                                                                  +--(dpkg-bp)--> .tdeb packages (newer)

dpkg-gentdebsrc is a tool that needs to be created. It creates a whole new debian source package which contains only l10n material. This new source, if compiled with dpkg-buildpackage, should generate a new set of tdebs that should supersede the initially generated ones.

Note: is not clear if generating a source tdeb for each language would be a good thing, but if done so, the translation maintainers could get each and everyone the opportunity to maintain their own language's translation.

Supplemental tools / Changes to handle tdebs / Clarifications

  • Aptitude/Apt/dpkg must be modified to allow installation of the binary and source tdeb packages (see explanations above and examples). (Eg.: apt-get source --l10n ro wormux should do the right thing)

  • By default, tdebs are not visible in searches, views, etc. Users need to force the display by adding an option like "--l10n ro", or some menu option.
  • dpkg-gentdebsrc - a helper command that creates source tdeb packages needs to be created (some people suggested that the tdeb-source packages could be made by hand in the beginning, but I believe that this could result in bad tdebs-sources; that can lead to many mistakes and packages of poor quality)
  • the main deb package will no longer contain l10n material
    • Q: how does one ensure smooth upgrades for translations without loosing them (e.g.: from etch to lenny without loosing translations)?
    • A: providing material in both the base package and the tdebs files are handled via diverts (? am I missing something ?)

Changes needed

Changes in dpkg

New location in /var/lib/dpkg/info/l10n/$LANG (or /var/lib/dpkg/info/tdeb/$LANG) which contains the .list files.

This idea is almost the same as Aigars', except:

  • the tdebs can have all types of maintainer scripts provided by the main package (filename /var/lib/dpkg/info/l10n/package.{pre,post}{inst,rm})

  • or the tdebs themselves can provide the maintainer scripts for the tdebs (filename /var/lib/dpkg/info/l10n/$LANG/package.{pre,post}{inst,rm}).

  • the tdeb maintainer scripts override the scripts provided by the main package if the tdeb_version >> package_version. If the tdeb wants to stop to provide/no longer use the tdeb maintainer scripts, it must provide empty scripts.

Installation of tdeb packages will not be different in any way, but they must take place after the regular deb was installed (to have maintainer scripts available). dpkg will refuse to install a tdeb if the deb is not present (can be forced). Installed-translations: filed is populated like Aigars proposed.

Changes in apt and archive

  • The l10n material is selected by adding new sources in /etc/apt/sources.list. There is no need of any supplemental enabling/disabling/configuration.
  • The tdebs will live in the common pool section.
  • Apt will take care to install all available tdebs for a given installed package, after the deb was installed. Dependency resolving is done in apt libraries, NOT in the applications/frontends.

Selecting translations -- Entirely new (incompatible) approach

/etc/apt/sources.list can contain new sources which implicitly select desired languages:

tdeb-$LANG http://ftp.debian.org/debian/ etch main

Archive layout is: ftp.debian.org/debian/dists/l10n/etch/main/$LANG/Translations

Advantages
  • mirrors can exclude easily the translations since they are in different section
  • no need of empty Packages files

Disadvantages
  • need to wait for a new release to enable translations by default (or don't we?)
  • mirrors can exclude easily the translations since they are in different section and they will, by default :-(

  • somewhat redundant $DIST location / l10n files are not under the same tree as the regular Packages files (l10n/etch in contrast to etch)
  • changes somewhat the way paths are implemented since the "binary-$ARCH" bit is missing
  • can't support arch dependent l10n material
    • EddyP @ 2007-02-18 : is all l10n material arch-independent? maybe weird/non-gettext l10n material where the compiled content might be endian dependant)

Selecting translations -- Semi-compatible approach

Since the first incompatible proposal will not offer support for smooth upgrades from the last release that does not support tdebs to the one that does due to changes in apt/dpkg, the following proposal is semi-compatible or semi-incompatible in that regard. It would allow addition of translation sources even on old systems without risks, but would allow a new syntax for tdebs for entirely compliant systems.

(Not sure if this is unnecessary head ache. Please talk some sense into me :-), if you care! )

/etc/apt/sources.list can contain new sources which implicitly select desired languages in the new or old form:

deb-tdeb-$LANG http://ftp.debian.org/debian/ etch main

deb http://ftp.debian.org/debian/ etch main/l10n/$LANG

Archive layout is: ftp.debian.org/debian/dists/etch/main/l10n/$LANG/binary-$ARCH/{Translations,Packages}

Advantages
  • no need to wait for a new release to enable translations by default
  • mirrors can't exclude easily the translations ;-)

Disadvantages
  • mirrors can't exclude easily the translations
  • the binary-$ARCH/Translations files might be all the same since most of the l10n material will be arch-independent
  • empty Packages files must be placed in all directories where Translations files are placed
  • the Packages files can't be used instead of the Translations files with legacy dpkg since dpkg does not know how to handle the translation files' naming format

EddyP @ @ 2007-02-18 : Does this approach has any significant benefits over the compatible one?

Selecting translations -- Compatible approach

If the previous proposals are unacceptable due to backward compatibility reasons, a line like the following should be used

deb http://ftp.debian.org/debian/ etch l10n/$LANG/main/

Archive layout is: ftp.debian.org/debian/dists/etch/l10n/$LANG/main/{Translations,Packages}

The compatible archive should contain in the expected place a Packages file with no packages and a Translations file with the correspondent tdebs.

Advantages
  • no need to wait for a new release to enable translations by default
  • mirrors can exclude easily the translations

Disadvantages
  • mirrors can exclude easily the translations :-(

  • the last slash is important and inconsistent with the current approach :-(

  • can't support arch dependent l10n material
    • EddyP @ @ 2007-02-18 : is all l10n material arch-independent? maybe weird/non-gettext l10n material where the compiled content might be endian dependant)
  • the Packages files can't be used instead of the Translations files with legacy dpkg since dpkg does not know how to handle the translation files' naming format
  • empty Packages files must be placed in all directories where Translations files are placed so that incompatible apt doesn't complain when the Packages file is missing

Selecting translations -- No visible change approach

/etc/apt/sources.list do not can contain any changes

Translations to be installed are selected via the LANG variable in /etc/default/locale and the new EXTRALANGS variable

Example for a system mainly localized in Romanian with French and Brazilian Portuguese extra translations:

$ cat /etc/dfault/locale
LANG=ro_RO.UTF-8
EXTRALANGS='fr pt_BR'

Archive layout is not relevant and not bound to this

Advantages
  • mirror layout is not bound in any way :-) (we can decide later about this and doesn't impose restriction on other things)

  • no need of empty Packages files
  • default system language is pulled on upgrade from old to new apt
  • does not pollute /etc/apt/sources.list with incompatible lines
  • /etc/default/locale is better suited for this kind of things

Disadvantages
  • multi-language systems will need to define EXTRALANGS before upgrade
  • need to wait for a new release to enable translations by default

Format of the Translations file

The file is similar to a Packages file, except there is no description and other needless information.

Example for a Romanian tdeb for wormux 0.7.4-3, translation updated for the first time, version is 0.7.4-3+t1.

Package: wormux
Language: ro
Installed-Size: $IS
Size: $S
Maintainer: Debian Games Team <pkg-games-devel @ lists.alioth.debian.org>
Architecture: all
Source: wormux-tdebs
Version: 0.7.4-3+t1
Filename: pool/main/w/wormux/wormux_0.7.4-3+t1_all.ro.tdeb
Depends: wormux (= 0.7.4-3)
MD5sum: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
SHA1: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
SHA256: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

or, if there will be a tdeb-source for each language:

Package: wormux
Language: ro
Installed-Size: $IS
Size: $S
Maintainer: Debian L10N Romanian <debian-l10n-romanian @ lists.debian.org>
Architecture: all
Source: wormux-tdeb-ro
Version: 0.7.4-3+t1
Filename: pool/main/w/wormux/wormux_0.7.4-3+t1_all.ro.tdeb
Depends: wormux (= 0.7.4-3)
MD5sum: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
SHA1: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
SHA256: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Notes:

  • Package is the deb package for which the translation is provided (do we need this?)
  • The Maintainer field could change (a translation maintainership team) after release.
  • The Source field could be (at different points in time - e.g.: "Debian Games Team" for 0.7.4-3, "Debian L10N Romanian" for 0.7.4-3+t1)

Problems, Questions and Discussions about this proposal

Size of the resulting Packages files would be huge

EddyP : well, no, because the tdebs are in a separate section (another source must be added in sources.list), so is a separate file. Indeed, there could be many Translations files, one per each language, but their sizes would depend on the available l10n material for each package. See the archive format above, either the compatible or the incompatible proposal.

Why not just add hooks to dpkg to not install locales that are not interesting to the user?

EddyP : because this is what localepurge is doing and is a hack. Also, having tdebs would allow updates of translations after releases. Note that l10n material is more than just .mo files and localized man pages; it can contain material like audio files for a game with speech in a certain language or similar material. Also, because of [wiki:UsefulImprovements economic reasons] we should prefer slimmer packages and less bandwidth usage.

Maybe better to use language packs

RaphaelHertzog: I edited the UsefulImprovements page to add a note concerning mirrors. They don't necessarily like less download if it means more files to download, because each file costs a disk seek which is time lost during which they can't send any data out. There's a reason why Ubuntu has big "language packs" instead of many small files and I believe that you must take that into account as well. As time goes, we can take less care of the local disk usage and accept some middle ground. IMO we should group translations even if it means that we have on disk some useless translations. It's already the case... and I won't be bothered to have some not used french translation instead of having all translations in all languages. Some intelligent grouping should be possible (all of essential, then all of standard, then the rest grouped by logical groups maybe our official sections).

  • EddyP: I haven't thought of the more accesses issue until now
  • EddyP: about the grouping, that is not such a great idea in contrast to 1 tdeb/lang/pack :
    • the main reason why one tdeb per language would be a good idea is that it would allow updates of specific translations
    • no patch in libc is needed
      • Ubuntu has such a change in libc to use alternative places for the mo files. Choosing one or the other is not as straight forward as one might think, you have to make sure the translation you choose is newer than the other AND that is compatible with the current version of the app in the deb; there were some really nasty problems because of this - Segfaults since the number of params in the program did not fit the ones in the mo files (https://launchpad.net/ubuntu/+source/language-pack-es-base/+bug/42264) also, some problems seem to occur due to bundling (https://launchpad.net/ubuntu/+source/language-pack-es-base/+bug/52267)

      • The translations belong to tdebs or to no other binary package (regular debs would not have them) so there is no overlap
      • making big native language-pack uploads due to small deltas (po files are text and is most likely that just small parts will be updated) doesn't seem like a good idea bandwidth-wise - don't forget, Ubuntu is not stripping translations from packages, they are pulling them together from rosetta; we would like to have tdebs built initialy together with the base package, then later treat them separately (see [#head-1ab4c6127645c56e4cdc6f80bfd14899846d3f76 schematic above])
      • tdebs will contain all types of localization material, not only mo files (see [#head-22ccd9252aa978d1913e0b97889fa1533651b444 the definition of tdebs])
  • I completely agree with EddyP. While intelligent grouping would be good, I disagree that we can have efficient grouping. The only criterion that comes to mind is the one you mention, priority, yet default Debian installs have extra packages and setting/adjusting priorities is increasingly done semi-randomly or not at all. I would speculate that under 5% of Debian installs do not have an optional or extra package. -- FilipusKlutiero

  • Are you saying that disk read is the bottleneck for a significant number of Debian mirrors? -- FilipusKlutiero


Add your proposal