This page describes the proposed changes to apt-ftparchive to support the DebTorrent program.

Packages files

Packages files are currently used for DebTorrent metadata, since they contain all the necessary information: the file path, the file size, and the SHA1 sum of the file. However, this means that all files must be a single piece, as each piece needs an SHA1 sum generated for it. Though this does work for DebTorrent, the sharing of pieces is not as effective as it should be, since very large files (up to 160MB) will need to be downloaded in their entirety before they can be shared with others. A more effective sharing can be obtained by breaking up the files larger than a threshold into smaller pieces, and generate the SHA1 sum for the pieces and add it to the Packages file.

Proposal 1: add piece info to the Packages file

Break up files larger than a given threshold into multiple pieces, and generate an SHA1 sum for the pieces that is then included in the Packages file. This would mean adding a new field to the Packages file, "SHA1-pieces", which would be a list of SHA1 sums and piece sizes, with one piece per line. This new field would only be present in packages that have a file larger than a predetermined threshold. An example, using a piece size and threshold of 1 MB is below (some of the normal fields present in Packages files have been omitted for brevity).

Package: foo
Size: 5341873
SHA1: 38170c08cb458fd4879c34b6f608294c50312bbb
SHA1-pieces:
 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
 ccf271b7830882da1791852baeca1737fcbe4b90 98993

Package: bar
Size: 72856
SHA1: 9425fa8de16f6283365f6bee87f405da16a203e6

Package: ...

In theory, the maximum piece size and threshold could be chosen to be any number, though in practice power's of 2 with values of 256 KB, 512 KB, and 1 MB are reasonable values. The pieces should be multiples of 16384 (214), which is the default size of blocks transferred between peers in the protocol. However, any multiple of 16384 is a good choice for piece size, so for each package the piece size could be chosen such that all the pieces are roughly the same size, with the last piece being only slightly smaller. This is an improvement over BitTorrent's fixed piece size of powers of 2, which could lead to the last piece being very small.

The smaller the maximum piece size is set at, the more effective the file sharing will be, but the more information will need to be added to Packages files and the more pieces will need to be tracked by DebTorrent (memory issue).

Implications

The easiest is to pick a maximum piece size and use it for all packages. (However, it is possible to use a bigger maximum piece size for large packages, and a smaller one for small packages.) Assuming the format above is used (space, 40-byte hash, space, piece size), we can calculate the quantity of data that needs to be added to the Packages files. This gives the following for the packages in the amd64 Packages' files (percentages are the increase in file size as a percentage of the current file size):

Piece

main

contrib

non-free

Size

Pieces

Data Added

Pieces

Data Added

Pieces

Data Added

32 KB

490237

22902 KB (112.2%)

4483

209 KB (93.9%)

29390

1380 KB (472.8%)

64 KB

250871

11515 KB (56.4%)

2310

105 KB (47.3%)

14764

693 KB (237.6%)

128 KB

131865

5890 KB (28.9%)

1224

54 KB (24.2%)

7472

358 KB (122.7%)

256 KB

73131

2934 KB (14.4%)

699

28 KB (12.4%)

3806

181 KB (61.9%)

512 KB

44432

1441 KB (7.1%)

436

13 KB (6.1%)

1986

91 KB (31.2%)

1 MB

30628

695 KB (3.4%)

310

6 KB (2.9%)

1090

47 KB (16.0%)

The increase in file size is quite small (7% to 14%) for reasonable piece sizes of 256 or 512 KB. Non-free is slightly larger as it has many large packages, though it's still quite a small absolute increase.

Proposal 2: add new files containing only piece hashes

The information for packages broken up into pieces could be stored separately in a file for each Packages file. The file would contain the same SHA1-Pieces information as in the previous example. For example, the main/binary-alpha/Packages file would have alongside it a main/binary-alpha/Packages-extrapieces file containing (the same example as in the proposal 1, note that package bar does not appear in the file):

Package: foo
SHA1-pieces:
 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
 ccf271b7830882da1791852baeca1737fcbe4b90 98993

This method is currently being used as a temporary solution to this problem. The extrapieces files for all of etch can be seen at http://merkel.debian.org/~ajt/extrapieces/.

Implications

The file sizes for the new data to be added to the archive will be approximately the same as the extra data added to the Packages files in proposal 1 (i.e. approximately 1441 KB for the etch main binary-amd64 Packages-extrapieces file). There will be some additional (redundant) data needed in these files to specify the package name and file name. Examples for all of etch can be seen from the current extrapieces files for etch: http://merkel.debian.org/~ajt/extrapieces/ (note that these files are gzipped).

Some data for these new files will also need to be added to the Release file to enable hash checking of the data. This will add anywhere from 25% (if only one extrapieces file is used) to 75% (if 3 differently compressed version of the extrapieces file are used) to the Release file from etch.

Comparison of Proposals

The advantage of proposal 2 over proposal 1 is that size added per person accessing the archive is minimized, as non-DebTorrent users will not download additional information on pieces that they aren't interested in. It also reduces the amount of clutter in the Packages file, if someone is reading it.

The disadvantages of proposal 2, compared with proposal 1, are:

The biggest disadvantages are the first 2, because DebTorrent will use these files, not knowing they are out of sync or missing. This will result in the creation of torrents that don't match others, therefore fracturing the download population. For example, consider a DebTorrent user downloading the Packages information while an archive update is happening. He retrieve the new Packages file, but gets the old Packages-extrapieces file. If a new package version was introduced (for a package that is larger than the piece threshold) he will be missing the pieces information for that package, and so will his torrent. Since torrent users find each other by the hash of their torrent information, and this users hash will match (probably) no one, he will be unable to download any files from that torrent. Though this scenario may be unlikely, the situation could be caused by many other things: a mirror change (due to DNS) during an update, a stale web cache between him and the archive, a temporary loss of network connectivity during the extrapieces download. Though additional functionality could be added to deal with all of these problems, they are all solved by keeping the hashes in the Packages file, which is a far more robust solution.

Also, in the future there may be the possibility to enable DebTorrent downloading of Packages files as well. Keeping the pieces information in a separate file will mean extra complexity in trying to add this functionality, which could slow the development of this feature.

Other implications

Another factor to consider is the increased computation time needed to generate the hashes for these pieces. As the largest files in the archive are the ones that need hashes generated for their pieces, it turns out that most of the archive's size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 89% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each package, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update packages in the archive under this plan, than it did previously. This is assuming that the piece hashes can be cached like the package hashes are, which seems reasonable.