Differences between revisions 2 and 4 (spanning 2 versions)
Revision 2 as of 2007-05-03 02:53:04
Size: 6379
Editor: CameronDale
Comment: First draft without implications
Revision 4 as of 2007-05-05 07:22:35
Size: 11021
Editor: CameronDale
Comment: Add implications of adding to Release files
Deletions are marked like this. Additions are marked like this.
Line 36: Line 36:
The easiest is to pick a maximum piece size and use it
for all packages. (However, it is possible to use a bigger piece size for large
packages, and a smaller one for small packages.) Assuming the format above is used (space, 40-byte hash, space, piece size), we can calculate the quantity of data that needs to be added to the Packages files. This gives the
following for the packages in the amd64 Packages' files (percentages are the increase in file size as a percentage of the current file size):
Line 37: Line 41:
||<rowstyle="text-align: center;"> '''Piece''' ||<-2> '''main''' ||<-2> '''contrib''' ||<-2> '''non-free''' ||
||<rowstyle="text-align: center;"> '''Size''' || ''Pieces'' || ''Data Added'' || ''Pieces'' || ''Data Added'' || ''Pieces'' || ''Data Added'' ||
||<rowstyle="text-align: right;"> 32 KB || 490237 || 22902 KB (112.2%) || 4483 || 209 KB (93.9%) || 29390 || 1380 KB (472.8%) ||
||<rowstyle="text-align: right;"> 64 KB || 250871 || 11515 KB (56.4%) || 2310 || 105 KB (47.3%) || 14764 || 693 KB (237.6%) ||
||<rowstyle="text-align: right;"> 128 KB || 131865 || 5890 KB (28.9%) || 1224 || 54 KB (24.2%) || 7472 || 358 KB (122.7%) ||
||<rowstyle="text-align: right;"> 256 KB || 73131 || 2934 KB (14.4%) || 699 || 28 KB (12.4%) || 3806 || 181 KB (61.9%) ||
||<rowstyle="text-align: right;"> 512 KB || 44432 || 1441 KB (7.1%) || 436 || 13 KB (6.1%) || 1986 || 91 KB (31.2%) ||
||<rowstyle="text-align: right;"> 1 MB || 30628 || 695 KB (3.4%) || 310 || 6 KB (2.9%) || 1090 || 47 KB (16.0%) ||

The increase in file size is quite small for reasonable piece sizes of 256 or 512 KB. Non-free is slightly larger as it has many large packages, though it's still quite a small absolute increase.

The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest files in the archive are the ones that need hashes generated for their pieces, it turns out that most of the archive's size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 89% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each package, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update packages in the archive under this plan, than it did previously. This is assuming that the piece hashes can be cached like the package hashes are, which seems reasonable.
Line 103: Line 119:

Choosing a maximum piece size, and assuming the formats above are used, we can calculate the quantity of data that needs to be added for the two methods listed above. This gives the
following for the Release file in the amd64 unstable archive (percentages are the increase in file size as a percentage of the current file size):

||<rowstyle="text-align: center;"> '''Piece''' || '''Number of''' || '''All in Release''' ||<-2> '''Separate Files''' ||
||<rowstyle="text-align: center;"> '''Size''' || '''Pieces''' || ''Data Added'' || ''Files Needed'' || ''Total Size'' ||
||<rowstyle="text-align: right;"> 32 KB || 12971 || 613 KB (788.4%) || 159 || 604 KB ||
||<rowstyle="text-align: right;"> 64 KB || 6574 || 307 KB (394.5%) || 106 || 301 KB ||
||<rowstyle="text-align: right;"> 128 KB || 3399 || 158 KB (203.1%) || 79 || 154 KB ||
||<rowstyle="text-align: right;"> 256 KB || 1817 || 79 KB (101.3%) || 49 || 76 KB ||
||<rowstyle="text-align: right;"> 512 KB || 1036 || 41 KB (52.2%) || 42 || 39 KB ||
||<rowstyle="text-align: right;"> 1 MB || 649 || 22 KB (28.8%) || 42 || 20 KB ||

For the first method ("All in Release"), the percentage increase in file sizes can be quite large, especially for the smaller piece sizes. However, the absolute increase is still very small. For the second method ("Separate Files"), the total data added to the archive is similar, and therefore manageable. The number of files added can be minimized though by choosing a piece size of 512 KB or larger. Therefore, either method is suitable.

The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest Packages files are the ones that need hashes generated for their pieces, it again turns out that most of the size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 97% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each Packages file, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update Release files under this plan, than it did previously.

This page describes the proposed changes to apt-ftparchive to support the DebTorrent program.

Packages files

Motivation

Packages files are currently used for DebTorrent metadata, since they contain all the necessary information: the file path, the file size, and the SHA1 sum of the file. However, this means that all files must be a single piece, as each piece needs an SHA1 sum generated for it. Though this does work for DebTorrent, the sharing of pieces is not as effective as it should be, since very large files (up to 160MB) will need to be downloaded in their entirety before they can be shared with others. A more effective sharing can be obtained by breaking up the files larger than a threshold into smaller pieces, and generate the SHA1 sum for the pieces and add it to the Packages file.

Proposal

Break up files larger than a given threshold into multiple pieces, and generate an SHA1 sum for the pieces that is then included in the Packages file. This would mean adding a new field to the Packages file, "SHA1-pieces", which would be a list of SHA1 sums and piece sizes, with one piece per line. This new field would only be present in packages that have a file larger than a predetermined threshold. An example, using a piece size and threshold of 1 MB is below (some of the normal fields present in Packages files have been omitted for brevity).

Package: foo
Size: 5341873
SHA1: 38170c08cb458fd4879c34b6f608294c50312bbb
SHA1-pieces:
 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
 ccf271b7830882da1791852baeca1737fcbe4b90 98993

Package: bar
Size: 72856
SHA1: 9425fa8de16f6283365f6bee87f405da16a203e6

Package: ...

In theory, the piece size and threshold could be chosen to be any power of 2, but in practice values of 256 KB, 512 KB, and 1 MB are reasonable values. The smaller it is set at, the more effective the file sharing will be, but the more information will need to be added to Packages files and the more pieces will need to be tracked by DebTorrent (memory issue).

Implications

The easiest is to pick a maximum piece size and use it for all packages. (However, it is possible to use a bigger piece size for large packages, and a smaller one for small packages.) Assuming the format above is used (space, 40-byte hash, space, piece size), we can calculate the quantity of data that needs to be added to the Packages files. This gives the following for the packages in the amd64 Packages' files (percentages are the increase in file size as a percentage of the current file size):

Piece

main

contrib

non-free

Size

Pieces

Data Added

Pieces

Data Added

Pieces

Data Added

32 KB

490237

22902 KB (112.2%)

4483

209 KB (93.9%)

29390

1380 KB (472.8%)

64 KB

250871

11515 KB (56.4%)

2310

105 KB (47.3%)

14764

693 KB (237.6%)

128 KB

131865

5890 KB (28.9%)

1224

54 KB (24.2%)

7472

358 KB (122.7%)

256 KB

73131

2934 KB (14.4%)

699

28 KB (12.4%)

3806

181 KB (61.9%)

512 KB

44432

1441 KB (7.1%)

436

13 KB (6.1%)

1986

91 KB (31.2%)

1 MB

30628

695 KB (3.4%)

310

6 KB (2.9%)

1090

47 KB (16.0%)

The increase in file size is quite small for reasonable piece sizes of 256 or 512 KB. Non-free is slightly larger as it has many large packages, though it's still quite a small absolute increase.

The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest files in the archive are the ones that need hashes generated for their pieces, it turns out that most of the archive's size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 89% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each package, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update packages in the archive under this plan, than it did previously. This is assuming that the piece hashes can be cached like the package hashes are, which seems reasonable.

Release files

Motivation

In addition to the information above in the Packages files, it would be useful to have some information stored in the Release file. The best example of this is the URL for the DebTorrent tracker, which all downloaders will need to contact to begin the download. Though this information could be left to a default value, adding it here allows any person running a debian-style archive to specify their own tracker here, to both monitor their own downloaders and reduce the load on the official (default) debian tracker.

The Packages files themselves could also be downloaded through DebTorrent, further minimizing the load on mirrors. This is possible with the current information in the Release file, by treating each Packages file as a single piece with the SHA1 sum given in the Packages file. This would again suffer from the same ineffective downloading mentioned above for large files in the archive, as some of the Packages files are quite large (20 MB). Again, these large files could be broken down into multiple pieces, and SHA1 sums generated for the pieces, to make this downloading more effective.

Proposal

Add a new field to the Release file that specifies the tracker URL for the archive.

Origin: Debian
Label: Debian
Suite: unstable
Codename: sid
Tracker: http://dttracker.debian.net:6969/announce
Date: Tue, 01 May 2007 20:13:19 UTC
Architectures: alpha amd64 arm hppa hurd-i386 i386 ia64 m68k mips mipsel powerpc s390 sparc
Components: main contrib non-free
Description: Debian Unstable - Not Released

The second part could be accomplished in 2 ways:

All in the Release file
Add information to the Release file for the large Packages files broken up into multiple pieces, each with an SHA1 sum. This would be similar to the "SHA1-pieces" field added to the Packages files for large files. As SHA1 sums for Packages files are currently listed on a single line per file, this would probably mean adding a new section at the end of the Release file, where each Packages file larger than the threshold would have the following entries (for 1 MB pieces).
Path: main/binary-alpha/Packages
Size: 20497160
SHA1-pieces:
 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
 ...
 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
 ccf271b7830882da1791852baeca1737fcbe4b90 574216

Path: main/binary-alpha/Packages.gz
Size: 5969270
SHA1-pieces:
 9425fa8de16f6283365f6bee87f405da16a203e6 1048576
 ...
This has the advantage of being simpler to manage and read, and possibly simpler to implement, as all the information is stored in a single location. However, it will add a lot of size to the Release file, which would then need to be downloaded by everyone accessing the archive.
Separate files
The information for Packages files broken up into pieces could be stored separately in a file for each Packages file. The file would contain the same information as the Release file in the previous proposal. For eaxmple, the main/binary-alpha/Packages file would have alongside it a main/binary-alpha/Packages.pieces file containing (for 1 MB pieces)
SHA1-pieces:
 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
 ...
 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
 ccf271b7830882da1791852baeca1737fcbe4b90 574216

This adds the complexity of many additional files, but also some flexibility in terms of adding additional features in the future. The size added per person accessing the archive is also minimized, as they will not download additional information on architectures they aren't interested in, nor will downloaders not using DebTorrent need to download this information at all.

Implications

Choosing a maximum piece size, and assuming the formats above are used, we can calculate the quantity of data that needs to be added for the two methods listed above. This gives the following for the Release file in the amd64 unstable archive (percentages are the increase in file size as a percentage of the current file size):

Piece

Number of

All in Release

Separate Files

Size

Pieces

Data Added

Files Needed

Total Size

32 KB

12971

613 KB (788.4%)

159

604 KB

64 KB

6574

307 KB (394.5%)

106

301 KB

128 KB

3399

158 KB (203.1%)

79

154 KB

256 KB

1817

79 KB (101.3%)

49

76 KB

512 KB

1036

41 KB (52.2%)

42

39 KB

1 MB

649

22 KB (28.8%)

42

20 KB

For the first method ("All in Release"), the percentage increase in file sizes can be quite large, especially for the smaller piece sizes. However, the absolute increase is still very small. For the second method ("Separate Files"), the total data added to the archive is similar, and therefore manageable. The number of files added can be minimized though by choosing a piece size of 512 KB or larger. Therefore, either method is suitable.

The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest Packages files are the ones that need hashes generated for their pieces, it again turns out that most of the size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 97% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each Packages file, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update Release files under this plan, than it did previously.