Differences between revisions 5 and 6
Revision 5 as of 2007-05-05 22:29:18
Size: 11237
Editor: CameronDale
Comment: Fixed Release implications to include data about new entries in the Release file
Revision 6 as of 2007-06-15 20:58:32
Size: 9484
Editor: CameronDale
Comment: Add second proposal for Packages files (extrapieces files). Remove Release info.
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
==== Motivation ====
Line 9: Line 7:
==== Proposal ==== == Proposal 1: add piece info to the Packages file ==
Line 32: Line 30:
In theory, the piece size and threshold could be chosen to be any power of 2, but in practice values of 256 KB, 512 KB, and 1 MB are reasonable values. The smaller it is set at, the more effective the file sharing will be, but the more information will need to be added to Packages files and the more pieces will need to be tracked by DebTorrent (memory issue). In theory, the maximum piece size and threshold could be chosen to be any number, though in practice power's of 2 with values of 256 KB, 512 KB, and 1 MB are reasonable values. The pieces should be multiples of 16384 (2^14^), which is the default size of blocks transferred between peers in the protocol. However, any multiple of 16384 is a good choice for piece size, so for each package the piece size could be chosen such that all the pieces are roughly the same size, with the last piece being only slightly smaller. This is an improvement over BitTorrent's fixed piece size of powers of 2, which could lead to the last piece being very small.

The smaller the maximum piece size
is set at, the more effective the file sharing will be, but the more information will need to be added to Packages files and the more pieces will need to be tracked by DebTorrent (memory issue).
Line 37: Line 37:
for all packages. (However, it is possible to use a bigger piece size for large for all packages. (However, it is possible to use a bigger maximum piece size for large
Line 50: Line 50:
The increase in file size is quite small for reasonable piece sizes of 256 or 512 KB. Non-free is slightly larger as it has many large packages, though it's still quite a small absolute increase. The increase in file size is quite small (7% to 14%) for reasonable piece sizes of 256 or 512 KB. Non-free is slightly larger as it has many large packages, though it's still quite a small absolute increase.
Line 52: Line 52:
The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest files in the archive are the ones that need hashes generated for their pieces, it turns out that most of the archive's size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 89% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each package, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update packages in the archive under this plan, than it did previously. This is assuming that the piece hashes can be cached like the package hashes are, which seems reasonable. == Proposal 2: add new files containing only piece hashes ==
Line 54: Line 54:
== Release files ==

==== Motivation ====

In addition to the information above in the Packages files, it would be useful to have some information stored in the Release file. The best example of this is the URL for the DebTorrent tracker, which all downloaders will need to contact to begin the download. Though this information could be left to a default value, adding it here allows any person running a debian-style archive to specify their own tracker here, to both monitor their own downloaders and reduce the load on the official (default) debian tracker.

The Packages files themselves could also be downloaded through DebTorrent, further minimizing the load on mirrors. This is possible with the current information in the Release file, by treating each Packages file as a single piece with the SHA1 sum given in the Packages file. This would again suffer from the same ineffective downloading mentioned above for large files in the archive, as some of the Packages files are quite large (20 MB). Again, these large files could be broken down into multiple pieces, and SHA1 sums generated for the pieces, to make this downloading more effective.

==== Proposal ====

Add a new field to the Release file that specifies the tracker URL for the archive.
The information for packages broken up into pieces could be stored separately in a file for each Packages file. The file would contain the same SHA1-Pieces information as in the previous example. For example, the main/binary-alpha/Packages file would have alongside it a main/binary-alpha/Packages-extrapieces file containing (the same example as in the proposal 1, note that package bar does not appear in the file):
Line 67: Line 57:
Origin: Debian
Label: Debian
Suite: unstable
Codename: sid
Tracker: http://dttracker.debian.net:6969/announce
Date: Tue, 01 May 2007 20:13:19 UTC
Architectures: alpha amd64 arm hppa hurd-i386 i386 ia64 m68k mips mipsel powerpc s390 sparc
Components: main contrib non-free
Description: Debian Unstable - Not Released
}}}

The second part could be accomplished in 2 ways:

 All in the Release file:: Add information to the Release file for the large Packages files broken up into multiple pieces, each with an SHA1 sum. This would be similar to the "SHA1-pieces" field added to the Packages files for large files. As SHA1 sums for Packages files are currently listed on a single line per file, this would probably mean adding a new section at the end of the Release file, where each Packages file larger than the threshold would have the following entries (for 1 MB pieces).

 {{{
Path: main/binary-alpha/Packages
Size: 20497160
Package: foo
Line 89: Line 62:
 ...
Line 92: Line 64:
 ccf271b7830882da1791852baeca1737fcbe4b90 574216

Path: main/binary-alpha/Packages.gz
Size: 5969270
SHA1-pieces:
 9425fa8de16f6283365f6bee87f405da16a203e6 1048576
 ...
 ccf271b7830882da1791852baeca1737fcbe4b90 98993
Line 101: Line 67:
 This has the advantage of being simpler to manage and read, and possibly simpler to implement, as all the information is stored in a single location. However, it will add a lot of size to the Release file, which would then need to be downloaded by everyone accessing the archive.

 Separate files:: The information for Packages files broken up into pieces could be stored separately in a file for each Packages file. The file would contain the same information as the Release file in the previous proposal. For eaxmple, the main/binary-alpha/Packages file would have alongside it a main/binary-alpha/Packages.pieces file containing (for 1 MB pieces)

 {{{
SHA1-pieces:
 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
 ...
 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
 ccf271b7830882da1791852baeca1737fcbe4b90 574216
}}}

 This adds the complexity of many additional files, but also some flexibility in terms of adding additional features in the future. The size added per person accessing the archive is also minimized, as they will not download additional information on architectures they aren't interested in, nor will downloaders not using DebTorrent need to download this information at all.
This method is currently being used as a temporary solution to this problem. The extrapieces files for all of etch can be seen at http://merkel.debian.org/~ajt/extrapieces/.
Line 120: Line 71:
Choosing a maximum piece size, and assuming the formats above are used, we can calculate the quantity of data that needs to be added for the two methods listed above. This gives the
following for the Release file in the amd64 unstable archive (percentages are the increase in file size as a percentage of the current file size):
The file sizes for the new data to be added to the archive will be approximately the same as the extra data added to the Packages files in proposal 1 (i.e. approximately 1441 KB for the etch main binary-amd64 Packages-extrapieces file). There will be some additional (redundant) data needed in these files to specify the package name and file name. Examples for all of etch can be seen from the current extrapieces files for etch: http://merkel.debian.org/~ajt/extrapieces/ (note that these files are gzipped).
Line 123: Line 73:
||<rowstyle="text-align: center;"> '''Piece''' || '''Number of''' || '''All in Release''' ||<-3> '''Separate Files''' ||
||<rowstyle="text-align: center;"> '''Size''' || '''Pieces''' || ''Data Added'' || ''Files Needed'' || ''Total New File Size'' || ''Data Added to Release File'' ||
||<rowstyle="text-align: right;"> 32 KB || 12971 || 613 KB (788.4%) || 159 || 604 KB || 14 KB (18.5%) ||
||<rowstyle="text-align: right;"> 64 KB || 6574 || 307 KB (394.5%) || 106 || 301 KB || 9 KB (12.0%) ||
||<rowstyle="text-align: right;"> 128 KB || 3399 || 158 KB (203.1%) || 79 || 154 KB || 7 KB (8.9%) ||
||<rowstyle="text-align: right;"> 256 KB || 1817 || 79 KB (101.3%) || 49 || 76 KB || 4 KB (5.4%) ||
||<rowstyle="text-align: right;"> 512 KB || 1036 || 41 KB (52.2%) || 42 || 39 KB || 4 KB (4.6%) ||
||<rowstyle="text-align: right;"> 1 MB || 649 || 22 KB (28.8%) || 42 || 20 KB || 4 KB (4.6%) ||
Some data for these new files will also need to be added to the Release file to enable hash checking of the data. This will add anywhere from 25% (if only one extrapieces file is used) to 75% (if 3 differently compressed version of the extrapieces file are used) to the Release file from etch.
Line 132: Line 75:
For the first method ("All in Release"), the percentage increase in file sizes can be quite large, especially for the smaller piece sizes. However, the absolute increase is still very small. For the second method ("Separate Files"), the total data added to the archive is similar, and therefore manageable, while the data added to the Release file to list these new files is very small. The number of files added can be minimized though by choosing a piece size of 512 KB or larger. Therefore, either method is suitable. == Comparison of Proposals ==
Line 134: Line 77:
The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest Packages files are the ones that need hashes generated for their pieces, it again turns out that most of the size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 97% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each Packages file, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update Release files under this plan, than it did previously. The advantage of proposal 2 over proposal 1 is that size added per person accessing the archive is minimized, as non-DebTorrent users will not download additional information on pieces that they aren't interested in. It also reduces the amount of clutter in the Packages file, if someone is reading it.

The disadvantages of proposal 2, compared with proposal 1, are:

 * introduces the possibility of Packages and Packages-extrapieces being out of sync
 * may require changes to other (mirroring) software to recognize these new files
 * adds complexity of many additional files to the Release tracking
 * adds complexity to the DebTorrent program, of having to download and parse 2 separate files to create a torrent

The biggest disadvantages are the first 2, because DebTorrent will use these files, not knowing they are out of sync or missing. This will result in the creation of torrents that don't match others, therefore fracturing the download population. For example, consider a DebTorrent user downloading the Packages information while an archive update is happening. He retrieve the new Packages file, but gets the old Packages-extrapieces file. If a new package version was introduced (for a package that is larger than the piece threshold) he will be missing the pieces information for that package, and so will his torrent. Since torrent users find each other by the hash of their torrent information, and this users hash will match (probably) no one, he will be unable to download any files from that torrent. Though this scenario may be unlikely, the situation could be caused by many other things: a mirror change (due to DNS) during an update, a stale web cache between him and the archive, a temporary loss of network connectivity during the extrapieces download. Though additional functionality could be added to deal with all of these problems, they are all solved by keeping the hashes in the Packages file, which is a far more robust solution.

Also, in the future there may be the possibility to enable DebTorrent downloading of Packages files as well. Keeping the pieces information in a separate file will mean extra complexity in trying to add this functionality, which could slow the development of this feature.

== Other implications ==

Another factor to consider is the increased computation time needed to generate the hashes for these pieces. As the largest files in the archive are the ones that need hashes generated for their pieces, it turns out that most of the archive's size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 89% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each package, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update packages in the archive under this plan, than it did previously. This is assuming that the piece hashes can be cached like the package hashes are, which seems reasonable.

This page describes the proposed changes to apt-ftparchive to support the DebTorrent program.

Packages files

Packages files are currently used for DebTorrent metadata, since they contain all the necessary information: the file path, the file size, and the SHA1 sum of the file. However, this means that all files must be a single piece, as each piece needs an SHA1 sum generated for it. Though this does work for DebTorrent, the sharing of pieces is not as effective as it should be, since very large files (up to 160MB) will need to be downloaded in their entirety before they can be shared with others. A more effective sharing can be obtained by breaking up the files larger than a threshold into smaller pieces, and generate the SHA1 sum for the pieces and add it to the Packages file.

Proposal 1: add piece info to the Packages file

Break up files larger than a given threshold into multiple pieces, and generate an SHA1 sum for the pieces that is then included in the Packages file. This would mean adding a new field to the Packages file, "SHA1-pieces", which would be a list of SHA1 sums and piece sizes, with one piece per line. This new field would only be present in packages that have a file larger than a predetermined threshold. An example, using a piece size and threshold of 1 MB is below (some of the normal fields present in Packages files have been omitted for brevity).

Package: foo
Size: 5341873
SHA1: 38170c08cb458fd4879c34b6f608294c50312bbb
SHA1-pieces:
 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
 ccf271b7830882da1791852baeca1737fcbe4b90 98993

Package: bar
Size: 72856
SHA1: 9425fa8de16f6283365f6bee87f405da16a203e6

Package: ...

In theory, the maximum piece size and threshold could be chosen to be any number, though in practice power's of 2 with values of 256 KB, 512 KB, and 1 MB are reasonable values. The pieces should be multiples of 16384 (214), which is the default size of blocks transferred between peers in the protocol. However, any multiple of 16384 is a good choice for piece size, so for each package the piece size could be chosen such that all the pieces are roughly the same size, with the last piece being only slightly smaller. This is an improvement over BitTorrent's fixed piece size of powers of 2, which could lead to the last piece being very small.

The smaller the maximum piece size is set at, the more effective the file sharing will be, but the more information will need to be added to Packages files and the more pieces will need to be tracked by DebTorrent (memory issue).

Implications

The easiest is to pick a maximum piece size and use it for all packages. (However, it is possible to use a bigger maximum piece size for large packages, and a smaller one for small packages.) Assuming the format above is used (space, 40-byte hash, space, piece size), we can calculate the quantity of data that needs to be added to the Packages files. This gives the following for the packages in the amd64 Packages' files (percentages are the increase in file size as a percentage of the current file size):

Piece

main

contrib

non-free

Size

Pieces

Data Added

Pieces

Data Added

Pieces

Data Added

32 KB

490237

22902 KB (112.2%)

4483

209 KB (93.9%)

29390

1380 KB (472.8%)

64 KB

250871

11515 KB (56.4%)

2310

105 KB (47.3%)

14764

693 KB (237.6%)

128 KB

131865

5890 KB (28.9%)

1224

54 KB (24.2%)

7472

358 KB (122.7%)

256 KB

73131

2934 KB (14.4%)

699

28 KB (12.4%)

3806

181 KB (61.9%)

512 KB

44432

1441 KB (7.1%)

436

13 KB (6.1%)

1986

91 KB (31.2%)

1 MB

30628

695 KB (3.4%)

310

6 KB (2.9%)

1090

47 KB (16.0%)

The increase in file size is quite small (7% to 14%) for reasonable piece sizes of 256 or 512 KB. Non-free is slightly larger as it has many large packages, though it's still quite a small absolute increase.

Proposal 2: add new files containing only piece hashes

The information for packages broken up into pieces could be stored separately in a file for each Packages file. The file would contain the same SHA1-Pieces information as in the previous example. For example, the main/binary-alpha/Packages file would have alongside it a main/binary-alpha/Packages-extrapieces file containing (the same example as in the proposal 1, note that package bar does not appear in the file):

Package: foo
SHA1-pieces:
 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
 ccf271b7830882da1791852baeca1737fcbe4b90 98993

This method is currently being used as a temporary solution to this problem. The extrapieces files for all of etch can be seen at http://merkel.debian.org/~ajt/extrapieces/.

Implications

The file sizes for the new data to be added to the archive will be approximately the same as the extra data added to the Packages files in proposal 1 (i.e. approximately 1441 KB for the etch main binary-amd64 Packages-extrapieces file). There will be some additional (redundant) data needed in these files to specify the package name and file name. Examples for all of etch can be seen from the current extrapieces files for etch: http://merkel.debian.org/~ajt/extrapieces/ (note that these files are gzipped).

Some data for these new files will also need to be added to the Release file to enable hash checking of the data. This will add anywhere from 25% (if only one extrapieces file is used) to 75% (if 3 differently compressed version of the extrapieces file are used) to the Release file from etch.

Comparison of Proposals

The advantage of proposal 2 over proposal 1 is that size added per person accessing the archive is minimized, as non-DebTorrent users will not download additional information on pieces that they aren't interested in. It also reduces the amount of clutter in the Packages file, if someone is reading it.

The disadvantages of proposal 2, compared with proposal 1, are:

  • introduces the possibility of Packages and Packages-extrapieces being out of sync
  • may require changes to other (mirroring) software to recognize these new files
  • adds complexity of many additional files to the Release tracking
  • adds complexity to the DebTorrent program, of having to download and parse 2 separate files to create a torrent

The biggest disadvantages are the first 2, because DebTorrent will use these files, not knowing they are out of sync or missing. This will result in the creation of torrents that don't match others, therefore fracturing the download population. For example, consider a DebTorrent user downloading the Packages information while an archive update is happening. He retrieve the new Packages file, but gets the old Packages-extrapieces file. If a new package version was introduced (for a package that is larger than the piece threshold) he will be missing the pieces information for that package, and so will his torrent. Since torrent users find each other by the hash of their torrent information, and this users hash will match (probably) no one, he will be unable to download any files from that torrent. Though this scenario may be unlikely, the situation could be caused by many other things: a mirror change (due to DNS) during an update, a stale web cache between him and the archive, a temporary loss of network connectivity during the extrapieces download. Though additional functionality could be added to deal with all of these problems, they are all solved by keeping the hashes in the Packages file, which is a far more robust solution.

Also, in the future there may be the possibility to enable DebTorrent downloading of Packages files as well. Keeping the pieces information in a separate file will mean extra complexity in trying to add this functionality, which could slow the development of this feature.

Other implications

Another factor to consider is the increased computation time needed to generate the hashes for these pieces. As the largest files in the archive are the ones that need hashes generated for their pieces, it turns out that most of the archive's size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 89% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each package, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update packages in the archive under this plan, than it did previously. This is assuming that the piece hashes can be cached like the package hashes are, which seems reasonable.