6379
Comment: First draft without implications
|
11021
Add implications of adding to Release files
|
Deletions are marked like this. | Additions are marked like this. |
Line 36: | Line 36: |
The easiest is to pick a maximum piece size and use it for all packages. (However, it is possible to use a bigger piece size for large packages, and a smaller one for small packages.) Assuming the format above is used (space, 40-byte hash, space, piece size), we can calculate the quantity of data that needs to be added to the Packages files. This gives the following for the packages in the amd64 Packages' files (percentages are the increase in file size as a percentage of the current file size): |
|
Line 37: | Line 41: |
||<rowstyle="text-align: center;"> '''Piece''' ||<-2> '''main''' ||<-2> '''contrib''' ||<-2> '''non-free''' || ||<rowstyle="text-align: center;"> '''Size''' || ''Pieces'' || ''Data Added'' || ''Pieces'' || ''Data Added'' || ''Pieces'' || ''Data Added'' || ||<rowstyle="text-align: right;"> 32 KB || 490237 || 22902 KB (112.2%) || 4483 || 209 KB (93.9%) || 29390 || 1380 KB (472.8%) || ||<rowstyle="text-align: right;"> 64 KB || 250871 || 11515 KB (56.4%) || 2310 || 105 KB (47.3%) || 14764 || 693 KB (237.6%) || ||<rowstyle="text-align: right;"> 128 KB || 131865 || 5890 KB (28.9%) || 1224 || 54 KB (24.2%) || 7472 || 358 KB (122.7%) || ||<rowstyle="text-align: right;"> 256 KB || 73131 || 2934 KB (14.4%) || 699 || 28 KB (12.4%) || 3806 || 181 KB (61.9%) || ||<rowstyle="text-align: right;"> 512 KB || 44432 || 1441 KB (7.1%) || 436 || 13 KB (6.1%) || 1986 || 91 KB (31.2%) || ||<rowstyle="text-align: right;"> 1 MB || 30628 || 695 KB (3.4%) || 310 || 6 KB (2.9%) || 1090 || 47 KB (16.0%) || The increase in file size is quite small for reasonable piece sizes of 256 or 512 KB. Non-free is slightly larger as it has many large packages, though it's still quite a small absolute increase. The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest files in the archive are the ones that need hashes generated for their pieces, it turns out that most of the archive's size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 89% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each package, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update packages in the archive under this plan, than it did previously. This is assuming that the piece hashes can be cached like the package hashes are, which seems reasonable. |
|
Line 103: | Line 119: |
Choosing a maximum piece size, and assuming the formats above are used, we can calculate the quantity of data that needs to be added for the two methods listed above. This gives the following for the Release file in the amd64 unstable archive (percentages are the increase in file size as a percentage of the current file size): ||<rowstyle="text-align: center;"> '''Piece''' || '''Number of''' || '''All in Release''' ||<-2> '''Separate Files''' || ||<rowstyle="text-align: center;"> '''Size''' || '''Pieces''' || ''Data Added'' || ''Files Needed'' || ''Total Size'' || ||<rowstyle="text-align: right;"> 32 KB || 12971 || 613 KB (788.4%) || 159 || 604 KB || ||<rowstyle="text-align: right;"> 64 KB || 6574 || 307 KB (394.5%) || 106 || 301 KB || ||<rowstyle="text-align: right;"> 128 KB || 3399 || 158 KB (203.1%) || 79 || 154 KB || ||<rowstyle="text-align: right;"> 256 KB || 1817 || 79 KB (101.3%) || 49 || 76 KB || ||<rowstyle="text-align: right;"> 512 KB || 1036 || 41 KB (52.2%) || 42 || 39 KB || ||<rowstyle="text-align: right;"> 1 MB || 649 || 22 KB (28.8%) || 42 || 20 KB || For the first method ("All in Release"), the percentage increase in file sizes can be quite large, especially for the smaller piece sizes. However, the absolute increase is still very small. For the second method ("Separate Files"), the total data added to the archive is similar, and therefore manageable. The number of files added can be minimized though by choosing a piece size of 512 KB or larger. Therefore, either method is suitable. The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest Packages files are the ones that need hashes generated for their pieces, it again turns out that most of the size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 97% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each Packages file, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update Release files under this plan, than it did previously. |
This page describes the proposed changes to apt-ftparchive to support the DebTorrent program.
Packages files
Motivation
Packages files are currently used for DebTorrent metadata, since they contain all the necessary information: the file path, the file size, and the SHA1 sum of the file. However, this means that all files must be a single piece, as each piece needs an SHA1 sum generated for it. Though this does work for DebTorrent, the sharing of pieces is not as effective as it should be, since very large files (up to 160MB) will need to be downloaded in their entirety before they can be shared with others. A more effective sharing can be obtained by breaking up the files larger than a threshold into smaller pieces, and generate the SHA1 sum for the pieces and add it to the Packages file.
Proposal
Break up files larger than a given threshold into multiple pieces, and generate an SHA1 sum for the pieces that is then included in the Packages file. This would mean adding a new field to the Packages file, "SHA1-pieces", which would be a list of SHA1 sums and piece sizes, with one piece per line. This new field would only be present in packages that have a file larger than a predetermined threshold. An example, using a piece size and threshold of 1 MB is below (some of the normal fields present in Packages files have been omitted for brevity).
Package: foo Size: 5341873 SHA1: 38170c08cb458fd4879c34b6f608294c50312bbb SHA1-pieces: e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576 ccf271b7830882da1791852baeca1737fcbe4b90 98993 Package: bar Size: 72856 SHA1: 9425fa8de16f6283365f6bee87f405da16a203e6 Package: ...
In theory, the piece size and threshold could be chosen to be any power of 2, but in practice values of 256 KB, 512 KB, and 1 MB are reasonable values. The smaller it is set at, the more effective the file sharing will be, but the more information will need to be added to Packages files and the more pieces will need to be tracked by DebTorrent (memory issue).
Implications
The easiest is to pick a maximum piece size and use it for all packages. (However, it is possible to use a bigger piece size for large packages, and a smaller one for small packages.) Assuming the format above is used (space, 40-byte hash, space, piece size), we can calculate the quantity of data that needs to be added to the Packages files. This gives the following for the packages in the amd64 Packages' files (percentages are the increase in file size as a percentage of the current file size):
Piece |
main |
contrib |
non-free |
|||
Size |
Pieces |
Data Added |
Pieces |
Data Added |
Pieces |
Data Added |
32 KB |
490237 |
22902 KB (112.2%) |
4483 |
209 KB (93.9%) |
29390 |
1380 KB (472.8%) |
64 KB |
250871 |
11515 KB (56.4%) |
2310 |
105 KB (47.3%) |
14764 |
693 KB (237.6%) |
128 KB |
131865 |
5890 KB (28.9%) |
1224 |
54 KB (24.2%) |
7472 |
358 KB (122.7%) |
256 KB |
73131 |
2934 KB (14.4%) |
699 |
28 KB (12.4%) |
3806 |
181 KB (61.9%) |
512 KB |
44432 |
1441 KB (7.1%) |
436 |
13 KB (6.1%) |
1986 |
91 KB (31.2%) |
1 MB |
30628 |
695 KB (3.4%) |
310 |
6 KB (2.9%) |
1090 |
47 KB (16.0%) |
The increase in file size is quite small for reasonable piece sizes of 256 or 512 KB. Non-free is slightly larger as it has many large packages, though it's still quite a small absolute increase.
The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest files in the archive are the ones that need hashes generated for their pieces, it turns out that most of the archive's size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 89% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each package, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update packages in the archive under this plan, than it did previously. This is assuming that the piece hashes can be cached like the package hashes are, which seems reasonable.
Release files
Motivation
In addition to the information above in the Packages files, it would be useful to have some information stored in the Release file. The best example of this is the URL for the DebTorrent tracker, which all downloaders will need to contact to begin the download. Though this information could be left to a default value, adding it here allows any person running a debian-style archive to specify their own tracker here, to both monitor their own downloaders and reduce the load on the official (default) debian tracker.
The Packages files themselves could also be downloaded through DebTorrent, further minimizing the load on mirrors. This is possible with the current information in the Release file, by treating each Packages file as a single piece with the SHA1 sum given in the Packages file. This would again suffer from the same ineffective downloading mentioned above for large files in the archive, as some of the Packages files are quite large (20 MB). Again, these large files could be broken down into multiple pieces, and SHA1 sums generated for the pieces, to make this downloading more effective.
Proposal
Add a new field to the Release file that specifies the tracker URL for the archive.
Origin: Debian Label: Debian Suite: unstable Codename: sid Tracker: http://dttracker.debian.net:6969/announce Date: Tue, 01 May 2007 20:13:19 UTC Architectures: alpha amd64 arm hppa hurd-i386 i386 ia64 m68k mips mipsel powerpc s390 sparc Components: main contrib non-free Description: Debian Unstable - Not Released
The second part could be accomplished in 2 ways:
- All in the Release file
- Add information to the Release file for the large Packages files broken up into multiple pieces, each with an SHA1 sum. This would be similar to the "SHA1-pieces" field added to the Packages files for large files. As SHA1 sums for Packages files are currently listed on a single line per file, this would probably mean adding a new section at the end of the Release file, where each Packages file larger than the threshold would have the following entries (for 1 MB pieces).
Path: main/binary-alpha/Packages Size: 20497160 SHA1-pieces: e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576 ... 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576 ccf271b7830882da1791852baeca1737fcbe4b90 574216 Path: main/binary-alpha/Packages.gz Size: 5969270 SHA1-pieces: 9425fa8de16f6283365f6bee87f405da16a203e6 1048576 ...
This has the advantage of being simpler to manage and read, and possibly simpler to implement, as all the information is stored in a single location. However, it will add a lot of size to the Release file, which would then need to be downloaded by everyone accessing the archive. - Separate files
- The information for Packages files broken up into pieces could be stored separately in a file for each Packages file. The file would contain the same information as the Release file in the previous proposal. For eaxmple, the main/binary-alpha/Packages file would have alongside it a main/binary-alpha/Packages.pieces file containing (for 1 MB pieces)
SHA1-pieces: e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576 ... 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576 ccf271b7830882da1791852baeca1737fcbe4b90 574216
This adds the complexity of many additional files, but also some flexibility in terms of adding additional features in the future. The size added per person accessing the archive is also minimized, as they will not download additional information on architectures they aren't interested in, nor will downloaders not using DebTorrent need to download this information at all.
Implications
Choosing a maximum piece size, and assuming the formats above are used, we can calculate the quantity of data that needs to be added for the two methods listed above. This gives the following for the Release file in the amd64 unstable archive (percentages are the increase in file size as a percentage of the current file size):
Piece |
Number of |
All in Release |
Separate Files |
|
Size |
Pieces |
Data Added |
Files Needed |
Total Size |
32 KB |
12971 |
613 KB (788.4%) |
159 |
604 KB |
64 KB |
6574 |
307 KB (394.5%) |
106 |
301 KB |
128 KB |
3399 |
158 KB (203.1%) |
79 |
154 KB |
256 KB |
1817 |
79 KB (101.3%) |
49 |
76 KB |
512 KB |
1036 |
41 KB (52.2%) |
42 |
39 KB |
1 MB |
649 |
22 KB (28.8%) |
42 |
20 KB |
For the first method ("All in Release"), the percentage increase in file sizes can be quite large, especially for the smaller piece sizes. However, the absolute increase is still very small. For the second method ("Separate Files"), the total data added to the archive is similar, and therefore manageable. The number of files added can be minimized though by choosing a piece size of 512 KB or larger. Therefore, either method is suitable.
The other factor is the increased computation time needed to generate the hashes for these pieces. As the largest Packages files are the ones that need hashes generated for their pieces, it again turns out that most of the size needs to be hashed to create the SHA1 values needed. Specifically, for a piece size of 512 KB, 97% of the archive (by size) needs to be rehashed. Since three hashes are currently calculated for each Packages file, and implementing this proposal will effectively add a fourth, a reasonable assumption is that it will take 33% more computation to update Release files under this plan, than it did previously.