Apt BitTorrent

This page is intended to collect information, ideas, problems, solutions and comments related to adding BitTorrent functionality to the downloading of package files by Apt.

Motivation

The benefits of this project are clear, both to the Debian project and its mirrors, as well as any other developer wanting to host a popular archive but concerned with bandwidth costs. Upon completion and widescale deployment of the service, the bandwidth and hardware costs of providing a very large Debian archive to hundreds of thousands of users will be dramatically reduced.

These costs are currently being reduced by the voluntary mirroring system Debian uses to help distribute packages. This system comes with some drawbacks though, especially as the size of the archive grows. Some mirrors are already feeling the burden of the size, which has led to the introduction of partial mirroring. It also creates multiple central points of failure for users, as most depend on a single mirror, and does a poor job of evenly distributing the load, as some mirrors may be chosen more often by users. Finally, non-primary mirrors may be slow to update to new versions of packages, and changes happen slowly as sources' lists must be updated manually by users.

However, using a BitTorrent-like system, these voluntary mirrors could simply join the swarm of peers for the archive: mirroring only as much data and contributing only as much bandwidth as they can, providing multiple redundancies and being automatically load balanced by the system, and using the bandwidth savings to update their packages more frequently. This will further allow for future growth, both in the size of the archive and in the popularity of Debian.

History

This was originally proposed and accepted as a Google Summer of Code project in 2006. Unfortunately, the project failed due to time restrictions and the complexity of the project, but some good ideas came out of it.

A similar project was accepted for the 2007 Google Summer of Code, and is currently underway: [http://code.google.com/soc/debian/appinfo.html?csaid=46454031B77ABCBA]

Problems

Though the idea of implementing a BitTorrent-like solution to package distribution seems good, there are some problems with the current way that BitTorrent distributes files that make it unsuitable for a Debian archive. These limitations of the current BitTorrent systems will mostly require modifications to the client and protocol to improve. The 2006 Google SoC project identified some of these concerns, and more have been added since (if you have a new problem, add it here).

The Protocol

Implementation Details

Solution Ideas

New or interesting ways of solving the problems above are needed. If you have an idea for a new solution, or just a way to improve a currently proposed solution, add it here. Don't worry if your solution solves one problem but adds another or conflicts with solving other problems, this is just a repository for solution ideas.

Protocol Solutions

  • a lot of packages are too small:

    • pieces could be made a variable size, so that a piece could be for a single small package, even if the package size is very small
      • breaks completely with the BitTorrent model of pieces

    • packages could be ordered in such a way that the wasted bandwidth is minimal
      • ordered by dependency, so packages that are always installed together are more likely to be in the same piece
      • ordered by popularity using popularity-contest data, so packages that are popular are in the same piece
      • ordered by correlation with other packages using raw popularity-contest data, so packages that many people install together are in the same piece
      • how do updates to packages get ordered in new pieces?
  • there are too many packages to create individual torrents for each and the archive is too large to track efficiently as a single torrent:

    • create a torrent for each suite (stable/testing/unstable ...)
    • create a torrent for each architecture, and for architecture:all and source
    • create a torrent for each Packages file
  • all downloaders of a package need to be aware of all other downloaders:

    • allow the tracker to do cross-torrent communication
    • keep torrent boundaries consistent so that users in a torrent are only interested in users in the same torrent
  • the archive is frequently updated:

Implementation Solutions

Full Proposals

If you have a full proposal for how to implement this functionality, possibly combining some of the ideas above, put it here. Make sure to list which problems it solves, and which it doesn't. Also, fell free to comment on other's proposals.

Proposal A

Implementation Steps

A good ordering of initial implementation steps might be (note that there is a working product at each step that can be tested):

  1. Implement a single variable size piece for each package (test using manually created .torrent files from Packages files for part of the archive).
  2. Add parsing of current Packages files so they can be used unmodified as torrent files.
  3. Implement proxy functionality to receive input from apt about which packages to download.
  4. Automate the process to run as a daemon started from /etc/init.d

At this point, a public beta can be released to get feedback and usage information from general users.

Future steps could then be:

Pros

Cons

Comments

Analysis

Ordering Strategies for Fixed Size Pieces

I did some preliminary analysis on the efficiency of this strategy. Using my install as a test (amd64 desktop), I examined different ordering strategies for pieces in the main-amd64 Packages file (which accounts for 1486 out of the 1551 packages I have installed). The total size for this configuration is 1.1489 GB. I tried out different piece sizes from 32 KB to 512 KB, which gives the following number of pieces for the entire torrent:

Piece Size

Number of Pieces

32 KB

466630

64 KB

233315

128 KB

116658

256 KB

58329

512 KB

29165

I would say that over 100,000 pieces is getting quite large, and we would probably want to go with 256 KB or 512 KB piece sizes.

[http://www.cs.sfu.ca/~camerond/personal/dump/wasted_download.png]

The above graph shows the results for 5 different orderings. The best is sorting by Priority, then Section, then by Source package (I guess because I install multiple packages from the same source package quite a bit, which is interesting in itself). The results show a wastage of at least 10% for 256 KB, even for the best strategy.

Variable Sized Pieces

I guess the easiest would be to pick a piece size and stick with it for everything, rather than use a bigger piece size for large packages, and a smaller one for small packages. This gives the following for the 19091 packages in the same main-amd64 Packages file:

Piece

Total Number

Number (%) of Packages that are

Size

of Pieces

1 piece

2-8 pieces

> 8 pieces

32 KB

476464

5388 (28.2%)

7789 (40.8%)

5914 (31.0%)

64 KB

243926

8098 (42.4%)

7008 (36.7%)

3985 (20.9%)

128 KB

128265

10814 (56.6%)

5906 (30.9%)

2371 (12.4%)

256 KB

71209

13177 (69.0%)

4586 (24.0%)

1328 ( 7.0%)

512 KB

43331

15106 (79.1%)

3321 (17.4%)

664 ( 3.5%)

1 MB

29920

16720 (87.6%)

2053 (10.7%)

318 ( 1.7%)

This is again looking like 256 KB or 512 KB are the way to go, as any smaller creates too many pieces, and larger means almost all the packages are a single piece.

Additional Comments