Differences between revisions 42 and 43
Revision 42 as of 2013-07-09 10:14:13
Size: 15935
Editor: GeoffSimmons
Comment: InterWiki.
Revision 43 as of 2013-08-04 16:59:10
Size: 15934
Comment: minor fix
Deletions are marked like this. Additions are marked like this.
Line 33: Line 33:
There is another program called apt-metalink [[https://github.com/tatsuhiro-t/apt-metalink]]. It uses DebianPkg:aria2 for download packages from multiple sources. Aria2 supports bittorrent too. There is another program called apt-metalink [[https://github.com/tatsuhiro-t/apt-metalink]]. It uses DebianPkg:aria2 to download packages from multiple sources. Aria2 supports bittorrent too.

Translation(s): English - Italiano


This page is intended to collect information, ideas and comments related to adding BitTorrent functionality to the downloading of package files by Apt.

Other pages:

Motivation

The benefits of this project are clear, both to the Debian project and its mirrors, as well as any other developer wanting to host a popular archive but concerned with bandwidth costs. Upon completion and widescale deployment of the service, the bandwidth and hardware costs of providing a very large Debian archive to hundreds of thousands of users will be dramatically reduced.

These costs are currently being reduced by the voluntary mirroring system Debian uses to help distribute packages. This system comes with some drawbacks though, especially as the size of the archive grows. Some mirrors are already feeling the burden of the size, which has led to the introduction of partial mirroring. It also creates multiple central points of failure for users, as most depend on a single mirror, and does a poor job of evenly distributing the load, as some mirrors may be chosen more often by users. Finally, non-primary mirrors may be slow to update to new versions of packages, and changes happen slowly as sources' lists must be updated manually by users.

However, using a BitTorrent-like system, these voluntary mirrors could simply join the swarm of peers for the archive: mirroring only as much data and contributing only as much bandwidth as they can, providing multiple redundancies and being automatically load balanced by the system, and using the bandwidth savings to update their packages more frequently. This will further allow for future growth, both in the size of the archive and in the popularity of Debian.

This was originally proposed and accepted as a Google Summer of Code project in 2006. Unfortunately, the project failed due to time restrictions and the complexity of the project, but some good ideas came out of it.

A similar project was accepted for the 2007 Google Summer of Code, and the resulting code can be found here

A program that is similar in idea to this one, apt-torrent, has been created by Arnaud Kyheng and is avaliable online: http://sianka.free.fr/. apt-torrent is different in that it doesn't make any changes to the BitTorrent protocol, but rather it provides a wrapper around the BitTornado package which it then uses to download the package files.

There is another program called apt-metalink https://github.com/tatsuhiro-t/apt-metalink. It uses aria2 to download packages from multiple sources. Aria2 supports bittorrent too.

Another derived BitTorrent protocol is gittorrent. Certainly people have talked about ways of using gittorrent to achieve the same thing as this proposal suggests - but it's probably a good idea that both efforts carry on in parallel.

Problems

Though the idea of implementing a BitTorrent-like solution to package distribution seems good, there are some problems with the current way that BitTorrent distributes files that make it unsuitable for a Debian archive. These limitations of the current BitTorrent systems will mostly require modifications to the client and protocol to improve. The 2006 Google SoC project identified some of these concerns, and more have been added since (if you have a new problem, add it here).

The Protocol

  • a lot of packages are too small:

    • an archive is made up of a distribution of file sizes, many of which are smaller than the smallest piece size used by BitTorrent (currently 32 KB).

    • if multiple packages are present in a single piece, then a peer could waste a lot of bandwidth just to get the single package they need from the piece
    • if pieces are too small, the downloading becomes inefficient as a larger percentage of bandwidth is wasted on overhead
  • there are too many packages to create individual torrents for each:

    • a user would need to participate in possibly thousands of torrents simultaneously, wasting resources
    • communication between the same users in many different torrents would be wasteful and hard to track to allow for fair downloading
  • the archive is too large to track efficiently as a single torrent:

    • a Debian archive is a possibly very large repository of packages, including many different versions and architectures
    • a normal user will only want to download a very small subset of the entire archive
    • it is normal in BitTorrent to download the entire torrent, however it is possible to download a subset of the files

  • all downloaders of a package need to be aware of all other downloaders:

    • multiple torrents could exist that contain the same packages (e.g. architecture:all packages, or the same version of packages in testing and unstable)
    • to be more efficient, some communication needs to occur between the users of different torrents containing the same package
  • the archive is frequently updated:

    • though only a very small portion of it at a time is updated, these updates occur on a daily basis
    • BitTorrent is currently not designed to handle updates to files, nor multiple versions of files

    • it may be possible to inform the client of an updated package when trying to download an outdated one
  • locality of transfers is not considered:

    • Locality of transfers is another problem that is not solved in the current BitTorrent protocol. Imagine a single torrent with 1000000+ downloaders distributed all across the world. When a BitTorrent client wants to download a chunk of data, it can not request it from all 1000000+ peers, so it selects a peer to request the piece from in a semi-random manner. Such strategy is simple algorithmically, but could easily create situations when a peer in USA would request a chunk of data from a peer in Australia despite the fact that another peer in the same state could have provided the same piece with a lower global network usage. This creates a situation where the same data is transferred repeatedly across the very busy intercontinental lines. To solve this problem some measure of locality needs to be introduced to the list of the users and the most local users need to be used first.

    • [lkcl]: locality of transfers is not debtorrent's problem to deal with. the internet can take care of itself. In fact, attempting to solve locality issues is second-guessing the layout of the internet and not trusting the bittorrent protocol to do its job. The bittorrent protocol selects the best places to download from. ISPs have created proxies, blockers and stutterers (dropping 50% of traffic) to "deal" with the "problem" of bittorrent, and attempting to second-guess what will be done in the future will only make the task that much more complex. The bottom line is that i believe that the issue of loading the internet should not be confused with the issue of loading the mirrors. The mirrors themselves should be running debtorrent (the client), publishing their (http-)mirrored content as their debtorrent cache.

Implementation Details

  • starting point:

    • as a good starting point, which currently available BitTorrent client should be used?

  • communication with apt:

    • how will apt tell the program to download a package?
    • how will the program return a package to apt?
    • BitTorrent downloads occur more slowly, so multiple package downloads may need to occur at once

    • the user needs to be aware that downloading is happening (such as by a progress bar)
    • duplication of downloaded files
  • archive updates:

    • how do archive updates happen?
  • mirror updates:

    • when new packages are available, how will mirrors be made aware of this so they can automatically fetch them?

The Solution

Here are some details on how the problems listed above are solved in the current implementation of the DebTorrent program.

Protocol Solutions

  • a lot of packages are too small:

    • pieces could be made a variable size, so that a piece could be for a single small package, even if the package size is very small.
    • using apt-metalink and aria2 you could download some files using http/ftp and the biggest using bittorrent, just make a metalink for each deb package and put in there (multiple) http/ftp and or bt urls.
  • there are too many packages to create individual torrents for each and the archive is too large to track efficiently as a single torrent:

    • create a torrent for each Packages file, or 2 if the Architecture:all packages are split out.

    • create a torrent representing subdirectories: map the concept of a hierarchical filesystem on top of torrents.
  • all downloaders of a package need to be aware of all other downloaders:

    • keep torrent boundaries consistent so that users in a torrent are only interested in users in the same torrent
  • the archive is frequently updated:

    • This issue is common to all update methods - ftp, http, rsync, debtorrent, and is being solved in a manner common to all update methods, by debian. Package update diffs are now generated on a daily basis. -- lkml
      • The frequently updated archive problem that debtorrent has is not the same. Since all files must be gathered together into a large torrent file, every time the archive is updated so will the torrent. This is not ideal. Updates to Packages.gz files with diffs are an unrelated issue. -- CameronDale

Implementation Solutions

  • starting point:

  • communication with apt:

    • implement a proxy, such as apt-proxy does, for apt to download through
    • integrate with apt as a new downloading protocol bt:// (like current http:// and ftp://)

  • archive updates:

  • mirror updates:

Full Proposal

Here are the details of how the DebTorrent program solves the problems above.

  • create a torrent for every combination of suite (stable/testing/unstable) and architecture, (possibly including separate ones for architecture:all and source)
  • communicate almost all torrent information in the Packages file
  • use a variable number of pieces and size of pieces
  • small packages are in their own single piece, hashed by their SHA1 value
  • large packages have multiple pieces, with the hashes stored separately (for now)
  • new/updated packages get new piece(s) which are automatically added to the torrent
  • pieces use unique numbers that are never reused over the life of the torrent
    • if a package is updated, the piece number larger than the current maximum one is assigned to it
      • [lkcl]: why?? packages are never updated. an archive is an archive! it is never "updated"! a package will have a new dpkg-allocated version number if it is updated.
        • while a specific version of the package is not updated, the package itself is, and a new .deb file is generated
        • in order for new .deb files to be downloaded, they need to be included in the torrent
        • the new .deb file can't reuse the old piece number(s), as the content is different and will hash differently
        • so to be included in the same torrent, which is the entire purpose of unique piece numbers, new piece numbers are added to the end of the torrent for the new .deb file -- CameronDale

    • allows for the use of bitfields along with updates
    • peers that don't have these new pieces will transmit shorter bitfields (receiving peers assume the peer doesn't have the pieces)
    • peers that aren't aware of the new pieces will get longer bitfields, and will ignore the new piece numbers they aren't aware of
    • see the proposal for maintaining unique piece numbers for more details

  • Interested state now indicates wanting pieces the other peer has
  • uninteresting connections are dropped to find interesting ones
  • new communication added: LOSTHAVE, indicating that the peer has lost the piece (due to cleaning or replacement with a new version)
  • seeding now means not currently downloading, and a peer can return to the downloading state later (when a new package is requested)

Implementation Steps

The initial implementation steps proceeded in this order:

  1. Version 0.1.0: Implement a single variable size piece for each package and add parsing of current Packages files so they can be used unmodified as torrent files.
  2. Version 0.1.1: Use data from dpkg-status to determine which packages/pieces to download.
  3. Version 0.1.2: Implement proxy functionality to receive input from apt about which packages to download, break up large packages into multiple pieces, and download rare pieces using a backup HTTP downloader.
  4. Version 0.1.3: Automate the process to run as a daemon started from /etc/init.d, and package the program as a debian binary package.

Future steps could then be:

  • Improve the interaction between apt and debtorrent to make the downloading more efficient and user-friendly
    • download all desired packages in parallel
    • send status information from debtorrent to apt about the progress of the download
    • requires a new debtorrent:// apt method
  • Use unique piece numbering for testing/unstable so a new torrent isn't created every day
    • use a new file that contains unique piece numbers for packages
  • Modifications to the Debian archive management software to allow for further enhancements to the client
    • add info to break up large packages into a number of pieces to the Packages files
    • add files for unique piece numbering
    • add torrent names to the Release files
  • Add communication protocols, such as ?LostHave or compressed ?BitFields, or the ability to drop uninteresting connections

  • Determine usage patterns from the public beta.
  • Based on measured usage patterns analyze ways to optimize sharing amongst different Packages files (arch:all, different versions, testing/unstable, testing/stable at release, etc ...)

Pros

  • solves the small package problem
  • solves the large number of torrents and large archive problem
  • mostly solves the communication between downloaders of the same package
    • if the same package version exists in more than one suites, communication between the downloaders will not occur

Cons