Differences between revisions 4 and 5
Revision 4 as of 2007-08-28 19:56:02
Size: 10567
Editor: CameronDale
Comment: Updated to current status, new status update proposal
Revision 5 as of 2007-11-03 19:47:29
Size: 10718
Editor: ?MichałPolitowski
Comment: a comment on status updates
Deletions are marked like this. Additions are marked like this.
Line 87: Line 87:
   * but these files could be sparse, I think, which would mean almost no increase in disk usage and no filesystem fragmentation -- MichałPolitowski

This page details a proposed new APT method for communicating between APT and the DebTorrent program. More information on APT methods can be found in the [http://packages.debian.org/libapt-pkg-doc libapt-pkg-doc package].

The Original State of Communication

Originally, the DebTorrent program makes use of the HTTP retrieval method for communicating with APT. It implements almost a complete proxy for downloading files from HTTP mirrors. The only exceptions are, since it considers Packages files to be torrents, it notes when they are requested and starts the corresponding torrent running. Also, when DebTorrent receives a request for a package file (which it identifies by extension), it finds the appropriate torrent that contains that file and begins to download it using the DebTorrent protocol (i.e. not using HTTP). Once the download is complete, it passes the file on to APT as if it had been downloaded directly from the HTTP mirror.

The major problems with this method are:

  • BitTorrent downloads start off slowly, so downloading multiple packages sequentially means the complete download will occur very slowly

  • downloads occur in a random order, so requiring packages be returned in the same order they were requested will cause completed packages to be delayed.
  • the user needs to be aware that downloading is happening (such as by a progress bar), which is not possible due to the non-sequential downloading of files that occurs in BitTorrent

To solve the first problem of slow startup of downloads, multiple packages need to be downloaded at once from the same torrent, without waiting for one to finish before starting another. This could be as simple as telling APT to pipeline multiple requests to DebTorrent, which would alleviate some of the problem.

The second problem of ordered downloads requires a new transport method to solve.

The last problem is trickier, as APT will only be aware of when downloads begin and end. Pipelined downloads returned in any order will help though, as there may be more activity of files starting and stopping so that the user will not notice so much. However, in BitTorrent downloads usually occur in such a way that all the files complete at near the same time at the end of the download. So, even with pipelined downloads, it may appear to the user that nothing is happening (at which point they may abort the download), until finally all will suddenly come in at the end.

HTTP/1.1 Pipelining

The HTTP protocol already has functionality built into it to allow for [http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html pipelining multiple requests] over the same connection. The current DebTorrent APT request listener implements HTTP/1.1, so the current http method of APT can be used to pipeline multiple requests. The APT http method has a configuration parameter (Acquire::http::Pipeline-Depth) for controlling the maximum number of outstanding requests that can be sent on a single connection (defaults to 5), however there is a maximum of 10 set in the code for this value. The HTTP protocol also specifies that the files returned by the pipelined connection must be in the same order as they were requested, since there is no way to identify the file by the HTTP server response, and APT's http method conforms to this.

The advantages of this are:

  • relatively simple to implement
  • no changes needed to the APT code
  • works with other package managers that don't use APT's methods

The disadvantages are:

  • can only pipeline 10 requests at a time
  • requests must be returned in order (though they may not download that way)
  • no status updates are possible while the download is under way

A New APT Method: apt-transport-debtorrent

Instead of using APT's http method, a new debtorrent method can by APT to request downloads of files (and possibly other information, see the proposal below) from DebTorrent. This method is indicated in the sources.list file by "debtorrent://...". This method is based on APT's http method, and will use a protocol similar to HTTP to communicate with DebTorrent, which will allow it to be accessed easily from other machines on the network. The debtorrent method will send all requests from APT to the DebTorrent program immediately, without waiting for any requests to complete. This "ultra-pipelining" will allow all the downloads to occur in parallel. The debtorrent method will also expect the downloaded files to be returned in an arbitrary order, which will then be passed to the APT program (which supports an arbitrary ordering of returned packages).

This new method has been created as a separate package [http://debtorrent.alioth.debian.org apt-transport-debtorrent], and so does not require any major changes to the APT code. However, one minor change to the APT code is still needed, as the current APT code will only pass a maximum of 10 requests to a method at a time. In order for this method to improve on the HTTP/1.1 Pipelining solution, this limit will be raised to 1000 (this change has [http://codebrowse.launchpad.net/~mvo/apt/apt--mvo/revision/michael.vogt%40ubuntu.com-20070828081707-f2pgqh1tbijviq8w?start_revid=michael.vogt%40ubuntu.com-20070828085459-0bpn4rgejkkrayhc already been made] to the APT code, and should be in the next release after 0.7.6).

The advantages are:

  • pipelines a large number of requests at once
  • removes the requirement for ordered returned files
  • a very small number of changes are needed to the APT code
  • status updates may be possible using a further addition to the method (see proposal below)

The disadvantages are:

  • doesn't work with other package managers that don't use APT's methods

Proposal: Status Updates

The last problem, that of no status updates in APT while the download is underway, has not yet been solved. Using the debtorrent APT method described above does solve some of the problems with this lack of updates (since files are constantly being downloaded), but APT's indication of the download completion is inaccurate and there is no indication of how long the download will take to complete. As an example, here is some data on the inaccuracies reported by APT from a dist-upgrade using the debtorrent APT method, compared with the accurate value taken from the DebTorrent status page.

APT Reports

Actual Completion

10%

40%

20%

70%

30%

85%

40%

90%

50%

93%

60%

96%

70%

97.5%

80%

98%

90%

99%

These inaccuracies occur because pieces of a multiple piece file have been downloaded, but APT does not yet know about them since the file is not yet complete.

To fix this, status updates need to be passed from the DebTorrent client to the debtorrent method. These updates will use a new status code message within the debtorrent APT method, which is not defined by the HTTP protocol (a 1xx informational message). This message will be passed at any point between messages containing downloaded files from the DebTorrent client to the debtorrent APT method. There are 2 possibilities for it's contents and use.

Updates when pieces download

This status message will be sent whenever a new piece of a multi-piece file is downloaded. It will contain the name of the file, and the size of the downloaded piece (but not the piece data itself). If the file was only a single piece, or if the piece is the final one needed to complete a multi-piece file, this message will not be sent, as instead the complete file will be sent.

The debtorrent APT method will use this message to fill the target files with empty data to indicate to the APT program that the file is partially downloaded. The file will be filled with all 0's, until the file itself is returned from the DebTorrent client, at which point the file will be rewritten with the actual data.

The advantages are:

  • APT will have some knowledge of the partial downloading of files
  • easy to implement
  • no changes needed to the APT code (only the separate debtorrent APT method)

The disadvantages are:

  • APT will need to monitor many files' sizes (previously only one file was monitored)
  • requires filling the larger files with data (almost) twice (increased hard disk usage)
    • but these files could be sparse, I think, which would mean almost no increase in disk usage and no filesystem fragmentation -- ?MichałPolitowski

  • may cause increased fragmentation of the file system due to writing many files incrementally
  • will still be slightly inaccurate and slow to update since pieces are large (512 KB)
    • with a 512kbps connection, a single piece download will take 8 seconds, so status updates will be 8 seconds apart
    • if multiple pieces are being downloaded simultaneously (which is almost always the case), the updates could be much further apart (though on average they would still be every 8 seconds)

Updates of amount of download completed

This status message will be sent periodically from the DebTorrent client to the debtorrent APT method. It will contain the total number of bytes that have been downloaded from all the requests sent on this connection from APT. This will need to be tracked somehow

The debtorrent APT method will use this information to send a download completion update to the main APT program using a new informational message added to the APT method communication protocol. The APT program will use this value instead of checking the size of the target files of the download to determine how much of the download is complete, and how fast it is progressing.

The advantages are:

  • APT will have a very accurate and up-to-date indication of the status of the download
  • no file-system issues (monitoring, increased hard-disk usage or fragmentation)

The disadvantages are:

  • requires many changes to the APT code
    • implementing the new method message
    • using the download completion value instead of the file sizes
  • difficult to implement, as this information is not tracked by DebTorrent

    • it cannot be the total bytes downloaded, but must be just the number of bytes downloaded for files requested on this connection
    • what happens if the connection gets dropped, reopened, and all files get rerequested?