Goals

debci currently only runs on one machine in schroot, with no clear separation of "worker jobs" (running adt-run and the actual tests), metadata generation, and web UI. We want to make this

For that we need to strictly separate the test execution backends, the meta-data generation/UI/presentation, the data storage, and the policy (i. e. when to run which test), so that each of these can run on different machines, be redundant/have failover, and scale to our current needs.

Design

Data structure

We want to keep the log files and artifacts (called "results") of test runs around for some time. We want all test runs for the current distro development series, and maybe the most recent ones for all supported distro releases. As this is fairly precious, needs to be available at all times in order to not block package propagation, and needs to be written by all workers, this needs to be in a distributed no-SPOF network file system. We choose OpenStack Swift for that, as it is well supported, widely used, and packaged.

However, the interface to this is relatively small, and we might want to switch to other solutions in the future. Also, to support the current Debian setup we want to support the case to have everything running on just one host, at least for the time being.

Thus the actual workers and the debci web UI will only work on the local file systems. This is necessary for creating meta-data which need a consistent snapshot of the current state, and doing that is generally not possible on a decentralized network file system. Also, serving results on a web front end generally needs a local file system for efficiency. There will be small separate scripts which upload a locally generated results directory into swift (for the workers) and to download the most recent N results for all packages/architectures/releases from swift (for the web UI).

The data should be structured so that it is efficient and easy to find a particular result, avoid name collisions (as we cannot assume global locking), and have sortable results so that we can consider the N most recent ones. So the directory structure will be

If a runner has "platform tags" (see below for details), the structure will be

prefix is the first letter (or first four letters if it starts with "lib") of the source package name, as usual for Debian-style archivess

For example: /trusty/amd64/libp/libpng/20140321_130412/log files

Job distribution/management

Queue structure

debci will be built around the notion of AMQP "job queues" provided by RabbitMQ. The rabbitmq-server package pretty much works right out of the box, does not need much configuration, and does not have inordinarily heavy dependencies. RabbitMQ provides failover with mirrored queues, so that we don't have a single point of failure here.

We want to use a reasonably fine-grained queue structure system so that we can support workers that serve only certain releases, architectures, virtualization servers, or platforms; where "platform" should be a free-form set of tags that worker nodes can have to describe their environment. For example:

Note: We may want to not support schroot runners at all as LXC is not much more overhead, provides much better isolation, and is available on all Linux architectures. But we want the design to be able to support this. We will have most workers serving the "simple" default environments (which can be brought up and down on the fly with juju or similar), and a few dedicated boxes which can serve the "null-desktop-nvidia" queues and similar.

Job creation

The requester (e. g. the debci batch runner or britney) downloads the test metadata (debian/tests/control) for a particular package, and decides which queues to put it in:

The queue request then just consists of the source package name, which will be put in all queues that came out of above list.

Job consumption

A worker node knows its supported architecture(s), the ADT schroots/containers/VMs it has available and their releases, or that it is a dedicated machine ("null runner") with a set of tags (e. g. "desktop" and "nvidia"). From that it determines which AMQP queues it subscribes to the corresponding queues. Once it receives a test request, it runs it with the appropriate adt-run command, stores the log into the distributed file system, and then ACKs the AMQP request.

Health check

adt-run has builtin timeouts for everything, but sometimes they fail: We had cases where a heavily loaded host running several LXC or Qemu test beds in parallel encountered a hardware or kernel error and just froze. RabbitMQ has no timeout for message acks (i. e. messages can be "held" by a worker node indefinitely), so we need an extra regular life check. It is unknown whether the TCP connection that the Rabbit server holds to the connected client will always eventually time out even in the absence of a TCP RST from the client; but even if that's the case, it seems like a good idea to have an extra safety net on top of local adt-run timeouts.

The health checking controller node sends out a ping-like message to a fanout "healthcheck" AMQP queue, with a unique correlation ID (current timestamp should do), and a reply queue "healthcheck_reply". The healthcheck monitor on all client nodes is expected to check whether the main worker process is running, and check for any stuck adt-run processes (i. e. older than e. g. 12 hours), and if there aren't any, reply with their host name. If the whole node is down or the healthcheck monitor itself is crashed, it won't receive the message and thus won't send a reply either. The controller determines the set of "expected" workers from the shared file system, i. e. which worker nodes have submitted results in the past N days. Any worker which does not respond on the reply queue within 30 seconds gets a notification created for it. (AMQP RPC pattern).

The health checking controller also needs to inspect all AMQP queues for messages which have not been received in 3 hours, and notify about them. This either means that there is no worker which could serve that queue, or that all workers for that type of test are busy, and thus identifies a bottleneck. Note that AMQP requests do not have builtin time stamp, so this needs to be added to the message.

Failure handling

In such a distributed system there is a lot of things which can go wrong. This section analyzes failure scenarios and how to handle them. We ignore failures in single swift nodes, as a proper swift setup already includes multiple redundant storage nodes.

Worker node problems

Test-induced problems

AMQP problems

Controller problems

debci development Tasks

( (./) : done in trunk, {*} : done in Martin's branch)

  1. (./) Remove assumption that tests run on the local box with schroot

    • lib/functions.sh accesses the schroot from outside → don't assume schroot; just download debci_suite's Sources.gz, cache it, and derive everything (sources, binaries, etc.) from that

    • list-base-system / list-dependencies don't translate to other (e. g. qemu) or remote runners → recent autopkgtest already records the package version and all packages/versions in the testbed as an artifact; split that in autopkgtest itself (./) and use these split files

    • debci-setup creates local schroot, but this should be in backends/schroot/create-testbed

  2. (./) Change data directory structure

    • keep one directory for each run, so that we can keep all artifacts (per-test stdout/err, summary, custom artifact files)
    • separate locally generated data (*.json, status, etc.) from autopkgtest-generated data (log, artifacts) if possible, so that we can more easily rebuild/update/refresh them
    • create migration script for existing data
  3. Remove synchronous waiting on adt-run
    • (./) debci-test waits for adt-run and then writes .json files → all .json generation needs to move to generate-index

    • debci-batch calls debci-test for all packages and then updates indexes → debci-batch does not work in a britney/CLI-controlled design anyway, so that's fine for now. But index generation needs to move to cron job or daemon which runs it regularly, to pick up new log files from remote FSes

    • (./) debci-test has policy about when to run packages → with britney we need to run a package several times a day, and we sometimes want to retry tests; we should not put that policy into the worker logic; it should be in debci-batch only for the "batched run" use case

  4. Move from local synchronous adt-run execution to asynchronous distributed jobs with worker nodes
    • (./) Separate controller and worker, do all job requests/accepts through AMQP queues with RabbitMQ

    • Implement health checking
    • Integrate it into generate-index
    • Present it in the web UI
  5. Support distributed file system
    • Should support just one local file system for the easiest case
    • Workers store their output into a network file system. If that's currently down, cache it locally and flush it on the next run
    • Controller retrieves worker output from network fs. Regularly sync the data (last N runs only) from swift into the local file system, and generate metadata and serve web from that.
  6. UI improvements
    • (./) add link to test output directory, to see all artifacts

    • (./) parse run duration from adt log instead of measuring the test-package time, as that only makes sense for local runners

    • (./) add per-architecture view

  7. Configuration improvements
    • (./) should have a config/debci that gets sourced by lib/environment.sh to override default variable values without having to change git-tracked source

    • (./) add a config option for mirror; use it for debootstrap and local download of debci_suite's Sources.gz.