debci currently only runs on one machine in schroot, with no clear separation of "worker jobs" (running adt-run and the actual tests), metadata generation, and web UI. We want to make this
- scalable: Running all tests on one machine won't suffice for a large number of tests or when packages with lots of reverse dependencies (hello eglibc) are uploaded.
- support multiple architectures: This necessarily means that test jobs need to be distributed to multiple external machines that provide test running for a particular architecture.
support multiple backends: Most tests run fine in schroots, but some tests need higher isolation such as containers or a full VM. QEMU with KVM is only available on some architectures, on the others we want to run all tests which don't need QEMU and skip the rest (adt-run does that if the testbed cannot satisfy the test-specified isolation restriction)
- support running tests on more than one platform: Most package tests don't make assumptions about particular hardware, but in some cases like the kernel, X.org, Mir, Wayland etc. we want to run them on e. g. machines with an NVidia, AMD, and Intel card. Or some tests may want to assume a fully running graphical desktop environment, while most tests just want to run in a minimal deboostrap-like environment.
- robust: If a host that provides a worker, job distribution/queueing, or log result storage hangs or dies, or new workers are added and removed the whole system should still continue to function, to avoid blocking packages in -proposed (Ubuntu / unstable (Debian).
- flexible: At the moment, debci just runs all pending jobs in a single big batch; we want e. g. Britney or command line tools to be able to request tests, while still being able to run all tests in a batch.
For that we need to strictly separate the test execution backends, the meta-data generation/UI/presentation, the data storage, and the policy (i. e. when to run which test), so that each of these can run on different machines, be redundant/have failover, and scale to our current needs.
We want to keep the log files and artifacts (called "results") of test runs around for some time. We want all test runs for the current distro development series, and maybe the most recent ones for all supported distro releases. As this is fairly precious, needs to be available at all times in order to not block package propagation, and needs to be written by all workers, this needs to be in a distributed no-SPOF network file system. We choose OpenStack Swift for that, as it is well supported, widely used, and packaged.
However, the interface to this is relatively small, and we might want to switch to other solutions in the future. Also, to support the current Debian setup we want to support the case to have everything running on just one host, at least for the time being.
Thus the actual workers and the debci web UI will only work on the local file systems. This is necessary for creating meta-data which need a consistent snapshot of the current state, and doing that is generally not possible on a decentralized network file system. Also, serving results on a web front end generally needs a local file system for efficiency. There will be small separate scripts which upload a locally generated results directory into swift (for the workers) and to download the most recent N results for all packages/architectures/releases from swift (for the web UI).
The data should be structured so that it is efficient and easy to find a particular result, avoid name collisions (as we cannot assume global locking), and have sortable results so that we can consider the N most recent ones. So the directory structure will be
If a runner has "platform tags" (see below for details), the structure will be
prefix is the first letter (or first four letters if it starts with "lib") of the source package name, as usual for Debian-style archivess
For example: /trusty/amd64/libp/libpng/20140321_130412/log files
debci will be built around the notion of AMQP "job queues" provided by RabbitMQ. The rabbitmq-server package pretty much works right out of the box, does not need much configuration, and does not have inordinarily heavy dependencies. RabbitMQ provides failover with mirrored queues, so that we don't have a single point of failure here.
We want to use a reasonably fine-grained queue structure system so that we can support workers that serve only certain releases, architectures, virtualization servers, or platforms; where "platform" should be a free-form set of tags that worker nodes can have to describe their environment. For example:
sid-i386-schroot (no platform means "any", i. e. usually a minimal debootstrap-like environment)
trusty-amd64-null-desktop_nvidia (i. e. running on bare metal without virtualization)
Note: We may want to not support schroot runners at all as LXC is not much more overhead, provides much better isolation, and is available on all Linux architectures. But we want the design to be able to support this. We will have most workers serving the "simple" default environments (which can be brought up and down on the fly with juju or similar), and a few dedicated boxes which can serve the "null-desktop-nvidia" queues and similar.
The requester (e. g. the debci batch runner or britney) downloads the test metadata (debian/tests/control) for a particular package, and decides which queues to put it in:
- release: britney knows that, and the batch runner will need this as an argument/configuration
- architecture: the set of architectures to run a test on will be in a configuration file read by britney or debci
virt server: requester looks at control's Restrictions (e. g. isolation-container or isolation-machine) and uses that runner; if not specified, use the "cheapest" runner available (schroot or LXC)
- platform tags: None by default (for the vast majority of packages); there needs to be a manually maintained mapping of source packages to these tags, e. g. "xorg-server: desktop_amd desktop_nvidia desktop_intel")
The queue request then just consists of the source package name, which will be put in all queues that came out of above list.
A worker node knows its supported architecture(s), the ADT schroots/containers/VMs it has available and their releases, or that it is a dedicated machine ("null runner") with a set of tags (e. g. "desktop" and "nvidia"). From that it determines which AMQP queues it subscribes to the corresponding queues. Once it receives a test request, it runs it with the appropriate adt-run command, stores the log into the distributed file system, and then ACKs the AMQP request.
adt-run has builtin timeouts for everything, but sometimes they fail: We had cases where a heavily loaded host running several LXC or Qemu test beds in parallel encountered a hardware or kernel error and just froze. RabbitMQ has no timeout for message acks (i. e. messages can be "held" by a worker node indefinitely), so we need an extra regular life check. It is unknown whether the TCP connection that the Rabbit server holds to the connected client will always eventually time out even in the absence of a TCP RST from the client; but even if that's the case, it seems like a good idea to have an extra safety net on top of local adt-run timeouts.
The health checking controller node sends out a ping-like message to a fanout "healthcheck" AMQP queue, with a unique correlation ID (current timestamp should do), and a reply queue "healthcheck_reply". The healthcheck monitor on all client nodes is expected to check whether the main worker process is running, and check for any stuck adt-run processes (i. e. older than e. g. 12 hours), and if there aren't any, reply with their host name. If the whole node is down or the healthcheck monitor itself is crashed, it won't receive the message and thus won't send a reply either. The controller determines the set of "expected" workers from the shared file system, i. e. which worker nodes have submitted results in the past N days. Any worker which does not respond on the reply queue within 30 seconds gets a notification created for it. (AMQP RPC pattern).
The health checking controller also needs to inspect all AMQP queues for messages which have not been received in 3 hours, and notify about them. This either means that there is no worker which could serve that queue, or that all workers for that type of test are busy, and thus identifies a bottleneck. Note that AMQP requests do not have builtin time stamp, so this needs to be added to the message.
In such a distributed system there is a lot of things which can go wrong. This section analyzes failure scenarios and how to handle them. We ignore failures in single swift nodes, as a proper swift setup already includes multiple redundant storage nodes.
Worker node problems
worker node crashes or is taken offline: If the worker wasn't currently running a test (i. e. it did not grab a message but didn't ack it yet), there is no problem at all, and further requests will just be taken by other nodes. If there is a running test, there will never be an ack. If it crashes "properly", i. e. with closing its AMQP TCP connection, the AMQP broker will immediately requeue that message.
worker node freezes: A worker which does not close its AMQP TCP connection, but does not make any actual progress (kernel crash, hw failure, huge load, etc.) will get picked up by the health check controller.
test hangs indefinitely:
At the first stage, adt-run has timeouts for every operation (even the simplest ones) on the test bed (which are usually LXC containers or QEMU VMs). After the timeout for copying, building package, or running test is reached, the testbed is torn down and the test fails with an error code. At least in the first stage of the implementation it is recommended to rely on manually retrying tests after inspecting the reason. At a later stage some auto-retry might be considered once we have some machine parseable reason (e. g. auto-retry on apt-get update hash sum mismatches)
- If that timeout fails, the health check monitor will flag the node for manual inspection/fixing.
- During that time, the node won't accept any other request (as that only happens after acking the current one), thus other requests get scheduled onto different workers.
test setup fails: Tests can fail with an "error" other than PASS/FAIL, if e. g. the test bed fails to start up. Eternal hangs are already covered above, otherwise adt-run will return with exit code 16 which should be considered an error case and cause a notification.
main AMQP broker goes down: A rabbitmq slave will take over to become the new master, and have the same queue status. However, this will cause listening clients to get an EPIPE socket.error exception. A worker node which is currently running a test will get that exception when trying to acknowledge the request, and thus can't ack it. But as the rabbit connection got disconnected at the same time the request for that test is still in the queue, and will just be given again to the next worker. So worker nodes need to intercept that error and try and reconnect to the list of AMQP brokers.
all AMQP brokers go down: Britney, or another requester won't be able to connect to any broker any more, and thus won't be able ot issue test requests. It just needs to keep trying until that works again, i. e. the brokers got fixed. No data is lost there, but packages won't get promoted.
britney or CI request a test on a architecture or platform for which no workers are available: This gets picked up by the health monitor, as requests in these queues will never get consumed.
britney goes down: Britney is stateless, so whenever the problem is repaired it will just start again from cron and continue to request tests/look at acquired results.
debci development Tasks
Remove assumption that tests run on the local box with schroot
lib/functions.sh accesses the schroot from outside → don't assume schroot; just download debci_suite's Sources.gz, cache it, and derive everything (sources, binaries, etc.) from that
list-base-system / list-dependencies don't translate to other (e. g. qemu) or remote runners → recent autopkgtest already records the package version and all packages/versions in the testbed as an artifact; split that in autopkgtest itself and use these split files
debci-setup creates local schroot, but this should be in backends/schroot/create-testbed
Change data directory structure
- keep one directory for each run, so that we can keep all artifacts (per-test stdout/err, summary, custom artifact files)
- separate locally generated data (*.json, status, etc.) from autopkgtest-generated data (log, artifacts) if possible, so that we can more easily rebuild/update/refresh them
- create migration script for existing data
- Remove synchronous waiting on adt-run
debci-test waits for adt-run and then writes .json files → all .json generation needs to move to generate-index
debci-batch calls debci-test for all packages and then updates indexes → debci-batch does not work in a britney/CLI-controlled design anyway, so that's fine for now. But index generation needs to move to cron job or daemon which runs it regularly, to pick up new log files from remote FSes
debci-test has policy about when to run packages → with britney we need to run a package several times a day, and we sometimes want to retry tests; we should not put that policy into the worker logic; it should be in debci-batch only for the "batched run" use case
- Move from local synchronous adt-run execution to asynchronous distributed jobs with worker nodes
Separate controller and worker, do all job requests/accepts through AMQP queues with RabbitMQ
- Implement health checking
- Integrate it into generate-index
- Present it in the web UI
- Support distributed file system
- Should support just one local file system for the easiest case
- Workers store their output into a network file system. If that's currently down, cache it locally and flush it on the next run
- Controller retrieves worker output from network fs. Regularly sync the data (last N runs only) from swift into the local file system, and generate metadata and serve web from that.
- UI improvements
add link to test output directory, to see all artifacts
parse run duration from adt log instead of measuring the test-package time, as that only makes sense for local runners
add per-architecture view
- Configuration improvements
should have a config/debci that gets sourced by lib/environment.sh to override default variable values without having to change git-tracked source
add a config option for mirror; use it for debootstrap and local download of debci_suite's Sources.gz.