Debian ROCm CI: Worker telemetry
Description of the project: The Debian ROCm Team operates ci.rocm.debian.net, a CI environment in which the ROCm software stack's unit tests are run on various AMD GPU architectures.
Workers run tests either on bare metal, or in containers, or in QEMU virtual machines. In all cases, collecting telemetry on GPU and system health during tests would provide information useful for maintaining system health at runtime as well as enabling higher-level analysis over time and system setups.
Confirmed Mentor: Christian Kastner
How to contact the mentor: ckk@debian.org
Confirmed co-mentors: Cordell Bloor
Difficulty level: Medium
Project size: Medium (175h)
Deliverables of the project:
- Agents for collecting telemetry bare metal, container, and virtual machine (kvm) use
- Daemon for receiving collected telemetry
Desirable skills:
- Experience with Linux-based operating systems
- Some experience with Python
What the intern will learn:
- Working with isolation technologies (containers, kvm)
- Visualization of system health data
Application tasks:
- Research and basic familiarization with container and kvm isolation
Propose design for daemon<->agent communication
- Implement daemon
- Implement agents for bare metal, containers, VMs