Translation(s): none
Remote Direct Memory Access (RDMA) is a computer networking technology usually implemented over high-speed, low-latency networks (aka fabrics) which allows for direct access to a remote host's memory, dramatically reducing latency and CPU overhead.
Use of RDMA usually requires specialized networking hardware that implements InfiniBand, Omni-Path, RoCE or iWARP protocols. Soft-RoCE provides RDMA features over a standard Ethernet NIC.
Upper-layer protocols (ULP) in the kernel implement RDMA-accelerated services such as IP (e.g. IPoIB) and storage (e.g. iSER, SRP). Applications are not required to be RDMA-aware in order to benefit from these kernel-provided RDMA services.
Application software may be RDMA-aware, either through use of an RDMA API (e.g. libibverbs, libfabrics) or an RDMA-aware framework (e.g. openmpi). These applications will benefit the most from a network that implements RDMA.
This document serves to provide general guidance for an administrator's configuration of RDMA on the Debian operating system. A full explanation of the benefits and implementation of RDMA is beyond the scope of this discussion.
Contents
Debian RDMA Packages
Kernel support for RDMA is maintained by the DebianKernel team. The kernel provides drivers for many models of RDMA hardware, as well as modules to provide ULP support.
Core user space packages for RDMA are maintained by Debian HPC.
Historically, the upstream source for the core RDMA packages has been Open Fabrics Enterprise Distribution (OFED). More recently, a large portion of this work has migrated to the rdma-core project.
Non-Free or Alternative RDMA Stacks
Vendors may provide their own distribution of hardware drivers and user space components that are tailored to their particular hardware. Some tasks, such as updating firmware, may require use of the vendor's software.
Also, a particular hardware component or feature may require use of the vendor's drivers and/or libraries. Please refer to the hardware manufacturer's documentation, if necessary.
Installation
RDMA support is included in the kernel and user space packages provided by the contrib and main components of the Debian release repositories.
Kernel Modules
The booted kernel's RDMA modules are found at the following locations under /lib/modules/`uname -r`:
- kernel/drivers/infiniband/hw/ - hardware device drivers
- kernel/drivers/infiniband/sw/ - software drivers (e.g. Soft-RoCE)
- kernel/drivers/infiniband/ulp/ - ULP modules
- kernel/drivers/staging/ - new drivers/modules, may be RDMA-related.
- updates - drivers/modules provided outside of the linux-image package, may be RDMA-related. Typically, these are vendor-provided.
Use of an alternative kernel or additional drivers may be required if the host's hardware is not supported by the modules included with the kernel (see Troubleshooting).
Use /sbin/modinfo for any given kernel module to view its documentation and configuration options.
rdma-core
The rdma-core package configures the RDMA stack on Debian. It is possible to configure the operating system without use of rdma-core, but that is beyond the scope of this document.
rdma-core is in the standard Debian repositories for buster and later releases. Installation of rdma-core on Debian stretch requires use of the stretch-backports repository. For information on stretch-backports, please see the Debian Backports documentation.
For stretch:
sudo apt-get -t stretch-backports install rdma-core
RDMA User Space Support
This step installs user space libraries and diagnostic applications that will allow testing of RDMA functionality.
For stretch:
sudo apt-get -t stretch-backports install libibverbs1 librdmacm1 \ libibmad5 libibumad3 librdmacm1 ibverbs-providers rdmacm-utils \ infiniband-diags libfabric1 ibverbs-utils
InfiniBand Subnet Manager
For Omni-Path installations, see Fabric Manager in the next section.
An InfiniBand network requires a subnet manager (SM) to orchestrate communications between hosts. There may be one or more SMs on any given network. If the Debian host needs to execute an SM, install opensm.
For stretch:
sudo apt-get -t stretch-backports install opensm
The opensm service should automatically start after installation and be enabled to start after system reboot.
opensm will provide basic management functionality with its stock configuration. Please refer to the opensm documentation for more information about its configuration.
Omni-Path Fabric Manager
An Omni-Path network requires a fabric manager (FM) to be present on the network, which provides similar functionality as that of the InfiniBand SM.
Note that an InfiniBand SM cannot be used to manage an Omni-Path network, and an FM cannot manage an InfiniBand network. Also, an Omni-Path FM cannot execute on the same host as an InfiniBand SM.
Debian does not currently provide a package for the Omni-Path FM. The options are:
- Execute an FM on the fabric switch, if supported.
- Execute an FM on a host that provides support for it.
- Install a vendor-provided FM package on the Debian host.
Please refer to the manufacturer's documentation regarding any of these options.
Initial Testing
Verify Loaded Kernel Modules
After installation of rdma-core, RDMA user space support and an SM or FM (if required), reboot the host.
After reboot, verify that the RDMA modules have been loaded, using lsmod:
$ lsmod | grep '\(^ib\|^rdma\)' rdma_ucm 24576 0 ib_uverbs 65536 1 rdma_ucm ib_iser 49152 0 rdma_cm 57344 3 ib_iser,rpcrdma,rdma_ucm ib_umad 24576 0 ib_ipoib 114688 0 ib_cm 45056 2 rdma_cm,ib_ipoib rdmavt 57344 1 hfi1 ib_core 208896 11 ib_iser,ib_cm,rdma_cm,ib_umad,ib_uverbs,rpcrdma,ib_ipoib,iw_cm,rdmavt,rdma_ucm,hfi1
The list may be different, depending upon the hardware that is installed. The example output is for a host with Omni-Path hardware. Note that ib_core is referred to by hfi1, the Omni-Path hardware driver. Different hardware would result in a different ib_core driver reference such as mlx4_ib, iw_cxgb3, etc.
The modules that should appear for any host, assuming a default configuration of rdma-core, are:
- rdma_ucm
- ib_uverbs
- ib_iser
- rdma_cm
- ib_umad
- ib_ipoib
- ib_cm
- ib_core
If these modules are not listed, that is likely an indication that the driver for the RDMA hardware did not properly load. See Troubleshooting.
Verify Port is Active
Use /usr/sbin/ibstat to verify that the hardware's port is active. The output will depend upon the type of hardware installed and network configuration. There may be multiple devices listed and/or multiple ports. The important things to note are:
- There is a device listed for the expected hardware.
- The expected port's State is Active. If the state is Polling, then wait a few seconds and try again.
- The expected port's SM lid has a non-zero value. If this is not the case, then there is a problem with the SM or FM.
$ /usr/sbin/ibstat CA 'hfi1_0' CA type: Number of ports: 1 Firmware version: 0.35 Hardware version: 10 Node GUID: 0x0011750101671ddd System image GUID: 0x0011750101671ddd Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 9 LMC: 0 SM lid: 1 Capability mask: 0x00410020 Port GUID: 0x0011750101671ddd Link layer: InfiniBand
RDMA Ping
This step verifies that basic RDMA communications work between applications executed upon different hosts.
Note: The UMAD device has default permissions 600. To enable non-root access to UMAD, execute sudo chmod go+rw /dev/infiniband/umad0. To make the configuration persistent, add the following configuration to /lib/udev/rules.d/90-rdma-umad.rules:
KERNEL=="umad*", NAME="infiniband/%k", MODE="0666"
Use ibping to test RDMA. Replace GUID with the Port GUID from ibstat output. This example shows how to execute the server and client on one host. Connectivity between each host should also be verified by executing the client and server on different hosts.
# launch server in background $ /usr/bin/ibping -S & # launch client $ /usr/sbin/ibping -G 0xGUID Pong from myhost.(none) (Lid 9): time 0.304 ms Pong from myhost.(none) (Lid 9): time 0.299 ms Pong from myhost.(none) (Lid 9): time 0.305 ms
Configure IPoIB
IPoIB provides IP communications layered over the RDMA hardware's network. The ib_ipoib module must be loaded to enable this functionality.
Execute ip link. A listing for interface ib0 should appear, and the state should be DOWN. If the state us UP, that indicates IPoIB has already been configured for ib0.
IPoIB interfaces are configured in the same manner as other IP interfaces on Debian. Use ip to assign an address, in this case 10.20.0.34. The IPoIB network address should be different from any other network address configured for the host.
sudo ip address add 10.20.0.34/24 dev ib0
The IP address should now respond to pings. If there are other hosts configured with IPoIB, each interface's addresses should also be pingable.
To make the address persistent, configure ib0 in /etc/network/interfaces.
auto ib0 iface ib0 inet static address 10.20.0.34 netmask 255.255.255.0 broadcast 10.20.0.255
Verify libibverbs
This step verifies that libibverbs has an appropriate provider for the installed hardware. libibverbs provides a standardized API for user space applications to interface with RDMA hardware. The libibverbs provider is responsible for interfacing with the specific hardware driver in the kernel.
Use ibv_devices to view the list of devices recognized by libibverbs. There should be an entry for the expected device, with a device name and node GUID matching the output of ibstat:
$ ibv_devices device node GUID ------ ---------------- hfi1_0 0011750101671ddd
libibverbs Ping
This step verifies that applications using libibverbs can communicate when executing upon different hosts.
On the server host:
$ ibv_rc_pingpong
On the client host, replace SERVER_IP with the server host's IP address:
$ ibv_rc_pingpong SERVER_IP
Each process should produce output similar to:
local address: LID 0x0009, QPN 0x00000c, PSN 0x1417e9, GID :: remote address: LID 0x0009, QPN 0x00000a, PSN 0x4443f6, GID :: 8192000 bytes in 0.01 seconds = 7202.55 Mbit/sec 1000 iters in 0.01 seconds = 9.10 usec/iter
Troubleshooting
Failure to Load Hardware Driver
If the hardware device's driver cannot be loaded, review the output of dmesg and look for messages that start with the driver's name (e.g. mlx5_ib).
There should be messages that indicate why the driver failed to load. If firmware failed to load, installation of non-free or vendor-supplied firmware may be required.
Other firmware messages may indicate that the hardware's onboard firmware needs to be updated or changed. Please refer to the manufacturer's documentation for this procedure.
While the latest release of Debian provides recent versions of hardware drivers from upstream, there is also the possibility that the hardware revision requires a newer or proprietary driver. Look in apt for more recent kernel releases than what is currently booted on the host, or review the manufacturer's documentation for driver support.
Non-Free Firmware
The hardware may require use of firmware provided by the Debian non-free component. More information about Debian firmware and non-free is available at Firmware.
Configure apt to include the non-free component and execute:
$ sudo apt-get update $ sudo apt-get install firmware-linux-nonfree
After rebooting the host, if the hardware device is still not functioning correctly, it may require firmware provided by the vendor. Please refer to the manufacturer's documentation.
IPoIB Cannot Communicate Remotely
If a host's local ib0 interface is pingable but a remote host's IPoIB interface is unreachable via IP, one of the following may be at fault:
- Improper IP configuration on the local or remote host. This includes network IP address, host IP address, netmask and routing. This would be resolved as with any IP interface.
No/down/misconfigured SM/FM - If required, an SM/FM must be functional and configured to allow IPoIB communications. This is only relevant to InfiniBand and Omni-Path networks.
Missing or Incompatible libibverbs Provider
The ibverbs-provider package includes libibverbs providers for many popular RDMA devices. If ibstat shows a device that is not listed in the output for ibv_devices, that indicates either a missing or incompatible provider.
/etc/libibverbs.d contains a .driver file for each libibverbs provider. There should be a driver file with a name that matches the hardware in question (e.g. mlx5.driver for an mlx5 card).
If there is no suitable driver file, that indicates there is no libibverbs provider for the hardware. Refer to the manufacturer's documentation for obtaining and installing the libibverbs provider.
If the appropriate libibverbs driver file exists, that could indicate the provider is improperly installed. dpkg --verify ibverbs-providers may be used to verify the installation of ibverbs-providers.
If the provider is properly installed, the failure may indicate that the provider is incompatible with either libibverbs or the kernel's hardware driver.
Additional RDMA Packages
The following RDMA-aware packages and projects are useful for adding additional functionality to a host configured for RDMA:
Debian HPC Packages - packages maintained by Debian HPC.
- glusterfs - high performance, clustered file system.
ibacm - address and route resolution services for InfiniBand. Required for some ULPs (e.g. SRP) and applications. Not used with iWARP.
Lustre - high performance, clustered, parallel file system, often used for checkpointing MPI jobs.
OpenMPI - middleware for parallel execution of jobs across a cluster of servers.
- perftest - performance tests for libibverbs.
- qperf - measure network bandwidth and latency for sockets and RDMA.
- srptools - utilities for SRP administration.
See also
Infiniband HOWTO - legacy Debian documentation. Much of it is obsolete, but this does provide additional information.
Open Fabrics Enterprise Distribution - historical upstream for many RDMA packages.
rdma-core - current upstream for many RDMA packages.