Translation(s): none


Remote Direct Memory Access (RDMA) is a computer networking technology usually implemented over high-speed, low-latency networks (aka fabrics) which allows for secure, direct access to a remote host's memory, dramatically reducing latency and CPU overhead.

Use of RDMA usually requires specialized networking hardware that implements InfiniBand, Omni-Path, RoCE or iWARP protocols. Soft-RoCE provides RDMA features on a standard Ethernet NIC.

Upper-layer protocols (ULP) in the kernel allow for RDMA-accelerated services such as TCP/IP (e.g. IPoIB) and storage (e.g. iSER, SRP).

Application software may be RDMA-aware, either through use of an RDMA API (e.g. libibverbs, libfabrics) or an RDMA-aware framework (e.g. openmpi).

This document serves to provide general guidance for an administrator's configuration of RDMA on the Debian operating system. A full explanation of the benefits and implementation of RDMA is beyond the scope of this discussion.

Debian RDMA Packages

Kernel support for RDMA is maintained by the [DebianKernel] team. The kernel provides drivers for many RDMA hardware vendors, as well as modules to provide ULP support.

Core user space packages for RDMA are maintained by Debian HPC.

Historically, the upstream source for many RDMA packages has been the Open Fabrics Enterprise Distribution (OFED). More recently, a large portion of this work has migrated to rdma-core.

Non-Free or Alternative RDMA Stacks

Vendors may provide their own distribution of hardware drivers and user space components that are tailored to their particular hardware. Some tasks, such as updating firmware, may require use of the vendor's software.

A particular hardware component or feature may require use of the vendor's drivers and/or libraries. Please refer to the hardware manufacturer's documentation, if necessary.

Installation

RDMA support is included in the kernel and user space packages provided by the Debian contrib and main repositories.

Kernel Modules

The booted kernel's RDMA modules are found at the following locations under /lib/modules/`uname -r`:

Use of an alternative kernel or additional drivers may be required if the host's hardware is not supported by the modules included with the kernel (see Troubleshooting).

Use /sbin/modinfo for any given kernel module to view its configuration options.

rdma-core

rdma-core is a package that assists in configuration of core RDMA functionality Debian. It is possible to configure the operating system without use of rdma-core, but that is beyond the scope of this document.

Installation of rdma-core on Debian stretch requires use of the stretch-backports repository. rdma-core is in the standard Debian repositories for buster and later releases. For information on stretch-backports, please see the Debian Backports documentation.

For stretch:

 sudo apt-get -t stretch-backports install rdma-core 

RDMA User Space Support

This step installs user space libraries and diagnostic applications that will allow testing of RDMA functionality.

For stretch:

sudo apt-get -t stretch-backports  install libibverbs1 librdmacm1 \
libibmad5 libibumad3 librdmacm1 ibverbs-providers rdmacm-utils \
infiniband-diags libfabric1 ibverbs-utils 

InfiniBand Subnet Manager

For Omni-Path installations, see Fabric Manager in the next section.

An InfiniBand network requires a subnet manager (SM) to orchestrate communications between hosts. There may be one or more SMs on any given network. If the Debian host needs to execute an SM, install opensm.

 sudo apt-get -t stretch-backports install opensm 

The opensm service should automatically start after installation and be enabled to start after system reboot.

opensm will provide basic management functionality with its stock configuration. Please refer to the opensm documentation for more information about its configuration.

Omni-Path Fabric Manager

An Omni-Path network requires a fabric manager (FM) to be present on the network, which provides similar functionality as that of the InfiniBand SM.

Note that an InfiniBand SM cannot be used to manage an Omni-Path network, and an FM cannot manage an InfiniBand network. Also, an Omni-Path FM cannot execute on the same host as and InfiniBand SM.

Debian does not currently provide a package for the Omni-Path FM. The options are:

Please refer to the manufacturer's documentation regarding any of these options.

Initial Testing

Verify Loaded Kernel Modules

After installation of rdma-core, RDMA user space support and an SM or FM (if required), the host should be rebooted.

First, verify that the RDMA modules have been loaded, using lsmod:

$ lsmod | grep '\(^ib\|^rdma\)'
rdma_ucm               24576  0
ib_uverbs              65536  1 rdma_ucm
ib_iser                49152  0
rdma_cm                57344  3 ib_iser,rpcrdma,rdma_ucm
ib_umad                24576  0
ib_ipoib              114688  0
ib_cm                  45056  2 rdma_cm,ib_ipoib
rdmavt                 57344  1 hfi1
ib_core               208896  11 ib_iser,ib_cm,rdma_cm,ib_umad,ib_uverbs,rpcrdma,ib_ipoib,iw_cm,rdmavt,rdma_ucm,hfi1

The list may be different, depending upon the hardware that is installed. The example output is for a host with Omni-Path hardware. Note that ib_core depends upon hfi1, the Omni-Path hardware driver. Different hardware would result in a different ib_core driver dependency such as mlx4_ib, iw_cxgb3, etc.

The modules that should appear for any host, assuming a default configuration of rdma-core, are:

If these modules are not listed, that is likely an indication that the driver for the RDMA hardware did not properly load. See Troubleshooting.

Verify Port is Active

Use /usr/sbin/ibstat to verify that the hardware's port is active. The output will depend upon the type of hardware installed and network configuration. There may be multiple devices listed and/or multiple ports. The important things to note are:

  1. There is a device listed for the expected hardware.
  2. The expected port's State is Active. If the state is Polling, then wait a few seconds and try again.
  3. The expected port's SM lid has a non-zero value. If this is not the case, then there is a problem with the SM or FM.

$ /usr/sbin/ibstat
CA 'hfi1_0'
        CA type:
        Number of ports: 1
        Firmware version: 0.35
        Hardware version: 10
        Node GUID: 0x0011750101671ddd
        System image GUID: 0x0011750101671ddd
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 9
                LMC: 0
                SM lid: 1
                Capability mask: 0x00410020
                Port GUID: 0x0011750101671ddd
                Link layer: InfiniBand

RDMA Ping

This step verifies that basic RDMA communications work between applications executed upon different hosts.

Note: The UMAD port has default permissions 600. To enable non-root access to UMAD, execute sudo chmod go+rw /dev/infiniband/umad0. To make the configuration persistent, add the following configuration to /lib/udev/rules.d/90-rdma-umad.rules:

 KERNEL=="umad*", NAME="infiniband/%k", MODE="0666" 

Use ibping to test RDMA. Replace GUID with the Node GUID from ibstat output. This example shows how to execute the server and client on one host. Connectivity between each host should also be verified by executing the client and server on different hosts.

# launch server in background
$ /usr/bin/ibping -S &
# launch client 
$ /usr/sbin/ibping -G 0xGUID
Pong from myhost.(none) (Lid 9): time 0.304 ms
Pong from myhost.(none) (Lid 9): time 0.299 ms
Pong from myhost.(none) (Lid 9): time 0.305 ms

Configure IPoIB

IPoIB provides TCP communications over the RDMA hardware's network. The ib_ipoib module must be loaded to enable this functionality.

Execute ip link. A listing for interface ib0 should appear, and the state should be DOWN. If the state us UP, that indicates IPoIB has already been configured for ib0.

IPoIB interfaces are configured in the same manner as other TCP interfaces on Debian. Use ip to assign an address, in this case 10.2.0.34. The IPoIB network address should be different from any other network address configured for the host.

 sudo ip address add 10.20.0.34/24 dev ib0 

The IP address should now respond to pings. If there are other hosts configured with IPoIB, their interface's addresses should also be pingable.

To make the address persistent, configure ib0 in /etc/network/interfaces.

auto ib0
iface ib0 inet static
  address 10.20.0.34
  netmask 255.255.255.0
  broadcast 10.20.0.255

Verify libibverbs

This step verifies that libibverbs has an appropriate provider for the installed hardware. libibverbs provides a standardized API for user space applications to interface with RDMA hardware. The libibverbs provider is responsible for interfacing with the specific hardware driver in the kernel.

Use ibv_devices to view the list of recognized libibverbs devices. There should be an entry for the expected device, with a device name and node GUID matching the output of ibstat:

$ ibv_devices
    device                 node GUID
    ------              ----------------
    hfi1_0              0011750101671ddd

libibverbs Ping

This step verifies the libibverbs applications can communicate between applications executing upon different hosts.

On the server host:

$ ibv_rc_pingpong

On the client host, replace SERVER_IP with the server host's IP address:

$ ibv_rc_pingpong SERVER_IP

Each process should produce output similar to:

  local address:  LID 0x0009, QPN 0x00000c, PSN 0x1417e9, GID ::
  remote address: LID 0x0009, QPN 0x00000a, PSN 0x4443f6, GID ::
8192000 bytes in 0.01 seconds = 7202.55 Mbit/sec
1000 iters in 0.01 seconds = 9.10 usec/iter

Troubleshooting

Failure to Load Hardware Driver

If the hardware device's driver cannot be loaded, review the output of dmesg and look for messages that start with the driver's name (e.g. mlx5_ib).

There should be messages that indicate why the driver failed to load. If firmware failed to load, installation of non-free or vendor-supplied firmware may be required.

Other firmware messages may indicate that the hardware's onboard firmware needs to be updated or changed. Please refer to the manufacturer's documentation for this procedure.

While the latest release of Debian provides recent versions of hardware drivers from upstream, there is also the possibility that the hardware revision requires a newer or proprietary driver. Look in apt for more recent kernel releases than what is currently booted on the host, or review the manufacturer's documentation for driver support.

Non-Free Firmware

The hardware may require use of firmware provided by the Debian non-free component. More information about Debian firmware and non-free is available at [Firmware].

Configure apt to include the non-free component and execute:

$ sudo apt-get update
$ sudo apt-get install firmware-linux-nonfree

After rebooting the host, if the hardware device is still not functioning correctly, it may require firmware provided by the vendor. Please refer the manufacturer's documentation.

IPoIB Cannot Communicate Remotely

If a host's local ib0 interface is pingable but a remote host's IPoIB interface is unreachable via TCP, one of the following may be at fault:

Missing or Incompatible libibverbs Provider

The ibverbs-provider package includes libibverbs providers for many popular RDMA devices. If ibstat shows a device that is not listed in the output for ibv_devices, that indicates either a missing or incompatible provider.

/etc/libibverbs.d contains .driver file for each libibverbs provider. There should be a driver file with a name that matches the hardware in question (e.g. mlx5.driver for an mlx5 card).

If there is no suitable driver file, that indicates there is no libibverbs provider for the hardware. Refer to the manufacturer's documentation for obtaining and installing the libibverbs provider.

If the appropriate libibverbs driver file exists, that could indicate that the provider is improperly installed. dpkg --verify ibverbs-providers may be used to verify the installation of ibverbs-providers.

If the provider is properly installed, that may indicate the provider is incompatible with etiher libibverbs or the kernel's hardware driver.

See also