Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

RDMA/rw: Fall back to direct SGE on MR pool exhaustion

When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs()
produces no coalescing: each scatterlist page maps 1:1 to a DMA
entry, so sgt.nents equals the raw page count. A 1 MB transfer
yields 256 DMA entries. If that count exceeds the device's
max_sgl_rd threshold (an optimization hint from mlx5 firmware),
rdma_rw_io_needs_mr() steers the operation into the MR
registration path. Each such operation consumes one or more MRs
from a pool sized at max_rdma_ctxs -- roughly one MR per
concurrent context. Under write-intensive workloads that issue
many concurrent RDMA READs, the pool is rapidly exhausted,
ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns
-EAGAIN. Upper layer protocols treat this as a fatal DMA mapping
failure and tear down the connection.

The max_sgl_rd check is a performance optimization, not a
correctness requirement: the device can handle large SGE counts
via direct posting, just less efficiently than with MR
registration. When the MR pool cannot satisfy a request, falling
back to the direct SGE (map_wrs) path avoids the connection
reset while preserving the MR optimization for the common case
where pool resources are available.

Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from
rdma_rw_init_mr_wrs() triggers direct SGE posting instead of
propagating the error. iWARP devices, which mandate MR
registration for RDMA READs, and force_mr debug mode continue
to treat -EAGAIN as terminal.

Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

authored by

Chuck Lever and committed by
Leon Romanovsky
00da250c ef3b0674

+21 -6
+21 -6
drivers/infiniband/core/rw.c
··· 608 608 if (rdma_rw_io_needs_mr(qp->device, port_num, dir, sg_cnt)) { 609 609 ret = rdma_rw_init_mr_wrs(ctx, qp, port_num, sg, sg_cnt, 610 610 sg_offset, remote_addr, rkey, dir); 611 - } else if (sg_cnt > 1) { 612 - ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_cnt, sg_offset, 613 - remote_addr, rkey, dir); 614 - } else { 615 - ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset, 616 - remote_addr, rkey, dir); 611 + /* 612 + * If MR init succeeded or failed for a reason other 613 + * than pool exhaustion, that result is final. 614 + * 615 + * Pool exhaustion (-EAGAIN) from the max_sgl_rd 616 + * optimization is recoverable: fall back to 617 + * direct SGE posting. iWARP and force_mr require 618 + * MRs unconditionally, so -EAGAIN is terminal. 619 + */ 620 + if (ret != -EAGAIN || 621 + rdma_protocol_iwarp(qp->device, port_num) || 622 + unlikely(rdma_rw_force_mr)) 623 + goto out; 617 624 } 618 625 626 + if (sg_cnt > 1) 627 + ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_cnt, sg_offset, 628 + remote_addr, rkey, dir); 629 + else 630 + ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset, 631 + remote_addr, rkey, dir); 632 + 633 + out: 619 634 if (ret < 0) 620 635 goto out_unmap_sg; 621 636 return ret;