RDMA/core: add rdma_rw_max_sge() helper for SQ sizing

svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the
number of rdma_rw contexts (ctxts). This value is used to allocate the
Send CQ and to initialize the sc_sq_avail credit pool.

However, when the device uses memory registration for RDMA operations,
rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three
per context to account for REG and INV work requests. The Send CQ and
credit pool remain sized for only one work request per context,
causing Send Queue exhaustion under heavy NFS WRITE workloads.

Introduce rdma_rw_max_sge() to compute the actual number of Send Queue
entries required for a given number of rdma_rw contexts. Upper layer
protocols call this helper before creating a Queue Pair so that their
Send CQs and credit accounting match the QP's true capacity.

Update svc_rdma_accept() to use rdma_rw_max_sge() when computing
sc_sq_depth, ensuring the credit pool reflects the work requests
that rdma_rw_init_qp() will reserve.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

authored by

Chuck Lever and committed by

Leon Romanovsky 5 months ago afcae7d7 bea28ac1

+46 -17

3 changed files

expand all

drivers

infiniband

core

rw.c

include

rdma

rw.h

net

sunrpc

xprtrdma

svc_rdma_transport.c

+38 -15

drivers/infiniband/core/rw.c

··· 1071 1071 } 1072 1072 EXPORT_SYMBOL(rdma_rw_mr_factor); 1073 1073 1074 + /** 1075 + * rdma_rw_max_send_wr - compute max Send WRs needed for RDMA R/W contexts 1076 + * @dev: RDMA device 1077 + * @port_num: port number 1078 + * @max_rdma_ctxs: number of rdma_rw_ctx structures 1079 + * @create_flags: QP create flags (pass IB_QP_CREATE_INTEGRITY_EN if 1080 + * data integrity will be enabled on the QP) 1081 + * 1082 + * Returns the total number of Send Queue entries needed for 1083 + * @max_rdma_ctxs. The result accounts for memory registration and 1084 + * invalidation work requests when the device requires them. 1085 + * 1086 + * ULPs use this to size Send Queues and Send CQs before creating a 1087 + * Queue Pair. 1088 + */ 1089 + unsigned int rdma_rw_max_send_wr(struct ib_device *dev, u32 port_num, 1090 + unsigned int max_rdma_ctxs, u32 create_flags) 1091 + { 1092 + unsigned int factor = 1; 1093 + unsigned int result; 1094 + 1095 + if (create_flags & IB_QP_CREATE_INTEGRITY_EN || 1096 + rdma_rw_can_use_mr(dev, port_num)) 1097 + factor += 2; /* reg + inv */ 1098 + 1099 + if (check_mul_overflow(factor, max_rdma_ctxs, &result)) 1100 + return UINT_MAX; 1101 + return result; 1102 + } 1103 + EXPORT_SYMBOL(rdma_rw_max_send_wr); 1104 + 1074 1105 void rdma_rw_init_qp(struct ib_device *dev, struct ib_qp_init_attr *attr) 1075 1106 { 1076 - u32 factor; 1107 + unsigned int factor = 1; 1077 1108 1078 1109 WARN_ON_ONCE(attr->port_num == 0); 1079 1110 1080 1111 /* 1081 - * Each context needs at least one RDMA READ or WRITE WR. 1082 - * 1083 - * For some hardware we might need more, eventually we should ask the 1084 - * HCA driver for a multiplier here. 1085 - */ 1086 - factor = 1; 1087 - 1088 - /* 1089 - * If the device needs MRs to perform RDMA READ or WRITE operations, 1090 - * we'll need two additional MRs for the registrations and the 1091 - * invalidation. 1112 + * If the device uses MRs to perform RDMA READ or WRITE operations, 1113 + * or if data integrity is enabled, account for registration and 1114 + * invalidation work requests. 1092 1115 */ 1093 1116 if (attr->create_flags & IB_QP_CREATE_INTEGRITY_EN || 1094 1117 rdma_rw_can_use_mr(dev, attr->port_num)) 1095 - factor += 2; /* inv + reg */ 1118 + factor += 2; /* reg + inv */ 1096 1119 1097 1120 attr->cap.max_send_wr += factor * attr->cap.max_rdma_ctxs; 1098 1121 1099 1122 /* 1100 - * But maybe we were just too high in the sky and the device doesn't 1101 - * even support all we need, and we'll have to live with what we get.. 1123 + * The device might not support all we need, and we'll have to 1124 + * live with what we get. 1102 1125 */ 1103 1126 attr->cap.max_send_wr = 1104 1127 min_t(u32, attr->cap.max_send_wr, dev->attrs.max_qp_wr);

include/rdma/rw.h

··· 86 86 87 87 unsigned int rdma_rw_mr_factor(struct ib_device *device, u32 port_num, 88 88 unsigned int maxpages); 89 + unsigned int rdma_rw_max_send_wr(struct ib_device *dev, u32 port_num, 90 + unsigned int max_rdma_ctxs, u32 create_flags); 89 91 void rdma_rw_init_qp(struct ib_device *dev, struct ib_qp_init_attr *attr); 90 92 int rdma_rw_init_mrs(struct ib_qp *qp, struct ib_qp_init_attr *attr); 91 93 void rdma_rw_cleanup_mrs(struct ib_qp *qp);

+6 -2

net/sunrpc/xprtrdma/svc_rdma_transport.c

··· 462 462 newxprt->sc_max_bc_requests = 2; 463 463 } 464 464 465 - /* Arbitrary estimate of the needed number of rdma_rw contexts. 465 + /* Estimate the needed number of rdma_rw contexts. The maximum 466 + * Read and Write chunks have one segment each. Each request 467 + * can involve one Read chunk and either a Write chunk or Reply 468 + * chunk; thus a factor of three. 466 469 */ 467 470 maxpayload = min(xprt->xpt_server->sv_max_payload, 468 471 RPCSVC_MAXPAYLOAD_RDMA); ··· 473 470 rdma_rw_mr_factor(dev, newxprt->sc_port_num, 474 471 maxpayload >> PAGE_SHIFT); 475 472 476 - newxprt->sc_sq_depth = rq_depth + ctxts; 473 + newxprt->sc_sq_depth = rq_depth + 474 + rdma_rw_max_send_wr(dev, newxprt->sc_port_num, ctxts, 0); 477 475 if (newxprt->sc_sq_depth > dev->attrs.max_qp_wr) 478 476 newxprt->sc_sq_depth = dev->attrs.max_qp_wr; 479 477 atomic_set(&newxprt->sc_sq_avail, newxprt->sc_sq_depth);

Configure Feed

Configure Feed