Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

SUNRPC: Allocate a separate Reply page array

struct svc_rqst uses a single dynamically-allocated page array
(rq_pages) for both the incoming RPC Call message and the outgoing
RPC Reply message. rq_respages is a sliding pointer into rq_pages
that each transport receive path must compute based on how many
pages the Call consumed. This boundary tracking is a source of
confusion and bugs, and prevents an RPC transaction from having
both a large Call and a large Reply simultaneously.

Allocate rq_respages as its own page array, eliminating the boundary
arithmetic. This decouples Call and Reply buffer lifetimes,
following the precedent set by rq_bvec (a separate dynamically-
allocated array for I/O vectors).

Each svc_rqst now pins twice as many pages as before. For a server
running 16 threads with a 1MB maximum payload, the additional cost
is roughly 16MB of pinned memory. The new dynamic svc thread count
facility keeps this overhead minimal on an idle server. A subsequent
patch in this series limits per-request repopulation to only the
pages released during the previous RPC, avoiding a full-array scan
on each call to svc_alloc_arg().

Note: We've considered several alternatives to maintaining a full
second array. Each alternative reintroduces either boundary logic
complexity or I/O-path allocation pressure.

rq_next_page is initialized in svc_alloc_arg() and svc_process()
during Reply construction, and in svc_rdma_recvfrom() as a
precaution on error paths. Transport receive paths no longer compute
it from the Call size.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

+77 -56
+23 -24
include/linux/sunrpc/svc.h
··· 134 134 extern u32 svc_max_payload(const struct svc_rqst *rqstp); 135 135 136 136 /* 137 - * RPC Requests and replies are stored in one or more pages. 138 - * We maintain an array of pages for each server thread. 139 - * Requests are copied into these pages as they arrive. Remaining 140 - * pages are available to write the reply into. 137 + * RPC Call and Reply messages each have their own page array. 138 + * rq_pages holds the incoming Call message; rq_respages holds 139 + * the outgoing Reply message. Both arrays are sized to 140 + * svc_serv_maxpages() entries and are allocated dynamically. 141 141 * 142 - * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each server thread 143 - * needs to allocate more to replace those used in sending. To help keep track 144 - * of these pages we have a receive list where all pages initialy live, and a 145 - * send list where pages are moved to when there are to be part of a reply. 142 + * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each 143 + * server thread needs to allocate more to replace those used in 144 + * sending. 146 145 * 147 - * We use xdr_buf for holding responses as it fits well with NFS 148 - * read responses (that have a header, and some data pages, and possibly 149 - * a tail) and means we can share some client side routines. 146 + * xdr_buf holds responses; the structure fits NFS read responses 147 + * (header, data pages, optional tail) and enables sharing of 148 + * client-side routines. 150 149 * 151 - * The xdr_buf.head kvec always points to the first page in the rq_*pages 152 - * list. The xdr_buf.pages pointer points to the second page on that 153 - * list. xdr_buf.tail points to the end of the first page. 154 - * This assumes that the non-page part of an rpc reply will fit 155 - * in a page - NFSd ensures this. lockd also has no trouble. 150 + * The xdr_buf.head kvec always points to the first page in the 151 + * rq_*pages list. The xdr_buf.pages pointer points to the second 152 + * page on that list. xdr_buf.tail points to the end of the first 153 + * page. This assumes that the non-page part of an rpc reply will 154 + * fit in a page - NFSd ensures this. lockd also has no trouble. 156 155 */ 157 156 158 157 /** ··· 161 162 * Returns a count of pages or vectors that can hold the maximum 162 163 * size RPC message for @serv. 163 164 * 164 - * Each request/reply pair can have at most one "payload", plus two 165 - * pages, one for the request, and one for the reply. 166 - * nfsd_splice_actor() might need an extra page when a READ payload 167 - * is not page-aligned. 165 + * Each page array can hold at most one payload plus two 166 + * overhead pages (one for the RPC header, one for tail data). 167 + * nfsd_splice_actor() might need an extra page when a READ 168 + * payload is not page-aligned. 168 169 */ 169 170 static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv) 170 171 { ··· 203 204 struct xdr_stream rq_res_stream; 204 205 struct folio *rq_scratch_folio; 205 206 struct xdr_buf rq_res; 206 - unsigned long rq_maxpages; /* num of entries in rq_pages */ 207 - struct page * *rq_pages; 208 - struct page * *rq_respages; /* points into rq_pages */ 207 + unsigned long rq_maxpages; /* entries per page array */ 208 + struct page * *rq_pages; /* Call buffer pages */ 209 + struct page * *rq_respages; /* Reply buffer pages */ 209 210 struct page * *rq_next_page; /* next reply page to use */ 210 - struct page * *rq_page_end; /* one past the last page */ 211 + struct page * *rq_page_end; /* one past the last reply page */ 211 212 212 213 struct folio_batch rq_fbatch; 213 214 struct bio_vec *rq_bvec;
+24 -5
net/sunrpc/svc.c
··· 638 638 { 639 639 rqstp->rq_maxpages = svc_serv_maxpages(serv); 640 640 641 - /* rq_pages' last entry is NULL for historical reasons. */ 641 + /* +1 for a NULL sentinel readable by nfsd_splice_actor() */ 642 642 rqstp->rq_pages = kcalloc_node(rqstp->rq_maxpages + 1, 643 643 sizeof(struct page *), 644 644 GFP_KERNEL, node); 645 645 if (!rqstp->rq_pages) 646 646 return false; 647 + 648 + /* +1 for a NULL sentinel at rq_page_end (see svc_rqst_replace_page) */ 649 + rqstp->rq_respages = kcalloc_node(rqstp->rq_maxpages + 1, 650 + sizeof(struct page *), 651 + GFP_KERNEL, node); 652 + if (!rqstp->rq_respages) { 653 + kfree(rqstp->rq_pages); 654 + rqstp->rq_pages = NULL; 655 + return false; 656 + } 647 657 648 658 return true; 649 659 } ··· 666 656 { 667 657 unsigned long i; 668 658 669 - for (i = 0; i < rqstp->rq_maxpages; i++) 670 - if (rqstp->rq_pages[i]) 671 - put_page(rqstp->rq_pages[i]); 672 - kfree(rqstp->rq_pages); 659 + if (rqstp->rq_pages) { 660 + for (i = 0; i < rqstp->rq_maxpages; i++) 661 + if (rqstp->rq_pages[i]) 662 + put_page(rqstp->rq_pages[i]); 663 + kfree(rqstp->rq_pages); 664 + } 665 + 666 + if (rqstp->rq_respages) { 667 + for (i = 0; i < rqstp->rq_maxpages; i++) 668 + if (rqstp->rq_respages[i]) 669 + put_page(rqstp->rq_respages[i]); 670 + kfree(rqstp->rq_respages); 671 + } 673 672 } 674 673 675 674 static void
+26 -10
net/sunrpc/svc_xprt.c
··· 650 650 } 651 651 } 652 652 653 - static bool svc_alloc_arg(struct svc_rqst *rqstp) 653 + static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages, 654 + unsigned long npages) 654 655 { 655 - struct xdr_buf *arg = &rqstp->rq_arg; 656 - unsigned long pages, filled, ret; 656 + unsigned long filled, ret; 657 657 658 - pages = rqstp->rq_maxpages; 659 - for (filled = 0; filled < pages; filled = ret) { 660 - ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages); 658 + for (filled = 0; filled < npages; filled = ret) { 659 + ret = alloc_pages_bulk(GFP_KERNEL, npages, pages); 661 660 if (ret > filled) 662 661 /* Made progress, don't sleep yet */ 663 662 continue; ··· 666 667 set_current_state(TASK_RUNNING); 667 668 return false; 668 669 } 669 - trace_svc_alloc_arg_err(pages, ret); 670 + trace_svc_alloc_arg_err(npages, ret); 670 671 memalloc_retry_wait(GFP_KERNEL); 671 672 } 672 - rqstp->rq_page_end = &rqstp->rq_pages[pages]; 673 - rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */ 673 + return true; 674 + } 675 + 676 + static bool svc_alloc_arg(struct svc_rqst *rqstp) 677 + { 678 + struct xdr_buf *arg = &rqstp->rq_arg; 679 + unsigned long pages; 680 + 681 + pages = rqstp->rq_maxpages; 682 + 683 + if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages)) 684 + return false; 685 + if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages)) 686 + return false; 687 + rqstp->rq_next_page = rqstp->rq_respages; 688 + rqstp->rq_page_end = &rqstp->rq_respages[pages]; 689 + /* svc_rqst_replace_page() dereferences *rq_next_page even 690 + * at rq_page_end; NULL prevents releasing a garbage page. 691 + */ 692 + rqstp->rq_page_end[0] = NULL; 674 693 675 694 /* Make arg->head point to first page and arg->pages point to rest */ 676 695 arg->head[0].iov_base = page_address(rqstp->rq_pages[0]); ··· 1294 1277 rqstp->rq_addrlen = dr->addrlen; 1295 1278 /* Save off transport header len in case we get deferred again */ 1296 1279 rqstp->rq_daddr = dr->daddr; 1297 - rqstp->rq_respages = rqstp->rq_pages; 1298 1280 rqstp->rq_xprt_ctxt = dr->xprt_ctxt; 1299 1281 1300 1282 dr->xprt_ctxt = NULL;
-6
net/sunrpc/svcsock.c
··· 351 351 352 352 for (i = 0, t = 0; t < buflen; i++, t += PAGE_SIZE) 353 353 bvec_set_page(&bvec[i], rqstp->rq_pages[i], PAGE_SIZE, 0); 354 - rqstp->rq_respages = &rqstp->rq_pages[i]; 355 - rqstp->rq_next_page = rqstp->rq_respages + 1; 356 354 357 355 iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen); 358 356 if (seek) { ··· 675 677 if (len <= rqstp->rq_arg.head[0].iov_len) { 676 678 rqstp->rq_arg.head[0].iov_len = len; 677 679 rqstp->rq_arg.page_len = 0; 678 - rqstp->rq_respages = rqstp->rq_pages+1; 679 680 } else { 680 681 rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len; 681 - rqstp->rq_respages = rqstp->rq_pages + 1 + 682 - DIV_ROUND_UP(rqstp->rq_arg.page_len, PAGE_SIZE); 683 682 } 684 - rqstp->rq_next_page = rqstp->rq_respages+1; 685 683 686 684 if (serv->sv_stats) 687 685 serv->sv_stats->netudpcnt++;
+4 -11
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
··· 861 861 unsigned int i; 862 862 863 863 /* Transfer the Read chunk pages into @rqstp.rq_pages, replacing 864 - * the rq_pages that were already allocated for this rqstp. 864 + * the receive buffer pages already allocated for this rqstp. 865 865 */ 866 - release_pages(rqstp->rq_respages, ctxt->rc_page_count); 866 + release_pages(rqstp->rq_pages, ctxt->rc_page_count); 867 867 for (i = 0; i < ctxt->rc_page_count; i++) 868 868 rqstp->rq_pages[i] = ctxt->rc_pages[i]; 869 - 870 - /* Update @rqstp's result send buffer to start after the 871 - * last page in the RDMA Read payload. 872 - */ 873 - rqstp->rq_respages = &rqstp->rq_pages[ctxt->rc_page_count]; 874 - rqstp->rq_next_page = rqstp->rq_respages + 1; 875 869 876 870 /* Prevent svc_rdma_recv_ctxt_put() from releasing the 877 871 * pages in ctxt::rc_pages a second time. ··· 925 931 struct svc_rdma_recv_ctxt *ctxt; 926 932 int ret; 927 933 928 - /* Prevent svc_xprt_release() from releasing pages in rq_pages 929 - * when returning 0 or an error. 934 + /* Precaution: a zero page count on error return causes 935 + * svc_rqst_release_pages() to release nothing. 930 936 */ 931 - rqstp->rq_respages = rqstp->rq_pages; 932 937 rqstp->rq_next_page = rqstp->rq_respages; 933 938 934 939 rqstp->rq_xprt_ctxt = NULL;