Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

bufmap: manage as folios, V2.

Thanks for the feedback from Dan Carpenter and Arnd Bergmann.

Dan suggested to make the rollback loop in orangefs_bufmap_map
more robust.

Arnd caught a %ld format for a size_t in
orangefs_bufmap_copy_to_iovec. He suggested %zd, I
used %zu which I think is OK too.

Orangefs userspace allocates 40 megabytes on an address that's page
aligned.

With this folio modification the allocation is aligned on a multiple of
2 megabytes:
posix_memalign(&ptr, 2097152, 41943040);

Then userspace tries to enable Huge Pages for the range:
madvise(ptr, 41943040, MADV_HUGEPAGE);

Userspace provides the address of the 40 megabyte allocation to
the Orangefs kernel module with an ioctl.

The kernel module initializes the memory as a "bufmap" with ten
4 megabyte "slots".

Traditionally, the slots are manipulated a page at a time.

This folio/bufmap modification manages the slots as folios, with
two 2 megabyte folios per slot and data can be read into
and out of each slot a folio at a time.

This modification works fine with orangefs userspace lacking
the THP focused posix_memalign and madvise settings listed above,
each slot can end up being made of page sized folios. It also works
if there are some, but less than 20, hugepages available. A message
is printed in the kernel ring buffer (dmesg) at userspace start
time that describes the folio/page ratio. As an example, I started
orangefs and saw "Grouped 2575 folios from 10240 pages" in the ring
buffer.

To get the optimum ratio, 20/10240, I use these settings before
I start the orangefs userspace:

echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/defrag
echo 30 > /proc/sys/vm/nr_hugepages

https://docs.kernel.org/admin-guide/mm/hugetlbpage.html discusses
hugepages and manipulating the /proc/sys/vm settings.

Comparing the performance between the page/bufmap and the folio/bufmap
is a mixed bag.

- The folio/bufmap version is about 8% faster at running through the
xfstest suite on my VMs.

- It is easy to construct an fio test that brings the page/bufmap
version to its knees on my dinky VM test system, with all bufmap
slots used and I/O timeouts cascading.

- Some smaller tests I did with fio that didn't overwhelm the
page/bufmap version showed no performance gain with the
folio/bufmap version on my VM.

I suspect this change will improve performance only in some use-cases.
I think it will be a gain when there are many concurrent IOs that
mostly fill the bufmap. I'm working up a gcloud test for that.

Reported-by: Dan Carpenter <error27@gmail.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>

+359 -44
+359 -44
fs/orangefs/orangefs-bufmap.c
··· 139 139 /* used to describe mapped buffers */ 140 140 struct orangefs_bufmap_desc { 141 141 void __user *uaddr; /* user space address pointer */ 142 - struct page **page_array; /* array of mapped pages */ 143 - int array_count; /* size of above arrays */ 144 - struct list_head list_link; 142 + struct folio **folio_array; 143 + /* 144 + * folio_offsets could be needed when userspace sets custom 145 + * sizes in user_desc, or when folios aren't all backed by 146 + * 2MB THPs. 147 + */ 148 + size_t *folio_offsets; 149 + int folio_count; 150 + bool is_two_2mib_chunks; 145 151 }; 146 152 147 153 static struct orangefs_bufmap { ··· 156 150 int desc_count; 157 151 int total_size; 158 152 int page_count; 153 + int folio_count; 159 154 160 155 struct page **page_array; 156 + struct folio **folio_array; 161 157 struct orangefs_bufmap_desc *desc_array; 162 158 163 159 /* array to track usage of buffer descriptors */ ··· 182 174 static void 183 175 orangefs_bufmap_free(struct orangefs_bufmap *bufmap) 184 176 { 177 + int i; 178 + 179 + if (!bufmap) 180 + return; 181 + 182 + for (i = 0; i < bufmap->desc_count; i++) { 183 + kfree(bufmap->desc_array[i].folio_array); 184 + kfree(bufmap->desc_array[i].folio_offsets); 185 + bufmap->desc_array[i].folio_array = NULL; 186 + bufmap->desc_array[i].folio_offsets = NULL; 187 + } 185 188 kfree(bufmap->page_array); 186 189 kfree(bufmap->desc_array); 187 190 bitmap_free(bufmap->buffer_index_array); ··· 232 213 bufmap->desc_count = user_desc->count; 233 214 bufmap->desc_size = user_desc->size; 234 215 bufmap->desc_shift = ilog2(bufmap->desc_size); 216 + bufmap->page_count = bufmap->total_size / PAGE_SIZE; 235 217 236 - bufmap->buffer_index_array = bitmap_zalloc(bufmap->desc_count, GFP_KERNEL); 218 + bufmap->buffer_index_array = 219 + bitmap_zalloc(bufmap->desc_count, GFP_KERNEL); 237 220 if (!bufmap->buffer_index_array) 238 221 goto out_free_bufmap; 239 222 ··· 244 223 if (!bufmap->desc_array) 245 224 goto out_free_index_array; 246 225 247 - bufmap->page_count = bufmap->total_size / PAGE_SIZE; 248 - 249 226 /* allocate storage to track our page mappings */ 250 227 bufmap->page_array = 251 228 kzalloc_objs(struct page *, bufmap->page_count); 252 229 if (!bufmap->page_array) 253 230 goto out_free_desc_array; 254 231 232 + /* allocate folio array. */ 233 + bufmap->folio_array = kzalloc_objs(struct folio *, bufmap->page_count); 234 + if (!bufmap->folio_array) 235 + goto out_free_page_array; 236 + 255 237 return bufmap; 256 238 239 + out_free_page_array: 240 + kfree(bufmap->page_array); 257 241 out_free_desc_array: 258 242 kfree(bufmap->desc_array); 259 243 out_free_index_array: ··· 269 243 return NULL; 270 244 } 271 245 272 - static int 273 - orangefs_bufmap_map(struct orangefs_bufmap *bufmap, 274 - struct ORANGEFS_dev_map_desc *user_desc) 246 + static int orangefs_bufmap_group_folios(struct orangefs_bufmap *bufmap) 247 + { 248 + int i = 0; 249 + int f = 0; 250 + int k; 251 + int num_pages; 252 + struct page *page; 253 + struct folio *folio; 254 + 255 + while (i < bufmap->page_count) { 256 + page = bufmap->page_array[i]; 257 + folio = page_folio(page); 258 + num_pages = folio_nr_pages(folio); 259 + gossip_debug(GOSSIP_BUFMAP_DEBUG, 260 + "%s: i:%d: num_pages:%d: \n", __func__, i, num_pages); 261 + 262 + for (k = 1; k < num_pages; k++) { 263 + if (bufmap->page_array[i + k] != folio_page(folio, k)) { 264 + gossip_err("%s: bad match, i:%d: k:%d:\n", 265 + __func__, i, k); 266 + return -EINVAL; 267 + } 268 + } 269 + 270 + bufmap->folio_array[f++] = folio; 271 + i += num_pages; 272 + } 273 + 274 + bufmap->folio_count = f; 275 + pr_info("%s: Grouped %d folios from %d pages.\n", 276 + __func__, 277 + bufmap->folio_count, 278 + bufmap->page_count); 279 + return 0; 280 + } 281 + 282 + static int orangefs_bufmap_map(struct orangefs_bufmap *bufmap, 283 + struct ORANGEFS_dev_map_desc *user_desc) 275 284 { 276 285 int pages_per_desc = bufmap->desc_size / PAGE_SIZE; 277 - int offset = 0, ret, i; 286 + int ret; 287 + int i; 288 + int j; 289 + int current_folio; 290 + int desc_pages_needed; 291 + int desc_folio_count; 292 + int remaining_pages; 293 + int need_avail_min; 294 + int pages_assigned_to_this_desc; 295 + int allocated_descs = 0; 296 + size_t current_offset; 297 + size_t adjust_offset; 298 + struct folio *folio; 278 299 279 300 /* map the pages */ 280 301 ret = pin_user_pages_fast((unsigned long)user_desc->ptr, 281 - bufmap->page_count, FOLL_WRITE, bufmap->page_array); 302 + bufmap->page_count, 303 + FOLL_WRITE, 304 + bufmap->page_array); 282 305 283 306 if (ret < 0) 284 307 return ret; ··· 335 260 if (ret != bufmap->page_count) { 336 261 gossip_err("orangefs error: asked for %d pages, only got %d.\n", 337 262 bufmap->page_count, ret); 338 - 339 263 for (i = 0; i < ret; i++) 340 264 unpin_user_page(bufmap->page_array[i]); 341 265 return -ENOMEM; ··· 349 275 for (i = 0; i < bufmap->page_count; i++) 350 276 flush_dcache_page(bufmap->page_array[i]); 351 277 352 - /* build a list of available descriptors */ 353 - for (offset = 0, i = 0; i < bufmap->desc_count; i++) { 354 - bufmap->desc_array[i].page_array = &bufmap->page_array[offset]; 355 - bufmap->desc_array[i].array_count = pages_per_desc; 278 + /* 279 + * Group pages into folios. 280 + */ 281 + ret = orangefs_bufmap_group_folios(bufmap); 282 + if (ret) 283 + goto unpin; 284 + 285 + pr_info("%s: desc_size=%d bytes (%d pages per desc), total folios=%d\n", 286 + __func__, bufmap->desc_size, pages_per_desc, 287 + bufmap->folio_count); 288 + 289 + current_folio = 0; 290 + remaining_pages = 0; 291 + current_offset = 0; 292 + for (i = 0; i < bufmap->desc_count; i++) { 293 + desc_pages_needed = pages_per_desc; 294 + desc_folio_count = 0; 295 + pages_assigned_to_this_desc = 0; 296 + bufmap->desc_array[i].is_two_2mib_chunks = false; 297 + 298 + /* 299 + * We hope there was enough memory that each desc is 300 + * covered by two THPs/folios, if not we want to keep on 301 + * working even if there's only one page per folio. 302 + */ 303 + bufmap->desc_array[i].folio_array = 304 + kzalloc_objs(struct folio *, pages_per_desc); 305 + if (!bufmap->desc_array[i].folio_array) { 306 + ret = -ENOMEM; 307 + goto unpin; 308 + } 309 + 310 + bufmap->desc_array[i].folio_offsets = 311 + kzalloc_objs(size_t, pages_per_desc); 312 + if (!bufmap->desc_array[i].folio_offsets) { 313 + ret = -ENOMEM; 314 + kfree(bufmap->desc_array[i].folio_array); 315 + bufmap->desc_array[i].folio_array = NULL; 316 + goto unpin; 317 + } 318 + 356 319 bufmap->desc_array[i].uaddr = 357 - (user_desc->ptr + (i * pages_per_desc * PAGE_SIZE)); 358 - offset += pages_per_desc; 320 + user_desc->ptr + (size_t)i * bufmap->desc_size; 321 + 322 + /* 323 + * Accumulate folios until desc is full. 324 + */ 325 + while (desc_pages_needed > 0) { 326 + if (remaining_pages == 0) { 327 + /* shouldn't happen. */ 328 + if (current_folio >= bufmap->folio_count) { 329 + ret = -EINVAL; 330 + goto unpin; 331 + } 332 + folio = bufmap->folio_array[current_folio++]; 333 + remaining_pages = folio_nr_pages(folio); 334 + current_offset = 0; 335 + } else { 336 + folio = bufmap->folio_array[current_folio - 1]; 337 + } 338 + 339 + need_avail_min = 340 + min(desc_pages_needed, remaining_pages); 341 + adjust_offset = need_avail_min * PAGE_SIZE; 342 + 343 + bufmap->desc_array[i].folio_array[desc_folio_count] = 344 + folio; 345 + bufmap->desc_array[i].folio_offsets[desc_folio_count] = 346 + current_offset; 347 + desc_folio_count++; 348 + pages_assigned_to_this_desc += need_avail_min; 349 + desc_pages_needed -= need_avail_min; 350 + remaining_pages -= need_avail_min; 351 + current_offset += adjust_offset; 352 + } 353 + 354 + /* Detect optimal case: two 2MiB folios per 4MiB slot. */ 355 + if (desc_folio_count == 2 && 356 + folio_nr_pages(bufmap->desc_array[i].folio_array[0]) == 512 && 357 + folio_nr_pages(bufmap->desc_array[i].folio_array[1]) == 512) { 358 + bufmap->desc_array[i].is_two_2mib_chunks = true; 359 + gossip_debug(GOSSIP_BUFMAP_DEBUG, "%s: descriptor :%d: " 360 + "optimal folio/page ratio.\n", __func__, i); 361 + } 362 + 363 + bufmap->desc_array[i].folio_count = desc_folio_count; 364 + gossip_debug(GOSSIP_BUFMAP_DEBUG, 365 + " descriptor %d: folio_count=%d, " 366 + "pages_assigned=%d (should be %d)\n", 367 + i, desc_folio_count, pages_assigned_to_this_desc, 368 + pages_per_desc); 369 + 370 + allocated_descs = i + 1; 359 371 } 360 372 361 373 return 0; 374 + unpin: 375 + /* 376 + * rollback any allocations we got so far... 377 + * Memory pressure, like in generic/340, led me 378 + * to write the rollback this way. 379 + */ 380 + for (j = 0; j < allocated_descs; j++) { 381 + if (bufmap->desc_array[j].folio_array) { 382 + kfree(bufmap->desc_array[j].folio_array); 383 + bufmap->desc_array[j].folio_array = NULL; 384 + } 385 + if (bufmap->desc_array[j].folio_offsets) { 386 + kfree(bufmap->desc_array[j].folio_offsets); 387 + bufmap->desc_array[j].folio_offsets = NULL; 388 + } 389 + } 390 + unpin_user_pages(bufmap->page_array, bufmap->page_count); 391 + return ret; 362 392 } 363 393 364 394 /* 365 395 * orangefs_bufmap_initialize() 366 396 * 367 397 * initializes the mapped buffer interface 398 + * 399 + * user_desc is the parameters provided by userspace for the bufmap. 368 400 * 369 401 * returns 0 on success, -errno on failure 370 402 */ ··· 480 300 int ret = -EINVAL; 481 301 482 302 gossip_debug(GOSSIP_BUFMAP_DEBUG, 483 - "orangefs_bufmap_initialize: called (ptr (" 484 - "%p) sz (%d) cnt(%d).\n", 303 + "%s: called (ptr (" "%p) sz (%d) cnt(%d).\n", 304 + __func__, 485 305 user_desc->ptr, 486 306 user_desc->size, 487 307 user_desc->count); ··· 551 371 spin_unlock(&orangefs_bufmap_lock); 552 372 553 373 gossip_debug(GOSSIP_BUFMAP_DEBUG, 554 - "orangefs_bufmap_initialize: exiting normally\n"); 374 + "%s: exiting normally\n", __func__); 555 375 return 0; 556 376 557 377 out_unmap_bufmap: ··· 651 471 size_t size) 652 472 { 653 473 struct orangefs_bufmap_desc *to; 654 - int i; 655 - 656 - gossip_debug(GOSSIP_BUFMAP_DEBUG, 657 - "%s: buffer_index:%d: size:%zu:\n", 658 - __func__, buffer_index, size); 474 + size_t remaining = size; 475 + int folio_index = 0; 476 + struct folio *folio; 477 + size_t folio_offset; 478 + size_t folio_avail; 479 + size_t copy_amount; 480 + size_t copied; 481 + void *kaddr; 482 + size_t half; 483 + size_t first; 484 + size_t second; 659 485 660 486 to = &__orangefs_bufmap->desc_array[buffer_index]; 661 - for (i = 0; size; i++) { 662 - struct page *page = to->page_array[i]; 663 - size_t n = size; 664 - if (n > PAGE_SIZE) 665 - n = PAGE_SIZE; 666 - if (copy_page_from_iter(page, 0, n, iter) != n) 487 + 488 + /* shouldn't happen... */ 489 + if (size > 4194304) 490 + pr_info("%s: size:%zu\n", __func__, size); 491 + 492 + gossip_debug(GOSSIP_BUFMAP_DEBUG, 493 + "%s: buffer_index:%d size:%zu folio_count:%d\n", 494 + __func__, 495 + buffer_index, 496 + size, 497 + to->folio_count); 498 + 499 + /* Fast path: exactly two 2 MiB folios */ 500 + if (to->is_two_2mib_chunks && size <= 4194304) { 501 + gossip_debug(GOSSIP_BUFMAP_DEBUG, 502 + "%s: fastpath hit.\n", __func__); 503 + half = 2097152; /* 2 MiB */ 504 + first = min(size, half); 505 + second = (size > half) ? size - half : 0; 506 + 507 + /* First 2 MiB chunk */ 508 + kaddr = kmap_local_folio(to->folio_array[0], 0); 509 + copied = copy_from_iter(kaddr, first, iter); 510 + kunmap_local(kaddr); 511 + if (copied != first) 667 512 return -EFAULT; 668 - size -= n; 513 + 514 + if (second == 0) 515 + return 0; 516 + 517 + /* Second 2 MiB chunk */ 518 + kaddr = kmap_local_folio(to->folio_array[1], 0); 519 + copied = copy_from_iter(kaddr, second, iter); 520 + kunmap_local(kaddr); 521 + if (copied != second) 522 + return -EFAULT; 523 + 524 + return 0; 669 525 } 526 + 527 + while (remaining > 0) { 528 + 529 + if (unlikely(folio_index >= to->folio_count || 530 + to->folio_array[folio_index] == NULL)) { 531 + gossip_err("%s: " 532 + "folio_index:%d: >= folio_count:%d: " 533 + "(size %zu, buffer %d)\n", 534 + __func__, 535 + folio_index, 536 + to->folio_count, 537 + size, 538 + buffer_index); 539 + return -EFAULT; 540 + } 541 + 542 + folio = to->folio_array[folio_index]; 543 + folio_offset = to->folio_offsets[folio_index]; 544 + folio_avail = folio_nr_pages(folio) * PAGE_SIZE - folio_offset; 545 + copy_amount = min(remaining, folio_avail); 546 + kaddr = kmap_local_folio(folio, folio_offset); 547 + copied = copy_from_iter(kaddr, copy_amount, iter); 548 + kunmap_local(kaddr); 549 + 550 + if (copied != copy_amount) 551 + return -EFAULT; 552 + 553 + remaining -= copied; 554 + folio_index++; 555 + } 556 + 670 557 return 0; 671 558 } 672 559 ··· 746 499 size_t size) 747 500 { 748 501 struct orangefs_bufmap_desc *from; 749 - int i; 502 + size_t remaining = size; 503 + int folio_index = 0; 504 + struct folio *folio; 505 + size_t folio_offset; 506 + size_t folio_avail; 507 + size_t copy_amount; 508 + size_t copied; 509 + void *kaddr; 510 + size_t half; 511 + size_t first; 512 + size_t second; 750 513 751 514 from = &__orangefs_bufmap->desc_array[buffer_index]; 515 + 516 + /* shouldn't happen... */ 517 + if (size > 4194304) 518 + pr_info("%s: size:%zu\n", __func__, size); 519 + 752 520 gossip_debug(GOSSIP_BUFMAP_DEBUG, 753 - "%s: buffer_index:%d: size:%zu:\n", 754 - __func__, buffer_index, size); 521 + "%s: buffer_index:%d size:%zu folio_count:%d\n", 522 + __func__, 523 + buffer_index, 524 + size, 525 + from->folio_count); 755 526 527 + /* Fast path: exactly two 2 MiB folios */ 528 + if (from->is_two_2mib_chunks && size <= 4194304) { 529 + gossip_debug(GOSSIP_BUFMAP_DEBUG, 530 + "%s: fastpath hit.\n", __func__); 531 + half = 2097152; /* 2 MiB */ 532 + first = min(size, half); 533 + second = (size > half) ? size - half : 0; 534 + void *kaddr; 535 + size_t copied; 756 536 757 - for (i = 0; size; i++) { 758 - struct page *page = from->page_array[i]; 759 - size_t n = size; 760 - if (n > PAGE_SIZE) 761 - n = PAGE_SIZE; 762 - n = copy_page_to_iter(page, 0, n, iter); 763 - if (!n) 537 + /* First 2 MiB chunk */ 538 + kaddr = kmap_local_folio(from->folio_array[0], 0); 539 + copied = copy_to_iter(kaddr, first, iter); 540 + kunmap_local(kaddr); 541 + if (copied != first) 764 542 return -EFAULT; 765 - size -= n; 543 + 544 + if (second == 0) 545 + return 0; 546 + 547 + /* Second 2 MiB chunk */ 548 + kaddr = kmap_local_folio(from->folio_array[1], 0); 549 + copied = copy_to_iter(kaddr, second, iter); 550 + kunmap_local(kaddr); 551 + if (copied != second) 552 + return -EFAULT; 553 + 554 + return 0; 766 555 } 556 + 557 + while (remaining > 0) { 558 + 559 + if (unlikely(folio_index >= from->folio_count || 560 + from->folio_array[folio_index] == NULL)) { 561 + gossip_err("%s: " 562 + "folio_index:%d: >= folio_count:%d: " 563 + "(size %zu, buffer %d)\n", 564 + __func__, 565 + folio_index, 566 + from->folio_count, 567 + size, 568 + buffer_index); 569 + return -EFAULT; 570 + } 571 + 572 + folio = from->folio_array[folio_index]; 573 + folio_offset = from->folio_offsets[folio_index]; 574 + folio_avail = folio_nr_pages(folio) * PAGE_SIZE - folio_offset; 575 + copy_amount = min(remaining, folio_avail); 576 + 577 + kaddr = kmap_local_folio(folio, folio_offset); 578 + copied = copy_to_iter(kaddr, copy_amount, iter); 579 + kunmap_local(kaddr); 580 + 581 + if (copied != copy_amount) 582 + return -EFAULT; 583 + 584 + remaining -= copied; 585 + folio_index++; 586 + } 587 + 767 588 return 0; 768 589 }