Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

kexec: enable CMA based contiguous allocation

When booting a new kernel with kexec_file, the kernel picks a target
location that the kernel should live at, then allocates random pages,
checks whether any of those patches magically happens to coincide with a
target address range and if so, uses them for that range.

For every page allocated this way, it then creates a page list that the
relocation code - code that executes while all CPUs are off and we are
just about to jump into the new kernel - copies to their final memory
location. We can not put them there before, because chances are pretty
good that at least some page in the target range is already in use by the
currently running Linux environment. Copying is happening from a single
CPU at RAM rate, which takes around 4-50 ms per 100 MiB.

All of this is inefficient and error prone.

To successfully kexec, we need to quiesce all devices of the outgoing
kernel so they don't scribble over the new kernel's memory. We have seen
cases where that does not happen properly (*cough* GIC *cough*) and hence
the new kernel was corrupted. This started a month long journey to root
cause failing kexecs to eventually see memory corruption, because the new
kernel was corrupted severely enough that it could not emit output to tell
us about the fact that it was corrupted. By allocating memory for the
next kernel from a memory range that is guaranteed scribbling free, we can
boot the next kernel up to a point where it is at least able to detect
corruption and maybe even stop it before it becomes severe. This
increases the chance for successful kexecs.

Since kexec got introduced, Linux has gained the CMA framework which can
perform physically contiguous memory mappings, while keeping that memory
available for movable memory when it is not needed for contiguous
allocations. The default CMA allocator is for DMA allocations.

This patch adds logic to the kexec file loader to attempt to place the
target payload at a location allocated from CMA. If successful, it uses
that memory range directly instead of creating copy instructions during
the hot phase. To ensure that there is a safety net in case anything goes
wrong with the CMA allocation, it also adds a flag for user space to force
disable CMA allocations.

Using CMA allocations has two advantages:

1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the
hot phase.
2) More robust. Even if by accident some page is still in use for DMA,
the new kernel image will be safe from that access because it resides
in a memory region that is considered allocated in the old kernel and
has a chance to reinitialize that component.

Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Alexander Graf and committed by
Andrew Morton
07d24902 ed4f142f

+156 -11
+1
arch/riscv/kernel/kexec_elf.c
··· 95 95 kbuf.buf_align = PMD_SIZE; 96 96 kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; 97 97 kbuf.memsz = ALIGN(kernel_len, PAGE_SIZE); 98 + kbuf.cma = NULL; 98 99 kbuf.top_down = false; 99 100 ret = arch_kexec_locate_mem_hole(&kbuf); 100 101 if (!ret) {
+10
include/linux/kexec.h
··· 79 79 80 80 typedef unsigned long kimage_entry_t; 81 81 82 + /* 83 + * This is a copy of the UAPI struct kexec_segment and must be identical 84 + * to it because it gets copied straight from user space into kernel 85 + * memory. Do not modify this structure unless you change the way segments 86 + * get ingested from user space. 87 + */ 82 88 struct kexec_segment { 83 89 /* 84 90 * This pointer can point to user memory if kexec_load() system ··· 178 172 * @buf_align: Minimum alignment needed. 179 173 * @buf_min: The buffer can't be placed below this address. 180 174 * @buf_max: The buffer can't be placed above this address. 175 + * @cma: CMA page if the buffer is backed by CMA. 181 176 * @top_down: Allocate from top of memory. 182 177 * @random: Place the buffer at a random position. 183 178 */ ··· 191 184 unsigned long buf_align; 192 185 unsigned long buf_min; 193 186 unsigned long buf_max; 187 + struct page *cma; 194 188 bool top_down; 195 189 #ifdef CONFIG_CRASH_DUMP 196 190 bool random; ··· 348 340 349 341 unsigned long nr_segments; 350 342 struct kexec_segment segment[KEXEC_SEGMENT_MAX]; 343 + struct page *segment_cma[KEXEC_SEGMENT_MAX]; 351 344 352 345 struct list_head control_pages; 353 346 struct list_head dest_pages; ··· 370 361 */ 371 362 unsigned int hotplug_support:1; 372 363 #endif 364 + unsigned int no_cma:1; 373 365 374 366 #ifdef ARCH_HAS_KIMAGE_ARCH 375 367 struct kimage_arch arch;
+1
include/uapi/linux/kexec.h
··· 27 27 #define KEXEC_FILE_ON_CRASH 0x00000002 28 28 #define KEXEC_FILE_NO_INITRAMFS 0x00000004 29 29 #define KEXEC_FILE_DEBUG 0x00000008 30 + #define KEXEC_FILE_NO_CMA 0x00000010 30 31 31 32 /* These values match the ELF architecture values. 32 33 * Unless there is a good reason that should continue to be the case.
+1 -1
kernel/kexec.c
··· 152 152 goto out; 153 153 154 154 for (i = 0; i < nr_segments; i++) { 155 - ret = kimage_load_segment(image, &image->segment[i]); 155 + ret = kimage_load_segment(image, i); 156 156 if (ret) 157 157 goto out; 158 158 }
+92 -8
kernel/kexec_core.c
··· 40 40 #include <linux/hugetlb.h> 41 41 #include <linux/objtool.h> 42 42 #include <linux/kmsg_dump.h> 43 + #include <linux/dma-map-ops.h> 43 44 44 45 #include <asm/page.h> 45 46 #include <asm/sections.h> ··· 554 553 kimage_free_pages(page); 555 554 } 556 555 556 + static void kimage_free_cma(struct kimage *image) 557 + { 558 + unsigned long i; 559 + 560 + for (i = 0; i < image->nr_segments; i++) { 561 + struct page *cma = image->segment_cma[i]; 562 + u32 nr_pages = image->segment[i].memsz >> PAGE_SHIFT; 563 + 564 + if (!cma) 565 + continue; 566 + 567 + arch_kexec_pre_free_pages(page_address(cma), nr_pages); 568 + dma_release_from_contiguous(NULL, cma, nr_pages); 569 + image->segment_cma[i] = NULL; 570 + } 571 + 572 + } 573 + 557 574 void kimage_free(struct kimage *image) 558 575 { 559 576 kimage_entry_t *ptr, entry; ··· 609 590 610 591 /* Free the kexec control pages... */ 611 592 kimage_free_page_list(&image->control_pages); 593 + 594 + /* Free CMA allocations */ 595 + kimage_free_cma(image); 612 596 613 597 /* 614 598 * Free up any temporary buffers allocated. This might hit if ··· 738 716 return page; 739 717 } 740 718 741 - static int kimage_load_normal_segment(struct kimage *image, 742 - struct kexec_segment *segment) 719 + static int kimage_load_cma_segment(struct kimage *image, int idx) 743 720 { 721 + struct kexec_segment *segment = &image->segment[idx]; 722 + struct page *cma = image->segment_cma[idx]; 723 + char *ptr = page_address(cma); 724 + unsigned long maddr; 725 + size_t ubytes, mbytes; 726 + int result = 0; 727 + unsigned char __user *buf = NULL; 728 + unsigned char *kbuf = NULL; 729 + 730 + if (image->file_mode) 731 + kbuf = segment->kbuf; 732 + else 733 + buf = segment->buf; 734 + ubytes = segment->bufsz; 735 + mbytes = segment->memsz; 736 + maddr = segment->mem; 737 + 738 + /* Then copy from source buffer to the CMA one */ 739 + while (mbytes) { 740 + size_t uchunk, mchunk; 741 + 742 + ptr += maddr & ~PAGE_MASK; 743 + mchunk = min_t(size_t, mbytes, 744 + PAGE_SIZE - (maddr & ~PAGE_MASK)); 745 + uchunk = min(ubytes, mchunk); 746 + 747 + if (uchunk) { 748 + /* For file based kexec, source pages are in kernel memory */ 749 + if (image->file_mode) 750 + memcpy(ptr, kbuf, uchunk); 751 + else 752 + result = copy_from_user(ptr, buf, uchunk); 753 + ubytes -= uchunk; 754 + if (image->file_mode) 755 + kbuf += uchunk; 756 + else 757 + buf += uchunk; 758 + } 759 + 760 + if (result) { 761 + result = -EFAULT; 762 + goto out; 763 + } 764 + 765 + ptr += mchunk; 766 + maddr += mchunk; 767 + mbytes -= mchunk; 768 + 769 + cond_resched(); 770 + } 771 + 772 + /* Clear any remainder */ 773 + memset(ptr, 0, mbytes); 774 + 775 + out: 776 + return result; 777 + } 778 + 779 + static int kimage_load_normal_segment(struct kimage *image, int idx) 780 + { 781 + struct kexec_segment *segment = &image->segment[idx]; 744 782 unsigned long maddr; 745 783 size_t ubytes, mbytes; 746 784 int result; ··· 814 732 ubytes = segment->bufsz; 815 733 mbytes = segment->memsz; 816 734 maddr = segment->mem; 735 + 736 + if (image->segment_cma[idx]) 737 + return kimage_load_cma_segment(image, idx); 817 738 818 739 result = kimage_set_destination(image, maddr); 819 740 if (result < 0) ··· 872 787 } 873 788 874 789 #ifdef CONFIG_CRASH_DUMP 875 - static int kimage_load_crash_segment(struct kimage *image, 876 - struct kexec_segment *segment) 790 + static int kimage_load_crash_segment(struct kimage *image, int idx) 877 791 { 878 792 /* For crash dumps kernels we simply copy the data from 879 793 * user space to it's destination. 880 794 * We do things a page at a time for the sake of kmap. 881 795 */ 796 + struct kexec_segment *segment = &image->segment[idx]; 882 797 unsigned long maddr; 883 798 size_t ubytes, mbytes; 884 799 int result; ··· 943 858 } 944 859 #endif 945 860 946 - int kimage_load_segment(struct kimage *image, 947 - struct kexec_segment *segment) 861 + int kimage_load_segment(struct kimage *image, int idx) 948 862 { 949 863 int result = -ENOMEM; 950 864 951 865 switch (image->type) { 952 866 case KEXEC_TYPE_DEFAULT: 953 - result = kimage_load_normal_segment(image, segment); 867 + result = kimage_load_normal_segment(image, idx); 954 868 break; 955 869 #ifdef CONFIG_CRASH_DUMP 956 870 case KEXEC_TYPE_CRASH: 957 - result = kimage_load_crash_segment(image, segment); 871 + result = kimage_load_crash_segment(image, idx); 958 872 break; 959 873 #endif 960 874 }
+50 -1
kernel/kexec_file.c
··· 26 26 #include <linux/kernel_read_file.h> 27 27 #include <linux/syscalls.h> 28 28 #include <linux/vmalloc.h> 29 + #include <linux/dma-map-ops.h> 29 30 #include "kexec_internal.h" 30 31 31 32 #ifdef CONFIG_KEXEC_SIG ··· 254 253 ret = 0; 255 254 } 256 255 256 + image->no_cma = !!(flags & KEXEC_FILE_NO_CMA); 257 + 257 258 if (cmdline_len) { 258 259 image->cmdline_buf = memdup_user(cmdline_ptr, cmdline_len); 259 260 if (IS_ERR(image->cmdline_buf)) { ··· 437 434 i, ksegment->buf, ksegment->bufsz, ksegment->mem, 438 435 ksegment->memsz); 439 436 440 - ret = kimage_load_segment(image, &image->segment[i]); 437 + ret = kimage_load_segment(image, i); 441 438 if (ret) 442 439 goto out; 443 440 } ··· 666 663 return walk_system_ram_res(0, ULONG_MAX, kbuf, func); 667 664 } 668 665 666 + static int kexec_alloc_contig(struct kexec_buf *kbuf) 667 + { 668 + size_t nr_pages = kbuf->memsz >> PAGE_SHIFT; 669 + unsigned long mem; 670 + struct page *p; 671 + 672 + /* User space disabled CMA allocations, bail out. */ 673 + if (kbuf->image->no_cma) 674 + return -EPERM; 675 + 676 + /* Skip CMA logic for crash kernel */ 677 + if (kbuf->image->type == KEXEC_TYPE_CRASH) 678 + return -EPERM; 679 + 680 + p = dma_alloc_from_contiguous(NULL, nr_pages, get_order(kbuf->buf_align), true); 681 + if (!p) 682 + return -ENOMEM; 683 + 684 + pr_debug("allocated %zu DMA pages at 0x%lx", nr_pages, page_to_boot_pfn(p)); 685 + 686 + mem = page_to_boot_pfn(p) << PAGE_SHIFT; 687 + 688 + if (kimage_is_destination_range(kbuf->image, mem, mem + kbuf->memsz)) { 689 + /* Our region is already in use by a statically defined one. Bail out. */ 690 + pr_debug("CMA overlaps existing mem: 0x%lx+0x%lx\n", mem, kbuf->memsz); 691 + dma_release_from_contiguous(NULL, p, nr_pages); 692 + return -EBUSY; 693 + } 694 + 695 + kbuf->mem = page_to_boot_pfn(p) << PAGE_SHIFT; 696 + kbuf->cma = p; 697 + 698 + arch_kexec_post_alloc_pages(page_address(p), (int)nr_pages, 0); 699 + 700 + return 0; 701 + } 702 + 669 703 /** 670 704 * kexec_locate_mem_hole - find free memory for the purgatory or the next kernel 671 705 * @kbuf: Parameters for the memory search. ··· 726 686 ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); 727 687 if (ret <= 0) 728 688 return ret; 689 + 690 + /* 691 + * Try to find a free physically contiguous block of memory first. With that, we 692 + * can avoid any copying at kexec time. 693 + */ 694 + if (!kexec_alloc_contig(kbuf)) 695 + return 0; 729 696 730 697 if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) 731 698 ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); ··· 779 732 /* Ensure minimum alignment needed for segments. */ 780 733 kbuf->memsz = ALIGN(kbuf->memsz, PAGE_SIZE); 781 734 kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE); 735 + kbuf->cma = NULL; 782 736 783 737 /* Walk the RAM ranges and allocate a suitable range for the buffer */ 784 738 ret = arch_kexec_locate_mem_hole(kbuf); ··· 792 744 ksegment->bufsz = kbuf->bufsz; 793 745 ksegment->mem = kbuf->mem; 794 746 ksegment->memsz = kbuf->memsz; 747 + kbuf->image->segment_cma[kbuf->image->nr_segments] = kbuf->cma; 795 748 kbuf->image->nr_segments++; 796 749 return 0; 797 750 }
+1 -1
kernel/kexec_internal.h
··· 10 10 int sanity_check_segment_list(struct kimage *image); 11 11 void kimage_free_page_list(struct list_head *list); 12 12 void kimage_free(struct kimage *image); 13 - int kimage_load_segment(struct kimage *image, struct kexec_segment *segment); 13 + int kimage_load_segment(struct kimage *image, int idx); 14 14 void kimage_terminate(struct kimage *image); 15 15 int kimage_is_destination_range(struct kimage *image, 16 16 unsigned long start, unsigned long end);