Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

x86/efi: defer freeing of boot services memory

efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE
and EFI_BOOT_SERVICES_DATA using memblock_free_late().

There are two issue with that: memblock_free_late() should be used for
memory allocated with memblock_alloc() while the memory reserved with
memblock_reserve() should be freed with free_reserved_area().

More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
efi_free_boot_services() is called before deferred initialization of the
memory map is complete.

Benjamin Herrenschmidt reports that this causes a leak of ~140MB of
RAM on EC2 t3a.nano instances which only have 512MB or RAM.

If the freed memory resides in the areas that memory map for them is
still uninitialized, they won't be actually freed because
memblock_free_late() calls memblock_free_pages() and the latter skips
uninitialized pages.

Using free_reserved_area() at this point is also problematic because
__free_page() accesses the buddy of the freed page and that again might
end up in uninitialized part of the memory map.

Delaying the entire efi_free_boot_services() could be problematic
because in addition to freeing boot services memory it updates
efi.memmap without any synchronization and that's undesirable late in
boot when there is concurrency.

More robust approach is to only defer freeing of the EFI boot services
memory.

Split efi_free_boot_services() in two. First efi_unmap_boot_services()
collects ranges that should be freed into an array then
efi_free_boot_services() later frees them after deferred init is complete.

Link: https://lore.kernel.org/all/ec2aaef14783869b3be6e3c253b2dcbf67dbc12a.camel@kernel.crashing.org
Fixes: 916f676f8dc0 ("x86, efi: Retain boot service code until after switching to virtual mode")
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

authored by

Mike Rapoport (Microsoft) and committed by
Ard Biesheuvel
a4b0bf6a 6de23f81

+55 -6
+1 -1
arch/x86/include/asm/efi.h
··· 138 138 extern int __init efi_reuse_config(u64 tables, int nr_tables); 139 139 extern void efi_delete_dummy_variable(void); 140 140 extern void efi_crash_gracefully_on_page_fault(unsigned long phys_addr); 141 - extern void efi_free_boot_services(void); 141 + extern void efi_unmap_boot_services(void); 142 142 143 143 void arch_efi_call_virt_setup(void); 144 144 void arch_efi_call_virt_teardown(void);
+1 -1
arch/x86/platform/efi/efi.c
··· 836 836 } 837 837 838 838 efi_check_for_embedded_firmwares(); 839 - efi_free_boot_services(); 839 + efi_unmap_boot_services(); 840 840 841 841 if (!efi_is_mixed()) 842 842 efi_native_runtime_setup();
+52 -3
arch/x86/platform/efi/quirks.c
··· 341 341 342 342 /* 343 343 * Because the following memblock_reserve() is paired 344 - * with memblock_free_late() for this region in 344 + * with free_reserved_area() for this region in 345 345 * efi_free_boot_services(), we must be extremely 346 346 * careful not to reserve, and subsequently free, 347 347 * critical regions of memory (like the kernel image) or ··· 404 404 pr_err("Failed to unmap VA mapping for 0x%llx\n", va); 405 405 } 406 406 407 - void __init efi_free_boot_services(void) 407 + struct efi_freeable_range { 408 + u64 start; 409 + u64 end; 410 + }; 411 + 412 + static struct efi_freeable_range *ranges_to_free; 413 + 414 + void __init efi_unmap_boot_services(void) 408 415 { 409 416 struct efi_memory_map_data data = { 0 }; 410 417 efi_memory_desc_t *md; 411 418 int num_entries = 0; 419 + int idx = 0; 420 + size_t sz; 412 421 void *new, *new_md; 413 422 414 423 /* Keep all regions for /sys/kernel/debug/efi */ 415 424 if (efi_enabled(EFI_DBG)) 416 425 return; 426 + 427 + sz = sizeof(*ranges_to_free) * efi.memmap.nr_map + 1; 428 + ranges_to_free = kzalloc(sz, GFP_KERNEL); 429 + if (!ranges_to_free) { 430 + pr_err("Failed to allocate storage for freeable EFI regions\n"); 431 + return; 432 + } 417 433 418 434 for_each_efi_memory_desc(md) { 419 435 unsigned long long start = md->phys_addr; ··· 487 471 start = SZ_1M; 488 472 } 489 473 490 - memblock_free_late(start, size); 474 + /* 475 + * With CONFIG_DEFERRED_STRUCT_PAGE_INIT parts of the memory 476 + * map are still not initialized and we can't reliably free 477 + * memory here. 478 + * Queue the ranges to free at a later point. 479 + */ 480 + ranges_to_free[idx].start = start; 481 + ranges_to_free[idx].end = start + size; 482 + idx++; 491 483 } 492 484 493 485 if (!num_entries) ··· 535 511 return; 536 512 } 537 513 } 514 + 515 + static int __init efi_free_boot_services(void) 516 + { 517 + struct efi_freeable_range *range = ranges_to_free; 518 + unsigned long freed = 0; 519 + 520 + if (!ranges_to_free) 521 + return 0; 522 + 523 + while (range->start) { 524 + void *start = phys_to_virt(range->start); 525 + void *end = phys_to_virt(range->end); 526 + 527 + free_reserved_area(start, end, -1, NULL); 528 + freed += (end - start); 529 + range++; 530 + } 531 + kfree(ranges_to_free); 532 + 533 + if (freed) 534 + pr_info("Freeing EFI boot services memory: %ldK\n", freed / SZ_1K); 535 + 536 + return 0; 537 + } 538 + arch_initcall(efi_free_boot_services); 538 539 539 540 /* 540 541 * A number of config table entries get remapped to virtual addresses
+1 -1
drivers/firmware/efi/mokvar-table.c
··· 85 85 * as an alternative to ordinary EFI variables, due to platform-dependent 86 86 * limitations. The memory occupied by this table is marked as reserved. 87 87 * 88 - * This routine must be called before efi_free_boot_services() in order 88 + * This routine must be called before efi_unmap_boot_services() in order 89 89 * to guarantee that it can mark the table as reserved. 90 90 * 91 91 * Implicit inputs: