Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

resource: improve child resource handling in release_mem_region_adjustable()

When memory block is removed via try_remove_memory(), it eventually
reaches release_mem_region_adjustable(). The current implementation
assumes that when a busy memory resource is split into two, all child
resources remain in the lower address range.

This simplification causes problems when child resources actually belong
to the upper split. For example:

* Initial memory layout:
lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000002ffffffff 12G online yes 0-95

* /proc/iomem
00000000-2dfefffff : System RAM
158834000-1597b3fff : Kernel code
1597b4000-159f50fff : Kernel data
15a13c000-15a218fff : Kernel bss
2dff00000-2ffefffff : Crash kernel
2fff00000-2ffffffff : System RAM

* After offlining and removing range
0x150000000-0x157ffffff
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED
(output according to upcoming lsmem changes with the configured column:
s390)
RANGE SIZE STATE BLOCK CONFIGURED
0x0000000000000000-0x000000014fffffff 5.3G online 0-41 yes
0x0000000150000000-0x0000000157ffffff 128M offline 42 no
0x0000000158000000-0x00000002ffffffff 6.6G online 43-95 yes

The iomem resource gets split into two entries, but kernel code, kernel
data, and kernel bss remain attached to the lower resource [0–5376M]
instead of the correct upper resource [5504M–12288M].

As a result, WARN_ON() triggers in release_mem_region_adjustable()
("Usecase: split into two entries - we need a new resource")
------------[ cut here ]------------
WARNING: CPU: 5 PID: 858 at kernel/resource.c:1486
release_mem_region_adjustable+0x210/0x280
Modules linked in:
CPU: 5 UID: 0 PID: 858 Comm: chmem Not tainted 6.17.0-rc2-11707-g2c36aaf3ba4e
Hardware name: IBM 3906 M04 704 (z/VM 7.3.0)
Krnl PSW : 0704d00180000000 0000024ec0dae0e4
(release_mem_region_adjustable+0x214/0x280)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 00000002ffffafc0 fffffffffffffff0 0000000000000000
000000014fffffff 0000024ec2257608 0000000000000000 0000024ec2301758
0000024ec22680d0 00000000902c9140 0000000150000000 00000002ffffafc0
000003ffa61d8d18 0000024ec21fb478 0000024ec0dae014 000001cec194fbb0
Krnl Code: 0000024ec0dae0d8: af000000 mc 0,0
0000024ec0dae0dc: a7f4ffc1 brc 15,0000024ec0dae05e
#0000024ec0dae0e0: af000000 mc 0,0
>0000024ec0dae0e4: a5defffd llilh %r13,65533
0000024ec0dae0e8: c04000c6064c larl %r4,0000024ec266ed80
0000024ec0dae0ee: eb1d400000f8 laa %r1,%r13,0(%r4)
0000024ec0dae0f4: 07e0 bcr 14,%r0
0000024ec0dae0f6: a7f4ffc0 brc 15,0000024ec0dae076

[<0000024ec0dae0e4>] release_mem_region_adjustable+0x214/0x280
([<0000024ec0dadf3c>] release_mem_region_adjustable+0x6c/0x280)
[<0000024ec10a2130>] try_remove_memory+0x100/0x140
[<0000024ec10a4052>] __remove_memory+0x22/0x40
[<0000024ec18890f6>] config_mblock_store+0x326/0x3e0
[<0000024ec11f7056>] kernfs_fop_write_iter+0x136/0x210
[<0000024ec1121e86>] vfs_write+0x236/0x3c0
[<0000024ec11221b8>] ksys_write+0x78/0x110
[<0000024ec1b6bfbe>] __do_syscall+0x12e/0x350
[<0000024ec1b782ce>] system_call+0x6e/0x90
Last Breaking-Event-Address:
[<0000024ec0dae014>] release_mem_region_adjustable+0x144/0x280
---[ end trace 0000000000000000 ]---

Also, resource adjustment doesn't happen and stale resources still cover
[0-12288M]. Later, memory re-add fails in register_memory_resource() with
-EBUSY.

i.e: /proc/iomem is still:
00000000-2dfefffff : System RAM
158834000-1597b3fff : Kernel code
1597b4000-159f50fff : Kernel data
15a13c000-15a218fff : Kernel bss
2dff00000-2ffefffff : Crash kernel
2fff00000-2ffffffff : System RAM

Enhance release_mem_region_adjustable() to reassign child resources to the
correct parent after a split. Children are now assigned based on their
actual range: If they fall within the lower split, keep them in the lower
parent. If they fall within the upper split, move them to the upper
parent.

Kernel code/data/bss regions are not offlined, so they will always reside
entirely within one parent and never span across both.

Output after the enhancement:
* Initial state /proc/iomem (before removal of memory block):
00000000-2dfefffff : System RAM
1f94f8000-1fa477fff : Kernel code
1fa478000-1fac14fff : Kernel data
1fae00000-1faedcfff : Kernel bss
2dff00000-2ffefffff : Crash kernel
2fff00000-2ffffffff : System RAM

* Offline and remove 0x1e8000000-0x1efffffff memory range
* /proc/iomem
00000000-1e7ffffff : System RAM
1f0000000-2dfefffff : System RAM
1f94f8000-1fa477fff : Kernel code
1fa478000-1fac14fff : Kernel data
1fae00000-1faedcfff : Kernel bss
2dff00000-2ffefffff : Crash kernel
2fff00000-2ffffffff : System RAM

Link: https://lkml.kernel.org/r/20250912123021.3219980-1-sumanthk@linux.ibm.com
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Sumanth Korikkar and committed by
Andrew Morton
eea5706c 5ea8ab7f

+45 -5
+45 -5
kernel/resource.c
··· 1388 1388 EXPORT_SYMBOL(__release_region); 1389 1389 1390 1390 #ifdef CONFIG_MEMORY_HOTREMOVE 1391 + static void append_child_to_parent(struct resource *new_parent, struct resource *new_child) 1392 + { 1393 + struct resource *child; 1394 + 1395 + child = new_parent->child; 1396 + if (child) { 1397 + while (child->sibling) 1398 + child = child->sibling; 1399 + child->sibling = new_child; 1400 + } else { 1401 + new_parent->child = new_child; 1402 + } 1403 + new_child->parent = new_parent; 1404 + new_child->sibling = NULL; 1405 + } 1406 + 1407 + /* 1408 + * Reparent all child resources that no longer belong to "low" after a split to 1409 + * "high". Note that "high" does not have any children, because "low" is the 1410 + * original resource and "high" is a new resource. Treat "low" as the original 1411 + * resource being split and defer its range adjustment to __adjust_resource(). 1412 + */ 1413 + static void reparent_children_after_split(struct resource *low, 1414 + struct resource *high, 1415 + resource_size_t split_addr) 1416 + { 1417 + struct resource *child, *next, **p; 1418 + 1419 + p = &low->child; 1420 + while ((child = *p)) { 1421 + next = child->sibling; 1422 + if (child->start > split_addr) { 1423 + /* unlink child */ 1424 + *p = next; 1425 + append_child_to_parent(high, child); 1426 + } else { 1427 + p = &child->sibling; 1428 + } 1429 + } 1430 + } 1431 + 1391 1432 /** 1392 1433 * release_mem_region_adjustable - release a previously reserved memory region 1393 1434 * @start: resource start address ··· 1438 1397 * is released from a currently busy memory resource. The requested region 1439 1398 * must either match exactly or fit into a single busy resource entry. In 1440 1399 * the latter case, the remaining resource is adjusted accordingly. 1441 - * Existing children of the busy memory resource must be immutable in the 1442 - * request. 1443 1400 * 1444 1401 * Note: 1445 1402 * - Additional release conditions, such as overlapping region, can be 1446 1403 * supported after they are confirmed as valid cases. 1447 - * - When a busy memory resource gets split into two entries, the code 1448 - * assumes that all children remain in the lower address entry for 1449 - * simplicity. Enhance this logic when necessary. 1404 + * - When a busy memory resource gets split into two entries, its children are 1405 + * reassigned to the correct parent based on their range. If a child memory 1406 + * resource overlaps with more than one parent, enhance the logic as needed. 1450 1407 */ 1451 1408 void release_mem_region_adjustable(resource_size_t start, resource_size_t size) 1452 1409 { ··· 1521 1482 new_res->parent = res->parent; 1522 1483 new_res->sibling = res->sibling; 1523 1484 new_res->child = NULL; 1485 + reparent_children_after_split(res, new_res, end); 1524 1486 1525 1487 if (WARN_ON_ONCE(__adjust_resource(res, res->start, 1526 1488 start - res->start)))