Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE

Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when
advised", v5.

This will allow individual processes to opt-out of THP = "always" into THP
= "madvise", without affecting other workloads on the system. This has
been extensively discussed on the mailing list and has been summarized
very well by David in the first patch which also includes the links to
alternatives, please refer to the first patch commit message for the
motivation for this series.

Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this,
along with the MMF changes.

Patch 2 is a cleanup patch for tva_flags that will allow the forced
collapse case to be transmitted to vma_thp_disabled (which is done in
patch 3).

Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE.

Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely
disabling THPs (old behaviour) and only enabling it at advise
(PR_THP_DISABLE_EXCEPT_ADVISED).


This patch (of 7):

People want to make use of more THPs, for example, moving from the "never"
system policy to "madvise", or from "madvise" to "always".

While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: individual processes may need to
opt-out from this behavior for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out.

The following scenarios are imaginable:

(1) Switch from "none" system policy to "madvise"/"always", but keep THPs
disabled for selected workloads.

(2) Stay at "none" system policy, but enable THPs for selected
workloads, making only these workloads use the "madvise" or "always"
policy.

(3) Switch from "madvise" system policy to "always", but keep the
"madvise" policy for selected workloads: allocate THPs only when
advised.

(4) Stay at "madvise" system policy, but enable THPs even when not advised
for selected workloads -- "always" policy.

Once can emulate (2) through (1), by setting the system policy to
"madvise"/"always" while disabling THPs for all processes that don't want
THPs. It requires configuring all workloads, but that is a user-space
problem to sort out.

(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet
(i.e., used by Redis) were able to just disable THPs completely. Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy. That
essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per process,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising. Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this is a step backwards, because these
processes can no longer allocate THPs in regions where THPs were
explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that
imposes a problem for relevant workloads, because "not THPs" is certainly
worse than "THPs only when advised".

Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
would change the documented semantics quite a bit, and the versatility to
use it for debugging purposes, so I am not 100% sure that is what we want
-- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), add an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).

(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

Previously, it would return 1 if THPs were disabled completely. Now
it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
was set.

(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
the semantics clearly.

Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
disabled completely

Only indicating that THPs are disabled when they are really disabled
completely, not only partially.

For now, we don't add another interface to obtained whether THPs
are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
ever required, we could add a new entry.

The documented semantics in the man page for PR_SET_THP_DISABLE "is
inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained. This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g., systemd
where we fork() a helper process to then exec()).

For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space thinks
a THP is a good idea, we'll enable that separately next (requiring a bit
of cleanup first).

There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP. There are not really known
users for re-enabling it, and it's against the purpose of the original
interface. So if ever required, we could investigate just forbidding to
re-enable them, or make this somehow configurable.

Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

David Hildenbrand and committed by
Andrew Morton
9dc21bbd e338d835

+82 -29
+3 -2
Documentation/filesystems/proc.rst
··· 291 291 HugetlbPages size of hugetlb memory portions 292 292 CoreDumping process's memory is currently being dumped 293 293 (killing the process may lead to a corrupted core) 294 - THP_enabled process is allowed to use THP (returns 0 when 295 - PR_SET_THP_DISABLE is set on the process 294 + THP_enabled process is allowed to use THP (returns 0 when 295 + PR_SET_THP_DISABLE is set on the process to disable 296 + THP completely, not just partially) 296 297 Threads number of threads 297 298 SigQ number of signals queued/max. number for queue 298 299 SigPnd bitmap of pending signals for the thread
+1 -1
fs/proc/array.c
··· 422 422 bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE); 423 423 424 424 if (thp_enabled) 425 - thp_enabled = !mm_flags_test(MMF_DISABLE_THP, mm); 425 + thp_enabled = !mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm); 426 426 seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); 427 427 } 428 428
+15 -5
include/linux/huge_mm.h
··· 318 318 (transparent_hugepage_flags & \ 319 319 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG)) 320 320 321 + /* 322 + * Check whether THPs are explicitly disabled for this VMA, for example, 323 + * through madvise or prctl. 324 + */ 321 325 static inline bool vma_thp_disabled(struct vm_area_struct *vma, 322 326 vm_flags_t vm_flags) 323 327 { 328 + /* Are THPs disabled for this VMA? */ 329 + if (vm_flags & VM_NOHUGEPAGE) 330 + return true; 331 + /* Are THPs disabled for all VMAs in the whole process? */ 332 + if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, vma->vm_mm)) 333 + return true; 324 334 /* 325 - * Explicitly disabled through madvise or prctl, or some 326 - * architectures may disable THP for some mappings, for 327 - * example, s390 kvm. 335 + * Are THPs disabled only for VMAs where we didn't get an explicit 336 + * advise to use them? 328 337 */ 329 - return (vm_flags & VM_NOHUGEPAGE) || 330 - mm_flags_test(MMF_DISABLE_THP, vma->vm_mm); 338 + if (vm_flags & VM_HUGEPAGE) 339 + return false; 340 + return mm_flags_test(MMF_DISABLE_THP_EXCEPT_ADVISED, vma->vm_mm); 331 341 } 332 342 333 343 static inline bool thp_disabled_by_hw(void)
+5 -8
include/linux/mm_types.h
··· 1792 1792 #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ 1793 1793 #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */ 1794 1794 1795 - /* 1796 - * This one-shot flag is dropped due to necessity of changing exe once again 1797 - * on NFS restore 1798 - */ 1799 - //#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ 1795 + #define MMF_HUGE_ZERO_FOLIO 18 /* mm has ever used the global huge zero folio */ 1800 1796 1801 1797 #define MMF_HAS_UPROBES 19 /* has uprobes */ 1802 1798 #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ 1803 1799 #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ 1804 1800 #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ 1805 - #define MMF_HUGE_ZERO_FOLIO 23 /* mm has ever used the global huge zero folio */ 1806 - #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ 1807 - #define MMF_DISABLE_THP_MASK BIT(MMF_DISABLE_THP) 1801 + #define MMF_DISABLE_THP_EXCEPT_ADVISED 23 /* no THP except when advised (e.g., VM_HUGEPAGE) */ 1802 + #define MMF_DISABLE_THP_COMPLETELY 24 /* no THP for all VMAs */ 1803 + #define MMF_DISABLE_THP_MASK (BIT(MMF_DISABLE_THP_COMPLETELY) | \ 1804 + BIT(MMF_DISABLE_THP_EXCEPT_ADVISED)) 1808 1805 #define MMF_OOM_REAP_QUEUED 25 /* mm was queued for oom_reaper */ 1809 1806 #define MMF_MULTIPROCESS 26 /* mm is shared between processes */ 1810 1807 /*
+10
include/uapi/linux/prctl.h
··· 177 177 178 178 #define PR_GET_TID_ADDRESS 40 179 179 180 + /* 181 + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0 182 + * is reserved, so PR_GET_THP_DISABLE can return "1 | flags", to effectively 183 + * return "1" when no flags were specified for PR_SET_THP_DISABLE. 184 + */ 180 185 #define PR_SET_THP_DISABLE 41 186 + /* 187 + * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE / 188 + * VM_HUGEPAGE). 189 + */ 190 + # define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1) 181 191 #define PR_GET_THP_DISABLE 42 182 192 183 193 /*
+47 -12
kernel/sys.c
··· 2452 2452 return sizeof(mm->saved_auxv); 2453 2453 } 2454 2454 2455 + static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3, 2456 + unsigned long arg4, unsigned long arg5) 2457 + { 2458 + struct mm_struct *mm = current->mm; 2459 + 2460 + if (arg2 || arg3 || arg4 || arg5) 2461 + return -EINVAL; 2462 + 2463 + /* If disabled, we return "1 | flags", otherwise 0. */ 2464 + if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm)) 2465 + return 1; 2466 + else if (mm_flags_test(MMF_DISABLE_THP_EXCEPT_ADVISED, mm)) 2467 + return 1 | PR_THP_DISABLE_EXCEPT_ADVISED; 2468 + return 0; 2469 + } 2470 + 2471 + static int prctl_set_thp_disable(bool thp_disable, unsigned long flags, 2472 + unsigned long arg4, unsigned long arg5) 2473 + { 2474 + struct mm_struct *mm = current->mm; 2475 + 2476 + if (arg4 || arg5) 2477 + return -EINVAL; 2478 + 2479 + /* Flags are only allowed when disabling. */ 2480 + if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED)) 2481 + return -EINVAL; 2482 + if (mmap_write_lock_killable(current->mm)) 2483 + return -EINTR; 2484 + if (thp_disable) { 2485 + if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) { 2486 + mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm); 2487 + mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm); 2488 + } else { 2489 + mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm); 2490 + mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm); 2491 + } 2492 + } else { 2493 + mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm); 2494 + mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm); 2495 + } 2496 + mmap_write_unlock(current->mm); 2497 + return 0; 2498 + } 2499 + 2455 2500 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, 2456 2501 unsigned long, arg4, unsigned long, arg5) 2457 2502 { ··· 2670 2625 return -EINVAL; 2671 2626 return task_no_new_privs(current) ? 1 : 0; 2672 2627 case PR_GET_THP_DISABLE: 2673 - if (arg2 || arg3 || arg4 || arg5) 2674 - return -EINVAL; 2675 - error = !!mm_flags_test(MMF_DISABLE_THP, me->mm); 2628 + error = prctl_get_thp_disable(arg2, arg3, arg4, arg5); 2676 2629 break; 2677 2630 case PR_SET_THP_DISABLE: 2678 - if (arg3 || arg4 || arg5) 2679 - return -EINVAL; 2680 - if (mmap_write_lock_killable(me->mm)) 2681 - return -EINTR; 2682 - if (arg2) 2683 - mm_flags_set(MMF_DISABLE_THP, me->mm); 2684 - else 2685 - mm_flags_clear(MMF_DISABLE_THP, me->mm); 2686 - mmap_write_unlock(me->mm); 2631 + error = prctl_set_thp_disable(arg2, arg3, arg4, arg5); 2687 2632 break; 2688 2633 case PR_MPX_ENABLE_MANAGEMENT: 2689 2634 case PR_MPX_DISABLE_MANAGEMENT:
+1 -1
mm/khugepaged.c
··· 410 410 static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm) 411 411 { 412 412 return hpage_collapse_test_exit(mm) || 413 - mm_flags_test(MMF_DISABLE_THP, mm); 413 + mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm); 414 414 } 415 415 416 416 static bool hugepage_pmd_enabled(void)