commits
Pull x86 fixes from Ingo Molnar:
- Fix SEV guest boot failures in certain circumstances, due to
very early code relying on a BSS-zeroed variable that isn't
actually zeroed yet an may contain non-zero bootup values
Move the variable into the .data section go gain even earlier
zeroing
- Expose & allow the IBPB-on-Entry feature on SNP guests, which
was not properly exposed to guests due to initial implementational
caution
- Fix O= build failure when CONFIG_EFI_SBAT_FILE is using relative
file paths
- Fix the various SNC (Sub-NUMA Clustering) topology enumeration
bugs/artifacts (sched-domain build errors mostly).
SNC enumeration data got more complicated with Granite Rapids X
(GNR) and Clearwater Forest X (CWF), which exposed these bugs
and made their effects more serious
- Also use the now sane(r) SNC code to fix resctrl SNC detection bugs
- Work around a historic libgcc unwinder bug in the vdso32 sigreturn
code (again), which regressed during an overly aggressive recent
cleanup of DWARF annotations
* tag 'x86-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/vdso32: Work around libgcc unwinder bug
x86/resctrl: Fix SNC detection
x86/topo: Fix SNC topology mess
x86/topo: Replace x86_has_numa_in_package
x86/topo: Add topology_num_nodes_per_package()
x86/numa: Store extra copy of numa_nodes_parsed
x86/boot: Handle relative CONFIG_EFI_SBAT_FILE file paths
x86/sev: Allow IBPB-on-Entry feature for SNP guests
x86/boot/sev: Move SEV decompressor variables into the .data section
Pull timer fix from Ingo Molnar:
"Make clock_adjtime() syscall timex validation slightly more permissive
for auxiliary clocks, to not reject syscalls based on the status field
that do not try to modify the status field.
This makes the ABI behavior in clock_adjtime() consistent with
CLOCK_REALTIME"
* tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix timex status validation for auxiliary clocks
The unwinder code in libgcc has a long standing bug which causes it to
fail to pick up the signal frame CFI flag. This is a generic bug
across all platforms.
It affects the __kernel_sigreturn and __kernel_rt_sigreturn vdso entry
points on i386. The x86-64 kernel doesn't provide a sigreturn stub,
and so there is no kernel-provided code that is affected on x86-64.
libgcc does have a legacy fallback path which happens to work as long
as the bytes immediately before each of the sigreturn functions fall
outside any function. This patch adds a nop before the ALIGN to each
of the sigreturn stubs to ensure that this is, indeed, the case.
The rest of the patch is just a comment which documents the invariants
that need to be maintained for this legacy path to work correctly.
This is a manifest bug: in the current vdso, __kernel_vsyscall is a
multiple of 16 bytes long and thus __kernel_sigreturn does not have
any padding in front of it.
Closes: https://lore.kernel.org/lkml/f3412cc3e8f66d1853cc9d572c0f2fab076872b1.camel@xry111.site
Fixes: 884961618ee5 ("x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S")
Reported-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124050
Link: https://patch.msgid.link/20260227010308.310342-1-hpa@zytor.com
Pull scheduler fix from Ingo Molnar:
"Fix a DL scheduler bug that may corrupt internal metrics during PI and
setscheduler() syscalls, resulting in kernel warnings and misbehavior.
Found during stress-testing"
* tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
The timekeeping_validate_timex() function validates the timex status
of an auxiliary system clock even when the status is not to be changed,
which causes unexpected errors for applications that make read-only
clock_adjtime() calls, or set some other timex fields, but without
clearing the status field.
Do the AUX-specific status validation only when the modes field contains
ADJ_STATUS, i.e. the application is actually trying to change the
status. This makes the AUX-specific clock_adjtime() behavior consistent
with CLOCK_REALTIME.
Fixes: 4eca49d0b621 ("timekeeping: Prepare do_adtimex() for auxiliary clocks")
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260225085231.276751-1-mlichvar@redhat.com
Now that the x86 topology code has a sensible nodes-per-package
measure, that does not depend on the online status of CPUs, use this
to divinate the SNC mode.
Note that when Cluster on Die (CoD) is configured on older systems this
will also show multiple NUMA nodes per package. Intel Resource Director
Technology is incomaptible with CoD. Print a warning and do not use the
fixup MSR_RMID_SNC_CONFIG.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Link: https://patch.msgid.link/aaCxbbgjL6OZ6VMd@agluck-desk3
Link: https://patch.msgid.link/20260303110100.367976706@infradead.org
Saves two function calls, and one stac/clac pair.
stac/clac is rather expensive on older cpus like Zen 2.
A synthetic network stress test gives a ~1.5% increase of pps
on AMD Zen 2.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running stress-ng --schedpolicy 0 on an RT kernel on a big machine
might lead to the following WARNINGs (edited).
sched: DL de-boosted task PID 22725: REPLENISH flag missing
WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8
... (running_bw underflow)
Call trace:
dequeue_task_dl+0x15c/0x1f8 (P)
dequeue_task+0x80/0x168
deactivate_task+0x24/0x50
push_dl_task+0x264/0x2e0
dl_task_timer+0x1b0/0x228
__hrtimer_run_queues+0x188/0x378
hrtimer_interrupt+0xfc/0x260
...
The problem is that when a SCHED_DEADLINE task (lock holder) is
changed to a lower priority class via sched_setscheduler(), it may
fail to properly inherit the parameters of potential DEADLINE donors
if it didn't already inherit them in the past (shorter deadline than
donor's at that time). This might lead to bandwidth accounting
corruption, as enqueue_task_dl() won't recognize the lock holder as
boosted.
The scenario occurs when:
1. A DEADLINE task (donor) blocks on a PI mutex held by another
DEADLINE task (holder), but the holder doesn't inherit parameters
(e.g., it already has a shorter deadline)
2. sched_setscheduler() changes the holder from DEADLINE to a lower
class while still holding the mutex
3. The holder should now inherit DEADLINE parameters from the donor
and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen
Fix the issue by introducing __setscheduler_dl_pi(), which detects when
a DEADLINE (proper or boosted) task gets setscheduled to a lower
priority class. In case, the function makes the task inherit DEADLINE
parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to
ensure proper bandwidth accounting during the next enqueue operation.
Fixes: 2279f540ea7d ("sched/deadline: Fix priority inheritance with multiple scheduling classes")
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260302-upstream-fix-deadline-piboost-b4-v3-1-6ba32184a9e0@redhat.com
Per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in
SNC-3 mode"), the original crazy SNC-3 SLIT table was:
node distances:
node 0 1 2 3 4 5
0: 10 15 17 21 28 26
1: 15 10 15 23 26 23
2: 17 15 10 26 23 21
3: 21 28 26 10 15 17
4: 23 26 23 15 10 15
5: 26 23 21 17 15 10
And per:
https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/
The suggestion was to average the off-trace clusters to restore sanity.
However, 4d6dd05d07d0 implements this under various assumptions:
- anything GNR/CWF with numa_in_package;
- there will never be more than 2 packages;
- the off-trace cluster will have distance >20
And then HPE shows up with a machine that matches the
Vendor-Family-Model checks but looks like this:
Here's an 8 socket (2 chassis) HPE system with SNC enabled:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40
1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40
2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40
3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40
4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40
5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40
6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40
7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40
8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18
9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18
10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16
11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16
12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16
13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16
14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12
15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10
10 = Same chassis and socket
12 = Same chassis and socket (SNC)
16 = Same chassis and adjacent socket
18 = Same chassis and non-adjacent socket
40 = Different chassis
Turns out, the 'max 2 packages' thing is only relevant to the SNC-3 parts, the
smaller parts do 8 sockets (like usual). The above SLIT table is sane, but
violates the previous assumptions and trips a WARN.
Now that the topology code has a sensible measure of nodes-per-package, we can
use that to divinate the SNC mode at hand, and only fix up SNC-3 topologies.
There is a 'healthy' amount of paranoia code validating the assumptions on the
SLIT table, a simple pr_err(FW_BUG) print on failure and a fallback to using
the regular table. Lets see how long this lasts :-)
Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Reported-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110100.238361290@infradead.org
Pull SCSI fixes from James Bottomley:
"Two core changes and the rest in drivers, one core change to quirk the
behaviour of the Iomega Zip drive and one to fix a hang caused by tag
reallocation problems, which has mostly been seen by the iscsi client.
Note the latter fixes the problem but still has a slight sysfs memory
leak, so will be amended in the next pull request (once we've run the
fix for the fix through our testing)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: Fix recursive locking in __configfs_open_file()
scsi: devinfo: Add BLIST_SKIP_IO_HINTS for Iomega ZIP
scsi: mpi3mr: Clear reset history on ready and recheck state after timeout
scsi: core: Fix refcount leak for tagset_refcnt
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature
set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered... [his words, not mine]
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Generic:
- Remove internal Kconfigs that are now set on all architectures
- Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
architectures finally enable it in Linux 7.0"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: always define KVM_CAP_SYNC_MMU
KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: arm64: Deduplicate ASID retrieval code
irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
KVM: arm64: Fix protected mode handling of pages larger than 4kB
KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
KVM: arm64: Fix ID register initialization for non-protected pKVM guests
KVM: arm64: Optimise away S1POE handling when not supported by host
KVM: arm64: Hide S1POE from guests when not supported by the host
.. with the brand spanking new topology_num_nodes_per_package().
Having the topology setup determine this value during MADT/SRAT parsing before
SMP bringup avoids having to detect this situation when building the SMP
topology masks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110100.123701837@infradead.org
Pull fbdev fix from Helge Deller:
"Silence build error in au1100fb driver found by kernel test robot"
* tag 'fbdev-for-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: au1100fb: Fix build on MIPS64
In flush_write_buffer, &p->frag_sem is acquired and then the loaded store
function is called, which, here, is target_core_item_dbroot_store(). This
function called filp_open(), following which these functions were called
(in reverse order), according to the call trace:
down_read
__configfs_open_file
do_dentry_open
vfs_open
do_open
path_openat
do_filp_open
file_open_name
filp_open
target_core_item_dbroot_store
flush_write_buffer
configfs_write_iter
target_core_item_dbroot_store() tries to validate the new file path by
trying to open the file path provided to it; however, in this case, the bug
report shows:
db_root: not a directory: /sys/kernel/config/target/dbroot
indicating that the same configfs file was tried to be opened, on which it
is currently working on. Thus, it is trying to acquire frag_sem semaphore
of the same file of which it already holds the semaphore obtained in
flush_write_buffer(), leading to acquiring the semaphore in a nested manner
and a possibility of recursive locking.
Fix this by modifying target_core_item_dbroot_store() to use kern_path()
instead of filp_open() to avoid opening the file using filesystem-specific
function __configfs_open_file(), and further modifying it to make this fix
compatible.
Reported-by: syzbot+f6e8174215573a84b797@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f6e8174215573a84b797
Tested-by: syzbot+f6e8174215573a84b797@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Link: https://patch.msgid.link/20260216062002.61937-1-activprithvi@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull debugobjects fix from Thomas Gleixner:
"A single fix for debugobjects.
The deferred page initialization prevents debug objects from
allocating slab pages until the initialization is complete. That
causes depletion of the pool and disabling of debugobjects.
The reason is that debugobjects uses __GFP_HIGH for allocations as it
might be invoked from arbitrary contexts. When PREEMPT_COUNT is
disabled there is no way to know whether the context is safe to set
__GFP_KSWAPD_RECLAIM.
This worked until v6.18. Since then allocations w/o a reclaim flag
cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(),
which returns early when deferred page initialization has not yet
completed.
Work around that when PREEMPT_COUNT is enabled as the preempt counter
allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when
the context is preemtible. When PREEMPT_COUNT is disabled the context
is unknown and the reclaim bit can't be set because the caller might
hold locks which might deadlock in the allocator.
That makes debugobjects depend on PREEMPT_COUNT ||
!DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but
keeps it functional for most cases"
* tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobject: Make it work with deferred page initialization - again
KVM/arm64 fixes for 7.0, take #1
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered...
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Use the MADT and SRAT table data to compute __num_nodes_per_package.
Specifically, SRAT has already been parsed in x86_numa_init(), which is called
before acpi_boot_init() which parses MADT. So both are available in
topology_init_possible_cpus().
This number is useful to divinate the various Intel CoD/SNC and AMD NPS modes,
since the platforms are failing to provide this otherwise.
Doing it this way is independent of the number of online CPUs and
other such shenanigans.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110100.004091624@infradead.org
Pull parisc fixes from Helge Deller:
"While testing Sasha Levin's 'kallsyms: embed source file:line info in
kernel stack traces' patch series, which increases the typical kernel
image size, I found some issues with the parisc initial kernel mapping
which may prevent the kernel to boot.
The three small patches here fix this"
* tag 'parisc-for-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix initial page table creation for boot
parisc: Check kernel mapping earlier at bootup
parisc: Increase initial mapping to 64 MB with KALLSYMS
Fix an error reported by the kernel test robot:
au1100fb.c: error: implicit declaration of function 'KSEG1ADDR'; did you mean 'CKSEG1ADDR'?
arch/mips/include/asm/addrspace.h defines KSEG1ADDR only for 32 bit
configurations. So provide its compile-test stub also for 64bit mips builds.
Fixes: 6f366e86481a ("fbdev: au1100fb: Make driver compilable on non-mips platforms")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603042127.PT6LuKqi-lkp@intel.com/
Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
The Iomega ZIP 100 (Z100P2) can't process IO Advice Hints Grouping mode
page query. It immediately switches to the status phase 0xb8 after
receiving the subpage code 0x05 of MODE_SENSE_10 command, which fails
imm_out() and turns into DID_ERROR of this command, which leads to unusable
device. This was tested with an Iomega ZIP 100 (Z100P2) connected with a
StarTech PEX1P2 AX99100 PCIe parallel port card.
Prior to this fix, Test Unit Ready fails and the drive can't be used:
IMM: returned SCSI status b8
sd 7:0:6:0: [sdh] Test Unit Ready failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
Signed-off-by: Florian Fuchs <fuchsfl@gmail.com>
Link: https://patch.msgid.link/20260227181823.892932-1-fuchsfl@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull x86 fixes from Ingo Molnar:
- Fix speculative safety in fred_extint()
- Fix __WARN_printf() trap in early_fixup_exception()
- Fix clang-build boot bug for unusual alignments, triggered by
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y
- Replace the final few __ASSEMBLY__ stragglers that snuck in lately
into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again)
* tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__
x86/cfi: Fix CFI rewrite for odd alignments
x86/bug: Handle __WARN_printf() trap in early_fixup_exception()
x86/fred: Correct speculative safety in fred_extint()
debugobjects uses __GFP_HIGH for allocations as it might be invoked
within locked regions. That worked perfectly fine until v6.18. It still
works correctly when deferred page initialization is disabled and works
by chance when no page allocation is required before deferred page
initialization has completed.
Since v6.18 allocations w/o a reclaim flag cause new_slab() to end up in
alloc_frozen_pages_nolock_noprof(), which returns early when deferred
page initialization has not yet completed. As the deferred page
initialization takes quite a while the debugobject pool is depleted and
debugobjects are disabled.
This can be worked around when PREEMPT_COUNT is enabled as that allows
debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when the context
is preemtible. When PREEMPT_COUNT is disabled the context is unknown and
the reclaim bit can't be set because the caller might hold locks which
might deadlock in the allocator.
In preemptible context the reclaim bit is harmless and not a performance
issue as that's usually invoked from slow path initialization context.
That makes debugobjects depend on PREEMPT_COUNT || !DEFERRED_STRUCT_PAGE_INIT.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/87pl6gznti.ffs@tglx
KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always
available. Move the definition from individual architectures to common
code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We currently have three versions of the ASID retrieval code, one
in the S1 walker, and two in the VNCR handling (although the last
two are limited to the EL2&0 translation regime).
Make this code common, and take this opportunity to also simplify
the code a bit while switching over to the TTBRx_EL1_ASID macro.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260225104718.14209-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
The topology setup code needs to know the total number of physical
nodes enumerated in SRAT; however NUMA_EMU can cause the existing
numa_nodes_parsed bitmap to be fictitious. Therefore, keep a copy of
the bitmap specifically to retain the physical node count.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110059.889884023@infradead.org
Pull bpf fixes from Alexei Starovoitov:
- Fix u32/s32 bounds when ranges cross min/max boundary (Eduard
Zingerman)
- Fix precision backtracking with linked registers (Eduard Zingerman)
- Fix linker flags detection for resolve_btfids (Ihor Solodrai)
- Fix race in update_ftrace_direct_add/del (Jiri Olsa)
- Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
resolve_btfids: Fix linker flags detection
selftests/bpf: add reproducer for spurious precision propagation through calls
bpf: collect only live registers in linked regs
Revert "selftests/bpf: Update reg_bound range refinement logic"
selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary
bpf: Fix u32/s32 bounds when ranges cross min/max boundary
bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim
ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
The KERNEL_INITIAL_ORDER value defines the initial size (usually 32 or
64 MB) of the page table during bootup. Up until now the whole area was
initialized with PTE entries, but there was no check if we filled too
many entries. Change the code to fill up with so many entries that the
"_end" symbol can be reached by the kernel, but not more entries than
actually fit into the initial PTE tables.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v6.0+
The driver retains reset history even after the IOC has successfully
reached the READY state. That leaves stale reset information active during
normal operation and can mislead recovery and diagnostics. In addition, if
the IOC becomes READY just as the ready timeout loop exits, the driver
still follows the failure path and may retry or report failure incorrectly.
Clear reset history once READY is confirmed so driver state matches actual
IOC status. After the timeout loop, recheck the IOC state and treat READY
as success instead of failing.
Signed-off-by: Ranjan Kumar <ranjan.kumar@broadcom.com>
Link: https://patch.msgid.link/20260225082622.82588-1-ranjan.kumar@broadcom.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull timer fix from Ingo Molnar:
"Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
the common HZ=100, 250 or 1000 cases. Only use a function call for odd
HZ values like HZ=300 that generate more code.
The function call overhead showed up in performance tests of the TCP
code"
* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
After converting the __ASSEMBLY__ statements to __ASSEMBLER__ in
commit 24a295e4ef1ca ("x86/headers: Replace __ASSEMBLY__ with
__ASSEMBLER__ in non-UAPI headers"), some new code has been
added that uses __ASSEMBLY__ again. Convert these stragglers, too.
This is a mechanical patch, done with a simple "sed -i" command.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251218182029.166993-1-thuth@redhat.com
All architectures now use MMU notifier for KVM page table management.
Remove the Kconfig symbol and the code that is used when it is
disabled.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It appears that a !! became ! during a cleanup, resulting in inverted
logic when detecting if a host GICv5 implementation is capable of
virtualization.
Re-add the missing !, fixing the behaviour.
Fixes: 3227c3a89d65f ("irqchip/gic-v5: Check if impl is virt capable")
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260225083130.3378490-1-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
CONFIG_EFI_SBAT_FILE can be a relative path. When compiling using a different
output directory (O=) the build currently fails because it can't find the
filename set in CONFIG_EFI_SBAT_FILE:
arch/x86/boot/compressed/sbat.S: Assembler messages:
arch/x86/boot/compressed/sbat.S:6: Error: file not found: kernel.sbat
Add $(srctree) as include dir for sbat.o.
[ bp: Massage commit message. ]
Fixes: 61b57d35396a ("x86/efi: Implement support for embedding SBAT data for x86")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/f4eda155b0cef91d4d316b4e92f5771cb0aa7187.1772047658.git.jstancek@redhat.com
Pull RCU selftest fixes from Boqun Feng:
"Fix a regression in RCU torture test pre-defined scenarios caused by
commit 7dadeaa6e851 ("sched: Further restrict the preemption modes")
which limits PREEMPT_NONE to architectures that do not support
preemption at all and PREEMPT_VOLUNTARY to those architectures that do
not yet have PREEMPT_LAZY support.
Since major architectures (e.g. x86 and arm64) no longer support
CONFIG_PREEMPT_NONE and CONFIG_PREEMPT_VOLUNTARY, using them in
rcutorture, rcuscale, refscale, and scftorture pre-defined scenarios
causes config checking errors.
Switch these kconfigs to PREEMPT_LAZY"
* tag 'rcu-fixes.v7.0-20260307a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
scftorture: Update due to x86 not supporting none/voluntary preemption
refscale: Update due to x86 not supporting none/voluntary preemption
rcuscale: Update due to x86 not supporting none/voluntary preemption
rcutorture: Update due to x86 not supporting none/voluntary preemption
The "|| echo -lzstd" default makes zstd an unconditional link
dependency of resolve_btfids. On systems where libzstd-dev is not
installed and pkg-config fails, the linker fails:
ld: cannot find -lzstd: No such file or directory
libzstd is a transitive dependency of libelf, so the -lzstd flag is
strictly necessary only for static builds [1].
Remove ZSTD_LIBS variable, and instead set LIBELF_LIBS depending on
whether the build is static or not. Use $(HOSTPKG_CONFIG) as primary
source of the flags list.
Also add a default value for HOSTPKG_CONFIG in case it's not built via
the toplevel Makefile. Pass it from selftests/bpf too.
[1] https://lore.kernel.org/bpf/4ff82800-2daa-4b9f-95a9-6f512859ee70@linux.dev/
Reported-by: BPF CI Bot (Claude Opus 4.6) <bot+bpf-ci@kernel.org>
Reported-by: Vitaly Chikunov <vt@altlinux.org>
Closes: https://lore.kernel.org/bpf/aaWqMcK-2AQw5dx8@altlinux.org/
Fixes: 4021848a903e ("selftests/bpf: Pass through build flags to bpftool and resolve_btfids")
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Reviewed-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260305014730.3123382-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The check if the initial mapping is sufficient needs to happen much
earlier during bootup. Move this test directly to the start_parisc()
function and use native PDC iodc functions to print the warning, because
panic() and printk() are not functional yet.
This fixes boot when enabling various KALLSYSMS options which need
much more space.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v6.0+
This leak will cause a hang when tearing down the SCSI host. For example,
iscsid hangs with the following call trace:
[130120.652718] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured
PID: 2528 TASK: ffff9d0408974e00 CPU: 3 COMMAND: "iscsid"
#0 [ffffb5b9c134b9e0] __schedule at ffffffff860657d4
#1 [ffffb5b9c134ba28] schedule at ffffffff86065c6f
#2 [ffffb5b9c134ba40] schedule_timeout at ffffffff86069fb0
#3 [ffffb5b9c134bab0] __wait_for_common at ffffffff8606674f
#4 [ffffb5b9c134bb10] scsi_remove_host at ffffffff85bfe84b
#5 [ffffb5b9c134bb30] iscsi_sw_tcp_session_destroy at ffffffffc03031c4 [iscsi_tcp]
#6 [ffffb5b9c134bb48] iscsi_if_recv_msg at ffffffffc0292692 [scsi_transport_iscsi]
#7 [ffffb5b9c134bb98] iscsi_if_rx at ffffffffc02929c2 [scsi_transport_iscsi]
#8 [ffffb5b9c134bbf0] netlink_unicast at ffffffff85e551d6
#9 [ffffb5b9c134bc38] netlink_sendmsg at ffffffff85e554ef
Fixes: 8fe4ce5836e9 ("scsi: core: Fix a use-after-free")
Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260223232728.93350-1-junxiao.bi@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull scheduler fixes from Ingo Molnar:
- Fix zero_vruntime tracking when there's a single task running
- Fix slice protection logic
- Fix the ->vprot logic for reniced tasks
- Fix lag clamping in mixed slice workloads
- Fix objtool uaccess warning (and bug) in the
!CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining,
which triggers with older compilers
- Fix a comment in the rseq registration rseq_size bound check code
- Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes
differently, which special size we now reached naturally and want to
avoid. The visible ugliness of the new reserved field will be avoided
the next time the RSEQ area is extended.
* tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: slice ext: Ensure rseq feature size differs from original rseq size
rseq: Clarify rseq registration rseq_size bound check comment
sched/core: Fix wakeup_preempt's next_class tracking
rseq: Mark rseq_arm_slice_extension_timer() __always_inline
sched/fair: Fix lag clamp
sched/eevdf: Update se->vprot in reweight_entity()
sched/fair: Only set slice protection at pick time
sched/fair: Fix zero_vruntime tracking
For common cases (HZ=100, 250 or 1000), these helpers are at most one
multiply, so there is no point calling a tiny function.
Keep them out of line for HZ=300 and others.
This saves cycles in TCP fast path, among other things.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/8 grow/shrink: 25/89 up/down: 530/-3474 (-2944)
...
nla_put_msecs 193 - -193
message_stats_print 2131 920 -1211
Total: Before=25365208, After=25362264, chg -0.01%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260210170226.57209-1-edumazet@google.com
Rustam reported his clang builds did not boot properly; turns out his
.config has: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y set.
Fix up the FineIBT code to deal with this unusual alignment.
Fixes: 931ab63664f0 ("x86/ibt: Implement FineIBT")
Reported-by: Rustam Kovhaev <rkovhaev@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Rustam Kovhaev <rkovhaev@gmail.com>
Pull i2c fix from Wolfram Sang:
- imx: preserve error state during SMBus block read length handling
* tag 'i2c-for-6.19-final' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: imx: preserve error state in block data length handler
Commit 0c4762e26879 ("KVM: arm64: nv: Avoid NV stage-2 code when NV is
not supported") added an early return to several functions in
arch/arm64/kvm/nested.c to prevent a UBSAN shift-out-of-bounds error
when accessing the pgt union for non-nested VMs.
However, this early return was inadvertently applied to
kvm_arch_flush_shadow_all() as well, causing it to skip the call to
kvm_uninit_stage2_mmu(kvm) for all non-nested VMs.
For pKVM, skipping this teardown means the host never unshares the
guest's memory with the EL2 hypervisor. When the host kernel later
recycles these leaked pages for a new VM, it attempts to re-share them.
The hypervisor correctly rejects this with -EPERM, triggering a host
WARN_ON and hanging the guest.
Fix this by dropping the early return from kvm_arch_flush_shadow_all().
The for-loop guarding the nested MMU cleanup already bounds itself when
nested_mmus_size == 0, allowing execution to proceed to
kvm_uninit_stage2_mmu() as intended.
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/60916cb6-f460-4751-b910-f63c58700ad0@sirena.org.uk/
Fixes: 0c4762e26879 ("KVM: arm64: nv: Avoid NV stage-2 code when NV is not supported")
Signed-off-by: Fuad Tabba <tabba@google.com>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://patch.msgid.link/20260222083352.89503-1-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The SEV-SNP IBPB-on-Entry feature does not require a guest-side
implementation. It was added in Zen5 h/w, after the first SNP Zen
implementation, and thus was not accounted for when the initial set of SNP
features were added to the kernel.
In its abundant precaution, commit
8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support")
included SEV_STATUS' IBPB-on-Entry bit as a reserved bit, thereby masking
guests from using the feature.
Allow guests to make use of IBPB-on-Entry when supported by the hypervisor, as
the bit is now architecturally defined and safe to expose.
Fixes: 8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@kernel.org
Link: https://patch.msgid.link/20260203222405.4065706-2-kim.phillips@amd.com
Pull tracing fixes from Steven Rostedt:
- Fix possible NULL pointer dereference in trace_data_alloc()
On the trace_data_alloc() error path, it can call trigger_data_free()
with a NULL pointer. This used to be a kfree() but was changed to
trigger_data_free() to clean up any partial initialization. The issue
is that trigger_data_free() does not expect a NULL pointer. Have
trigger_data_free() return safely on NULL pointer.
- Fix multiple events on the command line and bootconfig
If multiple events are enabled on the command line separately and not
grouped, only the last event gets enabled. That is:
trace_event=sched_switch trace_event=sched_waking
will only enable sched_waking whereas:
trace_event=sched_switch,sched_waking
will enable both.
The bootconfig makes it even worse as the second way is the more
common method.
The issue is that a temporary buffer is used to store the events to
enable later in boot. Each time the cmdline callback is called, it
overwrites what was previously there.
Have the callback append the next value (delimited by a comma) if the
temporary buffer already has content.
- Fix command line trace_buffer_size if >= 2G
The logic to allocate the trace buffer uses "int" for the size
parameter in the command line code causing overflow issues if more
that 2G is specified.
* tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
tracing: Fix enabling multiple events on the kernel command line and bootconfig
tracing: Add NULL pointer check to trigger_data_free()
As of v7.0-rc1, architectures that support preemption, including x86 and
arm64, no longer support CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY.
Attempting to build kernels with these two Kconfig options results in
.config errors. This commit therefore switches such scftorture scenarios
to CONFIG_PREEMPT_LAZY.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Link: https://patch.msgid.link/20260303235903.1967409-4-paulmck@kernel.org
Eduard Zingerman says:
====================
bpf: Fix precision backtracking bug with linked registers
Emil Tsalapatis reported a verifier bug hit by the scx_lavd sched_ext
scheduler. The essential part of the verifier log looks as follows:
436: ...
// checkpoint hit for 438: (1d) if r7 == r8 goto ...
frame 3: propagating r2,r7,r8
frame 2: propagating r6
mark_precise: frame3: last_idx ...
mark_precise: frame3: regs=r2,r7,r8 stack= before 436: ...
mark_precise: frame3: regs=r2,r7 stack= before 435: ...
mark_precise: frame3: regs=r2,r7 stack= before 434: (85) call bpf_trace_vprintk#177
verifier bug: backtracking call unexpected regs 84
The log complains that registers r2 and r7 are tracked as precise
while processing the bpf_trace_vprintk() call in precision backtracking.
This can't be right, as r2 is reset by the call and there is nothing
to backtrack it to. The precision propagation is triggered when
a checkpoint is hit at instruction 438, r2 is dead at that instruction.
This happens because of the following sequence of events:
- Instruction 438 is first reached with registers r2 and r7 having
the same id via a path that does not call bpf_trace_vprintk():
- Checkpoint is created at 438.
- The jump at 438 is predicted, hence r7 and registers linked to it
(r2) are propagated as precise, marking r2 and r7 precise in the
checkpoint.
- Instruction 438 is reached a second time with r2 undefined and via
a path that calls bpf_trace_vprintk():
- Checkpoint is hit.
- propagate_precision() picks registers r2 and r7 and propagates
precision marks for those up to the helper call.
The root cause is the fact that states_equal() and
propagate_precision() assume that the precision flag can't be set for a
dead register (as computed by compute_live_registers()).
However, this is not the case when linked registers are at play.
Fix this by accounting for live register flags in
collect_linked_regs().
---
====================
Link: https://patch.msgid.link/20260306-linked-regs-and-propagate-precision-v1-0-18e859be570d@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The 32MB initial kernel mapping can become too small when CONFIG_KALLSYMS
is used. Increase the mapping to 64 MB in this case.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v6.0+
According to JESD223F, the maximum number of queues (MAXQ) is 32. When MCQ
is enabled and ESI is disabled, nr_hw_queues=32 causes a shift overflow
problem.
Fix this by using 64-bit intermediate values to handle the nr_hw_queues=32
case safely.
Signed-off-by: wangshuaiwei <wangshuaiwei1@xiaomi.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260224063228.50112-1-wangshuaiwei1@xiaomi.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull perf events fixes from Ingo Molnar:
- Fix lock ordering bug found by lockdep in perf_event_wakeup()
- Fix uncore counter enumeration on Granite Rapids and Sierra Forest
- Fix perf_mmap() refcount bug found by Syzkaller
- Fix __perf_event_overflow() vs perf_remove_from_context() race
* tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix __perf_event_overflow() vs perf_remove_from_context() race
perf/core: Fix refcount bug and potential UAF in perf_mmap
perf/x86/intel/uncore: Add per-scheduler IMC CAS count events
perf/core: Fix invalid wait context in ctx_sched_in()
Before rseq became extensible, its original size was 32 bytes even
though the active rseq area was only 20 bytes. This had the following
impact in terms of userspace ecosystem evolution:
* The GNU libc between 2.35 and 2.39 expose a __rseq_size symbol set
to 32, even though the size of the active rseq area is really 20.
* The GNU libc 2.40 changes this __rseq_size to 20, thus making it
express the active rseq area.
* Starting from glibc 2.41, __rseq_size corresponds to the
AT_RSEQ_FEATURE_SIZE from getauxval(3).
This means that users of __rseq_size can always expect it to
correspond to the active rseq area, except for the value 32, for
which the active rseq area is 20 bytes.
Exposing a 32 bytes feature size would make life needlessly painful
for userspace. Therefore, add a reserved field at the end of the
rseq area to bump the feature size to 33 bytes. This reserved field
is expected to be replaced with whatever field will come next,
expecting that this field will be larger than 1 byte.
The effect of this change is to increase the size from 32 to 64 bytes
before we actually have fields using that memory.
Clarify the allocation size and alignment requirements in the struct
rseq uapi comment.
Change the value returned by getauxval(AT_RSEQ_ALIGN) to return the
value of the active rseq area size rounded up to next power of 2, which
guarantees that the rseq structure will always be aligned on the nearest
power of two large enough to contain it, even as it grows. Change the
alignment check in the rseq registration accordingly.
This will minimize the amount of ABI corner-cases we need to document
and require userspace to play games with. The rule stays simple when
__rseq_size != 32:
#define rseq_field_available(field) (__rseq_size >= offsetofend(struct rseq_abi, field))
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260220200642.1317826-3-mathieu.desnoyers@efficios.com
Pull powerpc updates for 7.0
- Implement masked user access
- Add bpf support for internal only per-CPU instructions and inline the
bpf_get_smp_processor_id() and bpf_get_current_task() functions
- Fix pSeries MSI-X allocation failure when quota is exceeded
- Fix recursive pci_lock_rescan_remove locking in EEH event handling
- Support tailcalls with subprogs & BPF exceptions on 64bit
- Extend "trusted" keys to support the PowerVM Key Wrapping Module
(PKWM)
Thanks to Abhishek Dubey, Christophe Leroy, Gaurav Batra, Guangshuo Li,
Jarkko Sakkinen, Mahesh Salgaonkar, Mimi Zohar, Miquel Sabaté Solà, Nam
Cao, Narayana Murty N, Nayna Jain, Nilay Shroff, Puranjay Mohan, Saket
Kumar Bhaskar, Sourabh Jain, Srish Srinivasan, and Venkat Rao Bagalkote.
* tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (27 commits)
powerpc/pseries: plpks: export plpks_wrapping_is_supported
docs: trusted-encryped: add PKWM as a new trust source
keys/trusted_keys: establish PKWM as a trusted source
pseries/plpks: add HCALLs for PowerVM Key Wrapping Module
pseries/plpks: expose PowerVM wrapping features via the sysfs
powerpc/pseries: move the PLPKS config inside its own sysfs directory
pseries/plpks: fix kernel-doc comment inconsistencies
powerpc/smp: Add check for kcalloc() failure in parse_thread_groups()
powerpc: kgdb: Remove OUTBUFMAX constant
powerpc64/bpf: Additional NVR handling for bpf_throw
powerpc64/bpf: Support exceptions
powerpc64/bpf: Add arch_bpf_stack_walk() for BPF JIT
powerpc64/bpf: Avoid tailcall restore from trampoline
powerpc64/bpf: Support tailcalls with subprogs
powerpc64/bpf: Moving tail_call_cnt to bottom of frame
powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling
powerpc/pseries: Fix MSI-X allocation failure when quota is exceeded
powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory
powerpc64/bpf: Inline bpf_get_smp_processor_id() and bpf_get_current_task/_btf()
powerpc64/bpf: Support internal-only MOV instruction to resolve per-CPU addrs
...
The commit 5b472b6e5bd9 ("x86_64/bug: Implement __WARN_printf()")
implemented __WARN_printf(), which changed the mechanism to use UD1
instead of UD2. However, it only handles the trap in the runtime IDT
handler, while the early booting IDT handler lacks this handling. As a
result, the usage of WARN() before the runtime IDT setup can lead to
kernel crashes. Since KMSAN is enabled after the runtime IDT setup, it
is safe to use handle_bug() directly in early_fixup_exception() to
address this issue.
Fixes: 5b472b6e5bd9 ("x86_64/bug: Implement __WARN_printf()")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/c4fb3645f60d3a78629d9870e8fcc8535281c24f.1768016713.git.houwenlong.hwl@antgroup.com
Pull spi fixes from Mark Brown:
"One final batch of fixes for the Tegra SPI drivers, the main one is a
batch of fixes for races with the interrupts in the Tegra210 QSPI
driver that Breno has been working on for a while"
* tag 'spi-fix-v6.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: tegra114: Preserve SPI mode bits in def_command1_reg
spi: tegra: Fix a memory leak in tegra_slink_probe()
spi: tegra210-quad: Protect curr_xfer check in IRQ handler
spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
spi: tegra210-quad: Move curr_xfer read inside spinlock
spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
When a block read returns an invalid length, zero or >I2C_SMBUS_BLOCK_MAX,
the length handler sets the state to IMX_I2C_STATE_FAILED. However,
i2c_imx_master_isr() unconditionally overwrites this with
IMX_I2C_STATE_READ_CONTINUE, causing an endless read loop that overruns
buffers and crashes the system.
Guard the state transition to preserve error states set by the length
handler.
Fixes: 5f5c2d4579ca ("i2c: imx: prevent rescheduling in non dma mode")
Signed-off-by: LI Qingwu <Qing-wu.Li@leica-geosystems.com.cn>
Cc: <stable@vger.kernel.org> # v6.13+
Reviewed-by: Stefan Eichenberger <eichest@gmail.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260116111906.3413346-2-Qing-wu.Li@leica-geosystems.com.cn
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
Since 3669ddd8fa8b5 ("KVM: arm64: Add a range to pkvm_mappings"),
pKVM tracks the memory that has been mapped into a guest in a
side data structure. Crucially, it uses it to find out whether
a page has already been mapped, and therefore refuses to map it
twice. So far, so good.
However, this very patch completely breaks non-4kB page support,
with guests being unable to boot. The most obvious symptom is that
we take the same fault repeatedly, and not making forward progress.
A quick investigation shows that this is because of the above
rejection code.
As it turns out, there are multiple issues at play:
- while the HPFAR_EL2 register gives you the faulting IPA minus
the bottom 12 bits, it will still give you the extra bits that
are part of the page offset for anything larger than 4kB,
even for a level-3 mapping
- pkvm_pgtable_stage2_map() assumes that the address passed as
a parameter is aligned to the size of the intended mapping
- the faulting address is only aligned for a non-page mapping
When the planets are suitably aligned (pun intended), the guest
faults on a page by accessing it past the bottom 4kB, and extra bits
get set in the HPFAR_EL2 register. If this results in a page mapping
(which is likely with large granule sizes), nothing aligns it further
down, and pkvm_mapping_iter_first() finds an intersection that
doesn't really exist. We assume this is a spurious fault and return
-EAGAIN. And again...
This doesn't hit outside of the protected code, as the page table
code always aligns the IPA down to a page boundary, hiding the issue
for everyone else.
Fix it by always forcing the alignment on vma_pagesize, irrespective
of the value of vma_pagesize.
Fixes: 3669ddd8fa8b5 ("KVM: arm64: Add a range to pkvm_mappings")
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://https://patch.msgid.link/20260222141000.3084258-1-maz@kernel.org
Cc: stable@vger.kernel.org
As part of the work to remove the dependency on calling into the decompressor
code (startup_64()) for a UEFI boot, a call to rmpadjust() was removed from
sev_enable() in favor of checking the value of the snp_vmpl variable.
When booting through a non-UEFI path and calling startup_64(), the call to
sev_enable() is performed before the BSS section is zeroed. With the removal
of the rmpadjust() call and the corresponding check of the return code, the
snp_vmpl variable is checked.
Since the kernel is running at VMPL0, the snp_vmpl variable will not have been
set and should be the default value of 0. However, since the call occurs
before the BSS is zeroed, the snp_vmpl variable may not actually be zero,
which will cause the guest boot to fail.
Since the decompressor relocates itself, the BSS would need to be cleared both
before and after the relocation, but this would, in effect, cause all of the
changes to BSS variables before relocation to be lost after relocation.
Instead, move the snp_vmpl variable into the .data section so that it is
initialized and the value made safe during relocation. As a pre-caution
against future changes, move other SEV-related decompressor variables into the
.data section, too.
Fixes: 68a501d7fd82 ("x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Changyuan Lyu <changyuanl@google.com>
Tested-by: Kevin Hui <kevinhui@meta.com>
Tested-by: Changyuan Lyu <changyuanl@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/5648b7de5b0a5d0dfef3785f9582b718678c6448.1770217260.git.thomas.lendacky@amd.com
Pull x86 fixes from Ingo Molnar:
- Fix SEV guest boot failures in certain circumstances, due to
very early code relying on a BSS-zeroed variable that isn't
actually zeroed yet an may contain non-zero bootup values
Move the variable into the .data section go gain even earlier
zeroing
- Expose & allow the IBPB-on-Entry feature on SNP guests, which
was not properly exposed to guests due to initial implementational
caution
- Fix O= build failure when CONFIG_EFI_SBAT_FILE is using relative
file paths
- Fix the various SNC (Sub-NUMA Clustering) topology enumeration
bugs/artifacts (sched-domain build errors mostly).
SNC enumeration data got more complicated with Granite Rapids X
(GNR) and Clearwater Forest X (CWF), which exposed these bugs
and made their effects more serious
- Also use the now sane(r) SNC code to fix resctrl SNC detection bugs
- Work around a historic libgcc unwinder bug in the vdso32 sigreturn
code (again), which regressed during an overly aggressive recent
cleanup of DWARF annotations
* tag 'x86-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/vdso32: Work around libgcc unwinder bug
x86/resctrl: Fix SNC detection
x86/topo: Fix SNC topology mess
x86/topo: Replace x86_has_numa_in_package
x86/topo: Add topology_num_nodes_per_package()
x86/numa: Store extra copy of numa_nodes_parsed
x86/boot: Handle relative CONFIG_EFI_SBAT_FILE file paths
x86/sev: Allow IBPB-on-Entry feature for SNP guests
x86/boot/sev: Move SEV decompressor variables into the .data section
Pull timer fix from Ingo Molnar:
"Make clock_adjtime() syscall timex validation slightly more permissive
for auxiliary clocks, to not reject syscalls based on the status field
that do not try to modify the status field.
This makes the ABI behavior in clock_adjtime() consistent with
CLOCK_REALTIME"
* tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix timex status validation for auxiliary clocks
The unwinder code in libgcc has a long standing bug which causes it to
fail to pick up the signal frame CFI flag. This is a generic bug
across all platforms.
It affects the __kernel_sigreturn and __kernel_rt_sigreturn vdso entry
points on i386. The x86-64 kernel doesn't provide a sigreturn stub,
and so there is no kernel-provided code that is affected on x86-64.
libgcc does have a legacy fallback path which happens to work as long
as the bytes immediately before each of the sigreturn functions fall
outside any function. This patch adds a nop before the ALIGN to each
of the sigreturn stubs to ensure that this is, indeed, the case.
The rest of the patch is just a comment which documents the invariants
that need to be maintained for this legacy path to work correctly.
This is a manifest bug: in the current vdso, __kernel_vsyscall is a
multiple of 16 bytes long and thus __kernel_sigreturn does not have
any padding in front of it.
Closes: https://lore.kernel.org/lkml/f3412cc3e8f66d1853cc9d572c0f2fab076872b1.camel@xry111.site
Fixes: 884961618ee5 ("x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S")
Reported-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124050
Link: https://patch.msgid.link/20260227010308.310342-1-hpa@zytor.com
Pull scheduler fix from Ingo Molnar:
"Fix a DL scheduler bug that may corrupt internal metrics during PI and
setscheduler() syscalls, resulting in kernel warnings and misbehavior.
Found during stress-testing"
* tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
The timekeeping_validate_timex() function validates the timex status
of an auxiliary system clock even when the status is not to be changed,
which causes unexpected errors for applications that make read-only
clock_adjtime() calls, or set some other timex fields, but without
clearing the status field.
Do the AUX-specific status validation only when the modes field contains
ADJ_STATUS, i.e. the application is actually trying to change the
status. This makes the AUX-specific clock_adjtime() behavior consistent
with CLOCK_REALTIME.
Fixes: 4eca49d0b621 ("timekeeping: Prepare do_adtimex() for auxiliary clocks")
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260225085231.276751-1-mlichvar@redhat.com
Now that the x86 topology code has a sensible nodes-per-package
measure, that does not depend on the online status of CPUs, use this
to divinate the SNC mode.
Note that when Cluster on Die (CoD) is configured on older systems this
will also show multiple NUMA nodes per package. Intel Resource Director
Technology is incomaptible with CoD. Print a warning and do not use the
fixup MSR_RMID_SNC_CONFIG.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Link: https://patch.msgid.link/aaCxbbgjL6OZ6VMd@agluck-desk3
Link: https://patch.msgid.link/20260303110100.367976706@infradead.org
Saves two function calls, and one stac/clac pair.
stac/clac is rather expensive on older cpus like Zen 2.
A synthetic network stress test gives a ~1.5% increase of pps
on AMD Zen 2.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running stress-ng --schedpolicy 0 on an RT kernel on a big machine
might lead to the following WARNINGs (edited).
sched: DL de-boosted task PID 22725: REPLENISH flag missing
WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8
... (running_bw underflow)
Call trace:
dequeue_task_dl+0x15c/0x1f8 (P)
dequeue_task+0x80/0x168
deactivate_task+0x24/0x50
push_dl_task+0x264/0x2e0
dl_task_timer+0x1b0/0x228
__hrtimer_run_queues+0x188/0x378
hrtimer_interrupt+0xfc/0x260
...
The problem is that when a SCHED_DEADLINE task (lock holder) is
changed to a lower priority class via sched_setscheduler(), it may
fail to properly inherit the parameters of potential DEADLINE donors
if it didn't already inherit them in the past (shorter deadline than
donor's at that time). This might lead to bandwidth accounting
corruption, as enqueue_task_dl() won't recognize the lock holder as
boosted.
The scenario occurs when:
1. A DEADLINE task (donor) blocks on a PI mutex held by another
DEADLINE task (holder), but the holder doesn't inherit parameters
(e.g., it already has a shorter deadline)
2. sched_setscheduler() changes the holder from DEADLINE to a lower
class while still holding the mutex
3. The holder should now inherit DEADLINE parameters from the donor
and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen
Fix the issue by introducing __setscheduler_dl_pi(), which detects when
a DEADLINE (proper or boosted) task gets setscheduled to a lower
priority class. In case, the function makes the task inherit DEADLINE
parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to
ensure proper bandwidth accounting during the next enqueue operation.
Fixes: 2279f540ea7d ("sched/deadline: Fix priority inheritance with multiple scheduling classes")
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260302-upstream-fix-deadline-piboost-b4-v3-1-6ba32184a9e0@redhat.com
Per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in
SNC-3 mode"), the original crazy SNC-3 SLIT table was:
node distances:
node 0 1 2 3 4 5
0: 10 15 17 21 28 26
1: 15 10 15 23 26 23
2: 17 15 10 26 23 21
3: 21 28 26 10 15 17
4: 23 26 23 15 10 15
5: 26 23 21 17 15 10
And per:
https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/
The suggestion was to average the off-trace clusters to restore sanity.
However, 4d6dd05d07d0 implements this under various assumptions:
- anything GNR/CWF with numa_in_package;
- there will never be more than 2 packages;
- the off-trace cluster will have distance >20
And then HPE shows up with a machine that matches the
Vendor-Family-Model checks but looks like this:
Here's an 8 socket (2 chassis) HPE system with SNC enabled:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40
1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40
2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40
3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40
4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40
5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40
6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40
7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40
8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18
9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18
10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16
11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16
12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16
13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16
14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12
15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10
10 = Same chassis and socket
12 = Same chassis and socket (SNC)
16 = Same chassis and adjacent socket
18 = Same chassis and non-adjacent socket
40 = Different chassis
Turns out, the 'max 2 packages' thing is only relevant to the SNC-3 parts, the
smaller parts do 8 sockets (like usual). The above SLIT table is sane, but
violates the previous assumptions and trips a WARN.
Now that the topology code has a sensible measure of nodes-per-package, we can
use that to divinate the SNC mode at hand, and only fix up SNC-3 topologies.
There is a 'healthy' amount of paranoia code validating the assumptions on the
SLIT table, a simple pr_err(FW_BUG) print on failure and a fallback to using
the regular table. Lets see how long this lasts :-)
Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Reported-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110100.238361290@infradead.org
Pull SCSI fixes from James Bottomley:
"Two core changes and the rest in drivers, one core change to quirk the
behaviour of the Iomega Zip drive and one to fix a hang caused by tag
reallocation problems, which has mostly been seen by the iscsi client.
Note the latter fixes the problem but still has a slight sysfs memory
leak, so will be amended in the next pull request (once we've run the
fix for the fix through our testing)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: Fix recursive locking in __configfs_open_file()
scsi: devinfo: Add BLIST_SKIP_IO_HINTS for Iomega ZIP
scsi: mpi3mr: Clear reset history on ready and recheck state after timeout
scsi: core: Fix refcount leak for tagset_refcnt
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature
set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered... [his words, not mine]
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Generic:
- Remove internal Kconfigs that are now set on all architectures
- Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
architectures finally enable it in Linux 7.0"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: always define KVM_CAP_SYNC_MMU
KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: arm64: Deduplicate ASID retrieval code
irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
KVM: arm64: Fix protected mode handling of pages larger than 4kB
KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
KVM: arm64: Fix ID register initialization for non-protected pKVM guests
KVM: arm64: Optimise away S1POE handling when not supported by host
KVM: arm64: Hide S1POE from guests when not supported by the host
.. with the brand spanking new topology_num_nodes_per_package().
Having the topology setup determine this value during MADT/SRAT parsing before
SMP bringup avoids having to detect this situation when building the SMP
topology masks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110100.123701837@infradead.org
In flush_write_buffer, &p->frag_sem is acquired and then the loaded store
function is called, which, here, is target_core_item_dbroot_store(). This
function called filp_open(), following which these functions were called
(in reverse order), according to the call trace:
down_read
__configfs_open_file
do_dentry_open
vfs_open
do_open
path_openat
do_filp_open
file_open_name
filp_open
target_core_item_dbroot_store
flush_write_buffer
configfs_write_iter
target_core_item_dbroot_store() tries to validate the new file path by
trying to open the file path provided to it; however, in this case, the bug
report shows:
db_root: not a directory: /sys/kernel/config/target/dbroot
indicating that the same configfs file was tried to be opened, on which it
is currently working on. Thus, it is trying to acquire frag_sem semaphore
of the same file of which it already holds the semaphore obtained in
flush_write_buffer(), leading to acquiring the semaphore in a nested manner
and a possibility of recursive locking.
Fix this by modifying target_core_item_dbroot_store() to use kern_path()
instead of filp_open() to avoid opening the file using filesystem-specific
function __configfs_open_file(), and further modifying it to make this fix
compatible.
Reported-by: syzbot+f6e8174215573a84b797@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f6e8174215573a84b797
Tested-by: syzbot+f6e8174215573a84b797@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Link: https://patch.msgid.link/20260216062002.61937-1-activprithvi@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull debugobjects fix from Thomas Gleixner:
"A single fix for debugobjects.
The deferred page initialization prevents debug objects from
allocating slab pages until the initialization is complete. That
causes depletion of the pool and disabling of debugobjects.
The reason is that debugobjects uses __GFP_HIGH for allocations as it
might be invoked from arbitrary contexts. When PREEMPT_COUNT is
disabled there is no way to know whether the context is safe to set
__GFP_KSWAPD_RECLAIM.
This worked until v6.18. Since then allocations w/o a reclaim flag
cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(),
which returns early when deferred page initialization has not yet
completed.
Work around that when PREEMPT_COUNT is enabled as the preempt counter
allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when
the context is preemtible. When PREEMPT_COUNT is disabled the context
is unknown and the reclaim bit can't be set because the caller might
hold locks which might deadlock in the allocator.
That makes debugobjects depend on PREEMPT_COUNT ||
!DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but
keeps it functional for most cases"
* tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobject: Make it work with deferred page initialization - again
KVM/arm64 fixes for 7.0, take #1
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered...
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Use the MADT and SRAT table data to compute __num_nodes_per_package.
Specifically, SRAT has already been parsed in x86_numa_init(), which is called
before acpi_boot_init() which parses MADT. So both are available in
topology_init_possible_cpus().
This number is useful to divinate the various Intel CoD/SNC and AMD NPS modes,
since the platforms are failing to provide this otherwise.
Doing it this way is independent of the number of online CPUs and
other such shenanigans.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110100.004091624@infradead.org
Pull parisc fixes from Helge Deller:
"While testing Sasha Levin's 'kallsyms: embed source file:line info in
kernel stack traces' patch series, which increases the typical kernel
image size, I found some issues with the parisc initial kernel mapping
which may prevent the kernel to boot.
The three small patches here fix this"
* tag 'parisc-for-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix initial page table creation for boot
parisc: Check kernel mapping earlier at bootup
parisc: Increase initial mapping to 64 MB with KALLSYMS
Fix an error reported by the kernel test robot:
au1100fb.c: error: implicit declaration of function 'KSEG1ADDR'; did you mean 'CKSEG1ADDR'?
arch/mips/include/asm/addrspace.h defines KSEG1ADDR only for 32 bit
configurations. So provide its compile-test stub also for 64bit mips builds.
Fixes: 6f366e86481a ("fbdev: au1100fb: Make driver compilable on non-mips platforms")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603042127.PT6LuKqi-lkp@intel.com/
Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
The Iomega ZIP 100 (Z100P2) can't process IO Advice Hints Grouping mode
page query. It immediately switches to the status phase 0xb8 after
receiving the subpage code 0x05 of MODE_SENSE_10 command, which fails
imm_out() and turns into DID_ERROR of this command, which leads to unusable
device. This was tested with an Iomega ZIP 100 (Z100P2) connected with a
StarTech PEX1P2 AX99100 PCIe parallel port card.
Prior to this fix, Test Unit Ready fails and the drive can't be used:
IMM: returned SCSI status b8
sd 7:0:6:0: [sdh] Test Unit Ready failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
Signed-off-by: Florian Fuchs <fuchsfl@gmail.com>
Link: https://patch.msgid.link/20260227181823.892932-1-fuchsfl@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull x86 fixes from Ingo Molnar:
- Fix speculative safety in fred_extint()
- Fix __WARN_printf() trap in early_fixup_exception()
- Fix clang-build boot bug for unusual alignments, triggered by
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y
- Replace the final few __ASSEMBLY__ stragglers that snuck in lately
into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again)
* tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__
x86/cfi: Fix CFI rewrite for odd alignments
x86/bug: Handle __WARN_printf() trap in early_fixup_exception()
x86/fred: Correct speculative safety in fred_extint()
debugobjects uses __GFP_HIGH for allocations as it might be invoked
within locked regions. That worked perfectly fine until v6.18. It still
works correctly when deferred page initialization is disabled and works
by chance when no page allocation is required before deferred page
initialization has completed.
Since v6.18 allocations w/o a reclaim flag cause new_slab() to end up in
alloc_frozen_pages_nolock_noprof(), which returns early when deferred
page initialization has not yet completed. As the deferred page
initialization takes quite a while the debugobject pool is depleted and
debugobjects are disabled.
This can be worked around when PREEMPT_COUNT is enabled as that allows
debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when the context
is preemtible. When PREEMPT_COUNT is disabled the context is unknown and
the reclaim bit can't be set because the caller might hold locks which
might deadlock in the allocator.
In preemptible context the reclaim bit is harmless and not a performance
issue as that's usually invoked from slow path initialization context.
That makes debugobjects depend on PREEMPT_COUNT || !DEFERRED_STRUCT_PAGE_INIT.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/87pl6gznti.ffs@tglx
We currently have three versions of the ASID retrieval code, one
in the S1 walker, and two in the VNCR handling (although the last
two are limited to the EL2&0 translation regime).
Make this code common, and take this opportunity to also simplify
the code a bit while switching over to the TTBRx_EL1_ASID macro.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260225104718.14209-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
The topology setup code needs to know the total number of physical
nodes enumerated in SRAT; however NUMA_EMU can cause the existing
numa_nodes_parsed bitmap to be fictitious. Therefore, keep a copy of
the bitmap specifically to retain the physical node count.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110059.889884023@infradead.org
Pull bpf fixes from Alexei Starovoitov:
- Fix u32/s32 bounds when ranges cross min/max boundary (Eduard
Zingerman)
- Fix precision backtracking with linked registers (Eduard Zingerman)
- Fix linker flags detection for resolve_btfids (Ihor Solodrai)
- Fix race in update_ftrace_direct_add/del (Jiri Olsa)
- Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
resolve_btfids: Fix linker flags detection
selftests/bpf: add reproducer for spurious precision propagation through calls
bpf: collect only live registers in linked regs
Revert "selftests/bpf: Update reg_bound range refinement logic"
selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary
bpf: Fix u32/s32 bounds when ranges cross min/max boundary
bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim
ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
The KERNEL_INITIAL_ORDER value defines the initial size (usually 32 or
64 MB) of the page table during bootup. Up until now the whole area was
initialized with PTE entries, but there was no check if we filled too
many entries. Change the code to fill up with so many entries that the
"_end" symbol can be reached by the kernel, but not more entries than
actually fit into the initial PTE tables.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v6.0+
The driver retains reset history even after the IOC has successfully
reached the READY state. That leaves stale reset information active during
normal operation and can mislead recovery and diagnostics. In addition, if
the IOC becomes READY just as the ready timeout loop exits, the driver
still follows the failure path and may retry or report failure incorrectly.
Clear reset history once READY is confirmed so driver state matches actual
IOC status. After the timeout loop, recheck the IOC state and treat READY
as success instead of failing.
Signed-off-by: Ranjan Kumar <ranjan.kumar@broadcom.com>
Link: https://patch.msgid.link/20260225082622.82588-1-ranjan.kumar@broadcom.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull timer fix from Ingo Molnar:
"Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
the common HZ=100, 250 or 1000 cases. Only use a function call for odd
HZ values like HZ=300 that generate more code.
The function call overhead showed up in performance tests of the TCP
code"
* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
After converting the __ASSEMBLY__ statements to __ASSEMBLER__ in
commit 24a295e4ef1ca ("x86/headers: Replace __ASSEMBLY__ with
__ASSEMBLER__ in non-UAPI headers"), some new code has been
added that uses __ASSEMBLY__ again. Convert these stragglers, too.
This is a mechanical patch, done with a simple "sed -i" command.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251218182029.166993-1-thuth@redhat.com
It appears that a !! became ! during a cleanup, resulting in inverted
logic when detecting if a host GICv5 implementation is capable of
virtualization.
Re-add the missing !, fixing the behaviour.
Fixes: 3227c3a89d65f ("irqchip/gic-v5: Check if impl is virt capable")
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260225083130.3378490-1-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
CONFIG_EFI_SBAT_FILE can be a relative path. When compiling using a different
output directory (O=) the build currently fails because it can't find the
filename set in CONFIG_EFI_SBAT_FILE:
arch/x86/boot/compressed/sbat.S: Assembler messages:
arch/x86/boot/compressed/sbat.S:6: Error: file not found: kernel.sbat
Add $(srctree) as include dir for sbat.o.
[ bp: Massage commit message. ]
Fixes: 61b57d35396a ("x86/efi: Implement support for embedding SBAT data for x86")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/f4eda155b0cef91d4d316b4e92f5771cb0aa7187.1772047658.git.jstancek@redhat.com
Pull RCU selftest fixes from Boqun Feng:
"Fix a regression in RCU torture test pre-defined scenarios caused by
commit 7dadeaa6e851 ("sched: Further restrict the preemption modes")
which limits PREEMPT_NONE to architectures that do not support
preemption at all and PREEMPT_VOLUNTARY to those architectures that do
not yet have PREEMPT_LAZY support.
Since major architectures (e.g. x86 and arm64) no longer support
CONFIG_PREEMPT_NONE and CONFIG_PREEMPT_VOLUNTARY, using them in
rcutorture, rcuscale, refscale, and scftorture pre-defined scenarios
causes config checking errors.
Switch these kconfigs to PREEMPT_LAZY"
* tag 'rcu-fixes.v7.0-20260307a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
scftorture: Update due to x86 not supporting none/voluntary preemption
refscale: Update due to x86 not supporting none/voluntary preemption
rcuscale: Update due to x86 not supporting none/voluntary preemption
rcutorture: Update due to x86 not supporting none/voluntary preemption
The "|| echo -lzstd" default makes zstd an unconditional link
dependency of resolve_btfids. On systems where libzstd-dev is not
installed and pkg-config fails, the linker fails:
ld: cannot find -lzstd: No such file or directory
libzstd is a transitive dependency of libelf, so the -lzstd flag is
strictly necessary only for static builds [1].
Remove ZSTD_LIBS variable, and instead set LIBELF_LIBS depending on
whether the build is static or not. Use $(HOSTPKG_CONFIG) as primary
source of the flags list.
Also add a default value for HOSTPKG_CONFIG in case it's not built via
the toplevel Makefile. Pass it from selftests/bpf too.
[1] https://lore.kernel.org/bpf/4ff82800-2daa-4b9f-95a9-6f512859ee70@linux.dev/
Reported-by: BPF CI Bot (Claude Opus 4.6) <bot+bpf-ci@kernel.org>
Reported-by: Vitaly Chikunov <vt@altlinux.org>
Closes: https://lore.kernel.org/bpf/aaWqMcK-2AQw5dx8@altlinux.org/
Fixes: 4021848a903e ("selftests/bpf: Pass through build flags to bpftool and resolve_btfids")
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Reviewed-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260305014730.3123382-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The check if the initial mapping is sufficient needs to happen much
earlier during bootup. Move this test directly to the start_parisc()
function and use native PDC iodc functions to print the warning, because
panic() and printk() are not functional yet.
This fixes boot when enabling various KALLSYSMS options which need
much more space.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v6.0+
This leak will cause a hang when tearing down the SCSI host. For example,
iscsid hangs with the following call trace:
[130120.652718] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured
PID: 2528 TASK: ffff9d0408974e00 CPU: 3 COMMAND: "iscsid"
#0 [ffffb5b9c134b9e0] __schedule at ffffffff860657d4
#1 [ffffb5b9c134ba28] schedule at ffffffff86065c6f
#2 [ffffb5b9c134ba40] schedule_timeout at ffffffff86069fb0
#3 [ffffb5b9c134bab0] __wait_for_common at ffffffff8606674f
#4 [ffffb5b9c134bb10] scsi_remove_host at ffffffff85bfe84b
#5 [ffffb5b9c134bb30] iscsi_sw_tcp_session_destroy at ffffffffc03031c4 [iscsi_tcp]
#6 [ffffb5b9c134bb48] iscsi_if_recv_msg at ffffffffc0292692 [scsi_transport_iscsi]
#7 [ffffb5b9c134bb98] iscsi_if_rx at ffffffffc02929c2 [scsi_transport_iscsi]
#8 [ffffb5b9c134bbf0] netlink_unicast at ffffffff85e551d6
#9 [ffffb5b9c134bc38] netlink_sendmsg at ffffffff85e554ef
Fixes: 8fe4ce5836e9 ("scsi: core: Fix a use-after-free")
Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260223232728.93350-1-junxiao.bi@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull scheduler fixes from Ingo Molnar:
- Fix zero_vruntime tracking when there's a single task running
- Fix slice protection logic
- Fix the ->vprot logic for reniced tasks
- Fix lag clamping in mixed slice workloads
- Fix objtool uaccess warning (and bug) in the
!CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining,
which triggers with older compilers
- Fix a comment in the rseq registration rseq_size bound check code
- Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes
differently, which special size we now reached naturally and want to
avoid. The visible ugliness of the new reserved field will be avoided
the next time the RSEQ area is extended.
* tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: slice ext: Ensure rseq feature size differs from original rseq size
rseq: Clarify rseq registration rseq_size bound check comment
sched/core: Fix wakeup_preempt's next_class tracking
rseq: Mark rseq_arm_slice_extension_timer() __always_inline
sched/fair: Fix lag clamp
sched/eevdf: Update se->vprot in reweight_entity()
sched/fair: Only set slice protection at pick time
sched/fair: Fix zero_vruntime tracking
For common cases (HZ=100, 250 or 1000), these helpers are at most one
multiply, so there is no point calling a tiny function.
Keep them out of line for HZ=300 and others.
This saves cycles in TCP fast path, among other things.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/8 grow/shrink: 25/89 up/down: 530/-3474 (-2944)
...
nla_put_msecs 193 - -193
message_stats_print 2131 920 -1211
Total: Before=25365208, After=25362264, chg -0.01%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260210170226.57209-1-edumazet@google.com
Rustam reported his clang builds did not boot properly; turns out his
.config has: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y set.
Fix up the FineIBT code to deal with this unusual alignment.
Fixes: 931ab63664f0 ("x86/ibt: Implement FineIBT")
Reported-by: Rustam Kovhaev <rkovhaev@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Rustam Kovhaev <rkovhaev@gmail.com>
Commit 0c4762e26879 ("KVM: arm64: nv: Avoid NV stage-2 code when NV is
not supported") added an early return to several functions in
arch/arm64/kvm/nested.c to prevent a UBSAN shift-out-of-bounds error
when accessing the pgt union for non-nested VMs.
However, this early return was inadvertently applied to
kvm_arch_flush_shadow_all() as well, causing it to skip the call to
kvm_uninit_stage2_mmu(kvm) for all non-nested VMs.
For pKVM, skipping this teardown means the host never unshares the
guest's memory with the EL2 hypervisor. When the host kernel later
recycles these leaked pages for a new VM, it attempts to re-share them.
The hypervisor correctly rejects this with -EPERM, triggering a host
WARN_ON and hanging the guest.
Fix this by dropping the early return from kvm_arch_flush_shadow_all().
The for-loop guarding the nested MMU cleanup already bounds itself when
nested_mmus_size == 0, allowing execution to proceed to
kvm_uninit_stage2_mmu() as intended.
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/60916cb6-f460-4751-b910-f63c58700ad0@sirena.org.uk/
Fixes: 0c4762e26879 ("KVM: arm64: nv: Avoid NV stage-2 code when NV is not supported")
Signed-off-by: Fuad Tabba <tabba@google.com>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://patch.msgid.link/20260222083352.89503-1-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The SEV-SNP IBPB-on-Entry feature does not require a guest-side
implementation. It was added in Zen5 h/w, after the first SNP Zen
implementation, and thus was not accounted for when the initial set of SNP
features were added to the kernel.
In its abundant precaution, commit
8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support")
included SEV_STATUS' IBPB-on-Entry bit as a reserved bit, thereby masking
guests from using the feature.
Allow guests to make use of IBPB-on-Entry when supported by the hypervisor, as
the bit is now architecturally defined and safe to expose.
Fixes: 8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@kernel.org
Link: https://patch.msgid.link/20260203222405.4065706-2-kim.phillips@amd.com
Pull tracing fixes from Steven Rostedt:
- Fix possible NULL pointer dereference in trace_data_alloc()
On the trace_data_alloc() error path, it can call trigger_data_free()
with a NULL pointer. This used to be a kfree() but was changed to
trigger_data_free() to clean up any partial initialization. The issue
is that trigger_data_free() does not expect a NULL pointer. Have
trigger_data_free() return safely on NULL pointer.
- Fix multiple events on the command line and bootconfig
If multiple events are enabled on the command line separately and not
grouped, only the last event gets enabled. That is:
trace_event=sched_switch trace_event=sched_waking
will only enable sched_waking whereas:
trace_event=sched_switch,sched_waking
will enable both.
The bootconfig makes it even worse as the second way is the more
common method.
The issue is that a temporary buffer is used to store the events to
enable later in boot. Each time the cmdline callback is called, it
overwrites what was previously there.
Have the callback append the next value (delimited by a comma) if the
temporary buffer already has content.
- Fix command line trace_buffer_size if >= 2G
The logic to allocate the trace buffer uses "int" for the size
parameter in the command line code causing overflow issues if more
that 2G is specified.
* tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
tracing: Fix enabling multiple events on the kernel command line and bootconfig
tracing: Add NULL pointer check to trigger_data_free()
As of v7.0-rc1, architectures that support preemption, including x86 and
arm64, no longer support CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY.
Attempting to build kernels with these two Kconfig options results in
.config errors. This commit therefore switches such scftorture scenarios
to CONFIG_PREEMPT_LAZY.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Link: https://patch.msgid.link/20260303235903.1967409-4-paulmck@kernel.org
Eduard Zingerman says:
====================
bpf: Fix precision backtracking bug with linked registers
Emil Tsalapatis reported a verifier bug hit by the scx_lavd sched_ext
scheduler. The essential part of the verifier log looks as follows:
436: ...
// checkpoint hit for 438: (1d) if r7 == r8 goto ...
frame 3: propagating r2,r7,r8
frame 2: propagating r6
mark_precise: frame3: last_idx ...
mark_precise: frame3: regs=r2,r7,r8 stack= before 436: ...
mark_precise: frame3: regs=r2,r7 stack= before 435: ...
mark_precise: frame3: regs=r2,r7 stack= before 434: (85) call bpf_trace_vprintk#177
verifier bug: backtracking call unexpected regs 84
The log complains that registers r2 and r7 are tracked as precise
while processing the bpf_trace_vprintk() call in precision backtracking.
This can't be right, as r2 is reset by the call and there is nothing
to backtrack it to. The precision propagation is triggered when
a checkpoint is hit at instruction 438, r2 is dead at that instruction.
This happens because of the following sequence of events:
- Instruction 438 is first reached with registers r2 and r7 having
the same id via a path that does not call bpf_trace_vprintk():
- Checkpoint is created at 438.
- The jump at 438 is predicted, hence r7 and registers linked to it
(r2) are propagated as precise, marking r2 and r7 precise in the
checkpoint.
- Instruction 438 is reached a second time with r2 undefined and via
a path that calls bpf_trace_vprintk():
- Checkpoint is hit.
- propagate_precision() picks registers r2 and r7 and propagates
precision marks for those up to the helper call.
The root cause is the fact that states_equal() and
propagate_precision() assume that the precision flag can't be set for a
dead register (as computed by compute_live_registers()).
However, this is not the case when linked registers are at play.
Fix this by accounting for live register flags in
collect_linked_regs().
---
====================
Link: https://patch.msgid.link/20260306-linked-regs-and-propagate-precision-v1-0-18e859be570d@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
According to JESD223F, the maximum number of queues (MAXQ) is 32. When MCQ
is enabled and ESI is disabled, nr_hw_queues=32 causes a shift overflow
problem.
Fix this by using 64-bit intermediate values to handle the nr_hw_queues=32
case safely.
Signed-off-by: wangshuaiwei <wangshuaiwei1@xiaomi.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260224063228.50112-1-wangshuaiwei1@xiaomi.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull perf events fixes from Ingo Molnar:
- Fix lock ordering bug found by lockdep in perf_event_wakeup()
- Fix uncore counter enumeration on Granite Rapids and Sierra Forest
- Fix perf_mmap() refcount bug found by Syzkaller
- Fix __perf_event_overflow() vs perf_remove_from_context() race
* tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix __perf_event_overflow() vs perf_remove_from_context() race
perf/core: Fix refcount bug and potential UAF in perf_mmap
perf/x86/intel/uncore: Add per-scheduler IMC CAS count events
perf/core: Fix invalid wait context in ctx_sched_in()
Before rseq became extensible, its original size was 32 bytes even
though the active rseq area was only 20 bytes. This had the following
impact in terms of userspace ecosystem evolution:
* The GNU libc between 2.35 and 2.39 expose a __rseq_size symbol set
to 32, even though the size of the active rseq area is really 20.
* The GNU libc 2.40 changes this __rseq_size to 20, thus making it
express the active rseq area.
* Starting from glibc 2.41, __rseq_size corresponds to the
AT_RSEQ_FEATURE_SIZE from getauxval(3).
This means that users of __rseq_size can always expect it to
correspond to the active rseq area, except for the value 32, for
which the active rseq area is 20 bytes.
Exposing a 32 bytes feature size would make life needlessly painful
for userspace. Therefore, add a reserved field at the end of the
rseq area to bump the feature size to 33 bytes. This reserved field
is expected to be replaced with whatever field will come next,
expecting that this field will be larger than 1 byte.
The effect of this change is to increase the size from 32 to 64 bytes
before we actually have fields using that memory.
Clarify the allocation size and alignment requirements in the struct
rseq uapi comment.
Change the value returned by getauxval(AT_RSEQ_ALIGN) to return the
value of the active rseq area size rounded up to next power of 2, which
guarantees that the rseq structure will always be aligned on the nearest
power of two large enough to contain it, even as it grows. Change the
alignment check in the rseq registration accordingly.
This will minimize the amount of ABI corner-cases we need to document
and require userspace to play games with. The rule stays simple when
__rseq_size != 32:
#define rseq_field_available(field) (__rseq_size >= offsetofend(struct rseq_abi, field))
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260220200642.1317826-3-mathieu.desnoyers@efficios.com
Pull powerpc updates for 7.0
- Implement masked user access
- Add bpf support for internal only per-CPU instructions and inline the
bpf_get_smp_processor_id() and bpf_get_current_task() functions
- Fix pSeries MSI-X allocation failure when quota is exceeded
- Fix recursive pci_lock_rescan_remove locking in EEH event handling
- Support tailcalls with subprogs & BPF exceptions on 64bit
- Extend "trusted" keys to support the PowerVM Key Wrapping Module
(PKWM)
Thanks to Abhishek Dubey, Christophe Leroy, Gaurav Batra, Guangshuo Li,
Jarkko Sakkinen, Mahesh Salgaonkar, Mimi Zohar, Miquel Sabaté Solà, Nam
Cao, Narayana Murty N, Nayna Jain, Nilay Shroff, Puranjay Mohan, Saket
Kumar Bhaskar, Sourabh Jain, Srish Srinivasan, and Venkat Rao Bagalkote.
* tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (27 commits)
powerpc/pseries: plpks: export plpks_wrapping_is_supported
docs: trusted-encryped: add PKWM as a new trust source
keys/trusted_keys: establish PKWM as a trusted source
pseries/plpks: add HCALLs for PowerVM Key Wrapping Module
pseries/plpks: expose PowerVM wrapping features via the sysfs
powerpc/pseries: move the PLPKS config inside its own sysfs directory
pseries/plpks: fix kernel-doc comment inconsistencies
powerpc/smp: Add check for kcalloc() failure in parse_thread_groups()
powerpc: kgdb: Remove OUTBUFMAX constant
powerpc64/bpf: Additional NVR handling for bpf_throw
powerpc64/bpf: Support exceptions
powerpc64/bpf: Add arch_bpf_stack_walk() for BPF JIT
powerpc64/bpf: Avoid tailcall restore from trampoline
powerpc64/bpf: Support tailcalls with subprogs
powerpc64/bpf: Moving tail_call_cnt to bottom of frame
powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling
powerpc/pseries: Fix MSI-X allocation failure when quota is exceeded
powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory
powerpc64/bpf: Inline bpf_get_smp_processor_id() and bpf_get_current_task/_btf()
powerpc64/bpf: Support internal-only MOV instruction to resolve per-CPU addrs
...
The commit 5b472b6e5bd9 ("x86_64/bug: Implement __WARN_printf()")
implemented __WARN_printf(), which changed the mechanism to use UD1
instead of UD2. However, it only handles the trap in the runtime IDT
handler, while the early booting IDT handler lacks this handling. As a
result, the usage of WARN() before the runtime IDT setup can lead to
kernel crashes. Since KMSAN is enabled after the runtime IDT setup, it
is safe to use handle_bug() directly in early_fixup_exception() to
address this issue.
Fixes: 5b472b6e5bd9 ("x86_64/bug: Implement __WARN_printf()")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/c4fb3645f60d3a78629d9870e8fcc8535281c24f.1768016713.git.houwenlong.hwl@antgroup.com
Pull spi fixes from Mark Brown:
"One final batch of fixes for the Tegra SPI drivers, the main one is a
batch of fixes for races with the interrupts in the Tegra210 QSPI
driver that Breno has been working on for a while"
* tag 'spi-fix-v6.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: tegra114: Preserve SPI mode bits in def_command1_reg
spi: tegra: Fix a memory leak in tegra_slink_probe()
spi: tegra210-quad: Protect curr_xfer check in IRQ handler
spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
spi: tegra210-quad: Move curr_xfer read inside spinlock
spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
When a block read returns an invalid length, zero or >I2C_SMBUS_BLOCK_MAX,
the length handler sets the state to IMX_I2C_STATE_FAILED. However,
i2c_imx_master_isr() unconditionally overwrites this with
IMX_I2C_STATE_READ_CONTINUE, causing an endless read loop that overruns
buffers and crashes the system.
Guard the state transition to preserve error states set by the length
handler.
Fixes: 5f5c2d4579ca ("i2c: imx: prevent rescheduling in non dma mode")
Signed-off-by: LI Qingwu <Qing-wu.Li@leica-geosystems.com.cn>
Cc: <stable@vger.kernel.org> # v6.13+
Reviewed-by: Stefan Eichenberger <eichest@gmail.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260116111906.3413346-2-Qing-wu.Li@leica-geosystems.com.cn
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
Since 3669ddd8fa8b5 ("KVM: arm64: Add a range to pkvm_mappings"),
pKVM tracks the memory that has been mapped into a guest in a
side data structure. Crucially, it uses it to find out whether
a page has already been mapped, and therefore refuses to map it
twice. So far, so good.
However, this very patch completely breaks non-4kB page support,
with guests being unable to boot. The most obvious symptom is that
we take the same fault repeatedly, and not making forward progress.
A quick investigation shows that this is because of the above
rejection code.
As it turns out, there are multiple issues at play:
- while the HPFAR_EL2 register gives you the faulting IPA minus
the bottom 12 bits, it will still give you the extra bits that
are part of the page offset for anything larger than 4kB,
even for a level-3 mapping
- pkvm_pgtable_stage2_map() assumes that the address passed as
a parameter is aligned to the size of the intended mapping
- the faulting address is only aligned for a non-page mapping
When the planets are suitably aligned (pun intended), the guest
faults on a page by accessing it past the bottom 4kB, and extra bits
get set in the HPFAR_EL2 register. If this results in a page mapping
(which is likely with large granule sizes), nothing aligns it further
down, and pkvm_mapping_iter_first() finds an intersection that
doesn't really exist. We assume this is a spurious fault and return
-EAGAIN. And again...
This doesn't hit outside of the protected code, as the page table
code always aligns the IPA down to a page boundary, hiding the issue
for everyone else.
Fix it by always forcing the alignment on vma_pagesize, irrespective
of the value of vma_pagesize.
Fixes: 3669ddd8fa8b5 ("KVM: arm64: Add a range to pkvm_mappings")
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://https://patch.msgid.link/20260222141000.3084258-1-maz@kernel.org
Cc: stable@vger.kernel.org
As part of the work to remove the dependency on calling into the decompressor
code (startup_64()) for a UEFI boot, a call to rmpadjust() was removed from
sev_enable() in favor of checking the value of the snp_vmpl variable.
When booting through a non-UEFI path and calling startup_64(), the call to
sev_enable() is performed before the BSS section is zeroed. With the removal
of the rmpadjust() call and the corresponding check of the return code, the
snp_vmpl variable is checked.
Since the kernel is running at VMPL0, the snp_vmpl variable will not have been
set and should be the default value of 0. However, since the call occurs
before the BSS is zeroed, the snp_vmpl variable may not actually be zero,
which will cause the guest boot to fail.
Since the decompressor relocates itself, the BSS would need to be cleared both
before and after the relocation, but this would, in effect, cause all of the
changes to BSS variables before relocation to be lost after relocation.
Instead, move the snp_vmpl variable into the .data section so that it is
initialized and the value made safe during relocation. As a pre-caution
against future changes, move other SEV-related decompressor variables into the
.data section, too.
Fixes: 68a501d7fd82 ("x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Changyuan Lyu <changyuanl@google.com>
Tested-by: Kevin Hui <kevinhui@meta.com>
Tested-by: Changyuan Lyu <changyuanl@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/5648b7de5b0a5d0dfef3785f9582b718678c6448.1770217260.git.thomas.lendacky@amd.com