commits
Pull sched_ext fixes from Tejun Heo:
"The merge window pulled in the cgroup sub-scheduler infrastructure,
and new AI reviews are accelerating bug reporting and fixing - hence
the larger than usual fixes batch:
- Use-after-frees during scheduler load/unload:
- The disable path could free the BPF scheduler while deferred
irq_work / kthread work was still in flight
- cgroup setter callbacks read the active scheduler outside the
rwsem that synchronizes against teardown
Fix both, and reuse the disable drain in the enable error paths so
the BPF JIT page can't be freed under live callbacks.
- Several BPF op invocations didn't tell the framework which runqueue
was already locked, so helper kfuncs that re-acquire the runqueue
by CPU could deadlock on the held lock
Fix the affected callsites, including recursive parent-into-child
dispatch.
- The hardlockup notifier ran from NMI but eventually took a
non-NMI-safe lock. Bounce it through irq_work.
- A handful of bugs in the new sub-scheduler hierarchy:
- helper kfuncs hard-coded the root instead of resolving the
caller's scheduler
- the enable error path tried to disable per-task state that had
never been initialized, and leaked cpus_read_lock on the way
out
- a sysfs object was leaked on every load/unload
- the dispatch fast-path used the root scheduler instead of the
task's
- a couple of CONFIG #ifdef guards were misclassified
- Verifier-time hardening: BPF programs of unrelated struct_ops types
(e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic
bug and, once sub-sched was enabled, a KASAN out-of-bounds read.
Now rejected at load. Plus a few NULL and cross-task argument
checks on sched_ext kfuncs, and a selftest covering the new deny.
- rhashtable (Herbert): restore the insecure_elasticity toggle and
bounce the deferred-resize kick through irq_work to break a
lock-order cycle observable from raw-spinlock callers. sched_ext's
scheduler-instance hash is the first user of both.
- The bypass-mode load balancer used file-scope cpumasks; with
multiple scheduler instances now possible, those raced. Move to
per-instance cpumasks, plus a follow-up to skip tasks whose
recorded CPU is stale relative to the new owning runqueue.
- Smaller fixes:
- a dispatch queue's first-task tracking misbehaved when a parked
iterator cursor sat in the list
- the runqueue's next-class wasn't promoted on local-queue
enqueue, leaving an SCX task behind RT in edge cases
- the reference qmap scheduler stopped erroring on legitimate
cross-scheduler task-storage misses"
* tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits)
sched_ext: Fix scx_flush_disable_work() UAF race
sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
sched_ext: Refuse cross-task select_cpu_from_kfunc calls
sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
sched_ext: Make bypass LB cpumasks per-scheduler
sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
sched_ext: Unregister sub_kset on scheduler disable
sched_ext: Defer scx_hardlockup() out of NMI
sched_ext: sync disable_irq_work in bpf_scx_unreg()
sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
...
Pull xen fixes from Juergen Gross:
"XSA-485 and XSA-487 security patches"
* tag 'xsa48x-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: fix double free via VMA splitting
Buffer overflow in drivers/xen/sys-hypervisor.c
scx_flush_disable_work() calls irq_work_sync() followed by
kthread_flush_work() to ensure that the disable kthread work has
fully completed before bpf_scx_unreg() frees the SCX scheduler.
However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall)
creates a race window between scx_claim_exit() and irq_work_queue():
CPU A (scx_vexit (watchdog)) CPU B (bpf_scx_unreg)
---- ----
scx_claim_exit()
atomic_try_cmpxchg(NONE->kind)
stack_trace_save()
vscnprintf()
scx_disable()
scx_claim_exit() -> FAIL
scx_flush_disable_work()
irq_work_sync() // no-op: not queued yet
kthread_flush_work() // no-op: not queued yet
kobject_put(&sch->kobj) -> free %sch
irq_work_queue() -> UAF on %sch
scx_disable_irq_workfn()
kthread_queue_work() -> UAF
The root cause is that CPU B's scx_flush_disable_work() returns after
syncing an irq_work that has not yet been queued, while CPU A is still
executing the code between scx_claim_exit() and irq_work_queue().
Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining
disable_irq_work and disable_work in each pass. This ensures that any
work queued after the previous check is caught, while also correctly
handling cases where no disable was triggered (e.g., the
scx_sub_enable_workfn() abort path).
Fixes: 510a27055446 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()")
Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup fixes from Tejun Heo:
- Fix UAF race in psi pressure_write() against cgroup file release by
extending cgroup_mutex coverage and ordering of->priv access after
cgroup_kn_lock_live()
- Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
by performing the increment in s64
- Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
recording the CPU used by dl_bw_alloc() so cancel_attach() returns
the reservation to the same root domain
- Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
rmdir by incrementing from kill_css() instead of offline_css()
- Typo fix in cgroup-v2 documentation
* tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup: fix typo 'protetion' -> 'protection'
cgroup: Increment nr_dying_subsys_* from rmdir context
cgroup/cpuset: record DL BW alloc CPU for attach rollback
cgroup/rdma: fix integer overflow in rdmacg_try_charge()
sched/psi: fix race between file release and pressure write
privcmd_vm_ops defines .close (privcmd_close), but neither .may_split
nor .open. When userspace does a partial munmap() on a privcmd mapping,
the kernel splits the VMA via __split_vma(). Since may_split is NULL,
the split is allowed. vm_area_dup() copies vm_private_data (a pages
array allocated in alloc_empty_pages()) into the new VMA without any
fixup, because there is no .open callback.
Both VMAs now point to the same pages array. When the unmapped portion
is closed, privcmd_close() calls:
- xen_unmap_domain_gfn_range()
- xen_free_unpopulated_pages()
- kvfree(pages)
The surviving VMA still holds the dangling pointer. When it is later
destroyed, the same sequence runs again, which leads to a double free.
Fix this issue by adding a .may_split callback denying the VMA split.
This is XSA-487 / CVE-2026-31787
Fixes: d71f513985c2 ("xen: privcmd: support autotranslated physmap guests.")
Reported-by: Atharva Vartak <atharva.a.vartak@gmail.com>
Suggested-by: Atharva Vartak <atharva.a.vartak@gmail.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
There are several edge cases (see linked thread) where an IMMED task
can be left lingering on a local DSQ if an RT task swoops in at the
wrong time. All of these edge cases are due to rq->next_class being idle
even after dispatching a task to rq's local DSQ. We should bump
rq->next_class to &ext_sched_class as soon as we've inserted a task into
the local DSQ.
To optimize the common case of rq->next_class == &ext_sched_class,
only call wakeup_preempt() if rq->next_class is below EXT. If next_class
is EXT or above, wakeup_preempt() is a no-op anyway.
This lets us also simplify the preempt_curr() logic a bit since
wakeup_preempt() will call preempt_curr() for us if next_class is
below EXT.
Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull isofs and udf fixes from Jan Kara:
"Several isofs and udf fixes"
* tag 'fs_for_v7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
docs: isofs: replace dead ECMA-119 FTP link
udf: reject descriptors with oversized CRC length
isofs: use QSTR_LEN() in isofs_cmp
isofs: validate block number from NFS file handle in isofs_export_iget
isofs: validate Rock Ridge CE continuation extent against volume size
Fix a small typo in the description of the memory_hugetlb_accounting
mount option.
Signed-off-by: Petr Vaněk <arkamar@atlas.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
The build id returned by HYPERVISOR_xen_version(XENVER_build_id) is
neither NUL terminated nor a string.
The first causes a buffer overflow as sprintf in buildid_show will
read and copy till it finds a NUL.
00000000 f4 91 51 f4 dd 38 9e 9d 65 47 52 eb 10 71 db 50 |..Q..8..eGR..q.P|
00000010 b9 a8 01 42 6f 2e 32 |...Bo.2|
00000017
So use a memcpy instead of sprintf to have the correct value:
00000000 f4 91 51 f4 dd 00 9e 9d 65 47 52 eb 10 71 db 50 |..Q.....eGR..q.P|
00000010 b9 a8 01 42 |...B|
00000014
(the above have a hack to embed a zero inside and check it's
returned correctly).
This is XSA-485 / CVE-2026-31786
Fixes: 84b7625728ea ("xen: add sysfs node for hypervisor build id")
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.
scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.
Drop cpus_read_lock() before goto err_disable.
v2: Correct Fixes: tag (Andrea Righi).
Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull fsnotify fixes from Jan Kara:
"Three fixes for fsnotify / fanotify"
* tag 'fsnotify_for_v7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix inode reference leak in fsnotify_recalc_mask()
fanotify: Fix spelling mistake "enforecement" -> "enforcement"
fanotify: fix false positive on permission events
The original link is no longer valid. Replace it with the official
PDF of the 2nd edition. The new link points to the exact 2nd edition
that the existing comment in isofs.rst refers to.
Signed-off-by: Ziran Zhang <zhangcoder@yeah.net>
Link: https://patch.msgid.link/20260425142943.6809-1-zhangcoder@yeah.net
Signed-off-by: Jan Kara <jack@suse.cz>
Incrementing nr_dying_subsys_* in offline_css(), which is executed by
cgroup_offline_wq worker, leads to a race where user can see the value
to be 0 if he reads cgroup.stat after calling rmdir and before the worker
executes. This makes the user wrongly expect resources released by the
removed cgroup to be available for a new assignment.
Increment nr_dying_subsys_* from kill_css(), which is called from the
cgroup_rmdir() context.
Fixes: ab0312526867 ("cgroup: Show # of subsystem CSSes in cgroup.stat")
Signed-off-by: Petr Malat <oss@malat.biz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull tracefs fixes from Steven Rostedt:
- Use list_add_tail_rcu() for walking eventfs children
The linked list of children is protected by SRCU and list walkers can
walk the list with only using SRCU. Using just list_add_tail() on
weakly ordered architectures can cause issues. Instead use
list_add_tail_rcu().
- Hold eventfs_mutex and SRCU for remount walk events
The trace_apply_options() walks the tracefs_inodes where some are
eventfs inodes and eventfs_remount() is called which in turn calls
eventfs_set_attr(). This walk only holds normal RCU read locks, but
the eventfs_mutex and SRCU should be held.
Add a eventfs_remount_(un)lock() helpers to take the necessary locks
before iterating the list.
* tag 'tracefs-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Hold eventfs_mutex and SRCU when remount walks events
eventfs: Use list_add_tail_rcu() for SRCU-protected children list
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For
any non-scx task p->scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p->scx.slice /
p->scx.dsq_vtime on arbitrary tasks.
Reject !sch up front so the gate only admits callers with a resolved
scheduler.
Fixes: 245d09c594ea ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Pull btrfs fixes from David Sterba:
- space reservation fixes:
- correctly undo 'may_use' accounting for remap tree
- avoid double decrement of 'may_use' when submitting async io
- actually enable the shutdown ioctl callback (not just the superblock
ops)
- raid stripe tree fixes when deleting extents
- add missing error handling
- fix various incorrect values set
- fix transaction state when removing a directory, possibly leading to
EIO during log replay
- additional b-tree node key checks during metadata readahead
- error handling and transaction abort updates
* tag 'for-7.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent()
btrfs: check return value of btrfs_partially_delete_raid_extent()
btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer
btrfs: replace ASSERT with proper error handling in stripe lookup fallback
btrfs: fix wrong min_objectid in btrfs_previous_item() call
btrfs: fix raid stripe search missing entries at leaf boundaries
btrfs: copy devid in btrfs_partially_delete_raid_extent()
btrfs: handle unexpected free-space-tree key types
btrfs: fix missing last_unlink_trans update when removing a directory
btrfs: don't clobber errors in add_remap_tree_entries()
btrfs: enable shutdown ioctl for non-experimental builds
btrfs: apply first key check for readahead when possible
btrfs: abort transaction in do_remap_reloc_trans() on failure
btrfs: fix bytes_may_use leak in do_remap_reloc_trans()
btrfs: fix bytes_may_use leak in move_existing_remap()
fsnotify_recalc_mask() fails to handle the return value of
__fsnotify_recalc_mask(), which may return an inode pointer that needs
to be released via fsnotify_drop_object() when the connector's HAS_IREF
flag transitions from set to cleared.
This manifests as a hung task with the following call trace:
INFO: task umount:1234 blocked for more than 120 seconds.
Call Trace:
__schedule
schedule
fsnotify_sb_delete
generic_shutdown_super
kill_anon_super
cleanup_mnt
task_work_run
do_exit
do_group_exit
The race window that triggers the iref leak:
Thread A (adding mark) Thread B (removing mark)
────────────────────── ────────────────────────
fsnotify_add_mark_locked():
fsnotify_add_mark_list():
spin_lock(conn->lock)
add mark_B(evictable) to list
spin_unlock(conn->lock)
return
/* ---- gap: no lock held ---- */
fsnotify_detach_mark(mark_A):
spin_lock(mark_A->lock)
clear ATTACHED flag on mark_A
spin_unlock(mark_A->lock)
fsnotify_put_mark(mark_A)
fsnotify_recalc_mask():
spin_lock(conn->lock)
__fsnotify_recalc_mask():
/* mark_A skipped: ATTACHED cleared */
/* only mark_B(evictable) remains */
want_iref = false
has_iref = true /* not yet cleared */
-> HAS_IREF transitions true -> false
-> returns inode pointer
spin_unlock(conn->lock)
/* BUG: return value discarded!
* iput() and fsnotify_put_sb_watched_objects()
* are never called */
Fix this by deferring the transition true -> false of HAS_IREF flag from
fsnotify_recalc_mask() (Thread A) to fsnotify_put_mark() (thread B).
Fixes: c3638b5b1374 ("fsnotify: allow adding an inode mark without pinning inode")
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/CAOQ4uxiPsbHb0o5voUKyPFMvBsDkG914FYDcs4C5UpBMNm0Vcg@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
udf_read_tagged() skips CRC verification when descCRCLength +
sizeof(struct tag) exceeds the block size. A crafted UDF image can
set descCRCLength to an oversized value to bypass CRC validation
entirely; the descriptor is then accepted based solely on the 8-bit
tag checksum, which is trivially recomputable.
Reject such descriptors instead of silently accepting them. A
legitimate single-block descriptor should never have a CRC length that
exceeds the block.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-6
Assisted-by: Codex:gpt-5-4
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260413211240.853662-1-michael.bommarito@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
cpuset_can_attach() allocates DL bandwidth only when migrating
deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
rolls back based only on nr_migrate_dl_tasks. This makes the DL
bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
even when no dl_bw_alloc() was done.
Rollback also needs to undo the reservation against the same CPU/root
domain that was charged. Record the CPU used by dl_bw_alloc() and use
that state in cpuset_cancel_attach(). If no allocation happened,
dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
did happen, bandwidth is returned to the same CPU/root domain.
Successful attach paths are unchanged. This only fixes failed attach
rollback accounting.
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull ktest updates from Steven Rostedt:
- Fix month in date timestamp used to create failure directories
On failure, a directory is created to store the logs and config file
to analyze the failure. The Perl function localtime is used to create
the data timestamp of the directory. The month passed back from that
function starts at 0 and not 1, but the timestamp used does not
account for that. Thus for April 20, 2026, the timestamp of 20260320
is used, instead of 20260420.
- Save the logfile to the failure directory
Just the test log was saved to the directory on failure, but there's
useful information in the full logfile that can be helpful to
analyzing the failure. Save the logfile as well.
* tag 'ktest-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
ktest: Add logfile to failure directory
ktest: Fix the month in the name of the failure directory
Commit 340f0c7067a9 ("eventfs: Update all the eventfs_inodes from the
events descriptor") had eventfs_set_attrs() recurse through ei->children
on remount. The walk only holds the rcu_read_lock() taken by
tracefs_apply_options() over tracefs_inodes, which is wrong:
- list_for_each_entry over ei->children races with the list_del_rcu()
in eventfs_remove_rec() -- LIST_POISON1 deref, same shape as
d2603279c7d6.
- eventfs_inodes are freed via call_srcu(&eventfs_srcu, ...).
rcu_read_lock() does not extend an SRCU grace period, so ti->private
can be reclaimed under the walk.
- The writes to ei->attr race with eventfs_set_attr(), which holds
eventfs_mutex.
Reproducer:
while :; do mount -o remount,uid=$((RANDOM%1000)) /sys/kernel/tracing; done &
while :; do
echo "p:kp submit_bio" > /sys/kernel/tracing/kprobe_events
echo > /sys/kernel/tracing/kprobe_events
done
Wrap the events portion of tracefs_apply_options() in
eventfs_remount_lock()/_unlock() that take eventfs_mutex and
srcu_read_lock(&eventfs_srcu). eventfs_set_attrs() doesn't sleep so the
nested rcu_read_lock() is fine; lockdep_assert_held() pins the contract.
Comment in tracefs_drop_inode() said "RCU cycle" -- it is SRCU.
Fixes: 340f0c7067a9 ("eventfs: Update all the eventfs_inodes from the events descriptor")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260418191737.10289-1-devnexen@gmail.com
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.
Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().
v2: Switch the in_select_cpu cross-task check from direct_dispatch_task
comparison to scx_kf_arg_task_ok(). The former spuriously rejects when
ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls
scx_bpf_select_cpu_*() on the same task. (Andrea Righi)
Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
Pull device mapper fix from Mikulas Patocka:
- fix metadata corruption in dm-thin
* tag 'for-7.1/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-thin: fix metadata refcount underflow
submit_one_async_extent() calls btrfs_reserve_extent(), which decrements
bytes_may_use. If the call btrfs_create_io_em() fails, we jump to
out_free_reserve, which calls extent_clear_unlock_delalloc().
Because we're specifying EXTENT_DO_ACCOUNTING, i.e.
EXTENT_CLEAR_META_RESV | EXTENT_CLEAR_DATA_RESV, this decreases
bytes_may_use again. This can lead to problems later on, as an initial
write can fail only for the writeback to silently ENOSPC.
Fix this by replacing EXTENT_DO_ACCOUNTING with EXTENT_CLEAR_META_RESV.
This parallels a4fe134fc1d8eb ("btrfs: fix a double release on reserved
extents in cow_one_range()"), which is the same fix in cow_one_range().
Fixes: 151a41bc46df ("Btrfs: fix what bits we clear when erroring out from delalloc")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a spelling mistake in a comment. Fix it.
Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Link: https://patch.msgid.link/20260418-fanotify-typo-v1-1-03ea48cb44ba@ethancedwards.com
Signed-off-by: Jan Kara <jack@suse.cz>
Use QSTR_LEN() and inline the code in isofs_cmp(). Remove the stale
function comment while at it.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260420102544.8924-3-thorsten.blum@linux.dev
Signed-off-by: Jan Kara <jack@suse.cz>
The expression `rpool->resources[index].usage + 1` is computed in int
arithmetic before being assigned to s64 variable `new`. When usage equals
INT_MAX (the default "max" value), the addition overflows to INT_MIN.
This negative value then passes the `new > max` check incorrectly,
allowing a charge that should be rejected and corrupting usage to
negative.
Fix by casting usage to s64 before the addition so the arithmetic is
done in 64-bit.
Fixes: 39d3e7584a68 ("rdmacg: Added rdma cgroup controller")
Signed-off-by: cuitao <cuitao@kylinos.cn>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull ring-buffer fix from Steven Rostedt:
- Make undefsyms_base.c into a real file
The file undefsyms_base.c is used to catch any symbols used by a
remote ring buffer that is made for use of a pKVM hypervisor. As it
doesn't share the same text as the rest of the kernel, referencing
any symbols within the kernel will make it fail to be built for the
standalone hypervisor.
A file was created by the Makefile that checked for any symbols that
could cause issues. There's no reason to have this file created by
the Makefile, just create it as a normal file instead.
* tag 'trace-ring-buffer-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Make undefsyms_base.c a first-class citizen
The logfile contains a lot of useful information about the tests being
run. Add it to the stored failure directory when the test fails.
Cc: John 'Warthog9' Hawley <warthog9@kernel.org>
Link: https://patch.msgid.link/20260420142315.7bbc3624@fedora
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit d2603279c7d6 ("eventfs: Use list_del_rcu() for SRCU protected
list variable") converted the removal side to pair with the
list_for_each_entry_srcu() walker in eventfs_iterate(). The insertion
in eventfs_create_dir() was left as a plain list_add_tail(), which on
weakly-ordered architectures can expose a new entry to the SRCU reader
before its list pointers and fields are observable.
Use list_add_tail_rcu() so the publication pairs with the existing
list_del_rcu() and list_for_each_entry_srcu().
Fixes: 43aa6f97c2d0 ("eventfs: Get rid of dentry pointers without refcounts")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260418152251.199343-1-devnexen@gmail.com
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:
- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
only (via sch->cgrp, stored only under SUB). GROUP-only would leak a
reference on every root-sched enable.
- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't
compile.
GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:
- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Pull mailbox updates from Jassi Brar:
- core: fix NULL message handling and add API to query TX queue slots
- test: resolve concurrency bugs, dangling IRQs, and memory leaks
- dt-bindings: qcom: add Eliza IPCC
- mtk: fix address calculation and pointer handling bugs
- cix: resolve SCMI suspend timeouts
- misc memory allocation optimizations and cleanups
* tag 'mailbox-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox:
mailbox: mailbox-test: make data_ready a per-instance variable
mailbox: mailbox-test: initialize struct earlier
mailbox: mailbox-test: don't free the reused channel
mailbox: mailbox-test: handle channel errors consistently
mailbox: update kdoc for struct mbox_controller
mailbox: add sanity check for channel array
mailbox: mailbox-test: free channels on probe error
mailbox: prefix new constants with MBOX_
dt-bindings: mailbox: qcom-ipcc: Document the Eliza Inter-Processor Communication Controller
mailbox: cix: Add IRQF_NO_SUSPEND to mailbox interrupt
mailbox: Fix NULL message support in mbox_send_message()
mailbox: remove superfluous internal header
mailbox: correct kdoc title for mbox_bind_client
mailbox: test: really ignore optional memory resources
mailbox: exynos: drop superfluous mbox setting per channel
mailbox: mtk-cmdq: Fix CURR and END addr for task insert case
mailbox: mtk-vcp-mailbox: Fix the return value in mtk_vcp_mbox_xlate()
mailbox: hi6220: kzalloc + kcalloc to kzalloc
mailbox: rockchip: kzalloc + kcalloc to kzalloc
mailbox: add API to query available TX queue slots
There's a bug in dm-thin in the function rebalance_children. If the
internal btree node has one entry, the code tries to copy all btree
entries from the node's child to the node itself and then decrement the
child's reference count.
If the child node is shared (it has reference count > 1), we won't free
it, so there would be two pointers to each of the grandchildren nodes.
But the reference counts of the grandchildren is not increased, thus the
reference count doesn't match the number of pointers that point to the
grandchildren. This results in "device mapper: space map common: unable
to decrement block" errors.
Fix this bug by incrementing reference counts on the grandchildren if the
btree node is shared.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 3241b1d3e0aa ("dm: add persistent data library")
Cc: stable@vger.kernel.org
btrfs_partially_delete_raid_extent() returns an error code (e.g.
-ENOMEM from kzalloc(), or errors from btrfs_del_item/btrfs_insert_item()),
but all three call sites in btrfs_delete_raid_extent() discard the
return value, silently losing errors and potentially leaving the stripe
tree in an inconsistent state.
Fix by capturing the return value into ret at all three call sites and
breaking out of the loop on error where appropriate.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fsnotify_get_mark_safe() may return false for a mark on an unrelated group,
which results in bypassing the permission check.
Fix by skipping over detached marks that are not in the current group.
CC: stable@vger.kernel.org
Fixes: abc77577a669 ("fsnotify: Provide framework for dropping SRCU lock in ->handle_event")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260410144950.156160-1-mszeredi@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
isofs_fh_to_dentry() and isofs_fh_to_parent() pass an attacker-
controlled block number (ifid->block or ifid->parent_block) from
the NFS file handle to isofs_export_iget(), which only rejects
block == 0 before calling isofs_iget() and ultimately sb_bread().
A crafted file handle with fh_len sufficient to pass the check
added by commit 0405d4b63d08 ("isofs: Prevent the use of too small
fid") can still drive the server to read any in-range block on the
backing device as if it were an iso_directory_record. That earlier
fix was assigned CVE-2025-37780.
sb_bread() on an out-of-range block returns NULL cleanly via the
EIO path, so there is no memory-safety violation. For in-range
reads of adjacent-partition data on the same block device, the
unrelated bytes end up in iso_inode_info fields that reach the NFS
client as dentry metadata. The deployment surface (isofs exported
over NFS from loop-mounted images) is narrow and requires an
authenticated NFS peer, but the malformed-file-handle class is
reportable as hardening next to the existing CVE-2025-37780 fix.
Reject block >= ISOFS_SB(sb)->s_nzones in isofs_export_iget() so
the check covers both isofs_fh_to_dentry() and isofs_fh_to_parent()
call sites with a single line.
Fixes: 0405d4b63d08 ("isofs: Prevent the use of too small fid")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260419212155.2169382-3-michael.bommarito@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
A potential race condition exists between pressure write and cgroup file
release regarding the priv member of struct kernfs_open_file, which
triggers the uaf reported in [1].
Consider the following scenario involving execution on two separate CPUs:
CPU0 CPU1
==== ====
vfs_rmdir()
kernfs_iop_rmdir()
cgroup_rmdir()
cgroup_kn_lock_live()
cgroup_destroy_locked()
cgroup_addrm_files()
cgroup_rm_file()
kernfs_remove_by_name()
kernfs_remove_by_name_ns()
vfs_write() __kernfs_remove()
new_sync_write() kernfs_drain()
kernfs_fop_write_iter() kernfs_drain_open_files()
cgroup_file_write() kernfs_release_file()
pressure_write() cgroup_file_release()
ctx = of->priv;
kfree(ctx);
of->priv = NULL;
cgroup_kn_unlock()
cgroup_kn_lock_live()
cgroup_get(cgrp)
cgroup_kn_unlock()
if (ctx->psi.trigger) // here, trigger uaf for ctx, that is of->priv
The cgroup_rmdir() is protected by the cgroup_mutex, it also safeguards
the memory deallocation of of->priv performed within cgroup_file_release().
However, the operations involving of->priv executed within pressure_write()
are not entirely covered by the protection of cgroup_mutex. Consequently,
if the code in pressure_write(), specifically the section handling the
ctx variable executes after cgroup_file_release() has completed, a uaf
vulnerability involving of->priv is triggered.
Therefore, the issue can be resolved by extending the scope of the
cgroup_mutex lock within pressure_write() to encompass all code paths
involving of->priv, thereby properly synchronizing the race condition
occurring between cgroup_file_release() and pressure_write().
And, if an live kn lock can be successfully acquired while executing
the pressure write operation, it indicates that the cgroup deletion
process has not yet reached its final stage; consequently, the priv
pointer within open_file cannot be NULL. Therefore, the operation to
retrieve the ctx value must be moved to a point *after* the live kn
lock has been successfully acquired.
In another situation, specifically after entering cgroup_kn_lock_live()
but before acquiring cgroup_mutex, there exists a different class of
race condition:
CPU0: write memory.pressure CPU1: write cgroup.pressure=0
=========================== =============================
kernfs_fop_write_iter()
kernfs_get_active_of(of)
pressure_write()
cgroup_kn_lock_live(memory.pressure)
cgroup_tryget(cgrp)
kernfs_break_active_protection(kn)
... blocks on cgroup_mutex
cgroup_pressure_write()
cgroup_kn_lock_live(cgroup.pressure)
cgroup_file_show(memory.pressure, false)
kernfs_show(false)
kernfs_drain_open_files()
cgroup_file_release(of)
kfree(ctx)
of->priv = NULL
cgroup_kn_unlock()
... acquires cgroup_mutex
ctx = of->priv; // may now be NULL
if (ctx->psi.trigger) // NULL dereference
Consequently, there is a possibility that of->priv is NULL, the pressure
write needs to check for this.
Now that the scope of the cgroup_mutex has been expanded, the original
explicit cgroup_get/put operations are no longer necessary, this is
because acquiring/releasing the live kn lock inherently executes a
cgroup get/put operation.
[1]
BUG: KASAN: slab-use-after-free in pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
Call Trace:
pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
cgroup_file_write+0x36f/0x790 kernel/cgroup/cgroup.c:4311
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
Allocated by task 9352:
cgroup_file_open+0x90/0x3a0 kernel/cgroup/cgroup.c:4256
kernfs_fop_open+0x9eb/0xcb0 fs/kernfs/file.c:724
do_dentry_open+0x83d/0x13e0 fs/open.c:949
Freed by task 9353:
cgroup_file_release+0xd6/0x100 kernel/cgroup/cgroup.c:4283
kernfs_release_file fs/kernfs/file.c:764 [inline]
kernfs_drain_open_files+0x392/0x720 fs/kernfs/file.c:834
kernfs_drain+0x470/0x600 fs/kernfs/dir.c:525
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Reported-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=33e571025d88efd1312c
Tested-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull kgdb update from Daniel Thompson:
"Only a very small update for kgdb this cycle: a single patch from
Kexin Sun that fixes some outdated comments"
* tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kgdb: update outdated references to kgdb_wait()
Linus points out that dumping undefsyms_base.c form the Makefile
is rather ugly, and that a much better course of action would be
to have this file as a first-class citizen in the git tree.
This allows some extra cleanup in the Makefile, and the removal of
the .gitignore file in kernel/trace.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/CAHk-=wieqGd_XKpu8UxDoyADZx8TDe8CF3RmkUXt5N_9t5Pf_w@mail.gmail.com
Link: https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/
Link: https://patch.msgid.link/20260421100455.324333-1-pbonzini@redhat.com
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The Perl localtime() function returns the month starting at 0 not 1. This
caused the date produced to create the directory for saving files of a
failed run to have the month off by one.
machine-test-useconfig-fail-20260314073628
The above happened in April, not March. The correct name should have been:
machine-test-useconfig-fail-20260414073628
This was somewhat confusing.
Cc: stable@vger.kernel.org
Cc: John 'Warthog9' Hawley <warthog9@kernel.org>
Link: https://patch.msgid.link/20260420142426.33ad0293@fedora
Fixes: 7faafbd69639b ("ktest: Add open and close console and start stop monitor")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Moving to guard() usage removed the need of using the 'ret' variable but
it wasn't removed. As it was set to zero, the compiler in use didn't warn
(although some compilers do).
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260414110344.75c0663f@robin
Fixes: 4d9b262031f ("eventfs: Simplify code using guard()s")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604100111.AAlbQKmK-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.
Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().
Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
While not the default case, multiple tests can be run simultaneously.
Then, data_ready being a global variable will be overwritten and the
per-instance lock will not help. Turn the global variable into a
per-instance one to avoid this problem.
Fixes: e339c80af95e ("mailbox: mailbox-test: don't rely on rx_buffer content to signal data ready")
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
In the 'punch a hole' case of btrfs_delete_raid_extent(),
btrfs_duplicate_item() can return -EAGAIN when the leaf needs to be
split and the path becomes invalid. The old code treats any error as
fatal and breaks out of the loop.
Additionally, btrfs_duplicate_item() may trigger setup_leaf_for_split()
which can reallocate the leaf node. The code continues using the old
leaf pointer, leading to use-after-free or stale data access.
Fix both issues by:
- Handling -EAGAIN specifically: release the path and retry the loop.
- Refreshing leaf = path->nodes[0] after successful duplication.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull jfs updates from Dave Kleikamp:
"More robust data integrity checking and some fixes"
* tag 'jfs-7.1' of github.com:kleikamp/linux-shaggy:
jfs: avoid -Wtautological-constant-out-of-range-compare warning again
JFS: always load filesystem UUID during mount
jfs: hold LOG_LOCK on umount to avoid null-ptr-deref
jfs: Set the lbmDone flag at the end of lbmIODone
jfs: fix corrupted list in dbUpdatePMap
jfs: add dmapctl integrity check to prevent invalid operations
jfs: add dtpage integrity check to prevent index/pointer overflows
jfs: add dtroot integrity check to prevent index out-of-bounds
rock_continue() reads rs->cont_extent verbatim from the Rock Ridge CE
record and passes it to sb_bread() without checking that the block
number is within the mounted ISO 9660 volume. commit e595447e177b
("[PATCH] rock.c: handle corrupted directories") added cont_offset
and cont_size rejection for the CE continuation but did not validate
the extent block number itself. commit f54e18f1b831 ("isofs: Fix
infinite looping over CE entries") later capped the CE chain length
at RR_MAX_CE_ENTRIES = 32 but again left the block number unchecked.
With a crafted ISO mounted via udisks2 (desktop optical auto-mount)
or via CAP_SYS_ADMIN mount, rs->cont_extent can therefore point at
an out-of-range block or at blocks belonging to an adjacent
filesystem on the same block device. sb_bread() on an out-of-range
block returns NULL cleanly via the block layer EIO path, so there
is no memory-safety violation. For in-range reads of adjacent-
filesystem data, the CE buffer is parsed as Rock Ridge records and
only the text of SL sub-records reaches userspace through
readlink(), which makes the info-leak channel narrow and difficult
to exploit; still, rejecting the malformed CE outright matches the
rejection shape already present in the same function for
cont_offset and cont_size.
Add an ISOFS_SB(sb)->s_nzones bounds check to rock_continue() next
to the existing offset/size rejection, printing the same
corrupted-directory-entry notice.
Fixes: f54e18f1b831 ("isofs: Fix infinite looping over CE entries")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260419212155.2169382-2-michael.bommarito@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Pull MIPS updates from Thomas Bogendoerfer:
- Support for Mobileye EyeQ6Lplus
- Cleanups and fixes
* tag 'mips_7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (30 commits)
MIPS/mtd: Handle READY GPIO in generic NAND platform data
MIPS/input: Move RB532 button to GPIO descriptors
MIPS: validate DT bootargs before appending them
MIPS: Alchemy: Remove unused forward declaration
MAINTAINERS: Mobileye: Add EyeQ6Lplus files
MIPS: config: add eyeq6lplus_defconfig
MIPS: Add Mobileye EyeQ6Lplus evaluation board dts
MIPS: Add Mobileye EyeQ6Lplus SoC dtsi
clk: eyeq: Add Mobileye EyeQ6Lplus OLB
clk: eyeq: Adjust PLL accuracy computation
clk: eyeq: Skip post-divisor when computing PLL frequency
pinctrl: eyeq5: Add Mobileye EyeQ6Lplus OLB
pinctrl: eyeq5: Use match data
reset: eyeq: Add Mobileye EyeQ6Lplus OLB
MIPS: Add Mobileye EyeQ6Lplus support
dt-bindings: soc: mobileye: Add EyeQ6Lplus OLB
dt-bindings: mips: Add Mobileye EyeQ6Lplus SoC
MIPS: dts: loongson64g-package: Switch to Loongson UART driver
mips: pci-mt7620: rework initialization procedure
mips: pci-mt7620: add more register init values
...
Pull tomoyo update from Tetsuo Handa:
"Handle 64-bit inode numbers"
* tag 'tomoyo-pr-20260422' of git://git.code.sf.net/p/tomoyo/tomoyo:
tomoyo: use u64 for holding inode->i_ino value
The function kgdb_wait() was folded into the static function
kgdb_cpu_enter() by commit 62fae312197a ("kgdb: eliminate
kgdb_wait(), all cpus enter the same way"). Update the four stale
references accordingly:
- include/linux/kgdb.h and arch/x86/kernel/kgdb.c: the
kgdb_roundup_cpus() kdoc describes what other CPUs are rounded up
to call. Because kgdb_cpu_enter() is static, the correct public
entry point is kgdb_handle_exception(); also fix a pre-existing
grammar error ("get them be" -> "get them into") and reflow the
text.
- kernel/debug/debug_core.c: replace with the generic description
"the debug trap handler", since the actual entry path is
architecture-specific.
- kernel/debug/gdbstub.c: kgdb_cpu_enter() is correct here (it
describes internal state, not a call target); add the missing
parentheses.
Suggested-by: Daniel Thompson <daniel@riscstar.com>
Assisted-by: unnamed:deepseek-v3.2 coccinelle
Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
This odd file was added to automatically figure out tool-generated
symbols.
Honestly, it *should* have been just a real honest-to-goodness regular
file in git, instead of having strange code to generate it in the
Makefile, but that is not how that silly thing works. So now we need to
ignore it explicitly.
Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
STORE_FAILURES was only saved from fail(), so paths that reached dodie()
could exit without preserving failure logs.
That includes fatal hook paths such as:
POST_BUILD_DIE = 1
and ordinary failures when:
DIE_ON_FAILURE = 1
Call save_logs("fail", ...) from dodie() too so fatal failures keep the
same STORE_FAILURES artifacts as non-fatal fail() paths.
Cc: John Hawley <warthog9@eaglescrag.net>
Link: https://patch.msgid.link/20260318-ktest-fixes-v1-1-9dd94d46d84c@suse.com
Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit e4d32142d1de ("tracing: Fix tracefs mount options") moved the
option application from tracefs_fill_super() to tracefs_reconfigure()
called from tracefs_get_tree(). This fixed mount options being ignored
on user-space mounts when the superblock already exists, but introduced
a regression for the initial kernel-internal mount.
On the first mount (via simple_pin_fs during init), sget_fc() transfers
fc->s_fs_info to sb->s_fs_info and sets fc->s_fs_info to NULL. When
tracefs_get_tree() then calls tracefs_reconfigure(), it sees a NULL
fc->s_fs_info and returns early without applying any options. The root
inode keeps mode 0755 from simple_fill_super() instead of the intended
TRACEFS_DEFAULT_MODE (0700).
Furthermore, even on subsequent user-space mounts without an explicit
mode= option, tracefs_apply_options(sb, true) gates the mode behind
fsi->opts & BIT(Opt_mode), which is unset for the defaults. So the
mode is never corrected unless the user explicitly passes mode=0700.
Restore the tracefs_apply_options(sb, false) call in tracefs_fill_super()
to apply default permissions on initial superblock creation, matching
what debugfs does in debugfs_fill_super().
Cc: stable@vger.kernel.org
Fixes: e4d32142d1de ("tracing: Fix tracefs mount options")
Link: https://patch.msgid.link/20260404134747.98867-1-devnexen@gmail.com
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.
Pass task_rq(a).
Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Pull clk fix from Stephen Boyd:
"One more fix for the merge window to avoid a boot hang on
Raspberry Pi 3B by marking the VEC clk critical so that it
doesn't get turned off and hang the bus"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: bcm: rpi: Mark VEC clock as CLK_IGNORE_UNUSED
The waitqueue must be initialized before the debugfs files are created
because from that time, requests from userspace can already be made.
Similarily, drvdata and spinlock needs to be initialized before we
request the channel, otherwise dangling irqs might run into problems
like a NULL pointer exception.
Fixes: 8ea4484d0c2b ("mailbox: Add generic mechanism for testing Mailbox Controllers")
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
Pull EDAC fix from Borislav Petkov:
- Fix the error path ordering when the driver-private descriptor
allocation fails
* tag 'edac_urgent_for_7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/mc: Fix error path ordering in edac_mc_alloc()
After falling back to the previous item in btrfs_delete_raid_extent(),
the code uses ASSERT(found_start <= start) to verify the found extent
actually precedes our target range. If the B-tree state is unexpected
(e.g. no overlapping extent exists), this triggers a kernel BUG/panic
in debug builds, or silently continues with wrong data otherwise.
Replace the ASSERT with a proper bounds check that returns -ENOENT if
the found extent does not actually overlap with the start position.
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull ext2, udf, quota updates from Jan Kara:
- A fix for a race in quota code that can expose ocfs2 to
use-after-free issues
- UDF fix to avoid memory corruption in face of corrupted format
- Couple of ext2 fixes for better handling of fs corruption
- Some more various code cleanups in UDF & ext2
* tag 'fs_for_v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: reject inodes with zero i_nlink and valid mode in ext2_iget()
ext2: use get_random_u32() where appropriate
quota: Fix race of dquot_scan_active() with quota deactivation
udf: fix partition descriptor append bookkeeping
ext2: avoid drop_nlink() during unlink of zero-nlink inode in ext2_unlink()
ext2: guard reservation window dump with EXT2FS_DEBUG
ext2: replace BUG_ON with WARN_ON_ONCE in ext2_get_blocks
ext2: remove stale TODO about kmap
fs: udf: avoid assignment in condition when selecting allocation goal
Pull sched_ext fixes from Tejun Heo:
"The merge window pulled in the cgroup sub-scheduler infrastructure,
and new AI reviews are accelerating bug reporting and fixing - hence
the larger than usual fixes batch:
- Use-after-frees during scheduler load/unload:
- The disable path could free the BPF scheduler while deferred
irq_work / kthread work was still in flight
- cgroup setter callbacks read the active scheduler outside the
rwsem that synchronizes against teardown
Fix both, and reuse the disable drain in the enable error paths so
the BPF JIT page can't be freed under live callbacks.
- Several BPF op invocations didn't tell the framework which runqueue
was already locked, so helper kfuncs that re-acquire the runqueue
by CPU could deadlock on the held lock
Fix the affected callsites, including recursive parent-into-child
dispatch.
- The hardlockup notifier ran from NMI but eventually took a
non-NMI-safe lock. Bounce it through irq_work.
- A handful of bugs in the new sub-scheduler hierarchy:
- helper kfuncs hard-coded the root instead of resolving the
caller's scheduler
- the enable error path tried to disable per-task state that had
never been initialized, and leaked cpus_read_lock on the way
out
- a sysfs object was leaked on every load/unload
- the dispatch fast-path used the root scheduler instead of the
task's
- a couple of CONFIG #ifdef guards were misclassified
- Verifier-time hardening: BPF programs of unrelated struct_ops types
(e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic
bug and, once sub-sched was enabled, a KASAN out-of-bounds read.
Now rejected at load. Plus a few NULL and cross-task argument
checks on sched_ext kfuncs, and a selftest covering the new deny.
- rhashtable (Herbert): restore the insecure_elasticity toggle and
bounce the deferred-resize kick through irq_work to break a
lock-order cycle observable from raw-spinlock callers. sched_ext's
scheduler-instance hash is the first user of both.
- The bypass-mode load balancer used file-scope cpumasks; with
multiple scheduler instances now possible, those raced. Move to
per-instance cpumasks, plus a follow-up to skip tasks whose
recorded CPU is stale relative to the new owning runqueue.
- Smaller fixes:
- a dispatch queue's first-task tracking misbehaved when a parked
iterator cursor sat in the list
- the runqueue's next-class wasn't promoted on local-queue
enqueue, leaving an SCX task behind RT in edge cases
- the reference qmap scheduler stopped erroring on legitimate
cross-scheduler task-storage misses"
* tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits)
sched_ext: Fix scx_flush_disable_work() UAF race
sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
sched_ext: Refuse cross-task select_cpu_from_kfunc calls
sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
sched_ext: Make bypass LB cpumasks per-scheduler
sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
sched_ext: Unregister sub_kset on scheduler disable
sched_ext: Defer scx_hardlockup() out of NMI
sched_ext: sync disable_irq_work in bpf_scx_unreg()
sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
...
scx_flush_disable_work() calls irq_work_sync() followed by
kthread_flush_work() to ensure that the disable kthread work has
fully completed before bpf_scx_unreg() frees the SCX scheduler.
However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall)
creates a race window between scx_claim_exit() and irq_work_queue():
CPU A (scx_vexit (watchdog)) CPU B (bpf_scx_unreg)
---- ----
scx_claim_exit()
atomic_try_cmpxchg(NONE->kind)
stack_trace_save()
vscnprintf()
scx_disable()
scx_claim_exit() -> FAIL
scx_flush_disable_work()
irq_work_sync() // no-op: not queued yet
kthread_flush_work() // no-op: not queued yet
kobject_put(&sch->kobj) -> free %sch
irq_work_queue() -> UAF on %sch
scx_disable_irq_workfn()
kthread_queue_work() -> UAF
The root cause is that CPU B's scx_flush_disable_work() returns after
syncing an irq_work that has not yet been queued, while CPU A is still
executing the code between scx_claim_exit() and irq_work_queue().
Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining
disable_irq_work and disable_work in each pass. This ensures that any
work queued after the previous check is caught, while also correctly
handling cases where no disable was triggered (e.g., the
scx_sub_enable_workfn() abort path).
Fixes: 510a27055446 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()")
Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup fixes from Tejun Heo:
- Fix UAF race in psi pressure_write() against cgroup file release by
extending cgroup_mutex coverage and ordering of->priv access after
cgroup_kn_lock_live()
- Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
by performing the increment in s64
- Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
recording the CPU used by dl_bw_alloc() so cancel_attach() returns
the reservation to the same root domain
- Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
rmdir by incrementing from kill_css() instead of offline_css()
- Typo fix in cgroup-v2 documentation
* tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup: fix typo 'protetion' -> 'protection'
cgroup: Increment nr_dying_subsys_* from rmdir context
cgroup/cpuset: record DL BW alloc CPU for attach rollback
cgroup/rdma: fix integer overflow in rdmacg_try_charge()
sched/psi: fix race between file release and pressure write
privcmd_vm_ops defines .close (privcmd_close), but neither .may_split
nor .open. When userspace does a partial munmap() on a privcmd mapping,
the kernel splits the VMA via __split_vma(). Since may_split is NULL,
the split is allowed. vm_area_dup() copies vm_private_data (a pages
array allocated in alloc_empty_pages()) into the new VMA without any
fixup, because there is no .open callback.
Both VMAs now point to the same pages array. When the unmapped portion
is closed, privcmd_close() calls:
- xen_unmap_domain_gfn_range()
- xen_free_unpopulated_pages()
- kvfree(pages)
The surviving VMA still holds the dangling pointer. When it is later
destroyed, the same sequence runs again, which leads to a double free.
Fix this issue by adding a .may_split callback denying the VMA split.
This is XSA-487 / CVE-2026-31787
Fixes: d71f513985c2 ("xen: privcmd: support autotranslated physmap guests.")
Reported-by: Atharva Vartak <atharva.a.vartak@gmail.com>
Suggested-by: Atharva Vartak <atharva.a.vartak@gmail.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
There are several edge cases (see linked thread) where an IMMED task
can be left lingering on a local DSQ if an RT task swoops in at the
wrong time. All of these edge cases are due to rq->next_class being idle
even after dispatching a task to rq's local DSQ. We should bump
rq->next_class to &ext_sched_class as soon as we've inserted a task into
the local DSQ.
To optimize the common case of rq->next_class == &ext_sched_class,
only call wakeup_preempt() if rq->next_class is below EXT. If next_class
is EXT or above, wakeup_preempt() is a no-op anyway.
This lets us also simplify the preempt_curr() logic a bit since
wakeup_preempt() will call preempt_curr() for us if next_class is
below EXT.
Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull isofs and udf fixes from Jan Kara:
"Several isofs and udf fixes"
* tag 'fs_for_v7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
docs: isofs: replace dead ECMA-119 FTP link
udf: reject descriptors with oversized CRC length
isofs: use QSTR_LEN() in isofs_cmp
isofs: validate block number from NFS file handle in isofs_export_iget
isofs: validate Rock Ridge CE continuation extent against volume size
The build id returned by HYPERVISOR_xen_version(XENVER_build_id) is
neither NUL terminated nor a string.
The first causes a buffer overflow as sprintf in buildid_show will
read and copy till it finds a NUL.
00000000 f4 91 51 f4 dd 38 9e 9d 65 47 52 eb 10 71 db 50 |..Q..8..eGR..q.P|
00000010 b9 a8 01 42 6f 2e 32 |...Bo.2|
00000017
So use a memcpy instead of sprintf to have the correct value:
00000000 f4 91 51 f4 dd 00 9e 9d 65 47 52 eb 10 71 db 50 |..Q.....eGR..q.P|
00000010 b9 a8 01 42 |...B|
00000014
(the above have a hack to embed a zero inside and check it's
returned correctly).
This is XSA-485 / CVE-2026-31786
Fixes: 84b7625728ea ("xen: add sysfs node for hypervisor build id")
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.
scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.
Drop cpus_read_lock() before goto err_disable.
v2: Correct Fixes: tag (Andrea Righi).
Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull fsnotify fixes from Jan Kara:
"Three fixes for fsnotify / fanotify"
* tag 'fsnotify_for_v7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix inode reference leak in fsnotify_recalc_mask()
fanotify: Fix spelling mistake "enforecement" -> "enforcement"
fanotify: fix false positive on permission events
The original link is no longer valid. Replace it with the official
PDF of the 2nd edition. The new link points to the exact 2nd edition
that the existing comment in isofs.rst refers to.
Signed-off-by: Ziran Zhang <zhangcoder@yeah.net>
Link: https://patch.msgid.link/20260425142943.6809-1-zhangcoder@yeah.net
Signed-off-by: Jan Kara <jack@suse.cz>
Incrementing nr_dying_subsys_* in offline_css(), which is executed by
cgroup_offline_wq worker, leads to a race where user can see the value
to be 0 if he reads cgroup.stat after calling rmdir and before the worker
executes. This makes the user wrongly expect resources released by the
removed cgroup to be available for a new assignment.
Increment nr_dying_subsys_* from kill_css(), which is called from the
cgroup_rmdir() context.
Fixes: ab0312526867 ("cgroup: Show # of subsystem CSSes in cgroup.stat")
Signed-off-by: Petr Malat <oss@malat.biz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull tracefs fixes from Steven Rostedt:
- Use list_add_tail_rcu() for walking eventfs children
The linked list of children is protected by SRCU and list walkers can
walk the list with only using SRCU. Using just list_add_tail() on
weakly ordered architectures can cause issues. Instead use
list_add_tail_rcu().
- Hold eventfs_mutex and SRCU for remount walk events
The trace_apply_options() walks the tracefs_inodes where some are
eventfs inodes and eventfs_remount() is called which in turn calls
eventfs_set_attr(). This walk only holds normal RCU read locks, but
the eventfs_mutex and SRCU should be held.
Add a eventfs_remount_(un)lock() helpers to take the necessary locks
before iterating the list.
* tag 'tracefs-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Hold eventfs_mutex and SRCU when remount walks events
eventfs: Use list_add_tail_rcu() for SRCU-protected children list
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For
any non-scx task p->scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p->scx.slice /
p->scx.dsq_vtime on arbitrary tasks.
Reject !sch up front so the gate only admits callers with a resolved
scheduler.
Fixes: 245d09c594ea ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Pull btrfs fixes from David Sterba:
- space reservation fixes:
- correctly undo 'may_use' accounting for remap tree
- avoid double decrement of 'may_use' when submitting async io
- actually enable the shutdown ioctl callback (not just the superblock
ops)
- raid stripe tree fixes when deleting extents
- add missing error handling
- fix various incorrect values set
- fix transaction state when removing a directory, possibly leading to
EIO during log replay
- additional b-tree node key checks during metadata readahead
- error handling and transaction abort updates
* tag 'for-7.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent()
btrfs: check return value of btrfs_partially_delete_raid_extent()
btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer
btrfs: replace ASSERT with proper error handling in stripe lookup fallback
btrfs: fix wrong min_objectid in btrfs_previous_item() call
btrfs: fix raid stripe search missing entries at leaf boundaries
btrfs: copy devid in btrfs_partially_delete_raid_extent()
btrfs: handle unexpected free-space-tree key types
btrfs: fix missing last_unlink_trans update when removing a directory
btrfs: don't clobber errors in add_remap_tree_entries()
btrfs: enable shutdown ioctl for non-experimental builds
btrfs: apply first key check for readahead when possible
btrfs: abort transaction in do_remap_reloc_trans() on failure
btrfs: fix bytes_may_use leak in do_remap_reloc_trans()
btrfs: fix bytes_may_use leak in move_existing_remap()
fsnotify_recalc_mask() fails to handle the return value of
__fsnotify_recalc_mask(), which may return an inode pointer that needs
to be released via fsnotify_drop_object() when the connector's HAS_IREF
flag transitions from set to cleared.
This manifests as a hung task with the following call trace:
INFO: task umount:1234 blocked for more than 120 seconds.
Call Trace:
__schedule
schedule
fsnotify_sb_delete
generic_shutdown_super
kill_anon_super
cleanup_mnt
task_work_run
do_exit
do_group_exit
The race window that triggers the iref leak:
Thread A (adding mark) Thread B (removing mark)
────────────────────── ────────────────────────
fsnotify_add_mark_locked():
fsnotify_add_mark_list():
spin_lock(conn->lock)
add mark_B(evictable) to list
spin_unlock(conn->lock)
return
/* ---- gap: no lock held ---- */
fsnotify_detach_mark(mark_A):
spin_lock(mark_A->lock)
clear ATTACHED flag on mark_A
spin_unlock(mark_A->lock)
fsnotify_put_mark(mark_A)
fsnotify_recalc_mask():
spin_lock(conn->lock)
__fsnotify_recalc_mask():
/* mark_A skipped: ATTACHED cleared */
/* only mark_B(evictable) remains */
want_iref = false
has_iref = true /* not yet cleared */
-> HAS_IREF transitions true -> false
-> returns inode pointer
spin_unlock(conn->lock)
/* BUG: return value discarded!
* iput() and fsnotify_put_sb_watched_objects()
* are never called */
Fix this by deferring the transition true -> false of HAS_IREF flag from
fsnotify_recalc_mask() (Thread A) to fsnotify_put_mark() (thread B).
Fixes: c3638b5b1374 ("fsnotify: allow adding an inode mark without pinning inode")
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/CAOQ4uxiPsbHb0o5voUKyPFMvBsDkG914FYDcs4C5UpBMNm0Vcg@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
udf_read_tagged() skips CRC verification when descCRCLength +
sizeof(struct tag) exceeds the block size. A crafted UDF image can
set descCRCLength to an oversized value to bypass CRC validation
entirely; the descriptor is then accepted based solely on the 8-bit
tag checksum, which is trivially recomputable.
Reject such descriptors instead of silently accepting them. A
legitimate single-block descriptor should never have a CRC length that
exceeds the block.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-6
Assisted-by: Codex:gpt-5-4
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260413211240.853662-1-michael.bommarito@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
cpuset_can_attach() allocates DL bandwidth only when migrating
deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
rolls back based only on nr_migrate_dl_tasks. This makes the DL
bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
even when no dl_bw_alloc() was done.
Rollback also needs to undo the reservation against the same CPU/root
domain that was charged. Record the CPU used by dl_bw_alloc() and use
that state in cpuset_cancel_attach(). If no allocation happened,
dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
did happen, bandwidth is returned to the same CPU/root domain.
Successful attach paths are unchanged. This only fixes failed attach
rollback accounting.
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull ktest updates from Steven Rostedt:
- Fix month in date timestamp used to create failure directories
On failure, a directory is created to store the logs and config file
to analyze the failure. The Perl function localtime is used to create
the data timestamp of the directory. The month passed back from that
function starts at 0 and not 1, but the timestamp used does not
account for that. Thus for April 20, 2026, the timestamp of 20260320
is used, instead of 20260420.
- Save the logfile to the failure directory
Just the test log was saved to the directory on failure, but there's
useful information in the full logfile that can be helpful to
analyzing the failure. Save the logfile as well.
* tag 'ktest-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
ktest: Add logfile to failure directory
ktest: Fix the month in the name of the failure directory
Commit 340f0c7067a9 ("eventfs: Update all the eventfs_inodes from the
events descriptor") had eventfs_set_attrs() recurse through ei->children
on remount. The walk only holds the rcu_read_lock() taken by
tracefs_apply_options() over tracefs_inodes, which is wrong:
- list_for_each_entry over ei->children races with the list_del_rcu()
in eventfs_remove_rec() -- LIST_POISON1 deref, same shape as
d2603279c7d6.
- eventfs_inodes are freed via call_srcu(&eventfs_srcu, ...).
rcu_read_lock() does not extend an SRCU grace period, so ti->private
can be reclaimed under the walk.
- The writes to ei->attr race with eventfs_set_attr(), which holds
eventfs_mutex.
Reproducer:
while :; do mount -o remount,uid=$((RANDOM%1000)) /sys/kernel/tracing; done &
while :; do
echo "p:kp submit_bio" > /sys/kernel/tracing/kprobe_events
echo > /sys/kernel/tracing/kprobe_events
done
Wrap the events portion of tracefs_apply_options() in
eventfs_remount_lock()/_unlock() that take eventfs_mutex and
srcu_read_lock(&eventfs_srcu). eventfs_set_attrs() doesn't sleep so the
nested rcu_read_lock() is fine; lockdep_assert_held() pins the contract.
Comment in tracefs_drop_inode() said "RCU cycle" -- it is SRCU.
Fixes: 340f0c7067a9 ("eventfs: Update all the eventfs_inodes from the events descriptor")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260418191737.10289-1-devnexen@gmail.com
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.
Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().
v2: Switch the in_select_cpu cross-task check from direct_dispatch_task
comparison to scx_kf_arg_task_ok(). The former spuriously rejects when
ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls
scx_bpf_select_cpu_*() on the same task. (Andrea Righi)
Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
Pull device mapper fix from Mikulas Patocka:
- fix metadata corruption in dm-thin
* tag 'for-7.1/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-thin: fix metadata refcount underflow
submit_one_async_extent() calls btrfs_reserve_extent(), which decrements
bytes_may_use. If the call btrfs_create_io_em() fails, we jump to
out_free_reserve, which calls extent_clear_unlock_delalloc().
Because we're specifying EXTENT_DO_ACCOUNTING, i.e.
EXTENT_CLEAR_META_RESV | EXTENT_CLEAR_DATA_RESV, this decreases
bytes_may_use again. This can lead to problems later on, as an initial
write can fail only for the writeback to silently ENOSPC.
Fix this by replacing EXTENT_DO_ACCOUNTING with EXTENT_CLEAR_META_RESV.
This parallels a4fe134fc1d8eb ("btrfs: fix a double release on reserved
extents in cow_one_range()"), which is the same fix in cow_one_range().
Fixes: 151a41bc46df ("Btrfs: fix what bits we clear when erroring out from delalloc")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The expression `rpool->resources[index].usage + 1` is computed in int
arithmetic before being assigned to s64 variable `new`. When usage equals
INT_MAX (the default "max" value), the addition overflows to INT_MIN.
This negative value then passes the `new > max` check incorrectly,
allowing a charge that should be rejected and corrupting usage to
negative.
Fix by casting usage to s64 before the addition so the arithmetic is
done in 64-bit.
Fixes: 39d3e7584a68 ("rdmacg: Added rdma cgroup controller")
Signed-off-by: cuitao <cuitao@kylinos.cn>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull ring-buffer fix from Steven Rostedt:
- Make undefsyms_base.c into a real file
The file undefsyms_base.c is used to catch any symbols used by a
remote ring buffer that is made for use of a pKVM hypervisor. As it
doesn't share the same text as the rest of the kernel, referencing
any symbols within the kernel will make it fail to be built for the
standalone hypervisor.
A file was created by the Makefile that checked for any symbols that
could cause issues. There's no reason to have this file created by
the Makefile, just create it as a normal file instead.
* tag 'trace-ring-buffer-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Make undefsyms_base.c a first-class citizen
Commit d2603279c7d6 ("eventfs: Use list_del_rcu() for SRCU protected
list variable") converted the removal side to pair with the
list_for_each_entry_srcu() walker in eventfs_iterate(). The insertion
in eventfs_create_dir() was left as a plain list_add_tail(), which on
weakly-ordered architectures can expose a new entry to the SRCU reader
before its list pointers and fields are observable.
Use list_add_tail_rcu() so the publication pairs with the existing
list_del_rcu() and list_for_each_entry_srcu().
Fixes: 43aa6f97c2d0 ("eventfs: Get rid of dentry pointers without refcounts")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260418152251.199343-1-devnexen@gmail.com
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:
- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
only (via sch->cgrp, stored only under SUB). GROUP-only would leak a
reference on every root-sched enable.
- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't
compile.
GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:
- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Pull mailbox updates from Jassi Brar:
- core: fix NULL message handling and add API to query TX queue slots
- test: resolve concurrency bugs, dangling IRQs, and memory leaks
- dt-bindings: qcom: add Eliza IPCC
- mtk: fix address calculation and pointer handling bugs
- cix: resolve SCMI suspend timeouts
- misc memory allocation optimizations and cleanups
* tag 'mailbox-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox:
mailbox: mailbox-test: make data_ready a per-instance variable
mailbox: mailbox-test: initialize struct earlier
mailbox: mailbox-test: don't free the reused channel
mailbox: mailbox-test: handle channel errors consistently
mailbox: update kdoc for struct mbox_controller
mailbox: add sanity check for channel array
mailbox: mailbox-test: free channels on probe error
mailbox: prefix new constants with MBOX_
dt-bindings: mailbox: qcom-ipcc: Document the Eliza Inter-Processor Communication Controller
mailbox: cix: Add IRQF_NO_SUSPEND to mailbox interrupt
mailbox: Fix NULL message support in mbox_send_message()
mailbox: remove superfluous internal header
mailbox: correct kdoc title for mbox_bind_client
mailbox: test: really ignore optional memory resources
mailbox: exynos: drop superfluous mbox setting per channel
mailbox: mtk-cmdq: Fix CURR and END addr for task insert case
mailbox: mtk-vcp-mailbox: Fix the return value in mtk_vcp_mbox_xlate()
mailbox: hi6220: kzalloc + kcalloc to kzalloc
mailbox: rockchip: kzalloc + kcalloc to kzalloc
mailbox: add API to query available TX queue slots
There's a bug in dm-thin in the function rebalance_children. If the
internal btree node has one entry, the code tries to copy all btree
entries from the node's child to the node itself and then decrement the
child's reference count.
If the child node is shared (it has reference count > 1), we won't free
it, so there would be two pointers to each of the grandchildren nodes.
But the reference counts of the grandchildren is not increased, thus the
reference count doesn't match the number of pointers that point to the
grandchildren. This results in "device mapper: space map common: unable
to decrement block" errors.
Fix this bug by incrementing reference counts on the grandchildren if the
btree node is shared.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 3241b1d3e0aa ("dm: add persistent data library")
Cc: stable@vger.kernel.org
btrfs_partially_delete_raid_extent() returns an error code (e.g.
-ENOMEM from kzalloc(), or errors from btrfs_del_item/btrfs_insert_item()),
but all three call sites in btrfs_delete_raid_extent() discard the
return value, silently losing errors and potentially leaving the stripe
tree in an inconsistent state.
Fix by capturing the return value into ret at all three call sites and
breaking out of the loop on error where appropriate.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fsnotify_get_mark_safe() may return false for a mark on an unrelated group,
which results in bypassing the permission check.
Fix by skipping over detached marks that are not in the current group.
CC: stable@vger.kernel.org
Fixes: abc77577a669 ("fsnotify: Provide framework for dropping SRCU lock in ->handle_event")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260410144950.156160-1-mszeredi@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
isofs_fh_to_dentry() and isofs_fh_to_parent() pass an attacker-
controlled block number (ifid->block or ifid->parent_block) from
the NFS file handle to isofs_export_iget(), which only rejects
block == 0 before calling isofs_iget() and ultimately sb_bread().
A crafted file handle with fh_len sufficient to pass the check
added by commit 0405d4b63d08 ("isofs: Prevent the use of too small
fid") can still drive the server to read any in-range block on the
backing device as if it were an iso_directory_record. That earlier
fix was assigned CVE-2025-37780.
sb_bread() on an out-of-range block returns NULL cleanly via the
EIO path, so there is no memory-safety violation. For in-range
reads of adjacent-partition data on the same block device, the
unrelated bytes end up in iso_inode_info fields that reach the NFS
client as dentry metadata. The deployment surface (isofs exported
over NFS from loop-mounted images) is narrow and requires an
authenticated NFS peer, but the malformed-file-handle class is
reportable as hardening next to the existing CVE-2025-37780 fix.
Reject block >= ISOFS_SB(sb)->s_nzones in isofs_export_iget() so
the check covers both isofs_fh_to_dentry() and isofs_fh_to_parent()
call sites with a single line.
Fixes: 0405d4b63d08 ("isofs: Prevent the use of too small fid")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260419212155.2169382-3-michael.bommarito@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
A potential race condition exists between pressure write and cgroup file
release regarding the priv member of struct kernfs_open_file, which
triggers the uaf reported in [1].
Consider the following scenario involving execution on two separate CPUs:
CPU0 CPU1
==== ====
vfs_rmdir()
kernfs_iop_rmdir()
cgroup_rmdir()
cgroup_kn_lock_live()
cgroup_destroy_locked()
cgroup_addrm_files()
cgroup_rm_file()
kernfs_remove_by_name()
kernfs_remove_by_name_ns()
vfs_write() __kernfs_remove()
new_sync_write() kernfs_drain()
kernfs_fop_write_iter() kernfs_drain_open_files()
cgroup_file_write() kernfs_release_file()
pressure_write() cgroup_file_release()
ctx = of->priv;
kfree(ctx);
of->priv = NULL;
cgroup_kn_unlock()
cgroup_kn_lock_live()
cgroup_get(cgrp)
cgroup_kn_unlock()
if (ctx->psi.trigger) // here, trigger uaf for ctx, that is of->priv
The cgroup_rmdir() is protected by the cgroup_mutex, it also safeguards
the memory deallocation of of->priv performed within cgroup_file_release().
However, the operations involving of->priv executed within pressure_write()
are not entirely covered by the protection of cgroup_mutex. Consequently,
if the code in pressure_write(), specifically the section handling the
ctx variable executes after cgroup_file_release() has completed, a uaf
vulnerability involving of->priv is triggered.
Therefore, the issue can be resolved by extending the scope of the
cgroup_mutex lock within pressure_write() to encompass all code paths
involving of->priv, thereby properly synchronizing the race condition
occurring between cgroup_file_release() and pressure_write().
And, if an live kn lock can be successfully acquired while executing
the pressure write operation, it indicates that the cgroup deletion
process has not yet reached its final stage; consequently, the priv
pointer within open_file cannot be NULL. Therefore, the operation to
retrieve the ctx value must be moved to a point *after* the live kn
lock has been successfully acquired.
In another situation, specifically after entering cgroup_kn_lock_live()
but before acquiring cgroup_mutex, there exists a different class of
race condition:
CPU0: write memory.pressure CPU1: write cgroup.pressure=0
=========================== =============================
kernfs_fop_write_iter()
kernfs_get_active_of(of)
pressure_write()
cgroup_kn_lock_live(memory.pressure)
cgroup_tryget(cgrp)
kernfs_break_active_protection(kn)
... blocks on cgroup_mutex
cgroup_pressure_write()
cgroup_kn_lock_live(cgroup.pressure)
cgroup_file_show(memory.pressure, false)
kernfs_show(false)
kernfs_drain_open_files()
cgroup_file_release(of)
kfree(ctx)
of->priv = NULL
cgroup_kn_unlock()
... acquires cgroup_mutex
ctx = of->priv; // may now be NULL
if (ctx->psi.trigger) // NULL dereference
Consequently, there is a possibility that of->priv is NULL, the pressure
write needs to check for this.
Now that the scope of the cgroup_mutex has been expanded, the original
explicit cgroup_get/put operations are no longer necessary, this is
because acquiring/releasing the live kn lock inherently executes a
cgroup get/put operation.
[1]
BUG: KASAN: slab-use-after-free in pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
Call Trace:
pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
cgroup_file_write+0x36f/0x790 kernel/cgroup/cgroup.c:4311
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
Allocated by task 9352:
cgroup_file_open+0x90/0x3a0 kernel/cgroup/cgroup.c:4256
kernfs_fop_open+0x9eb/0xcb0 fs/kernfs/file.c:724
do_dentry_open+0x83d/0x13e0 fs/open.c:949
Freed by task 9353:
cgroup_file_release+0xd6/0x100 kernel/cgroup/cgroup.c:4283
kernfs_release_file fs/kernfs/file.c:764 [inline]
kernfs_drain_open_files+0x392/0x720 fs/kernfs/file.c:834
kernfs_drain+0x470/0x600 fs/kernfs/dir.c:525
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Reported-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=33e571025d88efd1312c
Tested-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Linus points out that dumping undefsyms_base.c form the Makefile
is rather ugly, and that a much better course of action would be
to have this file as a first-class citizen in the git tree.
This allows some extra cleanup in the Makefile, and the removal of
the .gitignore file in kernel/trace.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/CAHk-=wieqGd_XKpu8UxDoyADZx8TDe8CF3RmkUXt5N_9t5Pf_w@mail.gmail.com
Link: https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/
Link: https://patch.msgid.link/20260421100455.324333-1-pbonzini@redhat.com
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The Perl localtime() function returns the month starting at 0 not 1. This
caused the date produced to create the directory for saving files of a
failed run to have the month off by one.
machine-test-useconfig-fail-20260314073628
The above happened in April, not March. The correct name should have been:
machine-test-useconfig-fail-20260414073628
This was somewhat confusing.
Cc: stable@vger.kernel.org
Cc: John 'Warthog9' Hawley <warthog9@kernel.org>
Link: https://patch.msgid.link/20260420142426.33ad0293@fedora
Fixes: 7faafbd69639b ("ktest: Add open and close console and start stop monitor")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Moving to guard() usage removed the need of using the 'ret' variable but
it wasn't removed. As it was set to zero, the compiler in use didn't warn
(although some compilers do).
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260414110344.75c0663f@robin
Fixes: 4d9b262031f ("eventfs: Simplify code using guard()s")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604100111.AAlbQKmK-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.
Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().
Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
While not the default case, multiple tests can be run simultaneously.
Then, data_ready being a global variable will be overwritten and the
per-instance lock will not help. Turn the global variable into a
per-instance one to avoid this problem.
Fixes: e339c80af95e ("mailbox: mailbox-test: don't rely on rx_buffer content to signal data ready")
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
In the 'punch a hole' case of btrfs_delete_raid_extent(),
btrfs_duplicate_item() can return -EAGAIN when the leaf needs to be
split and the path becomes invalid. The old code treats any error as
fatal and breaks out of the loop.
Additionally, btrfs_duplicate_item() may trigger setup_leaf_for_split()
which can reallocate the leaf node. The code continues using the old
leaf pointer, leading to use-after-free or stale data access.
Fix both issues by:
- Handling -EAGAIN specifically: release the path and retry the loop.
- Refreshing leaf = path->nodes[0] after successful duplication.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull jfs updates from Dave Kleikamp:
"More robust data integrity checking and some fixes"
* tag 'jfs-7.1' of github.com:kleikamp/linux-shaggy:
jfs: avoid -Wtautological-constant-out-of-range-compare warning again
JFS: always load filesystem UUID during mount
jfs: hold LOG_LOCK on umount to avoid null-ptr-deref
jfs: Set the lbmDone flag at the end of lbmIODone
jfs: fix corrupted list in dbUpdatePMap
jfs: add dmapctl integrity check to prevent invalid operations
jfs: add dtpage integrity check to prevent index/pointer overflows
jfs: add dtroot integrity check to prevent index out-of-bounds
rock_continue() reads rs->cont_extent verbatim from the Rock Ridge CE
record and passes it to sb_bread() without checking that the block
number is within the mounted ISO 9660 volume. commit e595447e177b
("[PATCH] rock.c: handle corrupted directories") added cont_offset
and cont_size rejection for the CE continuation but did not validate
the extent block number itself. commit f54e18f1b831 ("isofs: Fix
infinite looping over CE entries") later capped the CE chain length
at RR_MAX_CE_ENTRIES = 32 but again left the block number unchecked.
With a crafted ISO mounted via udisks2 (desktop optical auto-mount)
or via CAP_SYS_ADMIN mount, rs->cont_extent can therefore point at
an out-of-range block or at blocks belonging to an adjacent
filesystem on the same block device. sb_bread() on an out-of-range
block returns NULL cleanly via the block layer EIO path, so there
is no memory-safety violation. For in-range reads of adjacent-
filesystem data, the CE buffer is parsed as Rock Ridge records and
only the text of SL sub-records reaches userspace through
readlink(), which makes the info-leak channel narrow and difficult
to exploit; still, rejecting the malformed CE outright matches the
rejection shape already present in the same function for
cont_offset and cont_size.
Add an ISOFS_SB(sb)->s_nzones bounds check to rock_continue() next
to the existing offset/size rejection, printing the same
corrupted-directory-entry notice.
Fixes: f54e18f1b831 ("isofs: Fix infinite looping over CE entries")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260419212155.2169382-2-michael.bommarito@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Pull MIPS updates from Thomas Bogendoerfer:
- Support for Mobileye EyeQ6Lplus
- Cleanups and fixes
* tag 'mips_7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (30 commits)
MIPS/mtd: Handle READY GPIO in generic NAND platform data
MIPS/input: Move RB532 button to GPIO descriptors
MIPS: validate DT bootargs before appending them
MIPS: Alchemy: Remove unused forward declaration
MAINTAINERS: Mobileye: Add EyeQ6Lplus files
MIPS: config: add eyeq6lplus_defconfig
MIPS: Add Mobileye EyeQ6Lplus evaluation board dts
MIPS: Add Mobileye EyeQ6Lplus SoC dtsi
clk: eyeq: Add Mobileye EyeQ6Lplus OLB
clk: eyeq: Adjust PLL accuracy computation
clk: eyeq: Skip post-divisor when computing PLL frequency
pinctrl: eyeq5: Add Mobileye EyeQ6Lplus OLB
pinctrl: eyeq5: Use match data
reset: eyeq: Add Mobileye EyeQ6Lplus OLB
MIPS: Add Mobileye EyeQ6Lplus support
dt-bindings: soc: mobileye: Add EyeQ6Lplus OLB
dt-bindings: mips: Add Mobileye EyeQ6Lplus SoC
MIPS: dts: loongson64g-package: Switch to Loongson UART driver
mips: pci-mt7620: rework initialization procedure
mips: pci-mt7620: add more register init values
...
The function kgdb_wait() was folded into the static function
kgdb_cpu_enter() by commit 62fae312197a ("kgdb: eliminate
kgdb_wait(), all cpus enter the same way"). Update the four stale
references accordingly:
- include/linux/kgdb.h and arch/x86/kernel/kgdb.c: the
kgdb_roundup_cpus() kdoc describes what other CPUs are rounded up
to call. Because kgdb_cpu_enter() is static, the correct public
entry point is kgdb_handle_exception(); also fix a pre-existing
grammar error ("get them be" -> "get them into") and reflow the
text.
- kernel/debug/debug_core.c: replace with the generic description
"the debug trap handler", since the actual entry path is
architecture-specific.
- kernel/debug/gdbstub.c: kgdb_cpu_enter() is correct here (it
describes internal state, not a call target); add the missing
parentheses.
Suggested-by: Daniel Thompson <daniel@riscstar.com>
Assisted-by: unnamed:deepseek-v3.2 coccinelle
Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
This odd file was added to automatically figure out tool-generated
symbols.
Honestly, it *should* have been just a real honest-to-goodness regular
file in git, instead of having strange code to generate it in the
Makefile, but that is not how that silly thing works. So now we need to
ignore it explicitly.
Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
STORE_FAILURES was only saved from fail(), so paths that reached dodie()
could exit without preserving failure logs.
That includes fatal hook paths such as:
POST_BUILD_DIE = 1
and ordinary failures when:
DIE_ON_FAILURE = 1
Call save_logs("fail", ...) from dodie() too so fatal failures keep the
same STORE_FAILURES artifacts as non-fatal fail() paths.
Cc: John Hawley <warthog9@eaglescrag.net>
Link: https://patch.msgid.link/20260318-ktest-fixes-v1-1-9dd94d46d84c@suse.com
Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit e4d32142d1de ("tracing: Fix tracefs mount options") moved the
option application from tracefs_fill_super() to tracefs_reconfigure()
called from tracefs_get_tree(). This fixed mount options being ignored
on user-space mounts when the superblock already exists, but introduced
a regression for the initial kernel-internal mount.
On the first mount (via simple_pin_fs during init), sget_fc() transfers
fc->s_fs_info to sb->s_fs_info and sets fc->s_fs_info to NULL. When
tracefs_get_tree() then calls tracefs_reconfigure(), it sees a NULL
fc->s_fs_info and returns early without applying any options. The root
inode keeps mode 0755 from simple_fill_super() instead of the intended
TRACEFS_DEFAULT_MODE (0700).
Furthermore, even on subsequent user-space mounts without an explicit
mode= option, tracefs_apply_options(sb, true) gates the mode behind
fsi->opts & BIT(Opt_mode), which is unset for the defaults. So the
mode is never corrected unless the user explicitly passes mode=0700.
Restore the tracefs_apply_options(sb, false) call in tracefs_fill_super()
to apply default permissions on initial superblock creation, matching
what debugfs does in debugfs_fill_super().
Cc: stable@vger.kernel.org
Fixes: e4d32142d1de ("tracing: Fix tracefs mount options")
Link: https://patch.msgid.link/20260404134747.98867-1-devnexen@gmail.com
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.
Pass task_rq(a).
Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Pull clk fix from Stephen Boyd:
"One more fix for the merge window to avoid a boot hang on
Raspberry Pi 3B by marking the VEC clk critical so that it
doesn't get turned off and hang the bus"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: bcm: rpi: Mark VEC clock as CLK_IGNORE_UNUSED
The waitqueue must be initialized before the debugfs files are created
because from that time, requests from userspace can already be made.
Similarily, drvdata and spinlock needs to be initialized before we
request the channel, otherwise dangling irqs might run into problems
like a NULL pointer exception.
Fixes: 8ea4484d0c2b ("mailbox: Add generic mechanism for testing Mailbox Controllers")
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
After falling back to the previous item in btrfs_delete_raid_extent(),
the code uses ASSERT(found_start <= start) to verify the found extent
actually precedes our target range. If the B-tree state is unexpected
(e.g. no overlapping extent exists), this triggers a kernel BUG/panic
in debug builds, or silently continues with wrong data otherwise.
Replace the ASSERT with a proper bounds check that returns -ENOENT if
the found extent does not actually overlap with the start position.
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull ext2, udf, quota updates from Jan Kara:
- A fix for a race in quota code that can expose ocfs2 to
use-after-free issues
- UDF fix to avoid memory corruption in face of corrupted format
- Couple of ext2 fixes for better handling of fs corruption
- Some more various code cleanups in UDF & ext2
* tag 'fs_for_v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: reject inodes with zero i_nlink and valid mode in ext2_iget()
ext2: use get_random_u32() where appropriate
quota: Fix race of dquot_scan_active() with quota deactivation
udf: fix partition descriptor append bookkeeping
ext2: avoid drop_nlink() during unlink of zero-nlink inode in ext2_unlink()
ext2: guard reservation window dump with EXT2FS_DEBUG
ext2: replace BUG_ON with WARN_ON_ONCE in ext2_get_blocks
ext2: remove stale TODO about kmap
fs: udf: avoid assignment in condition when selecting allocation goal