commits
In unplug_oldest_pwq(), the first inactive work item on the
pool_workqueue is activated correctly. However, if multiple inactive
works exist on the same pool_workqueue, subsequent works fail to
activate because wq_node_nr_active.pending_pwqs is empty — the list
insertion is skipped when the pool_workqueue is plugged.
Fix this by checking for additional inactive works in
unplug_oldest_pwq() and updating wq_node_nr_active.pending_pwqs
accordingly.
Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues")
Cc: stable@vger.kernel.org
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Ryan Neph <ryanneph@google.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Try to be more explicit why the workqueue watchdog does not take
pool->lock by default. Spin locks are full memory barriers which
delay anything. Obviously, they would primary delay operations
on the related worker pools.
Explain why it is enough to prevent the false positive by re-checking
the timestamp under the pool->lock.
Finally, make it clear what would be the alternative solution in
__queue_work() which is a hotter path.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.
This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():
workqueue watchdog: pool 7 false positive detected!
lockless_ts=4784580465 locked_ts=4785033728
diff=453263ms worklist_empty=0
To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.
Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.
Fixes: 82607adcf9cd ("workqueue: implement lockup detector")
Cc: stable@vger.kernel.org # v4.5+
Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU
hogs — since commit 8823eaef45da ("workqueue: Show all busy workers in
stall diagnostics"), they dump every in-flight worker in the pool's
busy_hash.
Rename them to show_cpu_pool_busy_workers() and
show_cpu_pools_busy_workers() to accurately describe what they do.
Also fix the pr_info() message to say "stalled worker pools" instead of
"stalled CPU-bound worker pools", since sleeping/blocked workers are now
included.
No functional change.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a sample module under samples/workqueue/stall_detector/ that
reproduces a workqueue stall caused by PF_WQ_WORKER misuse. The
module queues two work items on the same per-CPU pool, then clears
PF_WQ_WORKER and sleeps in wait_event_idle(), hiding from the
concurrency manager and stalling the second work item indefinitely.
This is useful for testing the workqueue watchdog stall diagnostics.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
show_cpu_pool_hog() only prints workers whose task is currently running
on the CPU (task_is_running()). This misses workers that are busy
processing a work item but are sleeping or blocked — for example, a
worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a
worker still occupies a pool slot and prevents progress, yet produces
an empty backtrace section in the watchdog output.
This is happening on real arm64 systems, where
toggle_allocation_gate() IPIs every single CPU in the machine (which
lacks NMI), causing workqueue stalls that show empty backtraces because
toggle_allocation_gate() is sleeping in wait_event_idle().
Remove the task_is_running() filter so every in-flight worker in the
pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
which is already held.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
When diagnosing workqueue stalls, knowing how long each in-flight work
item has been executing is valuable. Add a current_start timestamp
(jiffies) to struct worker, set it when a work item begins execution in
process_one_work(), and print the elapsed wall-clock time in show_pwq().
Unlike current_at (which tracks CPU runtime and resets on wakeup for
CPU-intensive detection), current_start is never reset because the
diagnostic cares about total wall-clock time including sleeps.
Before: in-flight: 165:stall_work_fn [wq_stall]
After: in-flight: 165:stall_work_fn [wq_stall] for 100s
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The watchdog_ts name doesn't convey what the timestamp actually tracks.
This field tracks the last time a workqueue got progress.
Rename it to last_progress_ts to make it clear that it records when the
pool last made forward progress (started processing new work items).
No functional change.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
workqueue-level flag (defined in workqueue.h). Pool flags use a
separate namespace with POOL_* constants (defined in workqueue.c).
The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
as (1 << 0) so this has no behavioral impact, but it is semantically
wrong and inconsistent with every other pool-level BH check in the
file.
Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull module fixes from Sami Tolvanen:
- Fix a potential kernel panic in the module loader by adding a bounds
check for the ELF section index. This prevents crashes if attempting
to load a module that uses SHN_XINDEX or is corrupted.
- Fix the Kconfig menu layout for module versioning, signing, and
compression options so they correctly appear as submenus in
menuconfig.
- Remove a redundant lockdep_free_key_range() call in the load_module()
error path. This is already handled by module_deallocate() calling
free_mod_mem() since the module_memory rework.
* tag 'modules-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
module: Fix kernel panic when a symbol st_shndx is out of bounds
module: Fix the modversions and signing submenus
module: Remove duplicate freeing of lockdep classes
Pull vfs fixes from Christian Brauner:
- kthread: consolidate kthread exit paths to prevent use-after-free
- iomap:
- don't mark folio uptodate if read IO has bytes pending
- don't report direct-io retries to fserror
- reject delalloc mappings during writeback
- ns: tighten visibility checks
- netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict
sequence
* tag 'vfs-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: reject delalloc mappings during writeback
iomap: don't mark folio uptodate if read IO has bytes pending
selftests: fix mntns iteration selftests
nstree: tighten permission checks for listing
nsfs: tighten permission checks for handle opening
nsfs: tighten permission checks for ns iteration ioctls
netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence
kthread: consolidate kthread exit paths to prevent use-after-free
iomap: don't report direct-io retries to fserror
The module loader doesn't check for bounds of the ELF section index in
simplify_symbols():
for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
const char *name = info->strtab + sym[i].st_name;
switch (sym[i].st_shndx) {
case SHN_COMMON:
[...]
default:
/* Divert to percpu allocation if a percpu var. */
if (sym[i].st_shndx == info->index.pcpu)
secbase = (unsigned long)mod_percpu(mod);
else
/** HERE --> **/ secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
sym[i].st_value += secbase;
break;
}
}
A symbol with an out-of-bounds st_shndx value, for example 0xffff
(known as SHN_XINDEX or SHN_HIRESERVE), may cause a kernel panic:
BUG: unable to handle page fault for address: ...
RIP: 0010:simplify_symbols+0x2b2/0x480
...
Kernel panic - not syncing: Fatal exception
This can happen when module ELF is legitimately using SHN_XINDEX or
when it is corrupted.
Add a bounds check in simplify_symbols() to validate that st_shndx is
within the valid range before using it.
This issue was discovered due to a bug in llvm-objcopy, see relevant
discussion for details [1].
[1] https://lore.kernel.org/linux-modules/20251224005752.201911-1-ihor.solodrai@linux.dev/
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Pull sysctl fix from Joel Granados:
- Fix error when reporting jiffies converted values back to user space
Return the converted value instead of "Invalid argument" error
* tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ
Filesystems should never provide a delayed allocation mapping to
writeback; they're supposed to allocate the space before replying.
This can lead to weird IO errors and crashes in the block layer if the
filesystem is being malicious, or if it hadn't set iomap->dev because
it's a delalloc mapping.
Fix this by failing writeback on delalloc mappings. Currently no
filesystems actually misbehave in this manner, but we ought to be
stricter about things like that.
Cc: stable@vger.kernel.org # v5.5
Fixes: 598ecfbaa742ac ("iomap: lift the xfs writeback code to iomap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20260302173002.GL13829@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The module Kconfig file contains a set of options related to "Module
versioning support" (depends on MODVERSIONS) and "Module signature
verification" (depends on MODULE_SIG). The Kconfig tool automatically
creates submenus when an entry for a symbol is followed by consecutive
items that all depend on the symbol. However, this functionality doesn't
work for the mentioned module options. The MODVERSIONS options are
interleaved with ASM_MODVERSIONS, which has no 'depends on MODVERSIONS' but
instead uses 'default HAVE_ASM_MODVERSIONS && MODVERSIONS'. Similarly, the
MODULE_SIG options are interleaved by a comment warning not to forget
signing modules with scripts/sign-file, which uses the condition 'depends
on MODULE_SIG_FORCE && !MODULE_SIG_ALL'.
The result is that the options are confusingly shown when using
a menuconfig tool, as follows:
[*] Module versioning support
Module versioning implementation (genksyms (from source code)) --->
[ ] Extended Module Versioning Support
[*] Basic Module Versioning Support
[*] Source checksum for all modules
[*] Module signature verification
[ ] Require modules to be validly signed
[ ] Automatically sign all modules
Hash algorithm to sign modules (SHA-256) --->
Fix the issue by using if/endif to group related options together in
kernel/module/Kconfig, similarly to how the MODULE_DEBUG options are
already grouped. Note that the signing-related options depend on
'MODULE_SIG || IMA_APPRAISE_MODSIG', with the exception of
MODULE_SIG_FORCE, which is valid only for MODULE_SIG and is therefore kept
separately. For consistency, do the same for the MODULE_COMPRESS entries.
The options are then properly placed into submenus, as follows:
[*] Module versioning support
Module versioning implementation (genksyms (from source code)) --->
[ ] Extended Module Versioning Support
[*] Basic Module Versioning Support
[*] Source checksum for all modules
[*] Module signature verification
[ ] Require modules to be validly signed
[ ] Automatically sign all modules
Hash algorithm to sign modules (SHA-256) --->
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Pull media fix from Mauro Carvalho Chehab:
"Fix for MPEG-TS decoder in dvb-net"
* tag 'media/v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: dvb-net: fix OOB access in ULE extension header tables
Commit 2dc164a48e6fd ("sysctl: Create converter functions with two new
macros") incorrectly returns error to user space when jiffies sysctl
converter is used. The old overflow check got replaced with an
unconditional one:
+ if (USER_HZ < HZ)
+ return -EINVAL;
which will always be true on configurations with "USER_HZ < HZ".
Remove the check; it is no longer needed as clock_t_to_jiffies() returns
ULONG_MAX for the overflow case and proc_int_u2k_conv_uop() checks for
"> INT_MAX" after conversion
Fixes: 2dc164a48e6fd ("sysctl: Create converter functions with two new macros")
Reported-by: Colm Harrington <colm.harrington@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Joanne Koong <joannelkoong@gmail.com> says:
This is a fix for this scenario:
->read_folio() gets called on a folio size that is 16k while the file is 4k:
a) ifs->read_bytes_pending gets initialized to 16k
b) ->read_folio_range() is called for the 4k read
c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
0 to 4k range is marked uptodate
d) the post-eof blocks are zeroed and marked uptodate in the call to
iomap_set_range_uptodate()
e) iomap_set_range_uptodate() sees all the ranges are marked
uptodate and it marks the folio uptodate
f) iomap_read_end() gets called to subtract the 12k from
ifs->read_bytes_pending. it too sees all the ranges are marked
uptodate and marks the folio uptodate using XOR
g) the XOR call clears the uptodate flag on the folio
The same situation can occur if the last range read for the folio is done as
an inline read and all the previous ranges have already completed by the time
the inline read completes.
For more context, the full discussion can be found in [1]. There was a
discussion about alternative approaches in that thread, but they had more
complications.
There is another discussion in v1 [2] about consolidating the read paths.
Until that is resolved, this patch fixes the issue.
[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z9za5w4FoJqTGx50zR2haHHaoot1KJViQyEHJQq4=34w@mail.gmail.com/#t
[2] https://lore.kernel.org/linux-fsdevel/20260219003911.344478-1-joannelkoong@gmail.com/T/#u
* patches from https://patch.msgid.link/20260303233420.874231-1-joannelkoong@gmail.com:
iomap: don't mark folio uptodate if read IO has bytes pending
Link: https://patch.msgid.link/20260303233420.874231-1-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the error path of load_module(), under the free_module label, the
code calls lockdep_free_key_range() to release lock classes associated
with the MOD_DATA, MOD_RODATA and MOD_RO_AFTER_INIT module regions, and
subsequently invokes module_deallocate().
Since commit ac3b43283923 ("module: replace module_layout with
module_memory"), the module_deallocate() function calls free_mod_mem(),
which releases the lock classes as well and considers all module
regions.
Attempting to free these classes twice is unnecessary. Remove the
redundant code in load_module().
Fixes: ac3b43283923 ("module: replace module_layout with module_memory")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Pull pin control fixes from Linus Walleij:
"All of these are driver fixes except a memory leak in the
pinconf_generic_parse_dt_config() helper which is the most
important fix.
- Rename and fix up the Intel Equilibrium immutable interrupt chip
- Handle the Qualcomm QCS615 dual edge GPIO IRQ by adding the right
flag
- Fix a memory leak in the widely used pinconf_generic_parse_dt_config()
and a more local leak in aml_dt_node_to_map_pinmux()
- Fix double put in the Cirrus cs42l43_pin_probe()
- Staticize amdisp_pinctrl_ops, Qualcomm SDM660 groups and functions
- Unexport CIX sky1_pinctrl_pm_ops
- Fix configuration of deferred pin in the Rockchip driver
- Implement .get_direction() in the Sunxi driver squelching a dmesg
warning message
- Fix a readout of the last bank of registers in the Cypress CY8C95x0
driver"
* tag 'pinctrl-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: cy8c95x0: Don't miss reading the last bank registers
pinctrl: sunxi: Implement gpiochip::get_direction()
pinctrl: rockchip: Fix configuring a deferred pin
pinctrl: cirrus: cs42l43: Fix double-put in cs42l43_pin_probe()
pinctrl: meson: amlogic-a4: Fix device node reference leak in aml_dt_node_to_map_pinmux()
pinctrl: qcom: sdm660-lpass-lpi: Make groups and functions variables static
pinctrl: cix: sky1: Unexport sky1_pinctrl_pm_ops
pinctrl: amdisp: Make amdisp_pinctrl_ops variable static
pinctrl: pinconf-generic: Fix memory leak in pinconf_generic_parse_dt_config()
pinctrl: qcom: qcs615: Add missing dual edge GPIO IRQ errata flag
pinctrl: equilibrium: fix warning trace on load
pinctrl: equilibrium: rename irq_chip function callbacks
The ule_mandatory_ext_handlers[] and ule_optional_ext_handlers[] tables
in handle_one_ule_extension() are declared with 255 elements (valid
indices 0-254), but the index htype is derived from network-controlled
data as (ule_sndu_type & 0x00FF), giving a range of 0-255. When
htype equals 255, an out-of-bounds read occurs on the function pointer
table, and the OOB value may be called as a function pointer.
Add a bounds check on htype against the array size before either table
is accessed. Out-of-range values now cause the SNDU to be discarded.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Ariel Silver <arielsilver77@gmail.com>
Signed-off-by: Ariel Silver <arielsilver77@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Christian Brauner <brauner@kernel.org> says:
Listing various namespaces is currently only scoped on owning namespace.
We can make this more fine-grained so that we scope visibility even
tighter. To make it possible to change behavior restrict visibility for
now. This shouldn't be a big deal as there aren't actual large users out
there and paves the way to make this even cleaner in the future.
* patches from https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org:
selftests: fix mntns iteration selftests
nstree: tighten permission checks for listing
nsfs: tighten permission checks for handle opening
nsfs: tighten permission checks for ns iteration ioctls
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
If a folio has ifs metadata attached to it and the folio is partially
read in through an async IO helper with the rest of it then being read
in through post-EOF zeroing or as inline data, and the helper
successfully finishes the read first, then post-EOF zeroing / reading
inline will mark the folio as uptodate in iomap_set_range_uptodate().
This is a problem because when the read completion path later calls
iomap_read_end(), it will call folio_end_read(), which sets the uptodate
bit using XOR semantics. Calling folio_end_read() on a folio that was
already marked uptodate clears the uptodate bit.
Fix this by not marking the folio as uptodate if the read IO has bytes
pending. The folio uptodate state will be set in the read completion
path through iomap_end_read() -> folio_end_read().
Reported-by: Wei Gao <wegao@suse.com>
Suggested-by: Sasha Levin <sashal@kernel.org>
Tested-by: Wei Gao <wegao@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: stable@vger.kernel.org # v6.19
Link: https://lore.kernel.org/linux-fsdevel/aYbmy8JdgXwsGaPP@autotest-wegao.qe.prg2.suse.org/
Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260303233420.874231-2-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull cgroup fixes from Tejun Heo:
- Fix circular locking dependency in cpuset partition code by
deferring housekeeping_update() calls to a workqueue instead
of calling them directly under cpus_read_lock
- Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when
generate_sched_domains() returns NULL due to kmalloc failure
- Fix incorrect cpuset behavior for effective_xcpus in
partition_xcpus_del() and cpuset_update_tasks_cpumask()
in update_cpumasks_hier()
- Fix race between task migration and cgroup iteration
* tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked
cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
cgroup: fix race between task migration and iteration
When code had been changed to use for_each_set_clump8(), it mistakenly
switched from chip->nport to chip->tpin since the cy8c9540 and cy8c9560
have a 4-pin gap. This, in particular, led to the missed read of
the last bank interrupt status register and hence missing interrupts
on those pins. Restore the upper limit in for_each_set_clump8() to take
into consideration that gap.
Fixes: 83e29a7a1fdf ("pinctrl: cy8c95x0; Switch to use for_each_set_clump8()")
Cc: stable@vger.kernel.org
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature
set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered... [his words, not mine]
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Generic:
- Remove internal Kconfigs that are now set on all architectures
- Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
architectures finally enable it in Linux 7.0"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: always define KVM_CAP_SYNC_MMU
KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: arm64: Deduplicate ASID retrieval code
irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
KVM: arm64: Fix protected mode handling of pages larger than 4kB
KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
KVM: arm64: Fix ID register initialization for non-protected pKVM guests
KVM: arm64: Optimise away S1POE handling when not supported by host
KVM: arm64: Hide S1POE from guests when not supported by the host
Fix netfslib such that when it's making an unbuffered or DIO write, to make
sure that it sends each subrequest strictly sequentially, waiting till the
previous one is 'committed' before sending the next so that we don't have
pieces landing out of order and potentially leaving a hole if an error
occurs (ENOSPC for example).
This is done by copying in just those bits of issuing, collecting and
retrying subrequests that are necessary to do one subrequest at a time.
Retrying, in particular, is simpler because if the current subrequest needs
retrying, the source iterator can just be copied again and the subrequest
prepped and issued again without needing to be concerned about whether it
needs merging with the previous or next in the sequence.
Note that the issuing loop waits for a subrequest to complete right after
issuing it, but this wait could be moved elsewhere allowing preparatory
steps to be performed whilst the subrequest is in progress. In particular,
once content encryption is available in netfslib, that could be done whilst
waiting, as could cleanup of buffers that have been completed.
Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/58526.1772112753@warthog.procyon.org.uk
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that we changed permission checking make sure that we reflect that
in the selftests.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-4-d2c2853313bd@kernel.org
Fixes: 9d87b1067382 ("selftests: add tests for mntns iteration")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.14+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
Pull sched_ext fixes from Tejun Heo:
- Fix starvation of scx_enable() under fair-class saturation by
offloading the enable path to an RT kthread
- Fix out-of-bounds access in idle mask initialization on systems with
non-contiguous NUMA node IDs
- Fix a preemption window during scheduler exit and a refcount
underflow in cgroup init error path
- Fix SCX_EFLAG_INITIALIZED being a no-op flag
- Add READ_ONCE() annotations for KCSAN-clean lockless accesses and
replace naked scx_root dereferences with container_of() in kobject
callbacks
- Tooling and selftest fixes: compilation issues with clang 17,
strtoul() misuse, unused options cleanup, and Kconfig sync
* tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix starvation of scx_enable() under fair-class saturation
sched_ext: Remove redundant css_put() in scx_cgroup_init()
selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17
selftests/sched_ext: Add -fms-extensions to bpf build flags
tools/sched_ext: Add -fms-extensions to bpf build flags
sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout
sched_ext: Replace naked scx_root dereferences in kobject callbacks
sched_ext: Use READ_ONCE() for the read side of dsq->nr update
tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq()
sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag
sched_ext: Fix out-of-bounds access in scx_idle_init_masks()
sched_ext: Disable preemption between scx_claim_exit() and kicking helper work
tools/sched_ext: Add Kconfig to sync with upstream
tools/sched_ext: Sync README.md Kconfig with upstream scx
selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c
tools/sched_ext: scx_sdt: Remove unused '-f' option
tools/sched_ext: scx_central: Remove unused '-p' option
selftests/sched_ext: Fix unused-result warning for read()
selftests/sched_ext: Abort test loop on signal
A null-pointer-dereference bug was reported by syzbot:
Oops: general protection fault, probably for address 0xdffffc0000000000:
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:bitmap_subset include/linux/bitmap.h:433 [inline]
RIP: 0010:cpumask_subset include/linux/cpumask.h:836 [inline]
RIP: 0010:rebuild_sched_domains_locked kernel/cgroup/cpuset.c:967
RSP: 0018:ffffc90003ecfbc0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000020
RDX: ffff888028de0000 RSI: ffffffff8200f003 RDI: ffffffff8df14f28
RBP: 0000000000000000 R08: 0000000000000cc0 R09: 00000000ffffffff
R10: ffffffff8e7d95b3 R11: 0000000000000001 R12: 0000000000000000
R13: 00000000000f4240 R14: dffffc0000000000 R15: 0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2f463fff CR3: 000000003704c000 CR4: 00000000003526f0
Call Trace:
<TASK>
rebuild_sched_domains_cpuslocked kernel/cgroup/cpuset.c:983 [inline]
rebuild_sched_domains+0x21/0x40 kernel/cgroup/cpuset.c:990
sched_rt_handler+0xb5/0xe0 kernel/sched/rt.c:2911
proc_sys_call_handler+0x47f/0x5a0 fs/proc/proc_sysctl.c:600
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0x6ac/0x1070 fs/read_write.c:688
ksys_write+0x12a/0x250 fs/read_write.c:740
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The issue occurs when generate_sched_domains() returns ndoms = 1 and
doms = NULL due to a kmalloc failure. This leads to a null-pointer
dereference when accessing doms in rebuild_sched_domains_locked().
Fix this by adding a NULL check for doms before accessing it.
Fixes: 6ee43047e8ad ("cpuset: Remove unnecessary checks in rebuild_sched_domains_locked")
Reported-by: syzbot+460792609a79c085f79f@syzkaller.appspotmail.com
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After commit 471e998c0e31 ("gpiolib: remove redundant callback check"),
a warning will be printed if the gpio driver does not implement this
callback. The warning was added in commit e623c4303ed1 ("gpiolib:
sanitize the return value of gpio_chip::get_direction()"), but was
masked by the "redundant" check.
The warning can be triggered by any action that calls the callback,
such as dumping the GPIO state from /sys/kernel/debug/gpio.
Implement it for the sunxi driver. This is simply a matter of reading
out the mux value from the registers, then checking if it is one of
the GPIO functions and which direction it is.
Signed-off-by: Chen-Yu Tsai <wens@kernel.org>
Reviewed-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull debugobjects fix from Thomas Gleixner:
"A single fix for debugobjects.
The deferred page initialization prevents debug objects from
allocating slab pages until the initialization is complete. That
causes depletion of the pool and disabling of debugobjects.
The reason is that debugobjects uses __GFP_HIGH for allocations as it
might be invoked from arbitrary contexts. When PREEMPT_COUNT is
disabled there is no way to know whether the context is safe to set
__GFP_KSWAPD_RECLAIM.
This worked until v6.18. Since then allocations w/o a reclaim flag
cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(),
which returns early when deferred page initialization has not yet
completed.
Work around that when PREEMPT_COUNT is enabled as the preempt counter
allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when
the context is preemtible. When PREEMPT_COUNT is disabled the context
is unknown and the reclaim bit can't be set because the caller might
hold locks which might deadlock in the allocator.
That makes debugobjects depend on PREEMPT_COUNT ||
!DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but
keeps it functional for most cases"
* tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobject: Make it work with deferred page initialization - again
KVM/arm64 fixes for 7.0, take #1
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered...
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Guillaume reported crashes via corrupted RCU callback function pointers
during KUnit testing. The crash was traced back to the pidfs rhashtable
conversion which replaced the 24-byte rb_node with an 8-byte rhash_head
in struct pid, shrinking it from 160 to 144 bytes.
struct kthread (without CONFIG_BLK_CGROUP) is also 144 bytes. With
CONFIG_SLAB_MERGE_DEFAULT and SLAB_HWCACHE_ALIGN both round up to
192 bytes and share the same slab cache. struct pid.rcu.func and
struct kthread.affinity_node both sit at offset 0x78.
When a kthread exits via make_task_dead() it bypasses kthread_exit() and
misses the affinity_node cleanup. free_kthread_struct() frees the memory
while the node is still linked into the global kthread_affinity_list. A
subsequent list_del() by another kthread writes through dangling list
pointers into the freed and reused memory, corrupting the pid's
rcu.func pointer.
Instead of patching free_kthread_struct() to handle the missed cleanup,
consolidate all kthread exit paths. Turn kthread_exit() into a macro
that calls do_exit() and add kthread_do_exit() which is called from
do_exit() for any task with PF_KTHREAD set. This guarantees that
kthread-specific cleanup always happens regardless of the exit path -
make_task_dead(), direct do_exit(), or kthread_exit().
Replace __to_kthread() with a new tsk_is_kthread() accessor in the
public header. Export do_exit() since module code using the
kthread_exit() macro now needs it directly.
Reported-by: Guillaume Tucker <gtucker@gtucker.io>
Tested-by: Guillaume Tucker <gtucker@gtucker.io>
Tested-by: Mark Brown <broonie@kernel.org>
Tested-by: David Gow <davidgow@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20260224-mittlerweile-besessen-2738831ae7f6@brauner
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 4d13f4304fa4 ("kthread: Implement preferred affinity")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-3-d2c2853313bd@kernel.org
Fixes: 76b6f5dfb3fd ("nstree: add listns()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.19+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull crypto library fix from Eric Biggers:
"Fix a big endian specific issue in the PPC64-optimized AES code"
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: powerpc/aes: Fix rndkey_from_vsx() on big endian CPUs
hppa-linux-gcc 9.5.0 generates a call to fsverity_readahead() in
f2fs_readahead() when CONFIG_FS_VERITY=n, because it fails to do the
expected dead code elimination based on vi always being NULL. Fix the
build error by adding an inline stub for fsverity_readahead(). Since
it's just for opportunistic readahead, just make it a no-op.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602180838.pwICdY2r-lkp@intel.com/
Fixes: 45dcb3ac9832 ("f2fs: consolidate fsverity_info lookup")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218012244.18536-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Pull btrfs fixes from David Sterba:
"One-liner or short fixes for minor/moderate problems reported recently:
- fixes or level adjustments of error messages
- fix leaked transaction handles after aborted transactions, when
using the remap tree feature
- fix a few leaked chunk maps after errors
- fix leaked page array in io_uring encoded read if an error occurs
and the 'finished' is not called
- fix double release of reserved extents when doing a range COW
- don't commit super block when the filesystem is in shutdown state
- fix squota accounting condition when checking members vs parent
usage
- other error handling fixes"
* tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check block group lookup in remove_range_from_remap_tree()
btrfs: fix transaction handle leaks in btrfs_last_identity_remap_gone()
btrfs: fix chunk map leak in btrfs_map_block() after btrfs_translate_remap()
btrfs: fix chunk map leak in btrfs_map_block() after btrfs_chunk_map_num_copies()
btrfs: fix compat mask in error messages in btrfs_check_features()
btrfs: print correct subvol num if active swapfile prevents deletion
btrfs: fix warning in scrub_verify_one_metadata()
btrfs: fix objectid value in error message in check_extent_data_ref()
btrfs: fix incorrect key offset in error message in check_dev_extent_item()
btrfs: fix error message order of parameters in btrfs_delete_delayed_dir_index()
btrfs: don't commit the super block when unmounting a shutdown filesystem
btrfs: free pages on error in btrfs_uring_read_extent()
btrfs: fix referenced/exclusive check in squota_check_parent_usage()
btrfs: remove pointless WARN_ON() in cache_save_setup()
btrfs: convert log messages to error level in btrfs_replay_log()
btrfs: remove btrfs_handle_fs_error() after failure to recover log trees
btrfs: remove redundant warning message in btrfs_check_uuid_tree()
btrfs: change warning messages to error level in open_ctree()
btrfs: fix a double release on reserved extents in cow_one_range()
btrfs: handle discard errors in in btrfs_finish_extent_commit()
During scx_enable(), the READY -> ENABLED task switching loop changes the
calling thread's sched_class from fair to ext. Since fair has higher
priority than ext, saturating fair-class workloads can indefinitely starve
the enable thread, hanging the system. This was introduced when the enable
path switched from preempt_disable() to scx_bypass() which doesn't protect
against fair-class starvation. Note that the original preempt_disable()
protection wasn't complete either - in partial switch modes, the calling
thread could still be starved after preempt_enable() as it may have been
switched to ext class.
Fix it by offloading the enable body to a dedicated system-wide RT
(SCHED_FIFO) kthread which cannot be starved by either fair or ext class
tasks. scx_enable() lazily creates the kthread on first use and passes the
ops pointer through a struct scx_enable_cmd containing the kthread_work,
then synchronously waits for completion.
The workfn runs on a different kthread from sch->helper (which runs
disable_work), so it can safely flush disable_work on the error path
without deadlock.
Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo <tj@kernel.org>
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the
"isolcpus=domain,..." boot command line feature at run time.
The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.
======================================================
WARNING: possible circular locking dependency detected
6.18.0-test+ #2 Tainted: G S
------------------------------------------------------
test_cpuset_prs/10971 is trying to acquire lock:
ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
but task is already holding lock:
ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (cpuset_mutex){+.+.}-{4:4}:
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
-> #2 (rtnl_mutex){+.+.}-{4:4}:
-> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
-> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
Chain exists of:
(wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
5 locks held by test_cpuset_prs/10971:
#0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
#1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
#2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
#3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
#4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
Call Trace:
<TASK>
:
touch_wq_lockdep_map+0x93/0x180
__flush_workqueue+0x111/0x10b0
housekeeping_update+0x12d/0x2d0
update_parent_effective_cpumask+0x595/0x2440
update_prstate+0x89d/0xce0
cpuset_partition_write+0xc5/0x130
cgroup_file_write+0x1a5/0x680
kernfs_fop_write_iter+0x3df/0x5f0
vfs_write+0x525/0xfd0
ksys_write+0xf9/0x1d0
do_syscall_64+0x95/0x520
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.
So housekeeping_update() is now called after releasing cpus_read_lock
and cpuset_mutex at the end of a cpuset operation. These two locks are
then re-acquired later before calling rebuild_sched_domains_locked().
To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.
As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.
The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.
Fixes: 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit e2c58cbe3aff ("pinctrl: rockchip: Simplify locking with
scoped_guard()") added a scoped_guard() over existing code containing a
"break" instruction. That "break" was for the outer (existing)
for-loop, which now exits inner, scoped_guard() loop. If GPIO driver
did not probe, then driver will not bail out, but instead continue to
configure the pin.
Fix the issue by simplifying the code - the break in original code was
leading directly to end of the function returning 0, thus we can simply
return here rockchip_pinconf_defer_pin status.
Reported-by: David Lechner <dlechner@baylibre.com>
Closes: https://lore.kernel.org/r/f5b38942-a584-4e78-a893-de4a219070b2@baylibre.com/
Fixes: e2c58cbe3aff ("pinctrl: rockchip: Simplify locking with scoped_guard()")
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull x86 fixes from Ingo Molnar:
- Fix speculative safety in fred_extint()
- Fix __WARN_printf() trap in early_fixup_exception()
- Fix clang-build boot bug for unusual alignments, triggered by
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y
- Replace the final few __ASSEMBLY__ stragglers that snuck in lately
into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again)
* tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__
x86/cfi: Fix CFI rewrite for odd alignments
x86/bug: Handle __WARN_printf() trap in early_fixup_exception()
x86/fred: Correct speculative safety in fred_extint()
debugobjects uses __GFP_HIGH for allocations as it might be invoked
within locked regions. That worked perfectly fine until v6.18. It still
works correctly when deferred page initialization is disabled and works
by chance when no page allocation is required before deferred page
initialization has completed.
Since v6.18 allocations w/o a reclaim flag cause new_slab() to end up in
alloc_frozen_pages_nolock_noprof(), which returns early when deferred
page initialization has not yet completed. As the deferred page
initialization takes quite a while the debugobject pool is depleted and
debugobjects are disabled.
This can be worked around when PREEMPT_COUNT is enabled as that allows
debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when the context
is preemtible. When PREEMPT_COUNT is disabled the context is unknown and
the reclaim bit can't be set because the caller might hold locks which
might deadlock in the allocator.
In preemptible context the reclaim bit is harmless and not a performance
issue as that's usually invoked from slow path initialization context.
That makes debugobjects depend on PREEMPT_COUNT || !DEFERRED_STRUCT_PAGE_INIT.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/87pl6gznti.ffs@tglx
KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always
available. Move the definition from individual architectures to common
code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We currently have three versions of the ASID retrieval code, one
in the S1 walker, and two in the VNCR handling (although the last
two are limited to the EL2&0 translation regime).
Make this code common, and take this opportunity to also simplify
the code a bit while switching over to the TTBRx_EL1_ASID macro.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260225104718.14209-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
iomap's directio implementation has two magic errno codes that it uses
to signal callers -- ENOTBLK tells the filesystem that it should retry
a write with the pagecache; and EAGAIN tells the caller that pagecache
flushing or invalidation failed and that it should try again.
Neither of these indicate data loss, so let's not report them.
Fixes: a9d573ee88af98 ("iomap: report file I/O errors to the VFS")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20260224154637.GD2390381@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-2-d2c2853313bd@kernel.org
Fixes: 5222470b2fbb ("nsfs: support file handles")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.18+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stephen retired and stepped back from -next maintainership, update his
entry in CREDITS to recognise his 18 years of hard work making it what
it is today and all the impact it's had on our development process.
Also update to his current GnuPG key while we're here.
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I finally got a big endian PPC64 kernel to boot in QEMU. The PPC64 VSX
optimized AES library code does work in that case, with the exception of
rndkey_from_vsx() which doesn't take into account that the order in
which the VSX code stores the round key words depends on the endianness.
So fix rndkey_from_vsx() to do the right thing on big endian CPUs.
Fixes: 7cf2082e74ce ("lib/crypto: powerpc/aes: Migrate POWER8 optimized code into library")
Link: https://lore.kernel.org/r/20260216022104.332991-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Now that fsverity_verify_page() has no callers, remove it.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218010630.7407-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
If a 'const struct foo __user *ptr' is used for the address passed to
scoped_user_read_access() then you get a warning/error
uaccess.h:691:1: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
for the
void __user *_tmpptr = __scoped_user_access_begin(mode, uptr, size, elbl)
assignment.
Fix by using 'auto' for both _tmpptr and the redeclaration of uptr.
Replace the CLASS() with explicit __cleanup() functions on uptr.
Fixes: e497310b4ffb ("uaccess: Provide scoped user access regions")
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-and-tested-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a check in remove_range_from_remap_tree() after we call
btrfs_lookup_block_group(), to check if it is NULL. This shouldn't
happen, but if it does we at least get an error rather than a segfault.
Reported-by: Chris Mason <clm@fb.com>
Link: https://lore.kernel.org/linux-btrfs/20260125125129.2245240-1-clm@meta.com/
Fixes: 979e1dc3d69e ("btrfs: handle deletions from remapped block group")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The iterator css_for_each_descendant_pre() walks the cgroup hierarchy
under cgroup_lock(). It does not increment the reference counts on
yielded css structs.
According to the cgroup documentation, css_put() should only be used
to release a reference obtained via css_get() or css_tryget_online().
Since the iterator does not use either of these to acquire a reference,
calling css_put() in the error path of scx_cgroup_init() causes a
refcount underflow.
Remove the unbalanced css_put() to prevent a potential Use-After-Free
(UAF) vulnerability.
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
for instance, when an isolated partition is invalidated because its
last active CPU has been put offline.
As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the update_hk_sched_domains() function will be invoked via the
hk_sd_workfn().
An concurrent cpuset control file write may have executed the required
update_hk_sched_domains() function before the work function is called. So
the work function call may become a no-op when it is invoked.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
devm_add_action_or_reset() already invokes the action on failure,
so the explicit put causes a double-put.
Fixes: 9b07cdf86a0b ("pinctrl: cirrus: Fix fwnode leak in cs42l43_pin_probe()")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull timer fix from Ingo Molnar:
"Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
the common HZ=100, 250 or 1000 cases. Only use a function call for odd
HZ values like HZ=300 that generate more code.
The function call overhead showed up in performance tests of the TCP
code"
* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
After converting the __ASSEMBLY__ statements to __ASSEMBLER__ in
commit 24a295e4ef1ca ("x86/headers: Replace __ASSEMBLY__ with
__ASSEMBLER__ in non-UAPI headers"), some new code has been
added that uses __ASSEMBLY__ again. Convert these stragglers, too.
This is a mechanical patch, done with a simple "sed -i" command.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251218182029.166993-1-thuth@redhat.com
In unplug_oldest_pwq(), the first inactive work item on the
pool_workqueue is activated correctly. However, if multiple inactive
works exist on the same pool_workqueue, subsequent works fail to
activate because wq_node_nr_active.pending_pwqs is empty — the list
insertion is skipped when the pool_workqueue is plugged.
Fix this by checking for additional inactive works in
unplug_oldest_pwq() and updating wq_node_nr_active.pending_pwqs
accordingly.
Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues")
Cc: stable@vger.kernel.org
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Ryan Neph <ryanneph@google.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Try to be more explicit why the workqueue watchdog does not take
pool->lock by default. Spin locks are full memory barriers which
delay anything. Obviously, they would primary delay operations
on the related worker pools.
Explain why it is enough to prevent the false positive by re-checking
the timestamp under the pool->lock.
Finally, make it clear what would be the alternative solution in
__queue_work() which is a hotter path.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.
This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():
workqueue watchdog: pool 7 false positive detected!
lockless_ts=4784580465 locked_ts=4785033728
diff=453263ms worklist_empty=0
To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.
Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.
Fixes: 82607adcf9cd ("workqueue: implement lockup detector")
Cc: stable@vger.kernel.org # v4.5+
Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU
hogs — since commit 8823eaef45da ("workqueue: Show all busy workers in
stall diagnostics"), they dump every in-flight worker in the pool's
busy_hash.
Rename them to show_cpu_pool_busy_workers() and
show_cpu_pools_busy_workers() to accurately describe what they do.
Also fix the pr_info() message to say "stalled worker pools" instead of
"stalled CPU-bound worker pools", since sleeping/blocked workers are now
included.
No functional change.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a sample module under samples/workqueue/stall_detector/ that
reproduces a workqueue stall caused by PF_WQ_WORKER misuse. The
module queues two work items on the same per-CPU pool, then clears
PF_WQ_WORKER and sleeps in wait_event_idle(), hiding from the
concurrency manager and stalling the second work item indefinitely.
This is useful for testing the workqueue watchdog stall diagnostics.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
show_cpu_pool_hog() only prints workers whose task is currently running
on the CPU (task_is_running()). This misses workers that are busy
processing a work item but are sleeping or blocked — for example, a
worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a
worker still occupies a pool slot and prevents progress, yet produces
an empty backtrace section in the watchdog output.
This is happening on real arm64 systems, where
toggle_allocation_gate() IPIs every single CPU in the machine (which
lacks NMI), causing workqueue stalls that show empty backtraces because
toggle_allocation_gate() is sleeping in wait_event_idle().
Remove the task_is_running() filter so every in-flight worker in the
pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
which is already held.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
When diagnosing workqueue stalls, knowing how long each in-flight work
item has been executing is valuable. Add a current_start timestamp
(jiffies) to struct worker, set it when a work item begins execution in
process_one_work(), and print the elapsed wall-clock time in show_pwq().
Unlike current_at (which tracks CPU runtime and resets on wakeup for
CPU-intensive detection), current_start is never reset because the
diagnostic cares about total wall-clock time including sleeps.
Before: in-flight: 165:stall_work_fn [wq_stall]
After: in-flight: 165:stall_work_fn [wq_stall] for 100s
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The watchdog_ts name doesn't convey what the timestamp actually tracks.
This field tracks the last time a workqueue got progress.
Rename it to last_progress_ts to make it clear that it records when the
pool last made forward progress (started processing new work items).
No functional change.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
workqueue-level flag (defined in workqueue.h). Pool flags use a
separate namespace with POOL_* constants (defined in workqueue.c).
The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
as (1 << 0) so this has no behavioral impact, but it is semantically
wrong and inconsistent with every other pool-level BH check in the
file.
Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull module fixes from Sami Tolvanen:
- Fix a potential kernel panic in the module loader by adding a bounds
check for the ELF section index. This prevents crashes if attempting
to load a module that uses SHN_XINDEX or is corrupted.
- Fix the Kconfig menu layout for module versioning, signing, and
compression options so they correctly appear as submenus in
menuconfig.
- Remove a redundant lockdep_free_key_range() call in the load_module()
error path. This is already handled by module_deallocate() calling
free_mod_mem() since the module_memory rework.
* tag 'modules-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
module: Fix kernel panic when a symbol st_shndx is out of bounds
module: Fix the modversions and signing submenus
module: Remove duplicate freeing of lockdep classes
Pull vfs fixes from Christian Brauner:
- kthread: consolidate kthread exit paths to prevent use-after-free
- iomap:
- don't mark folio uptodate if read IO has bytes pending
- don't report direct-io retries to fserror
- reject delalloc mappings during writeback
- ns: tighten visibility checks
- netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict
sequence
* tag 'vfs-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: reject delalloc mappings during writeback
iomap: don't mark folio uptodate if read IO has bytes pending
selftests: fix mntns iteration selftests
nstree: tighten permission checks for listing
nsfs: tighten permission checks for handle opening
nsfs: tighten permission checks for ns iteration ioctls
netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence
kthread: consolidate kthread exit paths to prevent use-after-free
iomap: don't report direct-io retries to fserror
The module loader doesn't check for bounds of the ELF section index in
simplify_symbols():
for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
const char *name = info->strtab + sym[i].st_name;
switch (sym[i].st_shndx) {
case SHN_COMMON:
[...]
default:
/* Divert to percpu allocation if a percpu var. */
if (sym[i].st_shndx == info->index.pcpu)
secbase = (unsigned long)mod_percpu(mod);
else
/** HERE --> **/ secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
sym[i].st_value += secbase;
break;
}
}
A symbol with an out-of-bounds st_shndx value, for example 0xffff
(known as SHN_XINDEX or SHN_HIRESERVE), may cause a kernel panic:
BUG: unable to handle page fault for address: ...
RIP: 0010:simplify_symbols+0x2b2/0x480
...
Kernel panic - not syncing: Fatal exception
This can happen when module ELF is legitimately using SHN_XINDEX or
when it is corrupted.
Add a bounds check in simplify_symbols() to validate that st_shndx is
within the valid range before using it.
This issue was discovered due to a bug in llvm-objcopy, see relevant
discussion for details [1].
[1] https://lore.kernel.org/linux-modules/20251224005752.201911-1-ihor.solodrai@linux.dev/
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Pull sysctl fix from Joel Granados:
- Fix error when reporting jiffies converted values back to user space
Return the converted value instead of "Invalid argument" error
* tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ
Filesystems should never provide a delayed allocation mapping to
writeback; they're supposed to allocate the space before replying.
This can lead to weird IO errors and crashes in the block layer if the
filesystem is being malicious, or if it hadn't set iomap->dev because
it's a delalloc mapping.
Fix this by failing writeback on delalloc mappings. Currently no
filesystems actually misbehave in this manner, but we ought to be
stricter about things like that.
Cc: stable@vger.kernel.org # v5.5
Fixes: 598ecfbaa742ac ("iomap: lift the xfs writeback code to iomap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20260302173002.GL13829@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The module Kconfig file contains a set of options related to "Module
versioning support" (depends on MODVERSIONS) and "Module signature
verification" (depends on MODULE_SIG). The Kconfig tool automatically
creates submenus when an entry for a symbol is followed by consecutive
items that all depend on the symbol. However, this functionality doesn't
work for the mentioned module options. The MODVERSIONS options are
interleaved with ASM_MODVERSIONS, which has no 'depends on MODVERSIONS' but
instead uses 'default HAVE_ASM_MODVERSIONS && MODVERSIONS'. Similarly, the
MODULE_SIG options are interleaved by a comment warning not to forget
signing modules with scripts/sign-file, which uses the condition 'depends
on MODULE_SIG_FORCE && !MODULE_SIG_ALL'.
The result is that the options are confusingly shown when using
a menuconfig tool, as follows:
[*] Module versioning support
Module versioning implementation (genksyms (from source code)) --->
[ ] Extended Module Versioning Support
[*] Basic Module Versioning Support
[*] Source checksum for all modules
[*] Module signature verification
[ ] Require modules to be validly signed
[ ] Automatically sign all modules
Hash algorithm to sign modules (SHA-256) --->
Fix the issue by using if/endif to group related options together in
kernel/module/Kconfig, similarly to how the MODULE_DEBUG options are
already grouped. Note that the signing-related options depend on
'MODULE_SIG || IMA_APPRAISE_MODSIG', with the exception of
MODULE_SIG_FORCE, which is valid only for MODULE_SIG and is therefore kept
separately. For consistency, do the same for the MODULE_COMPRESS entries.
The options are then properly placed into submenus, as follows:
[*] Module versioning support
Module versioning implementation (genksyms (from source code)) --->
[ ] Extended Module Versioning Support
[*] Basic Module Versioning Support
[*] Source checksum for all modules
[*] Module signature verification
[ ] Require modules to be validly signed
[ ] Automatically sign all modules
Hash algorithm to sign modules (SHA-256) --->
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Commit 2dc164a48e6fd ("sysctl: Create converter functions with two new
macros") incorrectly returns error to user space when jiffies sysctl
converter is used. The old overflow check got replaced with an
unconditional one:
+ if (USER_HZ < HZ)
+ return -EINVAL;
which will always be true on configurations with "USER_HZ < HZ".
Remove the check; it is no longer needed as clock_t_to_jiffies() returns
ULONG_MAX for the overflow case and proc_int_u2k_conv_uop() checks for
"> INT_MAX" after conversion
Fixes: 2dc164a48e6fd ("sysctl: Create converter functions with two new macros")
Reported-by: Colm Harrington <colm.harrington@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Joanne Koong <joannelkoong@gmail.com> says:
This is a fix for this scenario:
->read_folio() gets called on a folio size that is 16k while the file is 4k:
a) ifs->read_bytes_pending gets initialized to 16k
b) ->read_folio_range() is called for the 4k read
c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
0 to 4k range is marked uptodate
d) the post-eof blocks are zeroed and marked uptodate in the call to
iomap_set_range_uptodate()
e) iomap_set_range_uptodate() sees all the ranges are marked
uptodate and it marks the folio uptodate
f) iomap_read_end() gets called to subtract the 12k from
ifs->read_bytes_pending. it too sees all the ranges are marked
uptodate and marks the folio uptodate using XOR
g) the XOR call clears the uptodate flag on the folio
The same situation can occur if the last range read for the folio is done as
an inline read and all the previous ranges have already completed by the time
the inline read completes.
For more context, the full discussion can be found in [1]. There was a
discussion about alternative approaches in that thread, but they had more
complications.
There is another discussion in v1 [2] about consolidating the read paths.
Until that is resolved, this patch fixes the issue.
[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1Z9za5w4FoJqTGx50zR2haHHaoot1KJViQyEHJQq4=34w@mail.gmail.com/#t
[2] https://lore.kernel.org/linux-fsdevel/20260219003911.344478-1-joannelkoong@gmail.com/T/#u
* patches from https://patch.msgid.link/20260303233420.874231-1-joannelkoong@gmail.com:
iomap: don't mark folio uptodate if read IO has bytes pending
Link: https://patch.msgid.link/20260303233420.874231-1-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the error path of load_module(), under the free_module label, the
code calls lockdep_free_key_range() to release lock classes associated
with the MOD_DATA, MOD_RODATA and MOD_RO_AFTER_INIT module regions, and
subsequently invokes module_deallocate().
Since commit ac3b43283923 ("module: replace module_layout with
module_memory"), the module_deallocate() function calls free_mod_mem(),
which releases the lock classes as well and considers all module
regions.
Attempting to free these classes twice is unnecessary. Remove the
redundant code in load_module().
Fixes: ac3b43283923 ("module: replace module_layout with module_memory")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Pull pin control fixes from Linus Walleij:
"All of these are driver fixes except a memory leak in the
pinconf_generic_parse_dt_config() helper which is the most
important fix.
- Rename and fix up the Intel Equilibrium immutable interrupt chip
- Handle the Qualcomm QCS615 dual edge GPIO IRQ by adding the right
flag
- Fix a memory leak in the widely used pinconf_generic_parse_dt_config()
and a more local leak in aml_dt_node_to_map_pinmux()
- Fix double put in the Cirrus cs42l43_pin_probe()
- Staticize amdisp_pinctrl_ops, Qualcomm SDM660 groups and functions
- Unexport CIX sky1_pinctrl_pm_ops
- Fix configuration of deferred pin in the Rockchip driver
- Implement .get_direction() in the Sunxi driver squelching a dmesg
warning message
- Fix a readout of the last bank of registers in the Cypress CY8C95x0
driver"
* tag 'pinctrl-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: cy8c95x0: Don't miss reading the last bank registers
pinctrl: sunxi: Implement gpiochip::get_direction()
pinctrl: rockchip: Fix configuring a deferred pin
pinctrl: cirrus: cs42l43: Fix double-put in cs42l43_pin_probe()
pinctrl: meson: amlogic-a4: Fix device node reference leak in aml_dt_node_to_map_pinmux()
pinctrl: qcom: sdm660-lpass-lpi: Make groups and functions variables static
pinctrl: cix: sky1: Unexport sky1_pinctrl_pm_ops
pinctrl: amdisp: Make amdisp_pinctrl_ops variable static
pinctrl: pinconf-generic: Fix memory leak in pinconf_generic_parse_dt_config()
pinctrl: qcom: qcs615: Add missing dual edge GPIO IRQ errata flag
pinctrl: equilibrium: fix warning trace on load
pinctrl: equilibrium: rename irq_chip function callbacks
The ule_mandatory_ext_handlers[] and ule_optional_ext_handlers[] tables
in handle_one_ule_extension() are declared with 255 elements (valid
indices 0-254), but the index htype is derived from network-controlled
data as (ule_sndu_type & 0x00FF), giving a range of 0-255. When
htype equals 255, an out-of-bounds read occurs on the function pointer
table, and the OOB value may be called as a function pointer.
Add a bounds check on htype against the array size before either table
is accessed. Out-of-range values now cause the SNDU to be discarded.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Ariel Silver <arielsilver77@gmail.com>
Signed-off-by: Ariel Silver <arielsilver77@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Christian Brauner <brauner@kernel.org> says:
Listing various namespaces is currently only scoped on owning namespace.
We can make this more fine-grained so that we scope visibility even
tighter. To make it possible to change behavior restrict visibility for
now. This shouldn't be a big deal as there aren't actual large users out
there and paves the way to make this even cleaner in the future.
* patches from https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org:
selftests: fix mntns iteration selftests
nstree: tighten permission checks for listing
nsfs: tighten permission checks for handle opening
nsfs: tighten permission checks for ns iteration ioctls
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
If a folio has ifs metadata attached to it and the folio is partially
read in through an async IO helper with the rest of it then being read
in through post-EOF zeroing or as inline data, and the helper
successfully finishes the read first, then post-EOF zeroing / reading
inline will mark the folio as uptodate in iomap_set_range_uptodate().
This is a problem because when the read completion path later calls
iomap_read_end(), it will call folio_end_read(), which sets the uptodate
bit using XOR semantics. Calling folio_end_read() on a folio that was
already marked uptodate clears the uptodate bit.
Fix this by not marking the folio as uptodate if the read IO has bytes
pending. The folio uptodate state will be set in the read completion
path through iomap_end_read() -> folio_end_read().
Reported-by: Wei Gao <wegao@suse.com>
Suggested-by: Sasha Levin <sashal@kernel.org>
Tested-by: Wei Gao <wegao@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: stable@vger.kernel.org # v6.19
Link: https://lore.kernel.org/linux-fsdevel/aYbmy8JdgXwsGaPP@autotest-wegao.qe.prg2.suse.org/
Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260303233420.874231-2-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull cgroup fixes from Tejun Heo:
- Fix circular locking dependency in cpuset partition code by
deferring housekeeping_update() calls to a workqueue instead
of calling them directly under cpus_read_lock
- Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when
generate_sched_domains() returns NULL due to kmalloc failure
- Fix incorrect cpuset behavior for effective_xcpus in
partition_xcpus_del() and cpuset_update_tasks_cpumask()
in update_cpumasks_hier()
- Fix race between task migration and cgroup iteration
* tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked
cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
cgroup: fix race between task migration and iteration
When code had been changed to use for_each_set_clump8(), it mistakenly
switched from chip->nport to chip->tpin since the cy8c9540 and cy8c9560
have a 4-pin gap. This, in particular, led to the missed read of
the last bank interrupt status register and hence missing interrupts
on those pins. Restore the upper limit in for_each_set_clump8() to take
into consideration that gap.
Fixes: 83e29a7a1fdf ("pinctrl: cy8c95x0; Switch to use for_each_set_clump8()")
Cc: stable@vger.kernel.org
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature
set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered... [his words, not mine]
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Generic:
- Remove internal Kconfigs that are now set on all architectures
- Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
architectures finally enable it in Linux 7.0"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: always define KVM_CAP_SYNC_MMU
KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: arm64: Deduplicate ASID retrieval code
irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
KVM: arm64: Fix protected mode handling of pages larger than 4kB
KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
KVM: arm64: Fix ID register initialization for non-protected pKVM guests
KVM: arm64: Optimise away S1POE handling when not supported by host
KVM: arm64: Hide S1POE from guests when not supported by the host
Fix netfslib such that when it's making an unbuffered or DIO write, to make
sure that it sends each subrequest strictly sequentially, waiting till the
previous one is 'committed' before sending the next so that we don't have
pieces landing out of order and potentially leaving a hole if an error
occurs (ENOSPC for example).
This is done by copying in just those bits of issuing, collecting and
retrying subrequests that are necessary to do one subrequest at a time.
Retrying, in particular, is simpler because if the current subrequest needs
retrying, the source iterator can just be copied again and the subrequest
prepped and issued again without needing to be concerned about whether it
needs merging with the previous or next in the sequence.
Note that the issuing loop waits for a subrequest to complete right after
issuing it, but this wait could be moved elsewhere allowing preparatory
steps to be performed whilst the subrequest is in progress. In particular,
once content encryption is available in netfslib, that could be done whilst
waiting, as could cleanup of buffers that have been completed.
Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/58526.1772112753@warthog.procyon.org.uk
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that we changed permission checking make sure that we reflect that
in the selftests.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-4-d2c2853313bd@kernel.org
Fixes: 9d87b1067382 ("selftests: add tests for mntns iteration")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.14+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
Pull sched_ext fixes from Tejun Heo:
- Fix starvation of scx_enable() under fair-class saturation by
offloading the enable path to an RT kthread
- Fix out-of-bounds access in idle mask initialization on systems with
non-contiguous NUMA node IDs
- Fix a preemption window during scheduler exit and a refcount
underflow in cgroup init error path
- Fix SCX_EFLAG_INITIALIZED being a no-op flag
- Add READ_ONCE() annotations for KCSAN-clean lockless accesses and
replace naked scx_root dereferences with container_of() in kobject
callbacks
- Tooling and selftest fixes: compilation issues with clang 17,
strtoul() misuse, unused options cleanup, and Kconfig sync
* tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix starvation of scx_enable() under fair-class saturation
sched_ext: Remove redundant css_put() in scx_cgroup_init()
selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17
selftests/sched_ext: Add -fms-extensions to bpf build flags
tools/sched_ext: Add -fms-extensions to bpf build flags
sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout
sched_ext: Replace naked scx_root dereferences in kobject callbacks
sched_ext: Use READ_ONCE() for the read side of dsq->nr update
tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq()
sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag
sched_ext: Fix out-of-bounds access in scx_idle_init_masks()
sched_ext: Disable preemption between scx_claim_exit() and kicking helper work
tools/sched_ext: Add Kconfig to sync with upstream
tools/sched_ext: Sync README.md Kconfig with upstream scx
selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c
tools/sched_ext: scx_sdt: Remove unused '-f' option
tools/sched_ext: scx_central: Remove unused '-p' option
selftests/sched_ext: Fix unused-result warning for read()
selftests/sched_ext: Abort test loop on signal
A null-pointer-dereference bug was reported by syzbot:
Oops: general protection fault, probably for address 0xdffffc0000000000:
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:bitmap_subset include/linux/bitmap.h:433 [inline]
RIP: 0010:cpumask_subset include/linux/cpumask.h:836 [inline]
RIP: 0010:rebuild_sched_domains_locked kernel/cgroup/cpuset.c:967
RSP: 0018:ffffc90003ecfbc0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000020
RDX: ffff888028de0000 RSI: ffffffff8200f003 RDI: ffffffff8df14f28
RBP: 0000000000000000 R08: 0000000000000cc0 R09: 00000000ffffffff
R10: ffffffff8e7d95b3 R11: 0000000000000001 R12: 0000000000000000
R13: 00000000000f4240 R14: dffffc0000000000 R15: 0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2f463fff CR3: 000000003704c000 CR4: 00000000003526f0
Call Trace:
<TASK>
rebuild_sched_domains_cpuslocked kernel/cgroup/cpuset.c:983 [inline]
rebuild_sched_domains+0x21/0x40 kernel/cgroup/cpuset.c:990
sched_rt_handler+0xb5/0xe0 kernel/sched/rt.c:2911
proc_sys_call_handler+0x47f/0x5a0 fs/proc/proc_sysctl.c:600
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0x6ac/0x1070 fs/read_write.c:688
ksys_write+0x12a/0x250 fs/read_write.c:740
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The issue occurs when generate_sched_domains() returns ndoms = 1 and
doms = NULL due to a kmalloc failure. This leads to a null-pointer
dereference when accessing doms in rebuild_sched_domains_locked().
Fix this by adding a NULL check for doms before accessing it.
Fixes: 6ee43047e8ad ("cpuset: Remove unnecessary checks in rebuild_sched_domains_locked")
Reported-by: syzbot+460792609a79c085f79f@syzkaller.appspotmail.com
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After commit 471e998c0e31 ("gpiolib: remove redundant callback check"),
a warning will be printed if the gpio driver does not implement this
callback. The warning was added in commit e623c4303ed1 ("gpiolib:
sanitize the return value of gpio_chip::get_direction()"), but was
masked by the "redundant" check.
The warning can be triggered by any action that calls the callback,
such as dumping the GPIO state from /sys/kernel/debug/gpio.
Implement it for the sunxi driver. This is simply a matter of reading
out the mux value from the registers, then checking if it is one of
the GPIO functions and which direction it is.
Signed-off-by: Chen-Yu Tsai <wens@kernel.org>
Reviewed-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull debugobjects fix from Thomas Gleixner:
"A single fix for debugobjects.
The deferred page initialization prevents debug objects from
allocating slab pages until the initialization is complete. That
causes depletion of the pool and disabling of debugobjects.
The reason is that debugobjects uses __GFP_HIGH for allocations as it
might be invoked from arbitrary contexts. When PREEMPT_COUNT is
disabled there is no way to know whether the context is safe to set
__GFP_KSWAPD_RECLAIM.
This worked until v6.18. Since then allocations w/o a reclaim flag
cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(),
which returns early when deferred page initialization has not yet
completed.
Work around that when PREEMPT_COUNT is enabled as the preempt counter
allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when
the context is preemtible. When PREEMPT_COUNT is disabled the context
is unknown and the reclaim bit can't be set because the caller might
hold locks which might deadlock in the allocator.
That makes debugobjects depend on PREEMPT_COUNT ||
!DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but
keeps it functional for most cases"
* tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobject: Make it work with deferred page initialization - again
KVM/arm64 fixes for 7.0, take #1
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered...
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Guillaume reported crashes via corrupted RCU callback function pointers
during KUnit testing. The crash was traced back to the pidfs rhashtable
conversion which replaced the 24-byte rb_node with an 8-byte rhash_head
in struct pid, shrinking it from 160 to 144 bytes.
struct kthread (without CONFIG_BLK_CGROUP) is also 144 bytes. With
CONFIG_SLAB_MERGE_DEFAULT and SLAB_HWCACHE_ALIGN both round up to
192 bytes and share the same slab cache. struct pid.rcu.func and
struct kthread.affinity_node both sit at offset 0x78.
When a kthread exits via make_task_dead() it bypasses kthread_exit() and
misses the affinity_node cleanup. free_kthread_struct() frees the memory
while the node is still linked into the global kthread_affinity_list. A
subsequent list_del() by another kthread writes through dangling list
pointers into the freed and reused memory, corrupting the pid's
rcu.func pointer.
Instead of patching free_kthread_struct() to handle the missed cleanup,
consolidate all kthread exit paths. Turn kthread_exit() into a macro
that calls do_exit() and add kthread_do_exit() which is called from
do_exit() for any task with PF_KTHREAD set. This guarantees that
kthread-specific cleanup always happens regardless of the exit path -
make_task_dead(), direct do_exit(), or kthread_exit().
Replace __to_kthread() with a new tsk_is_kthread() accessor in the
public header. Export do_exit() since module code using the
kthread_exit() macro now needs it directly.
Reported-by: Guillaume Tucker <gtucker@gtucker.io>
Tested-by: Guillaume Tucker <gtucker@gtucker.io>
Tested-by: Mark Brown <broonie@kernel.org>
Tested-by: David Gow <davidgow@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20260224-mittlerweile-besessen-2738831ae7f6@brauner
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 4d13f4304fa4 ("kthread: Implement preferred affinity")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-3-d2c2853313bd@kernel.org
Fixes: 76b6f5dfb3fd ("nstree: add listns()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.19+
Signed-off-by: Christian Brauner <brauner@kernel.org>
hppa-linux-gcc 9.5.0 generates a call to fsverity_readahead() in
f2fs_readahead() when CONFIG_FS_VERITY=n, because it fails to do the
expected dead code elimination based on vi always being NULL. Fix the
build error by adding an inline stub for fsverity_readahead(). Since
it's just for opportunistic readahead, just make it a no-op.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602180838.pwICdY2r-lkp@intel.com/
Fixes: 45dcb3ac9832 ("f2fs: consolidate fsverity_info lookup")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218012244.18536-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Pull btrfs fixes from David Sterba:
"One-liner or short fixes for minor/moderate problems reported recently:
- fixes or level adjustments of error messages
- fix leaked transaction handles after aborted transactions, when
using the remap tree feature
- fix a few leaked chunk maps after errors
- fix leaked page array in io_uring encoded read if an error occurs
and the 'finished' is not called
- fix double release of reserved extents when doing a range COW
- don't commit super block when the filesystem is in shutdown state
- fix squota accounting condition when checking members vs parent
usage
- other error handling fixes"
* tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check block group lookup in remove_range_from_remap_tree()
btrfs: fix transaction handle leaks in btrfs_last_identity_remap_gone()
btrfs: fix chunk map leak in btrfs_map_block() after btrfs_translate_remap()
btrfs: fix chunk map leak in btrfs_map_block() after btrfs_chunk_map_num_copies()
btrfs: fix compat mask in error messages in btrfs_check_features()
btrfs: print correct subvol num if active swapfile prevents deletion
btrfs: fix warning in scrub_verify_one_metadata()
btrfs: fix objectid value in error message in check_extent_data_ref()
btrfs: fix incorrect key offset in error message in check_dev_extent_item()
btrfs: fix error message order of parameters in btrfs_delete_delayed_dir_index()
btrfs: don't commit the super block when unmounting a shutdown filesystem
btrfs: free pages on error in btrfs_uring_read_extent()
btrfs: fix referenced/exclusive check in squota_check_parent_usage()
btrfs: remove pointless WARN_ON() in cache_save_setup()
btrfs: convert log messages to error level in btrfs_replay_log()
btrfs: remove btrfs_handle_fs_error() after failure to recover log trees
btrfs: remove redundant warning message in btrfs_check_uuid_tree()
btrfs: change warning messages to error level in open_ctree()
btrfs: fix a double release on reserved extents in cow_one_range()
btrfs: handle discard errors in in btrfs_finish_extent_commit()
During scx_enable(), the READY -> ENABLED task switching loop changes the
calling thread's sched_class from fair to ext. Since fair has higher
priority than ext, saturating fair-class workloads can indefinitely starve
the enable thread, hanging the system. This was introduced when the enable
path switched from preempt_disable() to scx_bypass() which doesn't protect
against fair-class starvation. Note that the original preempt_disable()
protection wasn't complete either - in partial switch modes, the calling
thread could still be starved after preempt_enable() as it may have been
switched to ext class.
Fix it by offloading the enable body to a dedicated system-wide RT
(SCHED_FIFO) kthread which cannot be starved by either fair or ext class
tasks. scx_enable() lazily creates the kthread on first use and passes the
ops pointer through a struct scx_enable_cmd containing the kthread_work,
then synchronously waits for completion.
The workfn runs on a different kthread from sch->helper (which runs
disable_work), so it can safely flush disable_work on the error path
without deadlock.
Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo <tj@kernel.org>
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the
"isolcpus=domain,..." boot command line feature at run time.
The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.
======================================================
WARNING: possible circular locking dependency detected
6.18.0-test+ #2 Tainted: G S
------------------------------------------------------
test_cpuset_prs/10971 is trying to acquire lock:
ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
but task is already holding lock:
ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (cpuset_mutex){+.+.}-{4:4}:
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
-> #2 (rtnl_mutex){+.+.}-{4:4}:
-> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
-> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
Chain exists of:
(wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
5 locks held by test_cpuset_prs/10971:
#0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
#1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
#2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
#3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
#4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
Call Trace:
<TASK>
:
touch_wq_lockdep_map+0x93/0x180
__flush_workqueue+0x111/0x10b0
housekeeping_update+0x12d/0x2d0
update_parent_effective_cpumask+0x595/0x2440
update_prstate+0x89d/0xce0
cpuset_partition_write+0xc5/0x130
cgroup_file_write+0x1a5/0x680
kernfs_fop_write_iter+0x3df/0x5f0
vfs_write+0x525/0xfd0
ksys_write+0xf9/0x1d0
do_syscall_64+0x95/0x520
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.
So housekeeping_update() is now called after releasing cpus_read_lock
and cpuset_mutex at the end of a cpuset operation. These two locks are
then re-acquired later before calling rebuild_sched_domains_locked().
To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.
As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.
The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.
Fixes: 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit e2c58cbe3aff ("pinctrl: rockchip: Simplify locking with
scoped_guard()") added a scoped_guard() over existing code containing a
"break" instruction. That "break" was for the outer (existing)
for-loop, which now exits inner, scoped_guard() loop. If GPIO driver
did not probe, then driver will not bail out, but instead continue to
configure the pin.
Fix the issue by simplifying the code - the break in original code was
leading directly to end of the function returning 0, thus we can simply
return here rockchip_pinconf_defer_pin status.
Reported-by: David Lechner <dlechner@baylibre.com>
Closes: https://lore.kernel.org/r/f5b38942-a584-4e78-a893-de4a219070b2@baylibre.com/
Fixes: e2c58cbe3aff ("pinctrl: rockchip: Simplify locking with scoped_guard()")
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull x86 fixes from Ingo Molnar:
- Fix speculative safety in fred_extint()
- Fix __WARN_printf() trap in early_fixup_exception()
- Fix clang-build boot bug for unusual alignments, triggered by
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y
- Replace the final few __ASSEMBLY__ stragglers that snuck in lately
into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again)
* tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__
x86/cfi: Fix CFI rewrite for odd alignments
x86/bug: Handle __WARN_printf() trap in early_fixup_exception()
x86/fred: Correct speculative safety in fred_extint()
debugobjects uses __GFP_HIGH for allocations as it might be invoked
within locked regions. That worked perfectly fine until v6.18. It still
works correctly when deferred page initialization is disabled and works
by chance when no page allocation is required before deferred page
initialization has completed.
Since v6.18 allocations w/o a reclaim flag cause new_slab() to end up in
alloc_frozen_pages_nolock_noprof(), which returns early when deferred
page initialization has not yet completed. As the deferred page
initialization takes quite a while the debugobject pool is depleted and
debugobjects are disabled.
This can be worked around when PREEMPT_COUNT is enabled as that allows
debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when the context
is preemtible. When PREEMPT_COUNT is disabled the context is unknown and
the reclaim bit can't be set because the caller might hold locks which
might deadlock in the allocator.
In preemptible context the reclaim bit is harmless and not a performance
issue as that's usually invoked from slow path initialization context.
That makes debugobjects depend on PREEMPT_COUNT || !DEFERRED_STRUCT_PAGE_INIT.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/87pl6gznti.ffs@tglx
We currently have three versions of the ASID retrieval code, one
in the S1 walker, and two in the VNCR handling (although the last
two are limited to the EL2&0 translation regime).
Make this code common, and take this opportunity to also simplify
the code a bit while switching over to the TTBRx_EL1_ASID macro.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260225104718.14209-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
iomap's directio implementation has two magic errno codes that it uses
to signal callers -- ENOTBLK tells the filesystem that it should retry
a write with the pagecache; and EAGAIN tells the caller that pagecache
flushing or invalidation failed and that it should try again.
Neither of these indicate data loss, so let's not report them.
Fixes: a9d573ee88af98 ("iomap: report file I/O errors to the VFS")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20260224154637.GD2390381@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-2-d2c2853313bd@kernel.org
Fixes: 5222470b2fbb ("nsfs: support file handles")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.18+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stephen retired and stepped back from -next maintainership, update his
entry in CREDITS to recognise his 18 years of hard work making it what
it is today and all the impact it's had on our development process.
Also update to his current GnuPG key while we're here.
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I finally got a big endian PPC64 kernel to boot in QEMU. The PPC64 VSX
optimized AES library code does work in that case, with the exception of
rndkey_from_vsx() which doesn't take into account that the order in
which the VSX code stores the round key words depends on the endianness.
So fix rndkey_from_vsx() to do the right thing on big endian CPUs.
Fixes: 7cf2082e74ce ("lib/crypto: powerpc/aes: Migrate POWER8 optimized code into library")
Link: https://lore.kernel.org/r/20260216022104.332991-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
If a 'const struct foo __user *ptr' is used for the address passed to
scoped_user_read_access() then you get a warning/error
uaccess.h:691:1: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
for the
void __user *_tmpptr = __scoped_user_access_begin(mode, uptr, size, elbl)
assignment.
Fix by using 'auto' for both _tmpptr and the redeclaration of uptr.
Replace the CLASS() with explicit __cleanup() functions on uptr.
Fixes: e497310b4ffb ("uaccess: Provide scoped user access regions")
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-and-tested-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a check in remove_range_from_remap_tree() after we call
btrfs_lookup_block_group(), to check if it is NULL. This shouldn't
happen, but if it does we at least get an error rather than a segfault.
Reported-by: Chris Mason <clm@fb.com>
Link: https://lore.kernel.org/linux-btrfs/20260125125129.2245240-1-clm@meta.com/
Fixes: 979e1dc3d69e ("btrfs: handle deletions from remapped block group")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The iterator css_for_each_descendant_pre() walks the cgroup hierarchy
under cgroup_lock(). It does not increment the reference counts on
yielded css structs.
According to the cgroup documentation, css_put() should only be used
to release a reference obtained via css_get() or css_tryget_online().
Since the iterator does not use either of these to acquire a reference,
calling css_put() in the error path of scx_cgroup_init() causes a
refcount underflow.
Remove the unbalanced css_put() to prevent a potential Use-After-Free
(UAF) vulnerability.
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
for instance, when an isolated partition is invalidated because its
last active CPU has been put offline.
As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the update_hk_sched_domains() function will be invoked via the
hk_sd_workfn().
An concurrent cpuset control file write may have executed the required
update_hk_sched_domains() function before the work function is called. So
the work function call may become a no-op when it is invoked.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
devm_add_action_or_reset() already invokes the action on failure,
so the explicit put causes a double-put.
Fixes: 9b07cdf86a0b ("pinctrl: cirrus: Fix fwnode leak in cs42l43_pin_probe()")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull timer fix from Ingo Molnar:
"Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
the common HZ=100, 250 or 1000 cases. Only use a function call for odd
HZ values like HZ=300 that generate more code.
The function call overhead showed up in performance tests of the TCP
code"
* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
After converting the __ASSEMBLY__ statements to __ASSEMBLER__ in
commit 24a295e4ef1ca ("x86/headers: Replace __ASSEMBLY__ with
__ASSEMBLER__ in non-UAPI headers"), some new code has been
added that uses __ASSEMBLY__ again. Convert these stragglers, too.
This is a mechanical patch, done with a simple "sed -i" command.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251218182029.166993-1-thuth@redhat.com