Linux kernel ============ The Linux kernel is the core of any Linux operating system. It manages hardware, system resources, and provides the fundamental services for all other software. Quick Start ----------- * Report a bug: See Documentation/admin-guide/reporting-issues.rst * Get the latest kernel: https://kernel.org * Build the kernel: See Documentation/admin-guide/quickly-build-trimmed-linux.rst * Join the community: https://lore.kernel.org/ Essential Documentation ----------------------- All users should be familiar with: * Building requirements: Documentation/process/changes.rst * Code of Conduct: Documentation/process/code-of-conduct.rst * License: See COPYING Documentation can be built with make htmldocs or viewed online at: https://www.kernel.org/doc/html/latest/ Who Are You? ============ Find your role below: * New Kernel Developer - Getting started with kernel development * Academic Researcher - Studying kernel internals and architecture * Security Expert - Hardening and vulnerability analysis * Backport/Maintenance Engineer - Maintaining stable kernels * System Administrator - Configuring and troubleshooting * Maintainer - Leading subsystems and reviewing patches * Hardware Vendor - Writing drivers for new hardware * Distribution Maintainer - Packaging kernels for distros * AI Coding Assistant - LLMs and AI-powered development tools For Specific Users ================== New Kernel Developer -------------------- Welcome! Start your kernel development journey here: * Getting Started: Documentation/process/development-process.rst * Your First Patch: Documentation/process/submitting-patches.rst * Coding Style: Documentation/process/coding-style.rst * Build System: Documentation/kbuild/index.rst * Development Tools: Documentation/dev-tools/index.rst * Kernel Hacking Guide: Documentation/kernel-hacking/hacking.rst * Core APIs: Documentation/core-api/index.rst Academic Researcher ------------------- Explore the kernel's architecture and internals: * Researcher Guidelines: Documentation/process/researcher-guidelines.rst * Memory Management: Documentation/mm/index.rst * Scheduler: Documentation/scheduler/index.rst * Networking Stack: Documentation/networking/index.rst * Filesystems: Documentation/filesystems/index.rst * RCU (Read-Copy Update): Documentation/RCU/index.rst * Locking Primitives: Documentation/locking/index.rst * Power Management: Documentation/power/index.rst Security Expert --------------- Security documentation and hardening guides: * Security Documentation: Documentation/security/index.rst * LSM Development: Documentation/security/lsm-development.rst * Self Protection: Documentation/security/self-protection.rst * Reporting Vulnerabilities: Documentation/process/security-bugs.rst * CVE Procedures: Documentation/process/cve.rst * Embargoed Hardware Issues: Documentation/process/embargoed-hardware-issues.rst * Security Features: Documentation/userspace-api/seccomp_filter.rst Backport/Maintenance Engineer ----------------------------- Maintain and stabilize kernel versions: * Stable Kernel Rules: Documentation/process/stable-kernel-rules.rst * Backporting Guide: Documentation/process/backporting.rst * Applying Patches: Documentation/process/applying-patches.rst * Subsystem Profile: Documentation/maintainer/maintainer-entry-profile.rst * Git for Maintainers: Documentation/maintainer/configure-git.rst System Administrator -------------------- Configure, tune, and troubleshoot Linux systems: * Admin Guide: Documentation/admin-guide/index.rst * Kernel Parameters: Documentation/admin-guide/kernel-parameters.rst * Sysctl Tuning: Documentation/admin-guide/sysctl/index.rst * Tracing/Debugging: Documentation/trace/index.rst * Performance Security: Documentation/admin-guide/perf-security.rst * Hardware Monitoring: Documentation/hwmon/index.rst Maintainer ---------- Lead kernel subsystems and manage contributions: * Maintainer Handbook: Documentation/maintainer/index.rst * Pull Requests: Documentation/maintainer/pull-requests.rst * Managing Patches: Documentation/maintainer/modifying-patches.rst * Rebasing and Merging: Documentation/maintainer/rebasing-and-merging.rst * Development Process: Documentation/process/maintainer-handbooks.rst * Maintainer Entry Profile: Documentation/maintainer/maintainer-entry-profile.rst * Git Configuration: Documentation/maintainer/configure-git.rst Hardware Vendor --------------- Write drivers and support new hardware: * Driver API Guide: Documentation/driver-api/index.rst * Driver Model: Documentation/driver-api/driver-model/driver.rst * Device Drivers: Documentation/driver-api/infrastructure.rst * Bus Types: Documentation/driver-api/driver-model/bus.rst * Device Tree Bindings: Documentation/devicetree/bindings/ * Power Management: Documentation/driver-api/pm/index.rst * DMA API: Documentation/core-api/dma-api.rst Distribution Maintainer ----------------------- Package and distribute the kernel: * Stable Kernel Rules: Documentation/process/stable-kernel-rules.rst * ABI Documentation: Documentation/ABI/README * Kernel Configuration: Documentation/kbuild/kconfig.rst * Module Signing: Documentation/admin-guide/module-signing.rst * Kernel Parameters: Documentation/admin-guide/kernel-parameters.rst * Tainted Kernels: Documentation/admin-guide/tainted-kernels.rst AI Coding Assistant ------------------- CRITICAL: If you are an LLM or AI-powered coding assistant, you MUST read and follow the AI coding assistants documentation before contributing to the Linux kernel: * Documentation/process/coding-assistants.rst This documentation contains essential requirements about licensing, attribution, and the Developer Certificate of Origin that all AI tools must comply with. Communication and Support ========================= * Mailing Lists: https://lore.kernel.org/ * IRC: #kernelnewbies on irc.oftc.net * Bugzilla: https://bugzilla.kernel.org/ * MAINTAINERS file: Lists subsystem maintainers and mailing lists * Email Clients: Documentation/process/email-clients.rst
Clone this repository
For self-hosted knots, clone URLs may differ based on your setup.
Download tar.gz
If a folio has ifs metadata attached to it and the folio is partially
read in through an async IO helper with the rest of it then being read
in through post-EOF zeroing or as inline data, and the helper
successfully finishes the read first, then post-EOF zeroing / reading
inline will mark the folio as uptodate in iomap_set_range_uptodate().
This is a problem because when the read completion path later calls
iomap_read_end(), it will call folio_end_read(), which sets the uptodate
bit using XOR semantics. Calling folio_end_read() on a folio that was
already marked uptodate clears the uptodate bit.
Fix this by not marking the folio as uptodate if the read IO has bytes
pending. The folio uptodate state will be set in the read completion
path through iomap_end_read() -> folio_end_read().
Reported-by: Wei Gao <wegao@suse.com>
Suggested-by: Sasha Levin <sashal@kernel.org>
Tested-by: Wei Gao <wegao@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: stable@vger.kernel.org # v6.19
Link: https://lore.kernel.org/linux-fsdevel/aYbmy8JdgXwsGaPP@autotest-wegao.qe.prg2.suse.org/
Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260303233420.874231-2-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
Listing various namespaces is currently only scoped on owning namespace.
We can make this more fine-grained so that we scope visibility even
tighter. To make it possible to change behavior restrict visibility for
now. This shouldn't be a big deal as there aren't actual large users out
there and paves the way to make this even cleaner in the future.
* patches from https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org:
selftests: fix mntns iteration selftests
nstree: tighten permission checks for listing
nsfs: tighten permission checks for handle opening
nsfs: tighten permission checks for ns iteration ioctls
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Fix netfslib such that when it's making an unbuffered or DIO write, to make
sure that it sends each subrequest strictly sequentially, waiting till the
previous one is 'committed' before sending the next so that we don't have
pieces landing out of order and potentially leaving a hole if an error
occurs (ENOSPC for example).
This is done by copying in just those bits of issuing, collecting and
retrying subrequests that are necessary to do one subrequest at a time.
Retrying, in particular, is simpler because if the current subrequest needs
retrying, the source iterator can just be copied again and the subrequest
prepped and issued again without needing to be concerned about whether it
needs merging with the previous or next in the sequence.
Note that the issuing loop waits for a subrequest to complete right after
issuing it, but this wait could be moved elsewhere allowing preparatory
steps to be performed whilst the subrequest is in progress. In particular,
once content encryption is available in netfslib, that could be done whilst
waiting, as could cleanup of buffers that have been completed.
Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/58526.1772112753@warthog.procyon.org.uk
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that we changed permission checking make sure that we reflect that
in the selftests.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-4-d2c2853313bd@kernel.org
Fixes: 9d87b1067382 ("selftests: add tests for mntns iteration")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.14+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Guillaume reported crashes via corrupted RCU callback function pointers
during KUnit testing. The crash was traced back to the pidfs rhashtable
conversion which replaced the 24-byte rb_node with an 8-byte rhash_head
in struct pid, shrinking it from 160 to 144 bytes.
struct kthread (without CONFIG_BLK_CGROUP) is also 144 bytes. With
CONFIG_SLAB_MERGE_DEFAULT and SLAB_HWCACHE_ALIGN both round up to
192 bytes and share the same slab cache. struct pid.rcu.func and
struct kthread.affinity_node both sit at offset 0x78.
When a kthread exits via make_task_dead() it bypasses kthread_exit() and
misses the affinity_node cleanup. free_kthread_struct() frees the memory
while the node is still linked into the global kthread_affinity_list. A
subsequent list_del() by another kthread writes through dangling list
pointers into the freed and reused memory, corrupting the pid's
rcu.func pointer.
Instead of patching free_kthread_struct() to handle the missed cleanup,
consolidate all kthread exit paths. Turn kthread_exit() into a macro
that calls do_exit() and add kthread_do_exit() which is called from
do_exit() for any task with PF_KTHREAD set. This guarantees that
kthread-specific cleanup always happens regardless of the exit path -
make_task_dead(), direct do_exit(), or kthread_exit().
Replace __to_kthread() with a new tsk_is_kthread() accessor in the
public header. Export do_exit() since module code using the
kthread_exit() macro now needs it directly.
Reported-by: Guillaume Tucker <gtucker@gtucker.io>
Tested-by: Guillaume Tucker <gtucker@gtucker.io>
Tested-by: Mark Brown <broonie@kernel.org>
Tested-by: David Gow <davidgow@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20260224-mittlerweile-besessen-2738831ae7f6@brauner
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 4d13f4304fa4 ("kthread: Implement preferred affinity")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-3-d2c2853313bd@kernel.org
Fixes: 76b6f5dfb3fd ("nstree: add listns()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.19+
Signed-off-by: Christian Brauner <brauner@kernel.org>
iomap's directio implementation has two magic errno codes that it uses
to signal callers -- ENOTBLK tells the filesystem that it should retry
a write with the pagecache; and EAGAIN tells the caller that pagecache
flushing or invalidation failed and that it should try again.
Neither of these indicate data loss, so let's not report them.
Fixes: a9d573ee88af98 ("iomap: report file I/O errors to the VFS")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20260224154637.GD2390381@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-2-d2c2853313bd@kernel.org
Fixes: 5222470b2fbb ("nsfs: support file handles")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.18+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull ata fixes from Niklas Cassel:
- The newly introduced feature that issues a deferred (non-NCQ) command
from a workqueue, forgot to consider the case where the deferred QC
times out. Fix the code to take timeouts into consideration, which
avoids a use after free (Damien)
- The newly introduced feature that issues a deferred (non-NCQ) command
from a workqueue, when unloading the module, calls cancel_work_sync(),
a function that can sleep, while holding a spin lock. Move the function
call outside the lock (Damien)
* tag 'ata-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: fix cancellation of a port deferred qc work
ata: libata-eh: correctly handle deferred qc timeouts
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-1-d2c2853313bd@kernel.org
Fixes: a1d220d9dafa ("nsfs: iterate through mount namespaces")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.12+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull vfs fixes from Christian Brauner:
- Fix an uninitialized variable in file_getattr().
The flags_valid field wasn't initialized before calling
vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse
- Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is
not enabled.
sysctl_hung_task_timeout_secs is 0 in that case causing spurious
"waiting for writeback completion for more than 1 seconds" warnings
- Fix a null-ptr-deref in do_statmount() when the mount is internal
- Add missing kernel-doc description for the @private parameter in
iomap_readahead()
- Fix mount namespace creation to hold namespace_sem across the mount
copy in create_new_namespace().
The previous drop-and-reacquire pattern was fragile and failed to
clean up mount propagation links if the real rootfs was a shared or
dependent mount
- Fix /proc mount iteration where m->index wasn't updated when
m->show() overflows, causing a restart to repeatedly show the same
mount entry in a rapidly expanding mount table
- Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the
inode number is out of range
- Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared.
copy_mnt_ns() received the live fs_struct so if a subsequent
namespace creation failed the rollback would leave pwd and root
pointing to detached mounts. Always allocate a new fs_struct when
CLONE_NEWNS is requested
- fserror bug fixes:
- Remove the unused fsnotify_sb_error() helper now that all callers
have been converted to fserror_report_metadata
- Fix a lockdep splat in fserror_report() where igrab() takes
inode::i_lock which can be held in IRQ context.
Replace igrab() with a direct i_count bump since filesystems
should not report inodes that are about to be freed or not yet
exposed
- Handle error pointer in procfs for try_lookup_noperm()
- Fix an integer overflow in ep_loop_check_proc() where recursive calls
returning INT_MAX would overflow when +1 is added, breaking the
recursion depth check
- Fix a misleading break in pidfs
* tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: avoid misleading break
eventpoll: Fix integer overflow in ep_loop_check_proc()
proc: Fix pointer error dereference
fserror: fix lockdep complaint when igrabbing inode
fsnotify: drop unused helper
unshare: fix unshare_fs() handling
minix: Correct errno in minix_new_inode
namespace: fix proc mount iteration
mount: hold namespace_sem across copy in create_new_namespace()
iomap: Describe @private in iomap_readahead()
statmount: Fix the null-ptr-deref in do_statmount()
writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK
fs: init flags_valid before calling vfs_fileattr_get
cancel_work_sync() is a sleeping function so it cannot be called with
the spin lock of a port being held. Move the call to this function in
ata_port_detach() after EH completes, with the port lock released,
together with other work cancellation calls.
Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
dvb_dvr_open() calls dvb_ringbuffer_init() when a new reader opens the
DVR device. dvb_ringbuffer_init() calls init_waitqueue_head(), which
reinitializes the waitqueue list head to empty.
Since dmxdev->dvr_buffer.queue is a shared waitqueue (all opens of the
same DVR device share it), this orphans any existing waitqueue entries
from io_uring poll or epoll, leaving them with stale prev/next pointers
while the list head is reset to {self, self}.
The waitqueue and spinlock in dvr_buffer are already properly
initialized once in dvb_dmxdev_init(). The open path only needs to
reset the buffer data pointer, size, and read/write positions.
Replace the dvb_ringbuffer_init() call in dvb_dvr_open() with direct
assignment of data/size and a call to dvb_ringbuffer_reset(), which
properly resets pread, pwrite, and error with correct memory ordering
without touching the waitqueue or spinlock.
Cc: stable@vger.kernel.org
Fixes: 34731df288a5f ("V4L/DVB (3501): Dmxdev: use dvb_ringbuffer")
Reported-by: syzbot+ab12f0c08dd7ab8d057c@syzkaller.appspotmail.com
Tested-by: syzbot+ab12f0c08dd7ab8d057c@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/698a26d3.050a0220.3b3015.007d.GAE@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The break would only break out of the scoped_guard() loop, not the
switch statement. It still works correct as is ofc but let's avoid the
confusion.
Reported-by: David Lechner <dlechner@baylibre.com>
Link:: https://lore.kernel.org/cd2153f1-098b-463c-bbc1-5c6ca9ef1f12@baylibre.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
A deferred qc may timeout while waiting for the device queue to drain
to be submitted. In such case, since the qc is not active,
ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
which frees the qc. But as the port deferred_qc field still references
this finished/freed qc, the deferred qc work may eventually attempt to
call ata_qc_issue() against this invalid qc, leading to errors such as
reported by UBSAN (syzbot run):
UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
__ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
process_scheduled_works kernel/workqueue.c:3358 [inline]
worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
kthread+0x370/0x450 kernel/kthread.c:467
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Fix this by checking if the qc of a timed out SCSI command is a deferred
one, and in such case, clear the port deferred_qc field and finish the
SCSI command with DID_TIME_OUT.
Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
This config option goes way back - it used to be an internal debug
option to random.c (at that point called DEBUG_RANDOM_BOOT), then was
renamed and exposed as a config option as CONFIG_WARN_UNSEEDED_RANDOM,
and then further renamed to the current CONFIG_WARN_ALL_UNSEEDED_RANDOM.
It was all done with the best of intentions: the more limited
rate-limited reports were reporting some cases, but if you wanted to see
all the gory details, you'd enable this "ALL" option.
However, it turns out - perhaps not surprisingly - that when people
don't care about and fix the first rate-limited cases, they most
certainly don't care about any others either, and so warning about all
of them isn't actually helping anything.
And the non-ratelimited reporting causes problems, where well-meaning
people enable debug options, but the excessive flood of messages that
nobody cares about will hide actual real information when things go
wrong.
I just got a kernel bug report (which had nothing to do with randomness)
where two thirds of the the truncated dmesg was just variations of
random: get_random_u32 called from __get_random_u32_below+0x10/0x70 with crng_init=0
and in the process early boot messages had been lost (in addition to
making the messages that _hadn't_ been lost harder to read).
The proper way to find these things for the hypothetical developer that
cares - if such a person exists - is almost certainly with boot time
tracing. That gives you the option to get call graphs etc too, which is
likely a requirement for fixing any problems anyway.
See Documentation/trace/boottime-trace.rst for that option.
And if we for some reason do want to re-introduce actual printing of
these things, it will need to have some uniqueness filtering rather than
this "just print it all" model.
Fixes: cc1e127bfa95 ("random: remove ratelimiting for in-kernel unseeded randomness")
Acked-by: Jason Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>