slab: update overview comments · tjh.dev/kernel@0f7075b

+68 -75

1 changed file

expand all

slub.c

+68 -75

mm/slub.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * SLUB: A slab allocator that limits cache line use instead of queuing 4 - * objects in per cpu and per node lists. 3 + * SLUB: A slab allocator with low overhead percpu array caches and mostly 4 + * lockless freeing of objects to slabs in the slowpath. 5 5 * 6 - * The allocator synchronizes using per slab locks or atomic operations 7 - * and only uses a centralized lock to manage a pool of partial slabs. 6 + * The allocator synchronizes using spin_trylock for percpu arrays in the 7 + * fastpath, and cmpxchg_double (or bit spinlock) for slowpath freeing. 8 + * Uses a centralized lock to manage a pool of partial slabs. 8 9 * 9 10 * (C) 2007 SGI, Christoph Lameter 10 11 * (C) 2011 Linux Foundation, Christoph Lameter 12 + * (C) 2025 SUSE, Vlastimil Babka 11 13 */ 12 14 13 15 #include <linux/mm.h> ··· 55 53 56 54 /* 57 55 * Lock order: 58 - * 1. slab_mutex (Global Mutex) 59 - * 2. node->list_lock (Spinlock) 60 - * 3. kmem_cache->cpu_slab->lock (Local lock) 61 - * 4. slab_lock(slab) (Only on some arches) 62 - * 5. object_map_lock (Only for debugging) 56 + * 0. cpu_hotplug_lock 57 + * 1. slab_mutex (Global Mutex) 58 + * 2a. kmem_cache->cpu_sheaves->lock (Local trylock) 59 + * 2b. node->barn->lock (Spinlock) 60 + * 2c. node->list_lock (Spinlock) 61 + * 3. slab_lock(slab) (Only on some arches) 62 + * 4. object_map_lock (Only for debugging) 63 63 * 64 64 * slab_mutex 65 65 * ··· 82 78 * C. slab->objects -> Number of objects in slab 83 79 * D. slab->frozen -> frozen state 84 80 * 81 + * SL_partial slabs 82 + * 83 + * Slabs on node partial list have at least one free object. A limited number 84 + * of slabs on the list can be fully free (slab->inuse == 0), until we start 85 + * discarding them. These slabs are marked with SL_partial, and the flag is 86 + * cleared while removing them, usually to grab their freelist afterwards. 87 + * This clearing also exempts them from list management. Please see 88 + * __slab_free() for more details. 89 + * 90 + * Full slabs 91 + * 92 + * For caches without debugging enabled, full slabs (slab->inuse == 93 + * slab->objects and slab->freelist == NULL) are not placed on any list. 94 + * The __slab_free() freeing the first object from such a slab will place 95 + * it on the partial list. Caches with debugging enabled place such slab 96 + * on the full list and use different allocation and freeing paths. 97 + * 85 98 * Frozen slabs 86 99 * 87 - * If a slab is frozen then it is exempt from list management. It is 88 - * the cpu slab which is actively allocated from by the processor that 89 - * froze it and it is not on any list. The processor that froze the 90 - * slab is the one who can perform list operations on the slab. Other 91 - * processors may put objects onto the freelist but the processor that 92 - * froze the slab is the only one that can retrieve the objects from the 93 - * slab's freelist. 94 - * 95 - * CPU partial slabs 96 - * 97 - * The partially empty slabs cached on the CPU partial list are used 98 - * for performance reasons, which speeds up the allocation process. 99 - * These slabs are not frozen, but are also exempt from list management, 100 - * by clearing the SL_partial flag when moving out of the node 101 - * partial list. Please see __slab_free() for more details. 100 + * If a slab is frozen then it is exempt from list management. It is used to 101 + * indicate a slab that has failed consistency checks and thus cannot be 102 + * allocated from anymore - it is also marked as full. Any previously 103 + * allocated objects will be simply leaked upon freeing instead of attempting 104 + * to modify the potentially corrupted freelist and metadata. 102 105 * 103 106 * To sum up, the current scheme is: 104 - * - node partial slab: SL_partial && !frozen 105 - * - cpu partial slab: !SL_partial && !frozen 106 - * - cpu slab: !SL_partial && frozen 107 - * - full slab: !SL_partial && !frozen 107 + * - node partial slab: SL_partial && !full && !frozen 108 + * - taken off partial list: !SL_partial && !full && !frozen 109 + * - full slab, not on any list: !SL_partial && full && !frozen 110 + * - frozen due to inconsistency: !SL_partial && full && frozen 108 111 * 109 - * list_lock 112 + * node->list_lock (spinlock) 110 113 * 111 114 * The list_lock protects the partial and full list on each node and 112 115 * the partial slab counter. If taken then no new slabs may be added or ··· 123 112 * 124 113 * The list_lock is a centralized lock and thus we avoid taking it as 125 114 * much as possible. As long as SLUB does not have to handle partial 126 - * slabs, operations can continue without any centralized lock. F.e. 127 - * allocating a long series of objects that fill up slabs does not require 128 - * the list lock. 115 + * slabs, operations can continue without any centralized lock. 129 116 * 130 117 * For debug caches, all allocations are forced to go through a list_lock 131 118 * protected region to serialize against concurrent validation. 132 119 * 133 - * cpu_slab->lock local lock 120 + * cpu_sheaves->lock (local_trylock) 134 121 * 135 - * This locks protect slowpath manipulation of all kmem_cache_cpu fields 136 - * except the stat counters. This is a percpu structure manipulated only by 137 - * the local cpu, so the lock protects against being preempted or interrupted 138 - * by an irq. Fast path operations rely on lockless operations instead. 122 + * This lock protects fastpath operations on the percpu sheaves. On !RT it 123 + * only disables preemption and does no atomic operations. As long as the main 124 + * or spare sheaf can handle the allocation or free, there is no other 125 + * overhead. 139 126 * 140 - * On PREEMPT_RT, the local lock neither disables interrupts nor preemption 141 - * which means the lockless fastpath cannot be used as it might interfere with 142 - * an in-progress slow path operations. In this case the local lock is always 143 - * taken but it still utilizes the freelist for the common operations. 127 + * node->barn->lock (spinlock) 144 128 * 145 - * lockless fastpaths 129 + * This lock protects the operations on per-NUMA-node barn. It can quickly 130 + * serve an empty or full sheaf if available, and avoid more expensive refill 131 + * or flush operation. 146 132 * 147 - * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free()) 148 - * are fully lockless when satisfied from the percpu slab (and when 149 - * cmpxchg_double is possible to use, otherwise slab_lock is taken). 150 - * They also don't disable preemption or migration or irqs. They rely on 151 - * the transaction id (tid) field to detect being preempted or moved to 152 - * another cpu. 133 + * Lockless freeing 134 + * 135 + * Objects may have to be freed to their slabs when they are from a remote 136 + * node (where we want to avoid filling local sheaves with remote objects) 137 + * or when there are too many full sheaves. On architectures supporting 138 + * cmpxchg_double this is done by a lockless update of slab's freelist and 139 + * counters, otherwise slab_lock is taken. This only needs to take the 140 + * list_lock if it's a first free to a full slab, or when a slab becomes empty 141 + * after the free. 153 142 * 154 143 * irq, preemption, migration considerations 155 144 * 156 - * Interrupts are disabled as part of list_lock or local_lock operations, or 145 + * Interrupts are disabled as part of list_lock or barn lock operations, or 157 146 * around the slab_lock operation, in order to make the slab allocator safe 158 147 * to use in the context of an irq. 148 + * Preemption is disabled as part of local_trylock operations. 149 + * kmalloc_nolock() and kfree_nolock() are safe in NMI context but see 150 + * their limitations. 159 151 * 160 - * In addition, preemption (or migration on PREEMPT_RT) is disabled in the 161 - * allocation slowpath, bulk allocation, and put_cpu_partial(), so that the 162 - * local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer 163 - * doesn't have to be revalidated in each section protected by the local lock. 164 - * 165 - * SLUB assigns one slab for allocation to each processor. 166 - * Allocations only occur from these slabs called cpu slabs. 152 + * SLUB assigns two object arrays called sheaves for caching allocations and 153 + * frees on each cpu, with a NUMA node shared barn for balancing between cpus. 154 + * Allocations and frees are primarily served from these sheaves. 167 155 * 168 156 * Slabs with free elements are kept on a partial list and during regular 169 157 * operations no list for full slabs is used. If an object in a full slab is ··· 170 160 * We track full slabs for debugging purposes though because otherwise we 171 161 * cannot scan all objects. 172 162 * 173 - * Slabs are freed when they become empty. Teardown and setup is 174 - * minimal so we rely on the page allocators per cpu caches for 175 - * fast frees and allocs. 176 - * 177 - * slab->frozen The slab is frozen and exempt from list processing. 178 - * This means that the slab is dedicated to a purpose 179 - * such as satisfying allocations for a specific 180 - * processor. Objects may be freed in the slab while 181 - * it is frozen but slab_free will then skip the usual 182 - * list operations. It is up to the processor holding 183 - * the slab to integrate the slab into the slab lists 184 - * when the slab is no longer needed. 185 - * 186 - * One use of this flag is to mark slabs that are 187 - * used for allocations. Then such a slab becomes a cpu 188 - * slab. The cpu slab may be equipped with an additional 189 - * freelist that allows lockless access to 190 - * free objects in addition to the regular freelist 191 - * that requires the slab lock. 163 + * Slabs are freed when they become empty. Teardown and setup is minimal so we 164 + * rely on the page allocators per cpu caches for fast frees and allocs. 192 165 * 193 166 * SLAB_DEBUG_FLAGS Slab requires special handling due to debug 194 167 * options set. This moves slab handling out of

Configure Feed

Configure Feed