Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm, swap: protect si->swap_file properly and use as a mount indicator

Patch series "mm, swap: swap table phase III: remove swap_map", v3.

This series removes the static swap_map and uses the swap table for the
swap count directly. This saves about ~30% memory usage for the static
swap metadata. For example, this saves 256MB of memory when mounting a
1TB swap device. Performance is slightly better too, since the double
update of the swap table and swap_map is now gone.

Test results:

Mounting a swap device:
=======================
Mount a 1TB brd device as SWAP, just to verify the memory save:

`free -m` before:
total used free shared buff/cache available
Mem: 1465 1051 417 1 61 413
Swap: 1054435 0 1054435

`free -m` after:
total used free shared buff/cache available
Mem: 1465 795 672 1 62 670
Swap: 1054435 0 1054435

Idle memory usage is reduced by ~256MB just as expected. And following
this design we should be able to save another ~512MB in a next phase.

Build kernel test:
==================
Test using ZSWAP with NVME SWAP, make -j48, defconfig, in a x86_64 VM
with 5G RAM, under global pressure, avg of 32 test run:

Before After:
System time: 1038.97s 1013.75s (-2.4%)

Test using ZRAM as SWAP, make -j12, tinyconfig, in a ARM64 VM with 1.5G
RAM, under global pressure, avg of 32 test run:

Before After:
System time: 67.75s 66.65s (-1.6%)

The result is slightly better.

Redis / Valkey benchmark:
=========================
Test using ZRAM as SWAP, in a ARM64 VM with 1.5G RAM, under global pressure,
avg of 64 test run:

Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

no persistence with BGSAVE
Before: 472705.71 RPS 369451.68 RPS
After: 481197.93 RPS (+1.8%) 374922.32 RPS (+1.5%)

In conclusion, performance is better in all cases, and memory usage is
much lower.

The swap cgroup array will also be merged into the swap table in a later
phase, saving the other ~60% part of the static swap metadata and making
all the swap metadata dynamic. The improved API for swap operations also
reduces the lock contention and makes more batching operations possible.


This patch (of 12):

/proc/swaps uses si->swap_map as the indicator to check if the swap
device is mounted. swap_map will be removed soon, so change it to use
si->swap_file instead because:

- si->swap_file is exactly the only dynamic content that /proc/swaps is
interested in. Previously, it was checking si->swap_map just to ensure
si->swap_file is available. si->swap_map is set under mutex
protection, and after si->swap_file is set, so having si->swap_map set
guarantees si->swap_file is set.

- Checking si->flags doesn't work here. SWP_WRITEOK is cleared during
swapoff, but /proc/swaps is supposed to show the device under swapoff
too to report the swapoff progress. And SWP_USED is set even if the
device hasn't been properly set up.

We can have another flag, but the easier way is to just check
si->swap_file directly. So protect si->swap_file setting with mutext,
and set si->swap_file only when the swap device is truly enabled.

/proc/swaps only interested in si->swap_file and a few static data
reading. Only si->swap_file needs protection. Reading other static
fields is always fine.

Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-0-f4e34be021a7@tencent.com
Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-1-f4e34be021a7@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Kairui Song and committed by
Andrew Morton
eca4d01b e623b4eb

+13 -12
+13 -12
mm/swapfile.c
··· 110 110 111 111 static struct kmem_cache *swap_table_cachep; 112 112 113 + /* Protects si->swap_file for /proc/swaps usage */ 113 114 static DEFINE_MUTEX(swapon_mutex); 114 115 115 116 static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait); ··· 2533 2532 /* 2534 2533 * Free all of a swapdev's extent information 2535 2534 */ 2536 - static void destroy_swap_extents(struct swap_info_struct *sis) 2535 + static void destroy_swap_extents(struct swap_info_struct *sis, 2536 + struct file *swap_file) 2537 2537 { 2538 2538 while (!RB_EMPTY_ROOT(&sis->swap_extent_root)) { 2539 2539 struct rb_node *rb = sis->swap_extent_root.rb_node; ··· 2545 2543 } 2546 2544 2547 2545 if (sis->flags & SWP_ACTIVATED) { 2548 - struct file *swap_file = sis->swap_file; 2549 2546 struct address_space *mapping = swap_file->f_mapping; 2550 2547 2551 2548 sis->flags &= ~SWP_ACTIVATED; ··· 2627 2626 * Typically it is in the 1-4 megabyte range. So we can have hundreds of 2628 2627 * extents in the rbtree. - akpm. 2629 2628 */ 2630 - static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) 2629 + static int setup_swap_extents(struct swap_info_struct *sis, 2630 + struct file *swap_file, sector_t *span) 2631 2631 { 2632 - struct file *swap_file = sis->swap_file; 2633 2632 struct address_space *mapping = swap_file->f_mapping; 2634 2633 struct inode *inode = mapping->host; 2635 2634 int ret; ··· 2647 2646 sis->flags |= SWP_ACTIVATED; 2648 2647 if ((sis->flags & SWP_FS_OPS) && 2649 2648 sio_pool_init() != 0) { 2650 - destroy_swap_extents(sis); 2649 + destroy_swap_extents(sis, swap_file); 2651 2650 return -ENOMEM; 2652 2651 } 2653 2652 return ret; ··· 2858 2857 flush_work(&p->reclaim_work); 2859 2858 flush_percpu_swap_cluster(p); 2860 2859 2861 - destroy_swap_extents(p); 2860 + destroy_swap_extents(p, p->swap_file); 2862 2861 if (p->flags & SWP_CONTINUED) 2863 2862 free_swap_count_continuations(p); 2864 2863 ··· 2946 2945 return SEQ_START_TOKEN; 2947 2946 2948 2947 for (type = 0; (si = swap_type_to_info(type)); type++) { 2949 - if (!(si->flags & SWP_USED) || !si->swap_map) 2948 + if (!(si->swap_file)) 2950 2949 continue; 2951 2950 if (!--l) 2952 2951 return si; ··· 2967 2966 2968 2967 ++(*pos); 2969 2968 for (; (si = swap_type_to_info(type)); type++) { 2970 - if (!(si->flags & SWP_USED) || !si->swap_map) 2969 + if (!(si->swap_file)) 2971 2970 continue; 2972 2971 return si; 2973 2972 } ··· 3377 3376 goto bad_swap; 3378 3377 } 3379 3378 3380 - si->swap_file = swap_file; 3381 3379 mapping = swap_file->f_mapping; 3382 3380 dentry = swap_file->f_path.dentry; 3383 3381 inode = mapping->host; ··· 3426 3426 3427 3427 si->max = maxpages; 3428 3428 si->pages = maxpages - 1; 3429 - nr_extents = setup_swap_extents(si, &span); 3429 + nr_extents = setup_swap_extents(si, swap_file, &span); 3430 3430 if (nr_extents < 0) { 3431 3431 error = nr_extents; 3432 3432 goto bad_swap_unlock_inode; ··· 3535 3535 prio = DEF_SWAP_PRIO; 3536 3536 if (swap_flags & SWAP_FLAG_PREFER) 3537 3537 prio = swap_flags & SWAP_FLAG_PRIO_MASK; 3538 + 3539 + si->swap_file = swap_file; 3538 3540 enable_swap_info(si, prio, swap_map, cluster_info, zeromap); 3539 3541 3540 3542 pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n", ··· 3561 3559 kfree(si->global_cluster); 3562 3560 si->global_cluster = NULL; 3563 3561 inode = NULL; 3564 - destroy_swap_extents(si); 3562 + destroy_swap_extents(si, swap_file); 3565 3563 swap_cgroup_swapoff(si->type); 3566 3564 spin_lock(&swap_lock); 3567 - si->swap_file = NULL; 3568 3565 si->flags = 0; 3569 3566 spin_unlock(&swap_lock); 3570 3567 vfree(swap_map);