mm/swap: select swap device with default priority round robin

Swap devices are assumed to have similar accessing speed when swapon if no
priority is specified. It's unfair and doesn't make sense just because
one swap device is swapped on firstly, its priority will be higher than
the one swapped on later.

Here, set all swap devicess to have priority '-1' by default. With this
change, swap device with default priority will be selected round robin
when swapping out. This can improve the swapping efficiency a lot among
multiple swap devices with default priority.

Below are swapon output during the processes when high pressure
vm-scability test is being taken:

1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
priority from high to low when one swap device is exhausted:
------------------------------------
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 16G -1
/dev/zram1 partition 16G 966.2M -2
/dev/zram2 partition 16G 0B -3
/dev/zram3 partition 16G 0B -4

2) This is behaviour with commit a2468cc9bfdf, on node, swap device
sharing the same node id is selected firstly until exhausted; while
on node no swap device sharing the node id it selects the one with
highest priority until exhaustd:
------------------------------------
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 15.7G -2
/dev/zram1 partition 16G 3.4G -3
/dev/zram2 partition 16G 3.4G -4
/dev/zram3 partition 16G 2.6G -5

3) After this patch applied, swap devices with default priority are selectd
round robin:
------------------------------------
[root@hp-dl385g10-03 block]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 6.6G -1
/dev/zram1 partition 16G 6.6G -1
/dev/zram2 partition 16G 6.6G -1
/dev/zram3 partition 16G 6.6G -1

With the change, about 18% efficiency promotion relative to node based
way as below. (Surely, the pre-commit a2468cc9bfdf way is the worst.)

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
one by one: node based: round robin:
System time: 1087.38 s 637.92 s 526.74 s (lower is better)
Sum Throughput: 2036.55 MB/s 3546.56 MB/s 4207.56 MB/s (higher is better)
Single process Throughput: 65.69 MB/s 114.40 MB/s 135.72 MB/s (high is better)
free latency: 15769409.48 us 10138455.99 us 6810119.01 us(lower is better)

Link: https://lkml.kernel.org/r/20251028034308.929550-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Baoquan He and committed by

Andrew Morton 6 months ago 52f37efc 8e689f8e

+4 -26

1 changed file

expand all

swapfile.c

+4 -26

mm/swapfile.c

··· 74 74 EXPORT_SYMBOL_GPL(nr_swap_pages); 75 75 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */ 76 76 long total_swap_pages; 77 - static int least_priority; 77 + #define DEF_SWAP_PRIO -1 78 78 unsigned long swapfile_maximum_size; 79 79 #ifdef CONFIG_MIGRATION 80 80 bool swap_migration_ad_supported; ··· 2707 2707 struct swap_cluster_info *cluster_info, 2708 2708 unsigned long *zeromap) 2709 2709 { 2710 - if (prio >= 0) 2711 - si->prio = prio; 2712 - else 2713 - si->prio = --least_priority; 2710 + si->prio = prio; 2714 2711 /* 2715 2712 * the plist prio is negated because plist ordering is 2716 2713 * low-to-high, while swap ordering is high-to-low ··· 2725 2728 total_swap_pages += si->pages; 2726 2729 2727 2730 assert_spin_locked(&swap_lock); 2728 - /* 2729 - * both lists are plists, and thus priority ordered. 2730 - * swap_active_head needs to be priority ordered for swapoff(), 2731 - * which on removal of any swap_info_struct with an auto-assigned 2732 - * (i.e. negative) priority increments the auto-assigned priority 2733 - * of any lower-priority swap_info_structs. 2734 - * swap_avail_head needs to be priority ordered for folio_alloc_swap(), 2735 - * which allocates swap pages from the highest available priority 2736 - * swap_info_struct. 2737 - */ 2731 + 2738 2732 plist_add(&si->list, &swap_active_head); 2739 2733 2740 2734 /* Add back to available list */ ··· 2875 2887 } 2876 2888 spin_lock(&p->lock); 2877 2889 del_from_avail_list(p, true); 2878 - if (p->prio < 0) { 2879 - struct swap_info_struct *si = p; 2880 - 2881 - plist_for_each_entry_continue(si, &swap_active_head, list) { 2882 - si->prio++; 2883 - si->list.prio--; 2884 - si->avail_list.prio--; 2885 - } 2886 - least_priority++; 2887 - } 2888 2890 plist_del(&p->list, &swap_active_head); 2889 2891 atomic_long_sub(p->pages, &nr_swap_pages); 2890 2892 total_swap_pages -= p->pages; ··· 3585 3607 } 3586 3608 3587 3609 mutex_lock(&swapon_mutex); 3588 - prio = -1; 3610 + prio = DEF_SWAP_PRIO; 3589 3611 if (swap_flags & SWAP_FLAG_PREFER) 3590 3612 prio = swap_flags & SWAP_FLAG_PRIO_MASK; 3591 3613 enable_swap_info(si, prio, swap_map, cluster_info, zeromap);

Configure Feed

Configure Feed