mm: zswap: tie per-CPU acomp_ctx lifetime to the pool

Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
hotplug, and destroyed on pool destruction or CPU hotunplug. This
complicates the lifetime management to save memory while a CPU is
offlined, which is not very common.

Simplify lifetime management by allocating per-CPU acomp_ctx once on pool
creation (or CPU hotplug for CPUs onlined later), and keeping them
allocated until the pool is destroyed.

Refactor cleanup code from zswap_cpu_comp_dead() into acomp_ctx_free() to
be used elsewhere.

The main benefit of using the CPU hotplug multi state instance startup
callback to allocate the acomp_ctx resources is that it prevents the cores
from being offlined until the multi state instance addition call returns.

From Documentation/core-api/cpu_hotplug.rst:

"The node list add/remove operations and the callback invocations are
serialized against CPU hotplug operations."

Furthermore, zswap_[de]compress() cannot contend with
zswap_cpu_comp_prepare() because:

- During pool creation/deletion, the pool is not in the zswap_pools
list.

- During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
out. zswap_cpu_comp_prepare() will be run on a control CPU,
since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
cpuhp_state".

In both these cases, any recursions into zswap reclaim from
zswap_cpu_comp_prepare() will be handled by the old pool.

The above two observations enable the following simplifications:

1) zswap_cpu_comp_prepare():

a) acomp_ctx mutex locking:

If the process gets migrated while zswap_cpu_comp_prepare() is
running, it will complete on the new CPU. In case of failures, we
pass the acomp_ctx pointer obtained at the start of
zswap_cpu_comp_prepare() to acomp_ctx_free(), which again, can
only undergo migration. There appear to be no contention
scenarios that might cause inconsistent values of acomp_ctx's
members. Hence, it seems there is no need for
mutex_lock(&acomp_ctx->mutex) in zswap_cpu_comp_prepare().

b) acomp_ctx mutex initialization:

Since the pool is not yet on zswap_pools list, we don't need to
initialize the per-CPU acomp_ctx mutex in
zswap_pool_create(). This has been restored to occur in
zswap_cpu_comp_prepare().

c) Subsequent CPU offline-online transitions:

zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
valid. If so, it returns success. This should handle any CPU
hotplug online-offline transitions after pool creation is done.

2) CPU offline vis-a-vis zswap ops:

Let's suppose the process is migrated to another CPU before the
current CPU is dysfunctional. If zswap_[de]compress() holds the
acomp_ctx->mutex lock of the offlined CPU, that mutex will be
released once it completes on the new CPU. Since there is no
teardown callback, there is no possibility of UAF.

3) Pool creation/deletion and process migration to another CPU:

During pool creation/deletion, the pool is not in the zswap_pools
list. Hence it cannot contend with zswap ops on that CPU. However,
the process can get migrated.

a) Pool creation --> zswap_cpu_comp_prepare()
--> process migrated:
* Old CPU offline: no-op.
* zswap_cpu_comp_prepare() continues
to run on the new CPU to finish
allocating acomp_ctx resources for
the offlined CPU.

b) Pool deletion --> acomp_ctx_free()
--> process migrated:
* Old CPU offline: no-op.
* acomp_ctx_free() continues
to run on the new CPU to finish
de-allocating acomp_ctx resources
for the offlined CPU.

4) Pool deletion vis-a-vis CPU onlining:

The call to cpuhp_state_remove_instance() cannot race with
zswap_cpu_comp_prepare() because of hotplug synchronization.

The current acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock() are deleted.
Instead, zswap_[de]compress() directly call
mutex_[un]lock(&acomp_ctx->mutex).

The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
offlining, and only deleting them when the pool is destroyed, is 8.28 KB
on x86_64. This cost is only paid when a CPU is offlined, until it is
onlined again.

Link: https://lore.kernel.org/20260331183351.29844-3-kanchanapsridhar2026@gmail.com
Co-developed-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com>
Signed-off-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Acked-by: Yosry Ahmed <yosry@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Kanchana P. Sridhar and committed by

Andrew Morton 2 months ago ef3c0f6c 1556478e

+80 -100

1 changed file

expand all

zswap.c

+80 -100

mm/zswap.c

··· 242 242 **********************************/ 243 243 static void __zswap_pool_empty(struct percpu_ref *ref); 244 244 245 + static void acomp_ctx_free(struct crypto_acomp_ctx *acomp_ctx) 246 + { 247 + if (!acomp_ctx) 248 + return; 249 + 250 + /* 251 + * If there was an error in allocating @acomp_ctx->req, it 252 + * would be set to NULL. 253 + */ 254 + if (acomp_ctx->req) 255 + acomp_request_free(acomp_ctx->req); 256 + 257 + acomp_ctx->req = NULL; 258 + 259 + /* 260 + * We have to handle both cases here: an error pointer return from 261 + * crypto_alloc_acomp_node(); and a) NULL initialization by zswap, or 262 + * b) NULL assignment done in a previous call to acomp_ctx_free(). 263 + */ 264 + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) 265 + crypto_free_acomp(acomp_ctx->acomp); 266 + 267 + acomp_ctx->acomp = NULL; 268 + 269 + kfree(acomp_ctx->buffer); 270 + acomp_ctx->buffer = NULL; 271 + } 272 + 245 273 static struct zswap_pool *zswap_pool_create(char *compressor) 246 274 { 247 275 struct zswap_pool *pool; ··· 291 263 292 264 strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); 293 265 294 - pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); 266 + /* Many things rely on the zero-initialization. */ 267 + pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx, 268 + GFP_KERNEL | __GFP_ZERO); 295 269 if (!pool->acomp_ctx) { 296 270 pr_err("percpu alloc failed\n"); 297 271 goto error; 298 272 } 299 273 300 - for_each_possible_cpu(cpu) 301 - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); 302 - 274 + /* 275 + * This is serialized against CPU hotplug operations. Hence, cores 276 + * cannot be offlined until this finishes. 277 + */ 303 278 ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, 304 279 &pool->node); 280 + 281 + /* 282 + * cpuhp_state_add_instance() will not cleanup on failure since 283 + * we don't register a hotunplug callback. 284 + */ 305 285 if (ret) 306 - goto error; 286 + goto cpuhp_add_fail; 307 287 308 288 /* being the current pool takes 1 ref; this func expects the 309 289 * caller to always add the new pool as the current pool ··· 328 292 329 293 ref_fail: 330 294 cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); 295 + 296 + cpuhp_add_fail: 297 + for_each_possible_cpu(cpu) 298 + acomp_ctx_free(per_cpu_ptr(pool->acomp_ctx, cpu)); 331 299 error: 332 300 if (pool->acomp_ctx) 333 301 free_percpu(pool->acomp_ctx); ··· 362 322 363 323 static void zswap_pool_destroy(struct zswap_pool *pool) 364 324 { 325 + int cpu; 326 + 365 327 zswap_pool_debug("destroying", pool); 366 328 367 329 cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); 330 + 331 + for_each_possible_cpu(cpu) 332 + acomp_ctx_free(per_cpu_ptr(pool->acomp_ctx, cpu)); 333 + 368 334 free_percpu(pool->acomp_ctx); 369 335 370 336 zs_destroy_pool(pool->zs_pool); ··· 784 738 { 785 739 struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); 786 740 struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); 787 - struct crypto_acomp *acomp = NULL; 788 - struct acomp_req *req = NULL; 789 - u8 *buffer = NULL; 790 - int ret; 741 + int ret = -ENOMEM; 791 742 792 - buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); 793 - if (!buffer) { 794 - ret = -ENOMEM; 795 - goto fail; 743 + /* 744 + * To handle cases where the CPU goes through online-offline-online 745 + * transitions, we return if the acomp_ctx has already been initialized. 746 + */ 747 + if (acomp_ctx->acomp) { 748 + WARN_ON_ONCE(IS_ERR(acomp_ctx->acomp)); 749 + return 0; 796 750 } 751 + 752 + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); 753 + if (!acomp_ctx->buffer) 754 + return ret; 797 755 798 756 /* 799 757 * In case of an error, crypto_alloc_acomp_node() returns an 800 758 * error pointer, never NULL. 801 759 */ 802 - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); 803 - if (IS_ERR(acomp)) { 760 + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); 761 + if (IS_ERR(acomp_ctx->acomp)) { 804 762 pr_err("could not alloc crypto acomp %s : %pe\n", 805 - pool->tfm_name, acomp); 806 - ret = PTR_ERR(acomp); 763 + pool->tfm_name, acomp_ctx->acomp); 764 + ret = PTR_ERR(acomp_ctx->acomp); 807 765 goto fail; 808 766 } 809 767 810 768 /* acomp_request_alloc() returns NULL in case of an error. */ 811 - req = acomp_request_alloc(acomp); 812 - if (!req) { 769 + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); 770 + if (!acomp_ctx->req) { 813 771 pr_err("could not alloc crypto acomp_request %s\n", 814 772 pool->tfm_name); 815 - ret = -ENOMEM; 816 773 goto fail; 817 774 } 818 775 819 - /* 820 - * Only hold the mutex after completing allocations, otherwise we may 821 - * recurse into zswap through reclaim and attempt to hold the mutex 822 - * again resulting in a deadlock. 823 - */ 824 - mutex_lock(&acomp_ctx->mutex); 825 776 crypto_init_wait(&acomp_ctx->wait); 826 777 827 778 /* ··· 826 783 * crypto_wait_req(); if the backend of acomp is scomp, the callback 827 784 * won't be called, crypto_wait_req() will return without blocking. 828 785 */ 829 - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, 786 + acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, 830 787 crypto_req_done, &acomp_ctx->wait); 831 788 832 - acomp_ctx->buffer = buffer; 833 - acomp_ctx->acomp = acomp; 834 - acomp_ctx->req = req; 835 - mutex_unlock(&acomp_ctx->mutex); 789 + mutex_init(&acomp_ctx->mutex); 836 790 return 0; 837 791 838 792 fail: 839 - if (!IS_ERR_OR_NULL(acomp)) 840 - crypto_free_acomp(acomp); 841 - kfree(buffer); 793 + acomp_ctx_free(acomp_ctx); 842 794 return ret; 843 - } 844 - 845 - static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) 846 - { 847 - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); 848 - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); 849 - struct acomp_req *req; 850 - struct crypto_acomp *acomp; 851 - u8 *buffer; 852 - 853 - if (!acomp_ctx) 854 - return 0; 855 - 856 - mutex_lock(&acomp_ctx->mutex); 857 - req = acomp_ctx->req; 858 - acomp = acomp_ctx->acomp; 859 - buffer = acomp_ctx->buffer; 860 - acomp_ctx->req = NULL; 861 - acomp_ctx->acomp = NULL; 862 - acomp_ctx->buffer = NULL; 863 - mutex_unlock(&acomp_ctx->mutex); 864 - 865 - /* 866 - * Do the actual freeing after releasing the mutex to avoid subtle 867 - * locking dependencies causing deadlocks. 868 - * 869 - * If there was an error in allocating @acomp_ctx->req, it 870 - * would be set to NULL. 871 - */ 872 - if (req) 873 - acomp_request_free(req); 874 - if (!IS_ERR_OR_NULL(acomp)) 875 - crypto_free_acomp(acomp); 876 - kfree(buffer); 877 - 878 - return 0; 879 - } 880 - 881 - static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) 882 - { 883 - struct crypto_acomp_ctx *acomp_ctx; 884 - 885 - for (;;) { 886 - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 887 - mutex_lock(&acomp_ctx->mutex); 888 - if (likely(acomp_ctx->req)) 889 - return acomp_ctx; 890 - /* 891 - * It is possible that we were migrated to a different CPU after 892 - * getting the per-CPU ctx but before the mutex was acquired. If 893 - * the old CPU got offlined, zswap_cpu_comp_dead() could have 894 - * already freed ctx->req (among other things) and set it to 895 - * NULL. Just try again on the new CPU that we ended up on. 896 - */ 897 - mutex_unlock(&acomp_ctx->mutex); 898 - } 899 - } 900 - 901 - static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx) 902 - { 903 - mutex_unlock(&acomp_ctx->mutex); 904 795 } 905 796 906 797 static bool zswap_compress(struct page *page, struct zswap_entry *entry, ··· 849 872 u8 *dst; 850 873 bool mapped = false; 851 874 852 - acomp_ctx = acomp_ctx_get_cpu_lock(pool); 875 + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 876 + mutex_lock(&acomp_ctx->mutex); 877 + 853 878 dst = acomp_ctx->buffer; 854 879 sg_init_table(&input, 1); 855 880 sg_set_page(&input, page, PAGE_SIZE, 0); ··· 917 938 else if (alloc_ret) 918 939 zswap_reject_alloc_fail++; 919 940 920 - acomp_ctx_put_unlock(acomp_ctx); 941 + mutex_unlock(&acomp_ctx->mutex); 921 942 return comp_ret == 0 && alloc_ret == 0; 922 943 } 923 944 ··· 929 950 struct crypto_acomp_ctx *acomp_ctx; 930 951 int ret = 0, dlen; 931 952 932 - acomp_ctx = acomp_ctx_get_cpu_lock(pool); 953 + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); 954 + mutex_lock(&acomp_ctx->mutex); 933 955 zs_obj_read_sg_begin(pool->zs_pool, entry->handle, input, entry->length); 934 956 935 957 /* zswap entries of length PAGE_SIZE are not compressed. */ ··· 955 975 } 956 976 957 977 zs_obj_read_sg_end(pool->zs_pool, entry->handle); 958 - acomp_ctx_put_unlock(acomp_ctx); 978 + mutex_unlock(&acomp_ctx->mutex); 959 979 960 980 if (!ret && dlen == PAGE_SIZE) 961 981 return true; ··· 1775 1795 ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE, 1776 1796 "mm/zswap_pool:prepare", 1777 1797 zswap_cpu_comp_prepare, 1778 - zswap_cpu_comp_dead); 1798 + NULL); 1779 1799 if (ret) 1780 1800 goto hp_fail; 1781 1801

Configure Feed

Configure Feed