Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"nphamcs@gmail.com" <nphamcs@gmail.com>,
	"chengming.zhou@linux.dev" <chengming.zhou@linux.dev>,
	"usamaarif642@gmail.com" <usamaarif642@gmail.com>,
	"ryan.roberts@arm.com" <ryan.roberts@arm.com>,
	"21cnbao@gmail.com" <21cnbao@gmail.com>,
	"ying.huang@linux.alibaba.com" <ying.huang@linux.alibaba.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"herbert@gondor.apana.org.au" <herbert@gondor.apana.org.au>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"clabbe@baylibre.com" <clabbe@baylibre.com>,
	"ardb@kernel.org" <ardb@kernel.org>,
	"ebiggers@google.com" <ebiggers@google.com>,
	"surenb@google.com" <surenb@google.com>,
	"Accardi, Kristen C" <kristen.c.accardi@intel.com>,
	"Feghali, Wajdi K" <wajdi.k.feghali@intel.com>,
	"Gopal, Vinodh" <vinodh.gopal@intel.com>
Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
Date: Mon, 10 Mar 2025 17:31:02 +0000	[thread overview]
Message-ID: <Z88h1mPkYNM6yOGE@google.com> (raw)
In-Reply-To: <PH8SPRMB00447B066A769C76F57F8800C9D42@PH8SPRMB0044.namprd11.prod.outlook.com>

On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote:
> 
[..]
> > > > >  	u8 *buffer;
> > > > > +	u8 nr_reqs;
> > > > > +	struct crypto_wait wait;
> > > > >  	struct mutex mutex;
> > > > >  	bool is_sleepable;
> > > > > +	bool __online;
> > > >
> > > > I don't believe we need this.
> > > >
> > > > If we are not freeing resources during CPU offlining, then we do not
> > > > need a CPU offline callback and acomp_ctx->__online serves no purpose.
> > > >
> > > > The whole point of synchronizing between offlining and
> > > > compress/decompress operations is to avoid UAF. If offlining does not
> > > > free resources, then we can hold the mutex directly in the
> > > > compress/decompress path and drop the hotunplug callback completely.
> > > >
> > > > I also believe nr_reqs can be dropped from this patch, as it seems like
> > > > it's only used know when to set __online.
> > >
> > > All great points! In fact, that was the original solution I had implemented
> > > (not having an offline callback). But then, I spent some time understanding
> > > the v6.13 hotfix for synchronizing freeing of resources, and this comment
> > > in zswap_cpu_comp_prepare():
> > >
> > > 	/*
> > > 	 * Only hold the mutex after completing allocations, otherwise we
> > may
> > > 	 * recurse into zswap through reclaim and attempt to hold the mutex
> > > 	 * again resulting in a deadlock.
> > > 	 */
> > >
> > > Hence, I figured the constraint of "recurse into zswap through reclaim" was
> > > something to comprehend in the simplification (even though I had a tough
> > > time imagining how this could happen).
> > 
> > The constraint here is about zswap_cpu_comp_prepare() holding the mutex,
> > making an allocation which internally triggers reclaim, then recursing
> > into zswap and trying to hold the same mutex again causing a deadlock.
> > 
> > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin
> > with, the constraint naturally goes away.
> 
> Actually, if it is possible for the allocations in zswap_cpu_comp_prepare()
> to trigger reclaim, then I believe we need some way for reclaim to know if
> the acomp_ctx resources are available. Hence, this seems like a potential
> for deadlock regardless of the mutex.

I took a closer look and I believe my hotfix was actually unnecessary. I
sent it out in response to a syzbot report, but upon closer look it
seems like it was not an actual problem. Sorry if my patch confused you.

Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like
CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The comment above
says:

 * PREPARE: The callbacks are invoked on a control CPU before the
 * hotplugged CPU is started up or after the hotplugged CPU has died.

So even if we go into reclaim during zswap_cpu_comp_prepare(), it will
never be on the CPU that we are allocating resources for.

The other case where zswap_cpu_comp_prepare() could race with
compression/decompression is when a pool is being created. In this case,
reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the same
CPU AFAICT. However, because the pool is still under creation, it will
not be used (i.e. zswap_pool_current_get() won't find it).

So I think we don't need to worry about zswap_cpu_comp_prepare() racing
with compression or decompression for the same pool and CPU.

> 
> I verified that all the zswap_cpu_comp_prepare() allocations are done with
> GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a
> risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare()
> in general, i.e., aside from this patchset.
> 
> I can think of the following options to resolve this, and would welcome
> other suggestions:
> 
> 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check
>     if acomp_ctx->__online is true, and if so, return the mutex. If
>     acomp_ctx->__online is false, then it returns NULL. In other words, we
>     don't have the for loop.
>     - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare()
>        to fail, cpuhotplug to fail. However, there is no deadlock.
>         - zswap_compress() will need to detect NULL returned by
>           acomp_ctx_get_cpu_lock(), and return an error.
>         - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling
>           acomp_ctx_get_cpu_lock().
>     - We won't be migrated to a different CPU because we hold the mutex, hence
>       zswap_cpu_comp_dead() will wait on the mutex.
> 
> 2) More intrusive: We would need to use a gfp_t that prevents direct reclaim
>     and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h,
>     but for non-THP allocations. If we decide to adopt this approach, we would
>     need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c
>     to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this
>     new gfp_t, in lieu of GFP_KERNEL.
> 
> > 
> > >
> > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare()
> > > does not acquire the mutex lock while allocating resources. We have
> > already
> > > initialized the mutex, so in theory, it is possible for compress/decompress
> > > to acquire the mutex lock. The __online acts as a way to indicate whether
> > > compress/decompress can proceed reliably to use the resources.
> > 
> > For compress/decompress to acquire the mutex they need to run on that
> > CPU, and I don't think that's possible before onlining completes, so
> > zswap_cpu_comp_prepare() must have already completed before
> > compress/decompress can use that CPU IIUC.
> 
> If we can make this assumption, that would be great! However, I am not
> totally sure because of the GFP_KERNEL allocations in
> zswap_cpu_comp_prepare().

As I mentioned above, when zswap_cpu_comp_prepare() is run we are in one
of two situations:
- The pool is under creation, so we cannot race with stores/loads from
  that same pool.
- The CPU is being onlined, in which case zswap_cpu_comp_prepare() is
  called from a control CPU before tasks start running on the CPU being
  onlined.

Please correct me if I am wrong.

[..]
> > > > > @@ -285,13 +403,21 @@ static struct zswap_pool
> > > > *zswap_pool_create(char *type, char *compressor)
> > > > >  		goto error;
> > > > >  	}
> > > > >
> > > > > -	for_each_possible_cpu(cpu)
> > > > > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > > > > +	for_each_possible_cpu(cpu) {
> > > > > +		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > > >acomp_ctx, cpu);
> > > > > +
> > > > > +		acomp_ctx->acomp = NULL;
> > > > > +		acomp_ctx->req = NULL;
> > > > > +		acomp_ctx->buffer = NULL;
> > > > > +		acomp_ctx->__online = false;
> > > > > +		acomp_ctx->nr_reqs = 0;
> > > >
> > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them
> > > > right away?
> > >
> > > Yes, I figured this is needed for two reasons:
> > >
> > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into
> > >     zswap_cpu_comp_dealloc() to be handled by the common procedure
> > >     "acomp_ctx_dealloc()" unambiguously.
> > 
> > This makes sense. When you move the refactoring to create
> > acomp_ctx_dealloc() to a separate patch, please include this change in
> > it and call it out explicitly in the commit message.
> 
> Sure.
> 
> > 
> > > 2) The second scenario I thought of that would need this, is let's say
> > >      the zswap compressor is switched immediately after setting the
> > >      compressor. Some cores have executed the onlining code and
> > >      some haven't. Because there are no pool refs held,
> > >      zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured
> > >      it would help to initialize these acomp_ctx members before the
> > >      hand-off to "cpuhp_state_add_instance()" in zswap_pool_create().
> > 
> > I believe cpuhp_state_add_instance() calls the onlining function
> > synchronously on all present CPUs, so I don't think it's possible to end
> > up in a state where the pool is being destroyed and some CPU executed
> > zswap_cpu_comp_prepare() while others haven't.
> 
> I looked at the cpuhotplug code some more. The startup callback is
> invoked sequentially for_each_present_cpu(). If an error occurs for any
> one of them, it calls the teardown callback only on the earlier cores that
> have already finished running the startup callback. However, 
> zswap_cpu_comp_dealloc() will be called for all cores, even the ones
> for which the startup callback was not run. Hence, I believe the
> zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO)
> to allocate the acomp_ctx.

Yeah this is point (1) above IIUC, and I agree about zero initialization
for that.

> 
> > 
> > That being said, this made me think of a different problem. If pool
> > destruction races with CPU onlining, there could be a race between
> > zswap_cpu_comp_prepare() allocating resources and
> > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them.
> > 
> > I believe we must always call cpuhp_state_remove_instance() *before*
> > freeing the resources to prevent this race from happening. This needs to
> > be documented with a comment.
> 
> Yes, this race condition is possible, thanks for catching this! The problem with
> calling cpuhp_state_remove_instance() before freeing the resources is that
> cpuhp_state_add_instance() and cpuhp_state_remove_instance() both
> acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence
> are serialized.
> 
> For the reasons motivating why acomp_ctx->__online is set to false in
> zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance()
> before calling acomp_ctx_dealloc() because the latter could wait until
> acomp_ctx->__online to be true before deleting the resources. I will
> think about this some more.

I believe this problem goes away with acomp_ctx->__online going away,
right?

> 
> Another possibility is to not rely on cpuhotplug in zswap, and instead
> manage the per-cpu acomp_ctx resource allocation entirely in
> zswap_pool_create(), and deletion entirely in zswap_pool_destroy(),
> along with the necessary error handling. Let me think about this some
> more as well.
> 
> > 
> > Let me know if I missed something.
> > 
> > >
> > > Please let me know if these are valid considerations.
> > >
> > > >
> > > > If it is in fact needed we should probably just use __GFP_ZERO.
> > >
> > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of
> > "alloc_percpu()"
> > > for the acomp_ctx?
> > 
> > Yeah if we need to initialize all/most fields to 0 let's use
> > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO.
> 
> Sounds good.
> 
> > 
> > [..]
> > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx
> > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> > > > >
> > > > >  	for (;;) {
> > > > >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > > > > -		mutex_lock(&acomp_ctx->mutex);
> > > > > -		if (likely(acomp_ctx->req))
> > > > > -			return acomp_ctx;
> > > > >  		/*
> > > > > -		 * It is possible that we were migrated to a different CPU
> > > > after
> > > > > -		 * getting the per-CPU ctx but before the mutex was
> > > > acquired. If
> > > > > -		 * the old CPU got offlined, zswap_cpu_comp_dead() could
> > > > have
> > > > > -		 * already freed ctx->req (among other things) and set it to
> > > > > -		 * NULL. Just try again on the new CPU that we ended up on.
> > > > > +		 * If the CPU onlining code successfully allocates acomp_ctx
> > > > resources,
> > > > > +		 * it sets acomp_ctx->__online to true. Until this happens, we
> > > > have
> > > > > +		 * two options:
> > > > > +		 *
> > > > > +		 * 1. Return NULL and fail all stores on this CPU.
> > > > > +		 * 2. Retry, until onlining has finished allocating resources.
> > > > > +		 *
> > > > > +		 * In theory, option 1 could be more appropriate, because it
> > > > > +		 * allows the calling procedure to decide how it wants to
> > > > handle
> > > > > +		 * reclaim racing with CPU hotplug. For instance, it might be
> > > > Ok
> > > > > +		 * for compress to return an error for the backing swap device
> > > > > +		 * to store the folio. Decompress could wait until we get a
> > > > > +		 * valid and locked mutex after onlining has completed. For
> > > > now,
> > > > > +		 * we go with option 2 because adding a do-while in
> > > > > +		 * zswap_decompress() adds latency for software
> > > > compressors.
> > > > > +		 *
> > > > > +		 * Once initialized, the resources will be de-allocated only
> > > > > +		 * when the pool is destroyed. The acomp_ctx will hold on to
> > > > the
> > > > > +		 * resources through CPU offlining/onlining at any time until
> > > > > +		 * the pool is destroyed.
> > > > > +		 *
> > > > > +		 * This prevents races/deadlocks between reclaim and CPU
> > > > acomp_ctx
> > > > > +		 * resource allocation that are a dependency for reclaim.
> > > > > +		 * It further simplifies the interaction with CPU onlining and
> > > > > +		 * offlining:
> > > > > +		 *
> > > > > +		 * - CPU onlining does not take the mutex. It only allocates
> > > > > +		 *   resources and sets __online to true.
> > > > > +		 * - CPU offlining acquires the mutex before setting
> > > > > +		 *   __online to false. If reclaim has acquired the mutex,
> > > > > +		 *   offlining will have to wait for reclaim to complete before
> > > > > +		 *   hotunplug can proceed. Further, hotplug merely sets
> > > > > +		 *   __online to false. It does not delete the acomp_ctx
> > > > > +		 *   resources.
> > > > > +		 *
> > > > > +		 * Option 1 is better than potentially not exiting the earlier
> > > > > +		 * for (;;) loop because the system is running low on memory
> > > > > +		 * and/or CPUs are getting offlined for whatever reason. At
> > > > > +		 * least failing this store will prevent data loss by failing
> > > > > +		 * zswap_store(), and saving the data in the backing swap
> > > > device.
> > > > >  		 */
> > > >
> > > > I believe we can dropped. I don't think we can have any store/load
> > > > operations on a CPU before it's fully onlined, and we should always have
> > > > a reference on the pool here, so the resources cannot go away.
> > > >
> > > > So unless I missed something we can drop this completely now and just
> > > > hold the mutex directly in the load/store paths.
> > >
> > > Based on the above explanations, please let me know if it is a good idea
> > > to keep the __online, or if you think further simplification is possible.
> > 
> > I still think it's not needed. Let me know if I missed anything.
> 
> Let me think some more about whether it is feasible to not have cpuhotplug
> manage the acomp_ctx resource allocation, and instead have this be done
> through the pool creation/deletion routines.

I don't think this is necessary, see my comments above.

next prev parent reply	other threads:[~2025-03-10 17:31 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 01/14] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 02/14] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
2025-03-03  8:47   ` Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 03/14] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 04/14] crypto: iaa - Implement batch compression/decompression with request chaining Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 05/14] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 06/14] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 07/14] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 08/14] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 09/14] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 10/14] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 11/14] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Kanchana P Sridhar
2025-03-06 19:35   ` Yosry Ahmed
2025-03-06 19:46     ` Yosry Ahmed
2025-03-07  0:01     ` Sridhar, Kanchana P
2025-03-07 19:30       ` Yosry Ahmed
2025-03-08  2:47         ` Sridhar, Kanchana P
2025-03-10 17:31           ` Yosry Ahmed [this message]
2025-03-17 21:15             ` Sridhar, Kanchana P
2025-03-17 23:25               ` Herbert Xu
2025-03-18 14:23               ` Yosry Ahmed
2025-03-18 17:38                 ` Sridhar, Kanchana P
2025-03-18 19:05                   ` Yosry Ahmed
2025-03-18 21:09                     ` Sridhar, Kanchana P
2025-03-18 23:14                       ` Yosry Ahmed
2025-03-18 23:37                         ` Sridhar, Kanchana P
2025-04-20 21:01                           ` Andrew Morton
2025-04-20 23:35                             ` Nhat Pham
2025-04-21  3:31                               ` [PATCH] crypto: scomp - Fix off-by-one bug when calculating last page Herbert Xu
2025-04-21 19:02                                 ` Marius Dinu
2025-03-06 19:42   ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Yosry Ahmed
2025-03-03  8:47 ` [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
2025-03-06 20:00   ` Yosry Ahmed
2025-04-30 21:15     ` Sridhar, Kanchana P
2025-03-03  8:47 ` [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios Kanchana P Sridhar
2025-03-03 11:07   ` kernel test robot
2025-03-03 18:21     ` Nhat Pham
2025-03-03 21:34       ` Sridhar, Kanchana P
2025-03-06 21:20         ` Yosry Ahmed
2025-04-30 21:07           ` Sridhar, Kanchana P
2025-03-03 11:07   ` kernel test robot
2025-03-06 21:17   ` Yosry Ahmed
2025-04-30 21:05     ` Sridhar, Kanchana P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z88h1mPkYNM6yOGE@google.com \
    --to=yosry.ahmed@linux.dev \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=clabbe@baylibre.com \
    --cc=davem@davemloft.net \
    --cc=ebiggers@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kristen.c.accardi@intel.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=usamaarif642@gmail.com \
    --cc=vinodh.gopal@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    --cc=ying.huang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).