Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"nphamcs@gmail.com" <nphamcs@gmail.com>,
	"chengming.zhou@linux.dev" <chengming.zhou@linux.dev>,
	"usamaarif642@gmail.com" <usamaarif642@gmail.com>,
	"ryan.roberts@arm.com" <ryan.roberts@arm.com>,
	"21cnbao@gmail.com" <21cnbao@gmail.com>,
	"ying.huang@linux.alibaba.com" <ying.huang@linux.alibaba.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"herbert@gondor.apana.org.au" <herbert@gondor.apana.org.au>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"clabbe@baylibre.com" <clabbe@baylibre.com>,
	"ardb@kernel.org" <ardb@kernel.org>,
	"ebiggers@google.com" <ebiggers@google.com>,
	"surenb@google.com" <surenb@google.com>,
	"Accardi, Kristen C" <kristen.c.accardi@intel.com>,
	"Feghali, Wajdi K" <wajdi.k.feghali@intel.com>,
	"Gopal, Vinodh" <vinodh.gopal@intel.com>
Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
Date: Mon, 10 Mar 2025 17:31:02 +0000	[thread overview]
Message-ID: <Z88h1mPkYNM6yOGE@google.com> (raw)
In-Reply-To: <PH8SPRMB00447B066A769C76F57F8800C9D42@PH8SPRMB0044.namprd11.prod.outlook.com>

On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote:
> 
[..]
> > > > >  	u8 *buffer;
> > > > > +	u8 nr_reqs;
> > > > > +	struct crypto_wait wait;
> > > > >  	struct mutex mutex;
> > > > >  	bool is_sleepable;
> > > > > +	bool __online;
> > > >
> > > > I don't believe we need this.
> > > >
> > > > If we are not freeing resources during CPU offlining, then we do not
> > > > need a CPU offline callback and acomp_ctx->__online serves no purpose.
> > > >
> > > > The whole point of synchronizing between offlining and
> > > > compress/decompress operations is to avoid UAF. If offlining does not
> > > > free resources, then we can hold the mutex directly in the
> > > > compress/decompress path and drop the hotunplug callback completely.
> > > >
> > > > I also believe nr_reqs can be dropped from this patch, as it seems like
> > > > it's only used know when to set __online.
> > >
> > > All great points! In fact, that was the original solution I had implemented
> > > (not having an offline callback). But then, I spent some time understanding
> > > the v6.13 hotfix for synchronizing freeing of resources, and this comment
> > > in zswap_cpu_comp_prepare():
> > >
> > > 	/*
> > > 	 * Only hold the mutex after completing allocations, otherwise we
> > may
> > > 	 * recurse into zswap through reclaim and attempt to hold the mutex
> > > 	 * again resulting in a deadlock.
> > > 	 */
> > >
> > > Hence, I figured the constraint of "recurse into zswap through reclaim" was
> > > something to comprehend in the simplification (even though I had a tough
> > > time imagining how this could happen).
> > 
> > The constraint here is about zswap_cpu_comp_prepare() holding the mutex,
> > making an allocation which internally triggers reclaim, then recursing
> > into zswap and trying to hold the same mutex again causing a deadlock.
> > 
> > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin
> > with, the constraint naturally goes away.
> 
> Actually, if it is possible for the allocations in zswap_cpu_comp_prepare()
> to trigger reclaim, then I believe we need some way for reclaim to know if
> the acomp_ctx resources are available. Hence, this seems like a potential
> for deadlock regardless of the mutex.

I took a closer look and I believe my hotfix was actually unnecessary. I
sent it out in response to a syzbot report, but upon closer look it
seems like it was not an actual problem. Sorry if my patch confused you.

Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like
CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The comment above
says:

 * PREPARE: The callbacks are invoked on a control CPU before the
 * hotplugged CPU is started up or after the hotplugged CPU has died.

So even if we go into reclaim during zswap_cpu_comp_prepare(), it will
never be on the CPU that we are allocating resources for.

The other case where zswap_cpu_comp_prepare() could race with
compression/decompression is when a pool is being created. In this case,
reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the same
CPU AFAICT. However, because the pool is still under creation, it will
not be used (i.e. zswap_pool_current_get() won't find it).

So I think we don't need to worry about zswap_cpu_comp_prepare() racing
with compression or decompression for the same pool and CPU.

> 
> I verified that all the zswap_cpu_comp_prepare() allocations are done with
> GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a
> risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare()
> in general, i.e., aside from this patchset.
> 
> I can think of the following options to resolve this, and would welcome
> other suggestions:
> 
> 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check
>     if acomp_ctx->__online is true, and if so, return the mutex. If
>     acomp_ctx->__online is false, then it returns NULL. In other words, we
>     don't have the for loop.
>     - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare()
>        to fail, cpuhotplug to fail. However, there is no deadlock.
>         - zswap_compress() will need to detect NULL returned by
>           acomp_ctx_get_cpu_lock(), and return an error.
>         - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling
>           acomp_ctx_get_cpu_lock().
>     - We won't be migrated to a different CPU because we hold the mutex, hence
>       zswap_cpu_comp_dead() will wait on the mutex.
> 
> 2) More intrusive: We would need to use a gfp_t that prevents direct reclaim
>     and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h,
>     but for non-THP allocations. If we decide to adopt this approach, we would
>     need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c
>     to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this
>     new gfp_t, in lieu of GFP_KERNEL.
> 
> > 
> > >
> > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare()
> > > does not acquire the mutex lock while allocating resources. We have
> > already
> > > initialized the mutex, so in theory, it is possible for compress/decompress
> > > to acquire the mutex lock. The __online acts as a way to indicate whether
> > > compress/decompress can proceed reliably to use the resources.
> > 
> > For compress/decompress to acquire the mutex they need to run on that
> > CPU, and I don't think that's possible before onlining completes, so
> > zswap_cpu_comp_prepare() must have already completed before
> > compress/decompress can use that CPU IIUC.
> 
> If we can make this assumption, that would be great! However, I am not
> totally sure because of the GFP_KERNEL allocations in
> zswap_cpu_comp_prepare().

As I mentioned above, when zswap_cpu_comp_prepare() is run we are in one
of two situations:
- The pool is under creation, so we cannot race with stores/loads from
  that same pool.
- The CPU is being onlined, in which case zswap_cpu_comp_prepare() is
  called from a control CPU before tasks start running on the CPU being
  onlined.

Please correct me if I am wrong.

[..]
> > > > > @@ -285,13 +403,21 @@ static struct zswap_pool
> > > > *zswap_pool_create(char *type, char *compressor)
> > > > >  		goto error;
> > > > >  	}
> > > > >
> > > > > -	for_each_possible_cpu(cpu)
> > > > > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > > > > +	for_each_possible_cpu(cpu) {
> > > > > +		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > > >acomp_ctx, cpu);
> > > > > +
> > > > > +		acomp_ctx->acomp = NULL;
> > > > > +		acomp_ctx->req = NULL;
> > > > > +		acomp_ctx->buffer = NULL;
> > > > > +		acomp_ctx->__online = false;
> > > > > +		acomp_ctx->nr_reqs = 0;
> > > >
> > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them
> > > > right away?
> > >
> > > Yes, I figured this is needed for two reasons:
> > >
> > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into
> > >     zswap_cpu_comp_dealloc() to be handled by the common procedure
> > >     "acomp_ctx_dealloc()" unambiguously.
> > 
> > This makes sense. When you move the refactoring to create
> > acomp_ctx_dealloc() to a separate patch, please include this change in
> > it and call it out explicitly in the commit message.
> 
> Sure.
> 
> > 
> > > 2) The second scenario I thought of that would need this, is let's say
> > >      the zswap compressor is switched immediately after setting the
> > >      compressor. Some cores have executed the onlining code and
> > >      some haven't. Because there are no pool refs held,
> > >      zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured
> > >      it would help to initialize these acomp_ctx members before the
> > >      hand-off to "cpuhp_state_add_instance()" in zswap_pool_create().
> > 
> > I believe cpuhp_state_add_instance() calls the onlining function
> > synchronously on all present CPUs, so I don't think it's possible to end
> > up in a state where the pool is being destroyed and some CPU executed
> > zswap_cpu_comp_prepare() while others haven't.
> 
> I looked at the cpuhotplug code some more. The startup callback is
> invoked sequentially for_each_present_cpu(). If an error occurs for any
> one of them, it calls the teardown callback only on the earlier cores that
> have already finished running the startup callback. However, 
> zswap_cpu_comp_dealloc() will be called for all cores, even the ones
> for which the startup callback was not run. Hence, I believe the
> zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO)
> to allocate the acomp_ctx.

Yeah this is point (1) above IIUC, and I agree about zero initialization
for that.

> 
> > 
> > That being said, this made me think of a different problem. If pool
> > destruction races with CPU onlining, there could be a race between
> > zswap_cpu_comp_prepare() allocating resources and
> > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them.
> > 
> > I believe we must always call cpuhp_state_remove_instance() *before*
> > freeing the resources to prevent this race from happening. This needs to
> > be documented with a comment.
> 
> Yes, this race condition is possible, thanks for catching this! The problem with
> calling cpuhp_state_remove_instance() before freeing the resources is that
> cpuhp_state_add_instance() and cpuhp_state_remove_instance() both
> acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence
> are serialized.
> 
> For the reasons motivating why acomp_ctx->__online is set to false in
> zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance()
> before calling acomp_ctx_dealloc() because the latter could wait until
> acomp_ctx->__online to be true before deleting the resources. I will
> think about this some more.

I believe this problem goes away with acomp_ctx->__online going away,
right?

> 
> Another possibility is to not rely on cpuhotplug in zswap, and instead
> manage the per-cpu acomp_ctx resource allocation entirely in
> zswap_pool_create(), and deletion entirely in zswap_pool_destroy(),
> along with the necessary error handling. Let me think about this some
> more as well.
> 
> > 
> > Let me know if I missed something.
> > 
> > >
> > > Please let me know if these are valid considerations.
> > >
> > > >
> > > > If it is in fact needed we should probably just use __GFP_ZERO.
> > >
> > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of
> > "alloc_percpu()"
> > > for the acomp_ctx?
> > 
> > Yeah if we need to initialize all/most fields to 0 let's use
> > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO.
> 
> Sounds good.
> 
> > 
> > [..]
> > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx
> > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> > > > >
> > > > >  	for (;;) {
> > > > >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > > > > -		mutex_lock(&acomp_ctx->mutex);
> > > > > -		if (likely(acomp_ctx->req))
> > > > > -			return acomp_ctx;
> > > > >  		/*
> > > > > -		 * It is possible that we were migrated to a different CPU
> > > > after
> > > > > -		 * getting the per-CPU ctx but before the mutex was
> > > > acquired. If
> > > > > -		 * the old CPU got offlined, zswap_cpu_comp_dead() could
> > > > have
> > > > > -		 * already freed ctx->req (among other things) and set it to
> > > > > -		 * NULL. Just try again on the new CPU that we ended up on.
> > > > > +		 * If the CPU onlining code successfully allocates acomp_ctx
> > > > resources,
> > > > > +		 * it sets acomp_ctx->__online to true. Until this happens, we
> > > > have
> > > > > +		 * two options:
> > > > > +		 *
> > > > > +		 * 1. Return NULL and fail all stores on this CPU.
> > > > > +		 * 2. Retry, until onlining has finished allocating resources.
> > > > > +		 *
> > > > > +		 * In theory, option 1 could be more appropriate, because it
> > > > > +		 * allows the calling procedure to decide how it wants to
> > > > handle
> > > > > +		 * reclaim racing with CPU hotplug. For instance, it might be
> > > > Ok
> > > > > +		 * for compress to return an error for the backing swap device
> > > > > +		 * to store the folio. Decompress could wait until we get a
> > > > > +		 * valid and locked mutex after onlining has completed. For
> > > > now,
> > > > > +		 * we go with option 2 because adding a do-while in
> > > > > +		 * zswap_decompress() adds latency for software
> > > > compressors.
> > > > > +		 *
> > > > > +		 * Once initialized, the resources will be de-allocated only
> > > > > +		 * when the pool is destroyed. The acomp_ctx will hold on to
> > > > the
> > > > > +		 * resources through CPU offlining/onlining at any time until
> > > > > +		 * the pool is destroyed.
> > > > > +		 *
> > > > > +		 * This prevents races/deadlocks between reclaim and CPU
> > > > acomp_ctx
> > > > > +		 * resource allocation that are a dependency for reclaim.
> > > > > +		 * It further simplifies the interaction with CPU onlining and
> > > > > +		 * offlining:
> > > > > +		 *
> > > > > +		 * - CPU onlining does not take the mutex. It only allocates
> > > > > +		 *   resources and sets __online to true.
> > > > > +		 * - CPU offlining acquires the mutex before setting
> > > > > +		 *   __online to false. If reclaim has acquired the mutex,
> > > > > +		 *   offlining will have to wait for reclaim to complete before
> > > > > +		 *   hotunplug can proceed. Further, hotplug merely sets
> > > > > +		 *   __online to false. It does not delete the acomp_ctx
> > > > > +		 *   resources.
> > > > > +		 *
> > > > > +		 * Option 1 is better than potentially not exiting the earlier
> > > > > +		 * for (;;) loop because the system is running low on memory
> > > > > +		 * and/or CPUs are getting offlined for whatever reason. At
> > > > > +		 * least failing this store will prevent data loss by failing
> > > > > +		 * zswap_store(), and saving the data in the backing swap
> > > > device.
> > > > >  		 */
> > > >
> > > > I believe we can dropped. I don't think we can have any store/load
> > > > operations on a CPU before it's fully onlined, and we should always have
> > > > a reference on the pool here, so the resources cannot go away.
> > > >
> > > > So unless I missed something we can drop this completely now and just
> > > > hold the mutex directly in the load/store paths.
> > >
> > > Based on the above explanations, please let me know if it is a good idea
> > > to keep the __online, or if you think further simplification is possible.
> > 
> > I still think it's not needed. Let me know if I missed anything.
> 
> Let me think some more about whether it is feasible to not have cpuhotplug
> manage the acomp_ctx resource allocation, and instead have this be done
> through the pool creation/deletion routines.

I don't think this is necessary, see my comments above.

next prev parent reply	other threads:[~2025-03-10 17:31 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 01/14] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 02/14] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
2025-03-03  8:47   ` Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 03/14] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 04/14] crypto: iaa - Implement batch compression/decompression with request chaining Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 05/14] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 06/14] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 07/14] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 08/14] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 09/14] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 10/14] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 11/14] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Kanchana P Sridhar
2025-03-06 19:35   ` Yosry Ahmed
2025-03-06 19:46     ` Yosry Ahmed
2025-03-07  0:01     ` Sridhar, Kanchana P
2025-03-07 19:30       ` Yosry Ahmed
2025-03-08  2:47         ` Sridhar, Kanchana P
2025-03-10 17:31           ` Yosry Ahmed [this message]
2025-03-17 21:15             ` Sridhar, Kanchana P
2025-03-17 23:25               ` Herbert Xu
2025-03-18 14:23               ` Yosry Ahmed
2025-03-18 17:38                 ` Sridhar, Kanchana P
2025-03-18 19:05                   ` Yosry Ahmed
2025-03-18 21:09                     ` Sridhar, Kanchana P
2025-03-18 23:14                       ` Yosry Ahmed
2025-03-18 23:37                         ` Sridhar, Kanchana P
2025-04-20 21:01                           ` Andrew Morton
2025-04-20 23:35                             ` Nhat Pham
2025-04-21  3:31                               ` [PATCH] crypto: scomp - Fix off-by-one bug when calculating last page Herbert Xu
2025-04-21 19:02                                 ` Marius Dinu
2025-03-06 19:42   ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Yosry Ahmed
2025-03-03  8:47 ` [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
2025-03-06 20:00   ` Yosry Ahmed
2025-04-30 21:15     ` Sridhar, Kanchana P
2025-03-03  8:47 ` [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios Kanchana P Sridhar
2025-03-03 11:07   ` kernel test robot
2025-03-03 18:21     ` Nhat Pham
2025-03-03 21:34       ` Sridhar, Kanchana P
2025-03-06 21:20         ` Yosry Ahmed
2025-04-30 21:07           ` Sridhar, Kanchana P
2025-03-03 11:07   ` kernel test robot
2025-03-06 21:17   ` Yosry Ahmed
2025-04-30 21:05     ` Sridhar, Kanchana P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z88h1mPkYNM6yOGE@google.com \
    --to=yosry.ahmed@linux.dev \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=clabbe@baylibre.com \
    --cc=davem@davemloft.net \
    --cc=ebiggers@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kristen.c.accardi@intel.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=usamaarif642@gmail.com \
    --cc=vinodh.gopal@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    --cc=ying.huang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.