From: Yosry Ahmed <yosry@kernel.org>
To: Wenchao Hao <haowenchao22@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Barry Song <21cnbao@gmail.com>,
Chengming Zhou <chengming.zhou@linux.dev>,
Jens Axboe <axboe@kernel.dk>,
Johannes Weiner <hannes@cmpxchg.org>,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, Minchan Kim <minchan@kernel.org>,
Nhat Pham <nphamcs@gmail.com>,
Sergey Senozhatsky <senozhatsky@chromium.org>,
Wenchao Hao <haowenchao@xiaomi.com>
Subject: Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
Date: Tue, 12 May 2026 00:01:18 +0000 [thread overview]
Message-ID: <agJslV2eNj9FLFqI@google.com> (raw)
In-Reply-To: <CAOptpSPY3YL5VFJW9KKP99Yb17+_rdXKsKj93FdEn3_Zb350ow@mail.gmail.com>
On Sat, May 09, 2026 at 04:32:04PM +0800, Wenchao Hao wrote:
> On Sat, May 9, 2026 at 4:13 AM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > >
> > > Swap freeing can be expensive when unmapping a VMA containing many swap
> > > entries. This has been reported to significantly delay memory reclamation
> > > during Android's low-memory killing, especially when multiple processes
> > > are terminated to free memory, with slot_free() accounting for more than
> > > 80% of the total cost of freeing swap entries.
> > >
> > > This series introduces a callback-based deferred free framework in
> > > zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> > > define what gets buffered and how it gets drained. The entire free
> > > path including caller-side bookkeeping (slot_free, zswap_entry_free)
> > > is deferred to a background worker.
> >
> > How much of the speedup comes from avoiding the per-class lock,
> > free_zspage(), other work in zswap, etc.
>
> This series doesn't avoid the per-class lock. The pool->lock part
> has been split out and posted as a separate series, so this series
> focuses purely on the defer scheme:
>
> https://lore.kernel.org/linux-mm/20260508061910.3882831-1-haowenchao@xiaomi.com/
>
> >
> > I ask because I think the design here is still fairly complex. I don't
> > like how zswap and zram are registering callbacks into zsmalloc to do
> > their own freeing work, and they fill the buffers on behalf of
> > zsmalloc which seems like a layering violation.
>
> The callback design was motivated by code reuse -- deferring only
> zs_free() inside zsmalloc gave less speedup, and the machinery
> needed to defer caller-side bookkeeping turns out to be the same
> on both sides (per-cpu page buffer, drain worker, fallback). So I
> folded the common parts into zsmalloc.
>
> I agree it's not clean from a layering standpoint, and I'm happy to
> revisit if the reuse isn't worth the cost.
>
> >
> > I wonder how much of the speedup we get by just deferring
> > free_zspage()?
>
> Below is the perf breakdown, sampled only during munmap() of a
> 256MB zram-filled VMA on a Raspberry Pi 4B.
>
> Base kernel:
>
> # Samples: 491 of event 'cycles'
> # Event count (approx.): 214056923
> #
> # Children Self Symbol
> # ........ ........ ..........................................
> 99.55% 0.41% [k] __zap_vma_range
> 97.27% 2.91% [k] swap_put_entries_cluster
> 94.37% 1.65% [k] __swap_cluster_free_entries
> 88.99% 8.91% [k] zram_slot_free_notify
> 79.87% 10.78% [k] slot_free
> 56.27% 5.99% [k] zs_free
> 47.61% 4.35% [k] free_zspage
Seems like most of the zsmalloc overhead comres from free_zspage(),
right? I think we significantly simplify things if we only defer that
part. Instead of having a page pool and buffers were we stores the
handles for async free, we can just remove the zspage from from the
fullness list and put it on a deferred freeing list.
We can probably even explore not doing per-CPU and just use a single
global worker with a single lockless list (llist), then the worker can
just do llist_del_all() to atomically empty the list and process it
locally. If that turns out to be expensive we can do per-CPU lists.
WDYT? I think this can simplify things significantly.
> 36.85% 4.96% [k] __free_zspage
> 19.27% 0.21% [k] __folio_put
> 12.64% 2.91% [k] __free_frozen_pages
> 9.50% 6.40% [k] kmem_cache_free
> 8.28% 8.28% [k] _raw_spin_unlock_irqrestore
> 6.83% 1.85% [k] dec_zone_page_state
> 5.18% 5.18% [k] _raw_spin_unlock
> 5.18% 5.18% [k] folio_unlock
> 4.98% 4.98% [k] mod_zone_state
> 4.12% 4.12% [k] _raw_spin_lock
> 3.30% 3.30% [k] __swap_cgroup_id_xchg
>
> Perf of the zsmalloc-only variant (same 256MB zram workload):
>
> My first attempt for this RFC was exactly that -- defer only the
> handle free inside zsmalloc, keep zram/zswap caller-side bookkeeping
> synchronous. (I would post this version after this thread)
[..]
next prev parent reply other threads:[~2026-05-12 0:01 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-08 6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
2026-05-08 6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
2026-05-09 0:29 ` Nhat Pham
2026-05-09 8:47 ` Wenchao Hao
2026-05-08 6:07 ` [RFC PATCH v3 2/4] mm/zswap: use zsmalloc deferred free callback for async invalidate Wenchao Hao
2026-05-08 6:07 ` [RFC PATCH v3 3/4] zram: use zsmalloc deferred free callback for async slot free Wenchao Hao
2026-05-08 6:07 ` [RFC PATCH v3 4/4] zram: batch clear flags in slot_free with single write Wenchao Hao
2026-05-08 20:12 ` [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Yosry Ahmed
2026-05-09 8:32 ` Wenchao Hao
2026-05-09 8:38 ` Wenchao Hao
2026-05-12 0:01 ` Yosry Ahmed [this message]
2026-05-09 0:08 ` Nhat Pham
2026-05-09 8:45 ` Wenchao Hao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agJslV2eNj9FLFqI@google.com \
--to=yosry@kernel.org \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=chengming.zhou@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=haowenchao22@gmail.com \
--cc=haowenchao@xiaomi.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=nphamcs@gmail.com \
--cc=senozhatsky@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox