Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry@kernel.org>
To: Wenchao Hao <haowenchao22@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 Barry Song <21cnbao@gmail.com>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	 Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,  Minchan Kim <minchan@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	 Sergey Senozhatsky <senozhatsky@chromium.org>,
	Wenchao Hao <haowenchao@xiaomi.com>
Subject: Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
Date: Tue, 12 May 2026 00:01:18 +0000	[thread overview]
Message-ID: <agJslV2eNj9FLFqI@google.com> (raw)
In-Reply-To: <CAOptpSPY3YL5VFJW9KKP99Yb17+_rdXKsKj93FdEn3_Zb350ow@mail.gmail.com>

On Sat, May 09, 2026 at 04:32:04PM +0800, Wenchao Hao wrote:
> On Sat, May 9, 2026 at 4:13 AM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > >
> > > Swap freeing can be expensive when unmapping a VMA containing many swap
> > > entries. This has been reported to significantly delay memory reclamation
> > > during Android's low-memory killing, especially when multiple processes
> > > are terminated to free memory, with slot_free() accounting for more than
> > > 80% of the total cost of freeing swap entries.
> > >
> > > This series introduces a callback-based deferred free framework in
> > > zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> > > define what gets buffered and how it gets drained. The entire free
> > > path including caller-side bookkeeping (slot_free, zswap_entry_free)
> > > is deferred to a background worker.
> >
> > How much of the speedup comes from avoiding the per-class lock,
> > free_zspage(), other work in zswap, etc.
> 
> This series doesn't avoid the per-class lock. The pool->lock part
> has been split out and posted as a separate series, so this series
> focuses purely on the defer scheme:
> 
> https://lore.kernel.org/linux-mm/20260508061910.3882831-1-haowenchao@xiaomi.com/
> 
> >
> > I ask because I think the design here is still fairly complex. I don't
> > like how zswap and zram are registering callbacks into zsmalloc to do
> > their own freeing work, and they fill the buffers on behalf of
> > zsmalloc which seems like a layering violation.
> 
> The callback design was motivated by code reuse -- deferring only
> zs_free() inside zsmalloc gave less speedup, and the machinery
> needed to defer caller-side bookkeeping turns out to be the same
> on both sides (per-cpu page buffer, drain worker, fallback). So I
> folded the common parts into zsmalloc.
> 
> I agree it's not clean from a layering standpoint, and I'm happy to
> revisit if the reuse isn't worth the cost.
> 
> >
> > I wonder how much of the speedup we get by just deferring
> > free_zspage()?
> 
> Below is the perf breakdown, sampled only during munmap() of a
> 256MB zram-filled VMA on a Raspberry Pi 4B.
> 
> Base kernel:
> 
>   # Samples: 491  of event 'cycles'
>   # Event count (approx.): 214056923
>   #
>   # Children      Self  Symbol
>   # ........  ........  ..........................................
>       99.55%     0.41%  [k] __zap_vma_range
>       97.27%     2.91%  [k] swap_put_entries_cluster
>       94.37%     1.65%  [k] __swap_cluster_free_entries
>       88.99%     8.91%  [k] zram_slot_free_notify
>       79.87%    10.78%  [k] slot_free
>       56.27%     5.99%  [k] zs_free
>       47.61%     4.35%  [k] free_zspage

Seems like most of the zsmalloc overhead comres from free_zspage(),
right? I think we significantly simplify things if we only defer that
part. Instead of having a page pool and buffers were we stores the
handles for async free, we can just remove the zspage from from the
fullness list and put it on a deferred freeing list.

We can probably even explore not doing per-CPU and just use a single
global worker with a single lockless list (llist), then the worker can
just do llist_del_all() to atomically empty the list and process it
locally. If that turns out to be expensive we can do per-CPU lists.

WDYT? I think this can simplify things significantly.

>       36.85%     4.96%  [k] __free_zspage
>       19.27%     0.21%  [k] __folio_put
>       12.64%     2.91%  [k] __free_frozen_pages
>        9.50%     6.40%  [k] kmem_cache_free
>        8.28%     8.28%  [k] _raw_spin_unlock_irqrestore
>        6.83%     1.85%  [k] dec_zone_page_state
>        5.18%     5.18%  [k] _raw_spin_unlock
>        5.18%     5.18%  [k] folio_unlock
>        4.98%     4.98%  [k] mod_zone_state
>        4.12%     4.12%  [k] _raw_spin_lock
>        3.30%     3.30%  [k] __swap_cgroup_id_xchg
> 
> Perf of the zsmalloc-only variant (same 256MB zram workload):
> 
> My first attempt for this RFC was exactly that -- defer only the
> handle free inside zsmalloc, keep zram/zswap caller-side bookkeeping
> synchronous. (I would post this version after this thread)
[..]

next prev parent reply	other threads:[~2026-05-12  0:01 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
2026-05-09  0:29   ` Nhat Pham
2026-05-09  8:47     ` Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 2/4] mm/zswap: use zsmalloc deferred free callback for async invalidate Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 3/4] zram: use zsmalloc deferred free callback for async slot free Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 4/4] zram: batch clear flags in slot_free with single write Wenchao Hao
2026-05-08 20:12 ` [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Yosry Ahmed
2026-05-09  8:32   ` Wenchao Hao
2026-05-09  8:38     ` Wenchao Hao
2026-05-12  0:01     ` Yosry Ahmed [this message]
2026-05-09  0:08 ` Nhat Pham
2026-05-09  8:45   ` Wenchao Hao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agJslV2eNj9FLFqI@google.com \
    --to=yosry@kernel.org \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=haowenchao22@gmail.com \
    --cc=haowenchao@xiaomi.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=senozhatsky@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox