Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry@kernel.org>
To: Wenchao Hao <haowenchao22@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 Barry Song <21cnbao@gmail.com>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	 Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,  Minchan Kim <minchan@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	 Sergey Senozhatsky <senozhatsky@chromium.org>,
	Wenchao Hao <haowenchao@xiaomi.com>
Subject: Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
Date: Tue, 12 May 2026 00:01:18 +0000	[thread overview]
Message-ID: <agJslV2eNj9FLFqI@google.com> (raw)
In-Reply-To: <CAOptpSPY3YL5VFJW9KKP99Yb17+_rdXKsKj93FdEn3_Zb350ow@mail.gmail.com>

On Sat, May 09, 2026 at 04:32:04PM +0800, Wenchao Hao wrote:
> On Sat, May 9, 2026 at 4:13 AM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > >
> > > Swap freeing can be expensive when unmapping a VMA containing many swap
> > > entries. This has been reported to significantly delay memory reclamation
> > > during Android's low-memory killing, especially when multiple processes
> > > are terminated to free memory, with slot_free() accounting for more than
> > > 80% of the total cost of freeing swap entries.
> > >
> > > This series introduces a callback-based deferred free framework in
> > > zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> > > define what gets buffered and how it gets drained. The entire free
> > > path including caller-side bookkeeping (slot_free, zswap_entry_free)
> > > is deferred to a background worker.
> >
> > How much of the speedup comes from avoiding the per-class lock,
> > free_zspage(), other work in zswap, etc.
> 
> This series doesn't avoid the per-class lock. The pool->lock part
> has been split out and posted as a separate series, so this series
> focuses purely on the defer scheme:
> 
> https://lore.kernel.org/linux-mm/20260508061910.3882831-1-haowenchao@xiaomi.com/
> 
> >
> > I ask because I think the design here is still fairly complex. I don't
> > like how zswap and zram are registering callbacks into zsmalloc to do
> > their own freeing work, and they fill the buffers on behalf of
> > zsmalloc which seems like a layering violation.
> 
> The callback design was motivated by code reuse -- deferring only
> zs_free() inside zsmalloc gave less speedup, and the machinery
> needed to defer caller-side bookkeeping turns out to be the same
> on both sides (per-cpu page buffer, drain worker, fallback). So I
> folded the common parts into zsmalloc.
> 
> I agree it's not clean from a layering standpoint, and I'm happy to
> revisit if the reuse isn't worth the cost.
> 
> >
> > I wonder how much of the speedup we get by just deferring
> > free_zspage()?
> 
> Below is the perf breakdown, sampled only during munmap() of a
> 256MB zram-filled VMA on a Raspberry Pi 4B.
> 
> Base kernel:
> 
>   # Samples: 491  of event 'cycles'
>   # Event count (approx.): 214056923
>   #
>   # Children      Self  Symbol
>   # ........  ........  ..........................................
>       99.55%     0.41%  [k] __zap_vma_range
>       97.27%     2.91%  [k] swap_put_entries_cluster
>       94.37%     1.65%  [k] __swap_cluster_free_entries
>       88.99%     8.91%  [k] zram_slot_free_notify
>       79.87%    10.78%  [k] slot_free
>       56.27%     5.99%  [k] zs_free
>       47.61%     4.35%  [k] free_zspage

Seems like most of the zsmalloc overhead comres from free_zspage(),
right? I think we significantly simplify things if we only defer that
part. Instead of having a page pool and buffers were we stores the
handles for async free, we can just remove the zspage from from the
fullness list and put it on a deferred freeing list.

We can probably even explore not doing per-CPU and just use a single
global worker with a single lockless list (llist), then the worker can
just do llist_del_all() to atomically empty the list and process it
locally. If that turns out to be expensive we can do per-CPU lists.

WDYT? I think this can simplify things significantly.

>       36.85%     4.96%  [k] __free_zspage
>       19.27%     0.21%  [k] __folio_put
>       12.64%     2.91%  [k] __free_frozen_pages
>        9.50%     6.40%  [k] kmem_cache_free
>        8.28%     8.28%  [k] _raw_spin_unlock_irqrestore
>        6.83%     1.85%  [k] dec_zone_page_state
>        5.18%     5.18%  [k] _raw_spin_unlock
>        5.18%     5.18%  [k] folio_unlock
>        4.98%     4.98%  [k] mod_zone_state
>        4.12%     4.12%  [k] _raw_spin_lock
>        3.30%     3.30%  [k] __swap_cgroup_id_xchg
> 
> Perf of the zsmalloc-only variant (same 256MB zram workload):
> 
> My first attempt for this RFC was exactly that -- defer only the
> handle free inside zsmalloc, keep zram/zswap caller-side bookkeeping
> synchronous. (I would post this version after this thread)
[..]

next prev parent reply	other threads:[~2026-05-12  0:01 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
2026-05-09  0:29   ` Nhat Pham
2026-05-09  8:47     ` Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 2/4] mm/zswap: use zsmalloc deferred free callback for async invalidate Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 3/4] zram: use zsmalloc deferred free callback for async slot free Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 4/4] zram: batch clear flags in slot_free with single write Wenchao Hao
2026-05-08 20:12 ` [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Yosry Ahmed
2026-05-09  8:32   ` Wenchao Hao
2026-05-09  8:38     ` Wenchao Hao
2026-05-12  0:01     ` Yosry Ahmed [this message]
2026-05-09  0:08 ` Nhat Pham
2026-05-09  8:45   ` Wenchao Hao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agJslV2eNj9FLFqI@google.com \
    --to=yosry@kernel.org \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=haowenchao22@gmail.com \
    --cc=haowenchao@xiaomi.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=senozhatsky@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.