[RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Wenchao Hao <haowenchao22@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Barry Song <21cnbao@gmail.com>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Minchan Kim <minchan@kernel.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Sergey Senozhatsky <senozhatsky@chromium.org>,
	Yosry Ahmed <yosry@kernel.org>
Cc: Wenchao Hao <haowenchao22@gmail.com>,
	Wenchao Hao <haowenchao@xiaomi.com>
Subject: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
Date: Fri,  8 May 2026 14:07:20 +0800	[thread overview]
Message-ID: <20260508060724.3810904-1-haowenchao@xiaomi.com> (raw)

Swap freeing can be expensive when unmapping a VMA containing many swap
entries. This has been reported to significantly delay memory reclamation
during Android's low-memory killing, especially when multiple processes
are terminated to free memory, with slot_free() accounting for more than
80% of the total cost of freeing swap entries.

Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
to asynchronously collect and free swap entries [1][2], but the design
itself is fairly complex.

When anon folios and swap entries are mixed within a process, reclaiming
anon folios from killed processes helps return memory to the system as
quickly as possible, so that newly launched applications can satisfy
their memory demands. It is not ideal for swap freeing to block anon
folio freeing. On the other hand, swap freeing can still return memory
to the system, although at a slower rate due to memory compression.

This series introduces a callback-based deferred free framework in
zsmalloc. Callers (zram, zswap) register push/drain callbacks to
define what gets buffered and how it gets drained. The entire free
path including caller-side bookkeeping (slot_free, zswap_entry_free)
is deferred to a background worker.

Implementation:
  - Each CPU owns a single-page buffer. The hot path writes a value
    via the push callback with preemption disabled (no locks).
  - When the buffer fills, it is swapped with a fresh page from a
    pre-allocated page pool. The full page is queued to a WQ_UNBOUND
    worker for drain.
  - The drain callback performs the actual expensive work (zs_free,
    slot_free, zswap_entry_free, etc.) in batch, off the hot path.
  - If no free page is available, the caller falls back to synchronous
    processing.

The speedup comes from moving expensive swap slot freeing off the
munmap hot path into a background worker, so that intact anonymous
folios are released back to the system without blocking. The worker
drains at a slower rate since compressed objects are small and freeing
a single handle may not release an entire page until the zspage is
fully empty.

Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):

Test 1: munmap latency for 256MB swap-filled VMA (zram backend)

  mode        Base       Patched     Speedup
  single      61.82ms    8.62ms      7.17x
  multi 2p    94.75ms    54.11ms     1.75x
  multi 3p    154.64ms   104.83ms    1.48x

Test 2: munmap latency for different sizes (zram, single process)

  Size       Base         Patched     Speedup
  64MB       14.11ms      2.18ms      6.47x
  128MB      29.45ms      4.48ms      6.57x
  192MB      43.85ms      6.62ms      6.62x
  256MB      57.01ms      9.08ms      6.28x
  512MB      115.13ms     55.58ms     2.07x
  1024MB     229.66ms     153.28ms    1.50x

Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)

  mode        Base       Patched     Speedup
  single      152.14ms   51.26ms     2.97x
  multi 2p    186.56ms   105.42ms    1.77x
  multi 3p    205.83ms   153.32ms    1.34x

Test 4: munmap latency for different sizes (zswap, single process)

  Size       Base         Patched     Speedup
  64MB       37.83ms      13.26ms     2.85x
  128MB      75.11ms      26.73ms     2.81x
  256MB      150.78ms     52.97ms     2.85x
  512MB      303.04ms     130.38ms    2.32x
  1024MB     599.95ms     287.10ms    2.09x

[1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/
[2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@vivo.com/
[3] https://lore.kernel.org/linux-mm/20260412060450.15813-1-baohua@kernel.org/

Changes since v2:
- Use per-cpu single-page buffers instead of a global list; the hot
  path only writes into the local CPU's buffer with preemption disabled
- Add a page pool for buffer rotation: when the current buffer is full,
  swap it with a free page from the pool and queue the full page for
  drain
- Introduce push/drain callback ops so that zram and zswap can each
  define their own element size and drain logic (zram stores u32 slot
  indices, zswap stores unsigned long handles)
- Drop the lock optimization patches it will be submitted separately
  as part of a dedicated zsmalloc lock contention series
- Link to v2: https://lore.kernel.org/all/20260421121616.3298845-1-haowenchao@xiaomi.com/

Barry Song (1):
  zram: use zsmalloc deferred free callback for async slot free

Wenchao Hao (3):
  mm/zsmalloc: introduce deferred free framework with callback ops
  mm/zswap: use zsmalloc deferred free callback for async invalidate
  zram: batch clear flags in slot_free with single write

 drivers/block/zram/zram_drv.c |  44 ++++++-
 drivers/block/zram/zram_drv.h |   6 +
 include/linux/zsmalloc.h      |  16 +++
 mm/zsmalloc.c                 | 208 +++++++++++++++++++++++++++++++++-
 mm/zswap.c                    |  38 ++++++-
 5 files changed, 306 insertions(+), 6 deletions(-)

--
2.34.1

next             reply	other threads:[~2026-05-08  6:08 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-08  6:07 Wenchao Hao [this message]
2026-05-08  6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
2026-05-09  0:29   ` Nhat Pham
2026-05-09  8:47     ` Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 2/4] mm/zswap: use zsmalloc deferred free callback for async invalidate Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 3/4] zram: use zsmalloc deferred free callback for async slot free Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 4/4] zram: batch clear flags in slot_free with single write Wenchao Hao
2026-05-08 20:12 ` [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Yosry Ahmed
2026-05-09  8:32   ` Wenchao Hao
2026-05-09  8:38     ` Wenchao Hao
2026-05-09  0:08 ` Nhat Pham
2026-05-09  8:45   ` Wenchao Hao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260508060724.3810904-1-haowenchao@xiaomi.com \
    --to=haowenchao22@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=haowenchao@xiaomi.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=senozhatsky@chromium.org \
    --cc=yosry@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox