[RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
@ 2026-05-08  6:07 Wenchao Hao
  2026-05-08  6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Wenchao Hao @ 2026-05-08  6:07 UTC (permalink / raw)
  To: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Nhat Pham, Sergey Senozhatsky, Yosry Ahmed
  Cc: Wenchao Hao, Wenchao Hao

Swap freeing can be expensive when unmapping a VMA containing many swap
entries. This has been reported to significantly delay memory reclamation
during Android's low-memory killing, especially when multiple processes
are terminated to free memory, with slot_free() accounting for more than
80% of the total cost of freeing swap entries.

Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
to asynchronously collect and free swap entries [1][2], but the design
itself is fairly complex.

When anon folios and swap entries are mixed within a process, reclaiming
anon folios from killed processes helps return memory to the system as
quickly as possible, so that newly launched applications can satisfy
their memory demands. It is not ideal for swap freeing to block anon
folio freeing. On the other hand, swap freeing can still return memory
to the system, although at a slower rate due to memory compression.

This series introduces a callback-based deferred free framework in
zsmalloc. Callers (zram, zswap) register push/drain callbacks to
define what gets buffered and how it gets drained. The entire free
path including caller-side bookkeeping (slot_free, zswap_entry_free)
is deferred to a background worker.

Implementation:
  - Each CPU owns a single-page buffer. The hot path writes a value
    via the push callback with preemption disabled (no locks).
  - When the buffer fills, it is swapped with a fresh page from a
    pre-allocated page pool. The full page is queued to a WQ_UNBOUND
    worker for drain.
  - The drain callback performs the actual expensive work (zs_free,
    slot_free, zswap_entry_free, etc.) in batch, off the hot path.
  - If no free page is available, the caller falls back to synchronous
    processing.

The speedup comes from moving expensive swap slot freeing off the
munmap hot path into a background worker, so that intact anonymous
folios are released back to the system without blocking. The worker
drains at a slower rate since compressed objects are small and freeing
a single handle may not release an entire page until the zspage is
fully empty.

Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):

Test 1: munmap latency for 256MB swap-filled VMA (zram backend)

  mode        Base       Patched     Speedup
  single      61.82ms    8.62ms      7.17x
  multi 2p    94.75ms    54.11ms     1.75x
  multi 3p    154.64ms   104.83ms    1.48x

Test 2: munmap latency for different sizes (zram, single process)

  Size       Base         Patched     Speedup
  64MB       14.11ms      2.18ms      6.47x
  128MB      29.45ms      4.48ms      6.57x
  192MB      43.85ms      6.62ms      6.62x
  256MB      57.01ms      9.08ms      6.28x
  512MB      115.13ms     55.58ms     2.07x
  1024MB     229.66ms     153.28ms    1.50x

Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)

  mode        Base       Patched     Speedup
  single      152.14ms   51.26ms     2.97x
  multi 2p    186.56ms   105.42ms    1.77x
  multi 3p    205.83ms   153.32ms    1.34x

Test 4: munmap latency for different sizes (zswap, single process)

  Size       Base         Patched     Speedup
  64MB       37.83ms      13.26ms     2.85x
  128MB      75.11ms      26.73ms     2.81x
  256MB      150.78ms     52.97ms     2.85x
  512MB      303.04ms     130.38ms    2.32x
  1024MB     599.95ms     287.10ms    2.09x

[1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/
[2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@vivo.com/
[3] https://lore.kernel.org/linux-mm/20260412060450.15813-1-baohua@kernel.org/

Changes since v2:
- Use per-cpu single-page buffers instead of a global list; the hot
  path only writes into the local CPU's buffer with preemption disabled
- Add a page pool for buffer rotation: when the current buffer is full,
  swap it with a free page from the pool and queue the full page for
  drain
- Introduce push/drain callback ops so that zram and zswap can each
  define their own element size and drain logic (zram stores u32 slot
  indices, zswap stores unsigned long handles)
- Drop the lock optimization patches it will be submitted separately
  as part of a dedicated zsmalloc lock contention series
- Link to v2: https://lore.kernel.org/all/20260421121616.3298845-1-haowenchao@xiaomi.com/

Barry Song (1):
  zram: use zsmalloc deferred free callback for async slot free

Wenchao Hao (3):
  mm/zsmalloc: introduce deferred free framework with callback ops
  mm/zswap: use zsmalloc deferred free callback for async invalidate
  zram: batch clear flags in slot_free with single write

 drivers/block/zram/zram_drv.c |  44 ++++++-
 drivers/block/zram/zram_drv.h |   6 +
 include/linux/zsmalloc.h      |  16 +++
 mm/zsmalloc.c                 | 208 +++++++++++++++++++++++++++++++++-
 mm/zswap.c                    |  38 ++++++-
 5 files changed, 306 insertions(+), 6 deletions(-)

--
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops
  2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
@ 2026-05-08  6:07 ` Wenchao Hao
  2026-05-09  0:29   ` Nhat Pham
  2026-05-08  6:07 ` [RFC PATCH v3 2/4] mm/zswap: use zsmalloc deferred free callback for async invalidate Wenchao Hao
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Wenchao Hao @ 2026-05-08  6:07 UTC (permalink / raw)
  To: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Nhat Pham, Sergey Senozhatsky, Yosry Ahmed
  Cc: Wenchao Hao, Wenchao Hao

Add a per-cpu deferred free mechanism to zsmalloc with a callback
interface that lets callers (zram, zswap) customize push and drain
behavior.

Each CPU owns a single-page buffer. The hot path (zs_free_deferred)
writes a value into the current CPU's buffer via the push callback
with preemption disabled — no locks, no atomics. When the buffer
fills, it is swapped with a fresh page from a pre-allocated page
pool and the full page is queued to a WQ_UNBOUND worker for drain.

The drain worker invokes the drain callback which performs the actual
expensive work (zs_free, slot_free, etc.) in batch, away from the
original hot path.

Page pool management:
  - Pool is pre-allocated at enable time (ZS_DEFERRED_POOL_SIZE pages)
  - Full buffers are drained and returned to the pool
  - If no free page is available when buffer is full, the push falls
    back to synchronous processing by the caller

Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 include/linux/zsmalloc.h |  16 +++
 mm/zsmalloc.c            | 208 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 223 insertions(+), 1 deletion(-)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 478410c880b1..8d6c675b10dc 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -24,12 +24,28 @@ struct zs_pool_stats {
 struct zs_pool;
 struct scatterlist;
 
+enum zs_push_ret {
+	ZS_PUSH_OK = 0,
+	ZS_PUSH_FULL,
+	ZS_PUSH_FULL_QUEUED,
+};
+
+struct zs_deferred_ops {
+	enum zs_push_ret (*push)(void *buf, unsigned int count,
+					  unsigned long value);
+	void (*drain)(void *private, void *buf, unsigned int count);
+};
+
 struct zs_pool *zs_create_pool(const char *name);
 void zs_destroy_pool(struct zs_pool *pool);
 
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags,
 			const int nid);
 void zs_free(struct zs_pool *pool, unsigned long obj);
+int zs_pool_enable_deferred_free(struct zs_pool *pool,
+				 const struct zs_deferred_ops *ops,
+				 void *private);
+bool zs_free_deferred(struct zs_pool *pool, unsigned long value);
 
 size_t zs_huge_class_size(struct zs_pool *pool);
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 63128ddb7959..d8220a8753a7 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -196,6 +196,13 @@ struct link_free {
 static struct kmem_cache *handle_cachep;
 static struct kmem_cache *zspage_cachep;
 
+#define ZS_DEFERRED_POOL_SIZE	(256 * 1024 / PAGE_SIZE)
+
+struct zs_deferred_percpu {
+	unsigned int count;
+	void *buf;
+};
+
 struct zs_pool {
 	const char *name;
 
@@ -217,6 +224,18 @@ struct zs_pool {
 	/* protect zspage migration/compaction */
 	rwlock_t lock;
 	atomic_t compaction_in_progress;
+
+	/* per-cpu deferred free */
+	const struct zs_deferred_ops *deferred_ops;
+	void *deferred_private;
+	struct zs_deferred_percpu __percpu *deferred;
+	struct work_struct deferred_work;
+	struct workqueue_struct *deferred_wq;
+	struct list_head deferred_pool;
+	unsigned int deferred_pool_count;
+	spinlock_t deferred_pool_lock;
+	struct list_head deferred_drain_list;
+	spinlock_t deferred_drain_lock;
 };
 
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
@@ -1416,6 +1435,171 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 }
 EXPORT_SYMBOL_GPL(zs_free);
 
+static struct page *deferred_pool_get(struct zs_pool *pool)
+{
+	struct page *page = NULL;
+
+	spin_lock(&pool->deferred_pool_lock);
+	if (!list_empty(&pool->deferred_pool)) {
+		page = list_first_entry(&pool->deferred_pool, struct page, lru);
+		list_del(&page->lru);
+		pool->deferred_pool_count--;
+	}
+	spin_unlock(&pool->deferred_pool_lock);
+	return page;
+}
+
+static void deferred_pool_put(struct zs_pool *pool, struct page *page)
+{
+	spin_lock(&pool->deferred_pool_lock);
+	list_add_tail(&page->lru, &pool->deferred_pool);
+	pool->deferred_pool_count++;
+	spin_unlock(&pool->deferred_pool_lock);
+}
+
+static void zs_deferred_work_fn(struct work_struct *work)
+{
+	struct zs_pool *pool = container_of(work, struct zs_pool, deferred_work);
+	struct page *page;
+
+	while (true) {
+		unsigned int count;
+
+		spin_lock(&pool->deferred_drain_lock);
+		if (list_empty(&pool->deferred_drain_list)) {
+			spin_unlock(&pool->deferred_drain_lock);
+			break;
+		}
+		page = list_first_entry(&pool->deferred_drain_list,
+					struct page, lru);
+		list_del(&page->lru);
+		count = page_private(page);
+		spin_unlock(&pool->deferred_drain_lock);
+
+		pool->deferred_ops->drain(pool->deferred_private,
+					  page_address(page), count);
+		deferred_pool_put(pool, page);
+		cond_resched();
+	}
+}
+
+bool zs_free_deferred(struct zs_pool *pool, unsigned long value)
+{
+	struct zs_deferred_percpu *def;
+	struct page *new_page, *full_page;
+	enum zs_push_ret ret;
+
+	if (!pool->deferred)
+		return false;
+
+	def = get_cpu_ptr(pool->deferred);
+
+	ret = pool->deferred_ops->push(def->buf, def->count, value);
+	if (ret == ZS_PUSH_OK) {
+		def->count++;
+		put_cpu_ptr(pool->deferred);
+		return true;
+	}
+
+	if (ret == ZS_PUSH_FULL_QUEUED)
+		def->count++;
+
+	new_page = deferred_pool_get(pool);
+	if (new_page) {
+		full_page = virt_to_page(def->buf);
+		set_page_private(full_page, def->count);
+		def->buf = page_address(new_page);
+		def->count = 0;
+
+		if (ret == ZS_PUSH_FULL) {
+			pool->deferred_ops->push(def->buf, 0, value);
+			def->count = 1;
+		}
+		put_cpu_ptr(pool->deferred);
+
+		spin_lock(&pool->deferred_drain_lock);
+		list_add_tail(&full_page->lru, &pool->deferred_drain_list);
+		spin_unlock(&pool->deferred_drain_lock);
+		queue_work(pool->deferred_wq, &pool->deferred_work);
+		return true;
+	}
+	put_cpu_ptr(pool->deferred);
+
+	/* ret==2: value already queued, will be drained eventually */
+	if (ret == 2)
+		return true;
+
+	/* ret==1: value not queued, caller must fallback */
+	return false;
+}
+EXPORT_SYMBOL_GPL(zs_free_deferred);
+
+int zs_pool_enable_deferred_free(struct zs_pool *pool,
+				 const struct zs_deferred_ops *ops,
+				 void *private)
+{
+	int cpu;
+	unsigned int pg_idx;
+	struct page *page, *tmp;
+
+	pool->deferred_ops = ops;
+	pool->deferred_private = private;
+
+	INIT_WORK(&pool->deferred_work, zs_deferred_work_fn);
+	pool->deferred_wq = alloc_workqueue("zs_drain", WQ_UNBOUND, 0);
+	if (!pool->deferred_wq)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&pool->deferred_pool);
+	spin_lock_init(&pool->deferred_pool_lock);
+	pool->deferred_pool_count = 0;
+	INIT_LIST_HEAD(&pool->deferred_drain_list);
+	spin_lock_init(&pool->deferred_drain_lock);
+
+	for (pg_idx = 0; pg_idx < ZS_DEFERRED_POOL_SIZE; pg_idx++) {
+		page = alloc_page(GFP_KERNEL);
+		if (!page)
+			goto err_pages;
+		list_add_tail(&page->lru, &pool->deferred_pool);
+		pool->deferred_pool_count++;
+	}
+
+	pool->deferred = alloc_percpu(struct zs_deferred_percpu);
+	if (!pool->deferred)
+		goto err_pages;
+
+	for_each_possible_cpu(cpu) {
+		struct zs_deferred_percpu *def = per_cpu_ptr(pool->deferred, cpu);
+
+		page = deferred_pool_get(pool);
+		if (!page)
+			goto err_percpu;
+		def->buf = page_address(page);
+		def->count = 0;
+	}
+
+	return 0;
+
+err_percpu:
+	for_each_possible_cpu(cpu) {
+		struct zs_deferred_percpu *def = per_cpu_ptr(pool->deferred, cpu);
+
+		if (def->buf)
+			deferred_pool_put(pool, virt_to_page(def->buf));
+	}
+	free_percpu(pool->deferred);
+	pool->deferred = NULL;
+err_pages:
+	list_for_each_entry_safe(page, tmp, &pool->deferred_pool, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+	}
+	destroy_workqueue(pool->deferred_wq);
+	pool->deferred_wq = NULL;
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(zs_pool_enable_deferred_free);
+
 static void zs_object_copy(struct size_class *class, unsigned long dst,
 				unsigned long src)
 {
@@ -2182,9 +2366,31 @@ EXPORT_SYMBOL_GPL(zs_create_pool);
 
 void zs_destroy_pool(struct zs_pool *pool)
 {
-	int i;
+	int i, cpu;
+	struct page *page, *tmp;
 
 	zs_unregister_shrinker(pool);
+
+	if (pool->deferred) {
+		flush_work(&pool->deferred_work);
+		for_each_possible_cpu(cpu) {
+			struct zs_deferred_percpu *def =
+				per_cpu_ptr(pool->deferred, cpu);
+
+			if (def->buf && def->count)
+				pool->deferred_ops->drain(pool->deferred_private,
+							  def->buf, def->count);
+			if (def->buf)
+				deferred_pool_put(pool, virt_to_page(def->buf));
+		}
+		free_percpu(pool->deferred);
+		list_for_each_entry_safe(page, tmp, &pool->deferred_pool, lru) {
+			list_del(&page->lru);
+			__free_page(page);
+		}
+		destroy_workqueue(pool->deferred_wq);
+	}
+
 	zs_flush_migration(pool);
 	zs_pool_stat_destroy(pool);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 2/4] mm/zswap: use zsmalloc deferred free callback for async invalidate
  2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
  2026-05-08  6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
@ 2026-05-08  6:07 ` Wenchao Hao
  2026-05-08  6:07 ` [RFC PATCH v3 3/4] zram: use zsmalloc deferred free callback for async slot free Wenchao Hao
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Wenchao Hao @ 2026-05-08  6:07 UTC (permalink / raw)
  To: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Nhat Pham, Sergey Senozhatsky, Yosry Ahmed
  Cc: Wenchao Hao, Wenchao Hao

Register zswap_deferred_ops to defer the entire zswap_entry_free()
to the WQ_UNBOUND worker. The invalidate hot path only stores the
entry pointer into the per-cpu buffer (512 entries/page).

The drain callback performs the full entry teardown: lru_del, zs_free,
memcg uncharge, cache_free, and stats update. On deferred failure,
fallback to synchronous zswap_entry_free().

Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 mm/zswap.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 4b5149173b0e..3f23ddbe525c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -270,6 +270,8 @@ static void acomp_ctx_free(struct crypto_acomp_ctx *acomp_ctx)
 	acomp_ctx->buffer = NULL;
 }
 
+static const struct zs_deferred_ops zswap_deferred_ops;
+
 static struct zswap_pool *zswap_pool_create(char *compressor)
 {
 	struct zswap_pool *pool;
@@ -289,6 +291,8 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
 	if (!pool->zs_pool)
 		goto error;
 
+	zs_pool_enable_deferred_free(pool->zs_pool, &zswap_deferred_ops, pool);
+
 	strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
 
 	/* Many things rely on the zero-initialization. */
@@ -777,6 +781,36 @@ static void zswap_entry_free(struct zswap_entry *entry)
 	atomic_long_dec(&zswap_stored_pages);
 }
 
+static enum zs_push_ret zswap_deferred_push(void *buf,
+		unsigned int count, unsigned long value)
+{
+	unsigned long *entries = buf;
+
+	if (count >= PAGE_SIZE / sizeof(unsigned long))
+		return ZS_PUSH_FULL;
+	entries[count] = value;
+	if (count + 1 >= PAGE_SIZE / sizeof(unsigned long))
+		return ZS_PUSH_FULL_QUEUED;
+	return ZS_PUSH_OK;
+}
+
+static void zswap_deferred_drain(void *private, void *buf, unsigned int count)
+{
+	unsigned long *entries = buf;
+	unsigned int i;
+
+	for (i = 0; i < count; i++) {
+		struct zswap_entry *entry = (struct zswap_entry *)entries[i];
+
+		zswap_entry_free(entry);
+	}
+}
+
+static const struct zs_deferred_ops zswap_deferred_ops = {
+	.push = zswap_deferred_push,
+	.drain = zswap_deferred_drain,
+};
+
 /*********************************
 * compressed storage functions
 **********************************/
@@ -1647,7 +1681,9 @@ void zswap_invalidate(swp_entry_t swp)
 		return;
 
 	entry = xa_erase(tree, offset);
-	if (entry)
+	if (!entry)
+		return;
+	if (!zs_free_deferred(entry->pool->zs_pool, (unsigned long)entry))
 		zswap_entry_free(entry);
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 3/4] zram: use zsmalloc deferred free callback for async slot free
  2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
  2026-05-08  6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
  2026-05-08  6:07 ` [RFC PATCH v3 2/4] mm/zswap: use zsmalloc deferred free callback for async invalidate Wenchao Hao
@ 2026-05-08  6:07 ` Wenchao Hao
  2026-05-08  6:07 ` [RFC PATCH v3 4/4] zram: batch clear flags in slot_free with single write Wenchao Hao
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Wenchao Hao @ 2026-05-08  6:07 UTC (permalink / raw)
  To: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Nhat Pham, Sergey Senozhatsky, Yosry Ahmed
  Cc: Wenchao Hao, Barry Song, Wenchao Hao

From: Barry Song <baohua@kernel.org>

Register zram_deferred_ops with zs_pool_enable_deferred_free() to
defer slot freeing to a WQ_UNBOUND worker. The notify hot path only
stores a u32 slot index into the per-cpu buffer (1024 entries/page).

The drain callback does slot_lock + slot_free + slot_unlock for each
index. On deferred failure (no free page), fallback to synchronous
slot_lock + slot_free + slot_unlock.

Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 drivers/block/zram/zram_drv.c | 39 +++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index aebc710f0d6a..0d07f0901e55 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -56,6 +56,7 @@ static size_t huge_class_size;
 static const struct block_device_operations zram_devops;
 
 static void slot_free(struct zram *zram, u32 index);
+static const struct zs_deferred_ops zram_deferred_ops;
 #define slot_dep_map(zram, index) (&(zram)->table[(index)].dep_map)
 
 static void slot_lock_init(struct zram *zram, u32 index)
@@ -1994,6 +1995,8 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 	if (!huge_class_size)
 		huge_class_size = zs_huge_class_size(zram->mem_pool);
 
+	zs_pool_enable_deferred_free(zram->mem_pool, &zram_deferred_ops, zram);
+
 	for (index = 0; index < num_pages; index++)
 		slot_lock_init(zram, index);
 
@@ -2784,6 +2787,39 @@ static void zram_submit_bio(struct bio *bio)
 	}
 }
 
+static enum zs_push_ret zram_deferred_push(void *buf,
+		unsigned int count, unsigned long value)
+{
+	u32 *indices = buf;
+
+	if (count >= PAGE_SIZE / sizeof(u32))
+		return ZS_PUSH_FULL;
+	indices[count] = (u32)value;
+	if (count + 1 >= PAGE_SIZE / sizeof(u32))
+		return ZS_PUSH_FULL_QUEUED;
+	return ZS_PUSH_OK;
+}
+
+static void zram_deferred_drain(void *private, void *buf, unsigned int count)
+{
+	struct zram *zram = private;
+	u32 *indices = buf;
+	unsigned int i;
+
+	for (i = 0; i < count; i++) {
+		u32 index = indices[i];
+
+		slot_lock(zram, index);
+		slot_free(zram, index);
+		slot_unlock(zram, index);
+	}
+}
+
+static const struct zs_deferred_ops zram_deferred_ops = {
+	.push = zram_deferred_push,
+	.drain = zram_deferred_drain,
+};
+
 static void zram_slot_free_notify(struct block_device *bdev,
 				unsigned long index)
 {
@@ -2792,6 +2828,9 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	zram = bdev->bd_disk->private_data;
 
 	atomic64_inc(&zram->stats.notify_free);
+	if (zs_free_deferred(zram->mem_pool, (unsigned long)index))
+		return;
+
 	if (!slot_trylock(zram, index)) {
 		atomic64_inc(&zram->stats.miss_free);
 		return;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 4/4] zram: batch clear flags in slot_free with single write
  2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
                   ` (2 preceding siblings ...)
  2026-05-08  6:07 ` [RFC PATCH v3 3/4] zram: use zsmalloc deferred free callback for async slot free Wenchao Hao
@ 2026-05-08  6:07 ` Wenchao Hao
  2026-05-08 20:12 ` [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Yosry Ahmed
  2026-05-09  0:08 ` Nhat Pham
  5 siblings, 0 replies; 12+ messages in thread
From: Wenchao Hao @ 2026-05-08  6:07 UTC (permalink / raw)
  To: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Nhat Pham, Sergey Senozhatsky, Yosry Ahmed
  Cc: Wenchao Hao, Wenchao Hao

Replace four separate flag clear operations in slot_free() with a
single mask write. This reduces redundant read-modify-write cycles
on the same flags word.

Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 drivers/block/zram/zram_drv.c | 5 +----
 drivers/block/zram/zram_drv.h | 6 ++++++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0d07f0901e55..b1a565d35567 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2011,10 +2011,7 @@ static void slot_free(struct zram *zram, u32 index)
 	zram->table[index].attr.ac_time = 0;
 #endif
 
-	clear_slot_flag(zram, index, ZRAM_IDLE);
-	clear_slot_flag(zram, index, ZRAM_INCOMPRESSIBLE);
-	clear_slot_flag(zram, index, ZRAM_PP_SLOT);
-	set_slot_comp_priority(zram, index, 0);
+	zram->table[index].attr.flags &= ~ZRAM_SLOT_FREE_CLEAR_MASK;
 
 	if (test_slot_flag(zram, index, ZRAM_HUGE)) {
 		/*
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 08d1774c15db..89a7e39a2f4b 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -57,6 +57,12 @@ enum zram_pageflags {
 	__NR_ZRAM_PAGEFLAGS,
 };
 
+#define ZRAM_SLOT_FREE_CLEAR_MASK	(BIT(ZRAM_IDLE) | \
+					 BIT(ZRAM_INCOMPRESSIBLE) | \
+					 BIT(ZRAM_PP_SLOT) | \
+					 (ZRAM_COMP_PRIORITY_MASK << \
+					  ZRAM_COMP_PRIORITY_BIT1))
+
 /*
  * Allocated for each disk page.  We use bit-lock (ZRAM_ENTRY_LOCK bit
  * of flags) to save memory.  There can be plenty of entries and standard
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
  2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
                   ` (3 preceding siblings ...)
  2026-05-08  6:07 ` [RFC PATCH v3 4/4] zram: batch clear flags in slot_free with single write Wenchao Hao
@ 2026-05-08 20:12 ` Yosry Ahmed
  2026-05-09  8:32   ` Wenchao Hao
  2026-05-09  0:08 ` Nhat Pham
  5 siblings, 1 reply; 12+ messages in thread
From: Yosry Ahmed @ 2026-05-08 20:12 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Nhat Pham, Sergey Senozhatsky, Wenchao Hao

()

On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing many swap
> entries. This has been reported to significantly delay memory reclamation
> during Android's low-memory killing, especially when multiple processes
> are terminated to free memory, with slot_free() accounting for more than
> 80% of the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the design
> itself is fairly complex.
>
> When anon folios and swap entries are mixed within a process, reclaiming
> anon folios from killed processes helps return memory to the system as
> quickly as possible, so that newly launched applications can satisfy
> their memory demands. It is not ideal for swap freeing to block anon
> folio freeing. On the other hand, swap freeing can still return memory
> to the system, although at a slower rate due to memory compression.
>
> This series introduces a callback-based deferred free framework in
> zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> define what gets buffered and how it gets drained. The entire free
> path including caller-side bookkeeping (slot_free, zswap_entry_free)
> is deferred to a background worker.

How much of the speedup comes from avoiding the per-class lock,
free_zspage(), other work in zswap, etc.

I ask because I think the design here is still fairly complex. I don't
like how zswap and zram are registering callbacks into zsmalloc to do
their own freeing work, and they fill the buffers on behalf of
zsmalloc which seems like a layering violation.

I wonder how much of the speedup we get by just deferring
free_zspage()? That part can be done much more simply by just putting
the pages on a per-class list and having an async worker or a kthread
consume them and batch-free them. If the rest of zs_free() is also
expensive, we can do the deferred freeing on that level although it
would be more complicated as we need to have a fixed size buffer to
store them and handle running out of space.

A breakdown of where the slowdown is coming from would be helpful to
understand what to focus on.

>
> Implementation:
>   - Each CPU owns a single-page buffer. The hot path writes a value
>     via the push callback with preemption disabled (no locks).
>   - When the buffer fills, it is swapped with a fresh page from a
>     pre-allocated page pool. The full page is queued to a WQ_UNBOUND
>     worker for drain.
>   - The drain callback performs the actual expensive work (zs_free,
>     slot_free, zswap_entry_free, etc.) in batch, off the hot path.
>   - If no free page is available, the caller falls back to synchronous
>     processing.
>
> The speedup comes from moving expensive swap slot freeing off the
> munmap hot path into a background worker, so that intact anonymous
> folios are released back to the system without blocking. The worker
> drains at a slower rate since compressed objects are small and freeing
> a single handle may not release an entire page until the zspage is
> fully empty.
>
> Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):
>
> Test 1: munmap latency for 256MB swap-filled VMA (zram backend)
>
>   mode        Base       Patched     Speedup
>   single      61.82ms    8.62ms      7.17x
>   multi 2p    94.75ms    54.11ms     1.75x
>   multi 3p    154.64ms   104.83ms    1.48x
>
> Test 2: munmap latency for different sizes (zram, single process)
>
>   Size       Base         Patched     Speedup
>   64MB       14.11ms      2.18ms      6.47x
>   128MB      29.45ms      4.48ms      6.57x
>   192MB      43.85ms      6.62ms      6.62x
>   256MB      57.01ms      9.08ms      6.28x
>   512MB      115.13ms     55.58ms     2.07x
>   1024MB     229.66ms     153.28ms    1.50x
>
> Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)
>
>   mode        Base       Patched     Speedup
>   single      152.14ms   51.26ms     2.97x
>   multi 2p    186.56ms   105.42ms    1.77x
>   multi 3p    205.83ms   153.32ms    1.34x
>
> Test 4: munmap latency for different sizes (zswap, single process)
>
>   Size       Base         Patched     Speedup
>   64MB       37.83ms      13.26ms     2.85x
>   128MB      75.11ms      26.73ms     2.81x
>   256MB      150.78ms     52.97ms     2.85x
>   512MB      303.04ms     130.38ms    2.32x
>   1024MB     599.95ms     287.10ms    2.09x
>
> [1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/
> [2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@vivo.com/
> [3] https://lore.kernel.org/linux-mm/20260412060450.15813-1-baohua@kernel.org/
>
> Changes since v2:
> - Use per-cpu single-page buffers instead of a global list; the hot
>   path only writes into the local CPU's buffer with preemption disabled
> - Add a page pool for buffer rotation: when the current buffer is full,
>   swap it with a free page from the pool and queue the full page for
>   drain
> - Introduce push/drain callback ops so that zram and zswap can each
>   define their own element size and drain logic (zram stores u32 slot
>   indices, zswap stores unsigned long handles)
> - Drop the lock optimization patches it will be submitted separately
>   as part of a dedicated zsmalloc lock contention series
> - Link to v2: https://lore.kernel.org/all/20260421121616.3298845-1-haowenchao@xiaomi.com/
>
> Barry Song (1):
>   zram: use zsmalloc deferred free callback for async slot free
>
> Wenchao Hao (3):
>   mm/zsmalloc: introduce deferred free framework with callback ops
>   mm/zswap: use zsmalloc deferred free callback for async invalidate
>   zram: batch clear flags in slot_free with single write
>
>  drivers/block/zram/zram_drv.c |  44 ++++++-
>  drivers/block/zram/zram_drv.h |   6 +
>  include/linux/zsmalloc.h      |  16 +++
>  mm/zsmalloc.c                 | 208 +++++++++++++++++++++++++++++++++-
>  mm/zswap.c                    |  38 ++++++-
>  5 files changed, 306 insertions(+), 6 deletions(-)
>
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
  2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
                   ` (4 preceding siblings ...)
  2026-05-08 20:12 ` [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Yosry Ahmed
@ 2026-05-09  0:08 ` Nhat Pham
  2026-05-09  8:45   ` Wenchao Hao
  5 siblings, 1 reply; 12+ messages in thread
From: Nhat Pham @ 2026-05-09  0:08 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Sergey Senozhatsky, Yosry Ahmed, Wenchao Hao

On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing many swap
> entries. This has been reported to significantly delay memory reclamation
> during Android's low-memory killing, especially when multiple processes
> are terminated to free memory, with slot_free() accounting for more than
> 80% of the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the design
> itself is fairly complex.
>
> When anon folios and swap entries are mixed within a process, reclaiming
> anon folios from killed processes helps return memory to the system as
> quickly as possible, so that newly launched applications can satisfy
> their memory demands. It is not ideal for swap freeing to block anon
> folio freeing. On the other hand, swap freeing can still return memory
> to the system, although at a slower rate due to memory compression.
>
> This series introduces a callback-based deferred free framework in
> zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> define what gets buffered and how it gets drained. The entire free
> path including caller-side bookkeeping (slot_free, zswap_entry_free)
> is deferred to a background worker.
>
> Implementation:
>   - Each CPU owns a single-page buffer. The hot path writes a value
>     via the push callback with preemption disabled (no locks).
>   - When the buffer fills, it is swapped with a fresh page from a
>     pre-allocated page pool. The full page is queued to a WQ_UNBOUND
>     worker for drain.
>   - The drain callback performs the actual expensive work (zs_free,
>     slot_free, zswap_entry_free, etc.) in batch, off the hot path.
>   - If no free page is available, the caller falls back to synchronous
>     processing.
>
> The speedup comes from moving expensive swap slot freeing off the
> munmap hot path into a background worker, so that intact anonymous
> folios are released back to the system without blocking. The worker
> drains at a slower rate since compressed objects are small and freeing
> a single handle may not release an entire page until the zspage is
> fully empty.
>
> Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):
>
> Test 1: munmap latency for 256MB swap-filled VMA (zram backend)
>
>   mode        Base       Patched     Speedup
>   single      61.82ms    8.62ms      7.17x
>   multi 2p    94.75ms    54.11ms     1.75x
>   multi 3p    154.64ms   104.83ms    1.48x
>
> Test 2: munmap latency for different sizes (zram, single process)
>
>   Size       Base         Patched     Speedup
>   64MB       14.11ms      2.18ms      6.47x
>   128MB      29.45ms      4.48ms      6.57x
>   192MB      43.85ms      6.62ms      6.62x
>   256MB      57.01ms      9.08ms      6.28x
>   512MB      115.13ms     55.58ms     2.07x
>   1024MB     229.66ms     153.28ms    1.50x
>
> Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)
>
>   mode        Base       Patched     Speedup
>   single      152.14ms   51.26ms     2.97x
>   multi 2p    186.56ms   105.42ms    1.77x
>   multi 3p    205.83ms   153.32ms    1.34x
>
> Test 4: munmap latency for different sizes (zswap, single process)
>
>   Size       Base         Patched     Speedup
>   64MB       37.83ms      13.26ms     2.85x
>   128MB      75.11ms      26.73ms     2.81x
>   256MB      150.78ms     52.97ms     2.85x
>   512MB      303.04ms     130.38ms    2.32x
>   1024MB     599.95ms     287.10ms    2.09x
>

Hmmm, why are we batching at the zswap/zsmalloc level like this? I
agree with Yosry that this seems like somewhat of an unnecessary
layering violation. For example, do we observe a lot more performance
wins by doing this instead of just simply:

static void zswap_entry_free(swp_entry_t swp, bool deferred)
{
    ...
    if (!deferred || !zs_deferred_free(entry->pool->zs_pool , entry->handle))
        zs_free(entry->pool->zs_pool , entry->handle);
}

(basically what you had in the last version).

One weird effect of doing deferred zswap entry freeing like what you
are proposing here, is that the zswap LRU will be littered with stale
zswap entries. Seems like you removed them from the zswap xarray, but
they're still linked into the zswap LRU? At writeback time, that will
throw off the statistics used in the heuristics, and will make
writeback go through a bunch of stale entries, wasting more cycles :)
Seems a bit inelegant, no?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops
  2026-05-08  6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
@ 2026-05-09  0:29   ` Nhat Pham
  2026-05-09  8:47     ` Wenchao Hao
  0 siblings, 1 reply; 12+ messages in thread
From: Nhat Pham @ 2026-05-09  0:29 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Sergey Senozhatsky, Yosry Ahmed, Wenchao Hao

On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> Add a per-cpu deferred free mechanism to zsmalloc with a callback
> interface that lets callers (zram, zswap) customize push and drain
> behavior.
>
> Each CPU owns a single-page buffer. The hot path (zs_free_deferred)
> writes a value into the current CPU's buffer via the push callback
> with preemption disabled — no locks, no atomics. When the buffer
> fills, it is swapped with a fresh page from a pre-allocated page
> pool and the full page is queued to a WQ_UNBOUND worker for drain.
>
> The drain worker invokes the drain callback which performs the actual
> expensive work (zs_free, slot_free, etc.) in batch, away from the
> original hot path.
>
> Page pool management:
>   - Pool is pre-allocated at enable time (ZS_DEFERRED_POOL_SIZE pages)
>   - Full buffers are drained and returned to the pool
>   - If no free page is available when buffer is full, the push falls
>     back to synchronous processing by the caller
>
> Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
> ---
> +#define ZS_DEFERRED_POOL_SIZE  (256 * 1024 / PAGE_SIZE)

Seems oddly specific? :) And this doesn't quite scale with number of
CPUs, or memory size?

> +
> +struct zs_deferred_percpu {
> +       unsigned int count;
> +       void *buf;
> +};
> +
>  struct zs_pool {
>         const char *name;
>
> @@ -217,6 +224,18 @@ struct zs_pool {
>         /* protect zspage migration/compaction */
>         rwlock_t lock;
>         atomic_t compaction_in_progress;
> +
> +       /* per-cpu deferred free */
> +       const struct zs_deferred_ops *deferred_ops;
> +       void *deferred_private;
> +       struct zs_deferred_percpu __percpu *deferred;
> +       struct work_struct deferred_work;
> +       struct workqueue_struct *deferred_wq;
> +       struct list_head deferred_pool;
> +       unsigned int deferred_pool_count;
> +       spinlock_t deferred_pool_lock;
> +       struct list_head deferred_drain_list;
> +       spinlock_t deferred_drain_lock;
>  };
>
>  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
> @@ -1416,6 +1435,171 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  }
>  EXPORT_SYMBOL_GPL(zs_free);
>
> +static struct page *deferred_pool_get(struct zs_pool *pool)
> +{
> +       struct page *page = NULL;
> +
> +       spin_lock(&pool->deferred_pool_lock);
> +       if (!list_empty(&pool->deferred_pool)) {
> +               page = list_first_entry(&pool->deferred_pool, struct page, lru);
> +               list_del(&page->lru);
> +               pool->deferred_pool_count--;
> +       }
> +       spin_unlock(&pool->deferred_pool_lock);
> +       return page;
> +}
> +
> +static void deferred_pool_put(struct zfs_pool *pool, struct page *page)
> +{
> +       spin_lock(&pool->deferred_pool_lock);
> +       list_add_tail(&page->lru, &pool->deferred_pool);
> +       pool->deferred_pool_count++;
> +       spin_unlock(&pool->deferred_pool_lock);
> +}
> +
> +static void zs_deferred_work_fn(struct work_struct *work)
> +{
> +       struct zs_pool *pool = container_of(work, struct zs_pool, deferred_work);
> +       struct page *page;
> +
> +       while (true) {
> +               unsigned int count;
> +
> +               spin_lock(&pool->deferred_drain_lock);
> +               if (list_empty(&pool->deferred_drain_list)) {
> +                       spin_unlock(&pool->deferred_drain_lock);
> +                       break;
> +               }
> +               page = list_first_entry(&pool->deferred_drain_list,
> +                                       struct page, lru);
> +               list_del(&page->lru);
> +               count = page_private(page);
> +               spin_unlock(&pool->deferred_drain_lock);
> +
> +               pool->deferred_ops->drain(pool->deferred_private,
> +                                         page_address(page), count);
> +               deferred_pool_put(pool, page);
> +               cond_resched();
> +       }
> +}
> +
> +bool zs_free_deferred(struct zs_pool *pool, unsigned long value)
> +{
> +       struct zs_deferred_percpu *def;
> +       struct page *new_page, *full_page;
> +       enum zs_push_ret ret;
> +
> +       if (!pool->deferred)
> +               return false;
> +
> +       def = get_cpu_ptr(pool->deferred);
> +
> +       ret = pool->deferred_ops->push(def->buf, def->count, value);
> +       if (ret == ZS_PUSH_OK) {
> +               def->count++;
> +               put_cpu_ptr(pool->deferred);
> +               return true;
> +       }
> +
> +       if (ret == ZS_PUSH_FULL_QUEUED)
> +               def->count++;
> +
> +       new_page = deferred_pool_get(pool);
> +       if (new_page) {
> +               full_page = virt_to_page(def->buf);
> +               set_page_private(full_page, def->count);
> +               def->buf = page_address(new_page);
> +               def->count = 0;
> +
> +               if (ret == ZS_PUSH_FULL) {
> +                       pool->deferred_ops->push(def->buf, 0, value);
> +                       def->count = 1;
> +               }
> +               put_cpu_ptr(pool->deferred);
> +
> +               spin_lock(&pool->deferred_drain_lock);
> +               list_add_tail(&full_page->lru, &pool->deferred_drain_list);
> +               spin_unlock(&pool->deferred_drain_lock);
> +               queue_work(pool->deferred_wq, &pool->deferred_work);
> +               return true;
> +       }
> +       put_cpu_ptr(pool->deferred);
> +
> +       /* ret==2: value already queued, will be drained eventually */
> +       if (ret == 2)

== 2? :)

> +               return true;
> +
> +       /* ret==1: value not queued, caller must fallback */
> +       return false;
> +}
> +EXPORT_SYMBOL_GPL(zs_free_deferred);


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
  2026-05-08 20:12 ` [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Yosry Ahmed
@ 2026-05-09  8:32   ` Wenchao Hao
  2026-05-09  8:38     ` Wenchao Hao
  0 siblings, 1 reply; 12+ messages in thread
From: Wenchao Hao @ 2026-05-09  8:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Nhat Pham, Sergey Senozhatsky, Wenchao Hao

On Sat, May 9, 2026 at 4:13 AM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> >
> > Swap freeing can be expensive when unmapping a VMA containing many swap
> > entries. This has been reported to significantly delay memory reclamation
> > during Android's low-memory killing, especially when multiple processes
> > are terminated to free memory, with slot_free() accounting for more than
> > 80% of the total cost of freeing swap entries.
> >
> > This series introduces a callback-based deferred free framework in
> > zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> > define what gets buffered and how it gets drained. The entire free
> > path including caller-side bookkeeping (slot_free, zswap_entry_free)
> > is deferred to a background worker.
>
> How much of the speedup comes from avoiding the per-class lock,
> free_zspage(), other work in zswap, etc.

This series doesn't avoid the per-class lock. The pool->lock part
has been split out and posted as a separate series, so this series
focuses purely on the defer scheme:

https://lore.kernel.org/linux-mm/20260508061910.3882831-1-haowenchao@xiaomi.com/

>
> I ask because I think the design here is still fairly complex. I don't
> like how zswap and zram are registering callbacks into zsmalloc to do
> their own freeing work, and they fill the buffers on behalf of
> zsmalloc which seems like a layering violation.

The callback design was motivated by code reuse -- deferring only
zs_free() inside zsmalloc gave less speedup, and the machinery
needed to defer caller-side bookkeeping turns out to be the same
on both sides (per-cpu page buffer, drain worker, fallback). So I
folded the common parts into zsmalloc.

I agree it's not clean from a layering standpoint, and I'm happy to
revisit if the reuse isn't worth the cost.

>
> I wonder how much of the speedup we get by just deferring
> free_zspage()?

Below is the perf breakdown, sampled only during munmap() of a
256MB zram-filled VMA on a Raspberry Pi 4B.

Base kernel:

  # Samples: 491  of event 'cycles'
  # Event count (approx.): 214056923
  #
  # Children      Self  Symbol
  # ........  ........  ..........................................
      99.55%     0.41%  [k] __zap_vma_range
      97.27%     2.91%  [k] swap_put_entries_cluster
      94.37%     1.65%  [k] __swap_cluster_free_entries
      88.99%     8.91%  [k] zram_slot_free_notify
      79.87%    10.78%  [k] slot_free
      56.27%     5.99%  [k] zs_free
      47.61%     4.35%  [k] free_zspage
      36.85%     4.96%  [k] __free_zspage
      19.27%     0.21%  [k] __folio_put
      12.64%     2.91%  [k] __free_frozen_pages
       9.50%     6.40%  [k] kmem_cache_free
       8.28%     8.28%  [k] _raw_spin_unlock_irqrestore
       6.83%     1.85%  [k] dec_zone_page_state
       5.18%     5.18%  [k] _raw_spin_unlock
       5.18%     5.18%  [k] folio_unlock
       4.98%     4.98%  [k] mod_zone_state
       4.12%     4.12%  [k] _raw_spin_lock
       3.30%     3.30%  [k] __swap_cgroup_id_xchg

Perf of the zsmalloc-only variant (same 256MB zram workload):

My first attempt for this RFC was exactly that -- defer only the
handle free inside zsmalloc, keep zram/zswap caller-side bookkeeping
synchronous. (I would post this version after this thread)

  # Samples: 164  of event 'cycles'
  # Event count (approx.): 68803872
  #
  # Children      Self  Symbol
  # ........  ........  ..........................................
      99.24%     1.28%  [k] __zap_vma_range
      94.17%     4.49%  [k] swap_put_entries_cluster
      87.77%    12.09%  [k] __swap_cluster_free_entries
      43.62%    24.33%  [k] zram_slot_free_notify
      21.80%    21.80%  [k] slot_free_extract
      19.29%     6.42%  [k] zs_free_deferred
      12.23%     0.64%  [k] zs_free         <- sync fallback only
       8.96%     8.96%  [k] __swap_cgroup_id_xchg
       4.51%     1.93%  [k] __free_frozen_pages

Zsmalloc-internal items drop out or shrink dramatically. zs_free at
0.64% is only the synchronous fallback when the per-cpu page pool is
temporarily empty. zram_slot_free_notify remains high (24.33%)
because slot_free_extract() still runs synchronously on the hot path
-- it's a new helper this variant introduces to do the zram-side
cleanup (slot flag clears, atomic stats updates, handle extraction)
before the handle is queued.

The perf numbers showed that zram also has non-trivial caller-side
bookkeeping cost -- the work in slot_free_extract() in particular.
I tried to reduce that without deferring it (per-cpu stats, swapping
the bit-lock for a different primitive), but the results were
basically a wash and sometimes slightly worse. That's what led to v3
extending the defer to cover the caller side as well, via the
push/drain callbacks.

v3 (this series, defer all zram slot notify):

  # Samples: 82  of event 'cycles'
  # Event count (approx.): 33089591
  #
  # Children      Self  Symbol
  # ........  ........  ..........................................
      91.46%     1.32%  [k] __zap_vma_range
      75.77%     8.35%  [k] swap_put_entries_cluster
      64.71%    10.72%  [k] __swap_cluster_free_entries
      33.36%    17.43%  [k] zram_slot_free_notify
      18.03%    18.03%  [k] __swap_cgroup_id_xchg
      13.31%    11.82%  [k] zs_free_deferred
       9.10%     9.10%  [k] lookup_swap_cgroup_id
       4.03%     4.03%  [k] zswap_invalidate
       3.94%     3.94%  [k] swap_pte_batch

Absolute cycles in the unmap window drop from 214M to 33M (~6.5x),
matching the observed munmap latency (57ms -> 9ms). The defer path
moves the following items out of the hot path (base kernel Self%):

  _raw_spin_unlock_irqrestore  8.28   class/pool locks
  kmem_cache_free              6.40   zspage struct slab free
  zs_free                      5.99
  _raw_spin_unlock             5.18
  folio_unlock                 5.18
  __free_zspage                4.96   zspage teardown
  mod_zone_state               4.98   zone stats
  free_zspage                  4.35
  _raw_spin_lock               4.12
  __free_frozen_pages          2.91   buddy page release
  _raw_spin_trylock            2.48
  dec_zone_page_state          1.85
  free_frozen_page_commit      1.66

That accounts for ~55% of the base-kernel munmap hot path, all
moved to the drain worker.

Benchmark (zram, single process, avg of 3):

  size    base     v3       zs-only     v3/base   zs-only/base
  64MB     14.38    2.12     4.34        6.8x      3.3x
  128MB    29.73    4.26     8.54        7.0x      3.5x
  256MB    57.93    8.54    19.90        6.8x      2.9x
  512MB   116.77   55.41    47.90        2.1x      2.4x
  1024MB  234.43  150.11   105.06        1.6x      2.2x

> That part can be done much more simply by just putting
> the pages on a per-class list and having an async worker or a kthread
> consume them and batch-free them. If the rest of zs_free() is also
> expensive, we can do the deferred freeing on that level although it
> would be more complicated as we need to have a fixed size buffer to
> store them and handle running out of space.
>

I hesitated on per-class because there are ~255 classes, so a
worker walking them would often find single-entry or empty lists,
defeating the batching. Using a per-cpu buffer of handles and
sorting by class inside the drain gets "batched under one
class->lock" without the many-short-lists problem.

Thanks,
Wenchao

> A breakdown of where the slowdown is coming from would be helpful to
> understand what to focus on.
>
> >
> > Implementation:
> >   - Each CPU owns a single-page buffer. The hot path writes a value
> >     via the push callback with preemption disabled (no locks).
> >   - When the buffer fills, it is swapped with a fresh page from a
> >     pre-allocated page pool. The full page is queued to a WQ_UNBOUND
> >     worker for drain.
> >   - The drain callback performs the actual expensive work (zs_free,
> >     slot_free, zswap_entry_free, etc.) in batch, off the hot path.
> >   - If no free page is available, the caller falls back to synchronous
> >     processing.
> >
> > The speedup comes from moving expensive swap slot freeing off the
> > munmap hot path into a background worker, so that intact anonymous
> > folios are released back to the system without blocking. The worker
> > drains at a slower rate since compressed objects are small and freeing
> > a single handle may not release an entire page until the zspage is
> > fully empty.
> >
> > Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):
> >
> > Test 1: munmap latency for 256MB swap-filled VMA (zram backend)
> >
> >   mode        Base       Patched     Speedup
> >   single      61.82ms    8.62ms      7.17x
> >   multi 2p    94.75ms    54.11ms     1.75x
> >   multi 3p    154.64ms   104.83ms    1.48x
> >
> > Test 2: munmap latency for different sizes (zram, single process)
> >
> >   Size       Base         Patched     Speedup
> >   64MB       14.11ms      2.18ms      6.47x
> >   128MB      29.45ms      4.48ms      6.57x
> >   192MB      43.85ms      6.62ms      6.62x
> >   256MB      57.01ms      9.08ms      6.28x
> >   512MB      115.13ms     55.58ms     2.07x
> >   1024MB     229.66ms     153.28ms    1.50x
> >
> > Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)
> >
> >   mode        Base       Patched     Speedup
> >   single      152.14ms   51.26ms     2.97x
> >   multi 2p    186.56ms   105.42ms    1.77x
> >   multi 3p    205.83ms   153.32ms    1.34x
> >
> > Test 4: munmap latency for different sizes (zswap, single process)
> >
> >   Size       Base         Patched     Speedup
> >   64MB       37.83ms      13.26ms     2.85x
> >   128MB      75.11ms      26.73ms     2.81x
> >   256MB      150.78ms     52.97ms     2.85x
> >   512MB      303.04ms     130.38ms    2.32x
> >   1024MB     599.95ms     287.10ms    2.09x
> >
> > [1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/
> > [2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@vivo.com/
> > [3] https://lore.kernel.org/linux-mm/20260412060450.15813-1-baohua@kernel.org/
> >
> > Changes since v2:
> > - Use per-cpu single-page buffers instead of a global list; the hot
> >   path only writes into the local CPU's buffer with preemption disabled
> > - Add a page pool for buffer rotation: when the current buffer is full,
> >   swap it with a free page from the pool and queue the full page for
> >   drain
> > - Introduce push/drain callback ops so that zram and zswap can each
> >   define their own element size and drain logic (zram stores u32 slot
> >   indices, zswap stores unsigned long handles)
> > - Drop the lock optimization patches it will be submitted separately
> >   as part of a dedicated zsmalloc lock contention series
> > - Link to v2: https://lore.kernel.org/all/20260421121616.3298845-1-haowenchao@xiaomi.com/
> >
> > Barry Song (1):
> >   zram: use zsmalloc deferred free callback for async slot free
> >
> > Wenchao Hao (3):
> >   mm/zsmalloc: introduce deferred free framework with callback ops
> >   mm/zswap: use zsmalloc deferred free callback for async invalidate
> >   zram: batch clear flags in slot_free with single write
> >
> >  drivers/block/zram/zram_drv.c |  44 ++++++-
> >  drivers/block/zram/zram_drv.h |   6 +
> >  include/linux/zsmalloc.h      |  16 +++
> >  mm/zsmalloc.c                 | 208 +++++++++++++++++++++++++++++++++-
> >  mm/zswap.c                    |  38 ++++++-
> >  5 files changed, 306 insertions(+), 6 deletions(-)
> >
> > --
> > 2.34.1
> >


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
  2026-05-09  8:32   ` Wenchao Hao
@ 2026-05-09  8:38     ` Wenchao Hao
  0 siblings, 0 replies; 12+ messages in thread
From: Wenchao Hao @ 2026-05-09  8:38 UTC (permalink / raw)
  To: haowenchao22
  Cc: 21cnbao, akpm, axboe, chengming.zhou, hannes, haowenchao,
	linux-block, linux-kernel, linux-mm, minchan, nphamcs,
	senozhatsky, yosry

The three patches below implement the zsmalloc-only variant --
deferring just zs_free(). They partially depend on the pool->lock
removal series, and also reduce the number of class->lock
acquire/release pairs on the drain path.

----- [1/3] mm/zsmalloc: introduce per-cpu deferred free with page pool -----

Introduce zs_free_deferred() that enqueues handles into per-cpu
buffers backed by single pages (PAGE_SIZE/8 entries each).

A pre-allocated page pool provides fresh pages for buffer swap on the
hot path without any allocation.  When a per-cpu buffer fills up, the
producer swaps in a page from the pool, moves the full page to a drain
list, and resets count — all within preempt_disable, no waiting for the
worker.

The drain worker runs on a WQ_UNBOUND workqueue to avoid preempting
the producer on its CPU.  It picks pages off the drain list one at a
time, drains them using consecutive-class batching (holding class->lock
across runs of same-class handles), and returns drained pages to the
pool.  It processes at most pool_size/2 pages per invocation to avoid
monopolizing CPU, rescheduling itself if more pages remain.

Extract __zs_free_handle() from zs_free() as the locked free primitive
shared by both synchronous and deferred paths.  Empty zspages are
collected on a list and released after dropping class->lock.

Also introduce zs_free_deferred_flush() for use before zs_compact()
and zs_deferred_free_all() for pool teardown.

Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 include/linux/zsmalloc.h |   2 +
 mm/zsmalloc.c            | 342 +++++++++++++++++++++++++++++++++++----
 2 files changed, 316 insertions(+), 28 deletions(-)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 478410c880b1..1e5ac1a39d41 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -30,6 +30,8 @@ void zs_destroy_pool(struct zs_pool *pool);
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags,
 			const int nid);
 void zs_free(struct zs_pool *pool, unsigned long obj);
+void zs_free_deferred(struct zs_pool *pool, unsigned long handle);
+void zs_free_deferred_flush(struct zs_pool *pool);
 
 size_t zs_huge_class_size(struct zs_pool *pool);
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 176d3ad4f6e9..f483937cf34f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -42,6 +42,7 @@
 #include <linux/zsmalloc.h>
 #include <linux/fs.h>
 #include <linux/workqueue.h>
+#include <linux/percpu.h>
 #include "zpdesc.h"
 
 #define ZSPAGE_MAGIC	0x58
@@ -56,6 +57,9 @@
 
 #define ZS_HANDLE_SIZE (sizeof(unsigned long))
 
+#define ZS_DEFERRED_BUF_ENTRIES	(PAGE_SIZE / sizeof(unsigned long))
+#define ZS_DEFERRED_POOL_SIZE	(256 * 1024 / PAGE_SIZE)
+
 /*
  * Object location (<PFN>, <obj_idx>) is encoded as
  * a single (unsigned long) handle value.
@@ -174,6 +178,7 @@ static_assert(_PFN_BITS + OBJ_CLASS_BITS_NEEDED + OBJ_IDX_BITS_NEEDED
 #define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
 				      ZS_SIZE_CLASS_DELTA) + 1)
 
+
 /*
  * Pages are distinguished by the ratio of used memory (that is the ratio
  * of ->inuse objects to all objects that page can store). For example,
@@ -246,6 +251,11 @@ struct link_free {
 	};
 };
 
+struct zs_deferred_percpu {
+	unsigned int count;
+	unsigned long *handles;
+};
+
 static struct kmem_cache *handle_cachep;
 static struct kmem_cache *zspage_cachep;
 
@@ -270,6 +280,20 @@ struct zs_pool {
 	/* protect zspage migration/compaction */
 	rwlock_t lock;
 	atomic_t compaction_in_progress;
+
+	/* per-cpu deferred free */
+	struct zs_deferred_percpu __percpu *deferred;
+	struct work_struct deferred_drain_work;
+	struct workqueue_struct *drain_wq;
+
+	/* page pool: free pages available for buffer swap */
+	struct list_head page_pool;
+	unsigned int page_pool_count;
+	spinlock_t page_pool_lock;
+
+	/* drain list: full pages waiting to be drained */
+	struct list_head drain_list;
+	spinlock_t drain_list_lock;
 };
 
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
@@ -788,12 +812,6 @@ static unsigned int obj_to_class_idx(unsigned long obj)
 	return (obj >> OBJ_IDX_BITS) & OBJ_CLASS_MASK;
 }
 
-/**
- * location_to_obj - encode (<zpdesc>, <obj_idx>, <class_idx>) into obj value
- * @zpdesc: zpdesc object resides in zspage
- * @obj_idx: object index
- * @class_idx: size class index
- */
 static unsigned long location_to_obj(struct zpdesc *zpdesc, unsigned int obj_idx,
 				     unsigned int class_idx)
 {
@@ -1454,23 +1472,14 @@ static void obj_free(int class_size, unsigned long obj)
 	mod_zspage_inuse(zspage, -1);
 }
 
-void zs_free(struct zs_pool *pool, unsigned long handle)
+static void __zs_free_handle(struct zs_pool *pool, struct size_class *class,
+			     unsigned long handle, struct list_head *free_list)
 {
-	struct zspage *zspage;
-	struct zspage *zspage_to_free = NULL;
 	struct zpdesc *f_zpdesc;
+	struct zspage *zspage;
 	unsigned long obj;
-	struct size_class *class;
 	int fullness;
 
-	if (IS_ERR_OR_NULL((void *)handle))
-		return;
-
-	obj = handle_to_obj(handle);
-	class = pool->size_class[obj_to_class_idx(obj)];
-
-	spin_lock(&class->lock);
-
 	obj = handle_to_obj(handle);
 	obj_to_zpdesc(obj, &f_zpdesc);
 	zspage = get_zspage(f_zpdesc);
@@ -1480,31 +1489,231 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 
 	fullness = fix_fullness_group(class, zspage);
 	if (fullness == ZS_INUSE_RATIO_0) {
-		/*
-		 * Perform bookkeeping under class->lock, but defer the
-		 * actual page release (which may contend on zone->lock)
-		 * until after dropping class->lock.
-		 */
 		if (trylock_zspage(zspage)) {
 			remove_zspage(class, zspage);
 			class_stat_sub(class, ZS_OBJS_ALLOCATED,
 				       class->objs_per_zspage);
 			atomic_long_sub(class->pages_per_zspage,
 					&pool->pages_allocated);
-			zspage_to_free = zspage;
+			list_add(&zspage->list, free_list);
 		} else {
 			kick_deferred_free(pool);
 		}
 	}
+}
 
+static void free_zspage_list(struct zs_pool *pool, struct list_head *list)
+{
+	struct zspage *zspage, *tmp;
+
+	list_for_each_entry_safe(zspage, tmp, list, list) {
+		list_del(&zspage->list);
+		free_zspage_pages(pool, zspage);
+	}
+}
+
+void zs_free(struct zs_pool *pool, unsigned long handle)
+{
+	struct size_class *class;
+	unsigned long obj;
+	LIST_HEAD(free_list);
+
+	if (IS_ERR_OR_NULL((void *)handle))
+		return;
+
+	obj = handle_to_obj(handle);
+	class = pool->size_class[obj_to_class_idx(obj)];
+	spin_lock(&class->lock);
+
+	__zs_free_handle(pool, class, handle, &free_list);
 	spin_unlock(&class->lock);
 
-	if (zspage_to_free)
-		free_zspage_pages(pool, zspage_to_free);
+	free_zspage_list(pool, &free_list);
 	cache_free_handle(handle);
 }
 EXPORT_SYMBOL_GPL(zs_free);
 
+static void zs_deferred_drain_batch(struct zs_pool *pool,
+				    unsigned long *handles, unsigned int count)
+{
+	struct size_class *class = NULL;
+	unsigned int cur_cls = UINT_MAX;
+	LIST_HEAD(free_list);
+	unsigned int i;
+
+	for (i = 0; i < count; i++) {
+		unsigned long obj = handle_to_obj(handles[i]);
+		unsigned int cls = obj_to_class_idx(obj);
+
+		if (cls != cur_cls) {
+			if (class) {
+				spin_unlock(&class->lock);
+				free_zspage_list(pool, &free_list);
+				cond_resched();
+			}
+			cur_cls = cls;
+			class = pool->size_class[cls];
+			spin_lock(&class->lock);
+		}
+		__zs_free_handle(pool, class, handles[i], &free_list);
+	}
+
+	if (class) {
+		spin_unlock(&class->lock);
+		free_zspage_list(pool, &free_list);
+	}
+
+	for (i = 0; i < count; i++)
+		cache_free_handle(handles[i]);
+}
+
+static struct page *deferred_pool_get(struct zs_pool *pool)
+{
+	struct page *page = NULL;
+
+	spin_lock(&pool->page_pool_lock);
+	if (!list_empty(&pool->page_pool)) {
+		page = list_first_entry(&pool->page_pool, struct page, lru);
+		list_del(&page->lru);
+		pool->page_pool_count--;
+	}
+	spin_unlock(&pool->page_pool_lock);
+	return page;
+}
+
+static void deferred_pool_put(struct zs_pool *pool, struct page *page)
+{
+	spin_lock(&pool->page_pool_lock);
+	list_add_tail(&page->lru, &pool->page_pool);
+	pool->page_pool_count++;
+	spin_unlock(&pool->page_pool_lock);
+}
+
+static void deferred_drain_enqueue(struct zs_pool *pool, struct page *page)
+{
+	spin_lock(&pool->drain_list_lock);
+	list_add_tail(&page->lru, &pool->drain_list);
+	spin_unlock(&pool->drain_list_lock);
+}
+
+static struct page *deferred_drain_dequeue(struct zs_pool *pool)
+{
+	struct page *page = NULL;
+
+	spin_lock(&pool->drain_list_lock);
+	if (!list_empty(&pool->drain_list)) {
+		page = list_first_entry(&pool->drain_list, struct page, lru);
+		list_del(&page->lru);
+	}
+	spin_unlock(&pool->drain_list_lock);
+	return page;
+}
+
+static void zs_deferred_drain_work(struct work_struct *work)
+{
+	struct zs_pool *pool = container_of(work, struct zs_pool,
+					    deferred_drain_work);
+	struct page *page;
+	unsigned int drained = 0;
+	unsigned int max_drain = ZS_DEFERRED_POOL_SIZE / 2;
+
+	while (drained < max_drain) {
+		page = deferred_drain_dequeue(pool);
+		if (!page)
+			break;
+
+		zs_deferred_drain_batch(pool, page_address(page),
+					ZS_DEFERRED_BUF_ENTRIES);
+		deferred_pool_put(pool, page);
+		drained++;
+		cond_resched();
+	}
+
+	/* If drain list still has pages, reschedule */
+	spin_lock(&pool->drain_list_lock);
+	if (!list_empty(&pool->drain_list))
+		queue_work(pool->drain_wq, &pool->deferred_drain_work);
+	spin_unlock(&pool->drain_list_lock);
+}
+
+void zs_free_deferred(struct zs_pool *pool, unsigned long handle)
+{
+	struct zs_deferred_percpu *def;
+	struct page *new_page, *full_page;
+	bool queued = false;
+
+	if (IS_ERR_OR_NULL((void *)handle))
+		return;
+
+	def = get_cpu_ptr(pool->deferred);
+
+	if (likely(def->count < ZS_DEFERRED_BUF_ENTRIES)) {
+		def->handles[def->count++] = handle;
+		queued = true;
+		if (def->count < ZS_DEFERRED_BUF_ENTRIES) {
+			put_cpu_ptr(pool->deferred);
+			return;
+		}
+	}
+
+	/* Buffer is full, try to swap in a fresh page */
+	new_page = deferred_pool_get(pool);
+	if (new_page) {
+		full_page = virt_to_page(def->handles);
+		def->handles = page_address(new_page);
+		def->count = 0;
+		if (!queued)
+			def->handles[def->count++] = handle;
+		put_cpu_ptr(pool->deferred);
+		deferred_drain_enqueue(pool, full_page);
+		queue_work(pool->drain_wq, &pool->deferred_drain_work);
+		return;
+	}
+	put_cpu_ptr(pool->deferred);
+
+	if (!queued)
+		zs_free(pool, handle);
+}
+EXPORT_SYMBOL_GPL(zs_free_deferred);
+
+/*
+ * Called only from zs_destroy_pool() when no producers are running.
+ * Drains all per-cpu buffers regardless of whether they are full.
+ */
+static void zs_deferred_free_all(struct zs_pool *pool)
+{
+	struct page *page;
+	int cpu;
+
+	flush_work(&pool->deferred_drain_work);
+
+	/* Drain remaining pages on drain list */
+	while ((page = deferred_drain_dequeue(pool)) != NULL) {
+		zs_deferred_drain_batch(pool, page_address(page),
+					ZS_DEFERRED_BUF_ENTRIES);
+		deferred_pool_put(pool, page);
+	}
+
+	/* Drain partially-filled per-cpu buffers */
+	for_each_possible_cpu(cpu) {
+		struct zs_deferred_percpu *def;
+		unsigned int count;
+
+		def = per_cpu_ptr(pool->deferred, cpu);
+		count = def->count;
+		if (!count)
+			continue;
+		zs_deferred_drain_batch(pool, def->handles, count);
+		def->count = 0;
+	}
+}
+
+void zs_free_deferred_flush(struct zs_pool *pool)
+{
+	flush_work(&pool->deferred_drain_work);
+}
+EXPORT_SYMBOL_GPL(zs_free_deferred_flush);
+
 static void zs_object_copy(struct size_class *class, unsigned long dst,
 				unsigned long src)
 {
@@ -2053,6 +2262,8 @@ unsigned long zs_compact(struct zs_pool *pool)
 	if (atomic_xchg(&pool->compaction_in_progress, 1))
 		return 0;
 
+	zs_free_deferred_flush(pool);
+
 	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
 		class = pool->size_class[i];
 		if (class->index != i)
@@ -2161,9 +2372,11 @@ static int calculate_zspage_chain_size(int class_size)
  */
 struct zs_pool *zs_create_pool(const char *name)
 {
-	int i;
+	int i, cpu;
+	unsigned int pg_idx;
 	struct zs_pool *pool;
 	struct size_class *prev_class = NULL;
+	struct page *page, *tmp;
 
 	pool = kzalloc_obj(*pool);
 	if (!pool)
@@ -2172,11 +2385,67 @@ struct zs_pool *zs_create_pool(const char *name)
 	init_deferred_free(pool);
 	rwlock_init(&pool->lock);
 	atomic_set(&pool->compaction_in_progress, 0);
+	INIT_WORK(&pool->deferred_drain_work, zs_deferred_drain_work);
+
+	pool->drain_wq = alloc_workqueue("zs_drain", WQ_UNBOUND, 0);
+	if (!pool->drain_wq) {
+		kfree(pool);
+		return NULL;
+	}
+
+	/* Initialize page pool and drain list */
+	INIT_LIST_HEAD(&pool->page_pool);
+	spin_lock_init(&pool->page_pool_lock);
+	pool->page_pool_count = 0;
+	INIT_LIST_HEAD(&pool->drain_list);
+	spin_lock_init(&pool->drain_list_lock);
+
+	for (pg_idx = 0; pg_idx < ZS_DEFERRED_POOL_SIZE; pg_idx++) {
+		page = alloc_page(GFP_KERNEL);
+		if (!page)
+			goto err_pool_pages;
+		list_add_tail(&page->lru, &pool->page_pool);
+		pool->page_pool_count++;
+	}
+
+	pool->deferred = alloc_percpu(struct zs_deferred_percpu);
+	if (!pool->deferred)
+		goto err_pool_pages;
+	for_each_possible_cpu(cpu) {
+		struct zs_deferred_percpu *def = per_cpu_ptr(pool->deferred, cpu);
+
+		page = deferred_pool_get(pool);
+		if (!page) {
+			for_each_possible_cpu(cpu) {
+				def = per_cpu_ptr(pool->deferred, cpu);
+				if (def->handles)
+					deferred_pool_put(pool,
+						virt_to_page(def->handles));
+			}
+			free_percpu(pool->deferred);
+			goto err_pool_pages;
+		}
+		def->handles = page_address(page);
+		def->count = 0;
+	}
 
 	pool->name = kstrdup(name, GFP_KERNEL);
 	if (!pool->name)
 		goto err;
 
+	goto pool_init_done;
+
+err_pool_pages:
+	list_for_each_entry_safe(page, tmp, &pool->page_pool, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+	}
+	destroy_workqueue(pool->drain_wq);
+	kfree(pool);
+	return NULL;
+
+pool_init_done:
+
 	/*
 	 * Iterate reversely, because, size of size_class that we want to use
 	 * for merging should be larger or equal to current size.
@@ -2272,9 +2541,11 @@ EXPORT_SYMBOL_GPL(zs_create_pool);
 
 void zs_destroy_pool(struct zs_pool *pool)
 {
-	int i;
+	int i, cpu;
+	struct page *page, *tmp;
 
 	zs_unregister_shrinker(pool);
+	zs_deferred_free_all(pool);
 	zs_flush_migration(pool);
 	zs_pool_stat_destroy(pool);
 
@@ -2298,6 +2569,21 @@ void zs_destroy_pool(struct zs_pool *pool)
 		kfree(class);
 	}
 
+	/* Return per-cpu buffers to page pool */
+	for_each_possible_cpu(cpu) {
+		struct zs_deferred_percpu *def = per_cpu_ptr(pool->deferred, cpu);
+
+		if (def->handles)
+			deferred_pool_put(pool, virt_to_page(def->handles));
+	}
+
+	/* Free all pages in page pool */
+	list_for_each_entry_safe(page, tmp, &pool->page_pool, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+	}
+	free_percpu(pool->deferred);
+	destroy_workqueue(pool->drain_wq);
 	kfree(pool->name);
 	kfree(pool);
 }


----- [2/3] mm/zswap: use zs_free_deferred() in entry free path -----

Replace zs_free() with zs_free_deferred() in zswap_entry_free() to
avoid the overhead of zsmalloc class->lock and potential zone->lock
contention in the zswap invalidation/reclaim hot path.

The store failure path still uses zs_free() directly since it is not
performance critical.

Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 mm/zswap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 4b5149173b0e..f2a38c07579f 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -765,7 +765,7 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
 static void zswap_entry_free(struct zswap_entry *entry)
 {
 	zswap_lru_del(&zswap_list_lru, entry);
-	zs_free(entry->pool->zs_pool, entry->handle);
+	zs_free_deferred(entry->pool->zs_pool, entry->handle);
 	zswap_pool_put(entry->pool);
 	if (entry->objcg) {
 		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);


----- [3/3] zram: defer zs_free() in swap slot free notification path -----

zram_slot_free_notify() is called on the process exit path when
unmapping swap entries.  The zs_free() it invokes accounts for ~87%
of slot_free() cost due to zsmalloc locking, blocking memory release
during Android low-memory killing.

Split slot_free() into slot_free_extract() and the actual zs_free():

  slot_free_extract() handles slot metadata cleanup (flags, stats,
  handle/size zeroing) and returns the zsmalloc handle.

  The returned handle is passed to zs_free_deferred() in the
  notification path, deferring the expensive zs_free() to a
  workqueue so the exit path can release anon folios faster.

All other slot_free() callers (write, discard, meta_free) continue
to use synchronous zs_free() through the unchanged slot_free().

Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 drivers/block/zram/zram_drv.c | 41 ++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index aebc710f0d6a..c67a7442d283 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2000,24 +2000,26 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 	return true;
 }
 
-static void slot_free(struct zram *zram, u32 index)
+/*
+ * Clear slot metadata and extract the zsmalloc handle that needs freeing.
+ * Returns the handle, or 0 if no zsmalloc free is required (e.g. same-filled
+ * or writeback slots).
+ */
+#define ZRAM_SLOT_CLEAR_MASK \
+	(BIT(ZRAM_IDLE) | BIT(ZRAM_INCOMPRESSIBLE) | BIT(ZRAM_PP_SLOT) | \
+	 (ZRAM_COMP_PRIORITY_MASK << ZRAM_COMP_PRIORITY_BIT1))
+
+static unsigned long slot_free_extract(struct zram *zram, u32 index)
 {
-	unsigned long handle;
+	unsigned long handle = 0;
 
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	zram->table[index].attr.ac_time = 0;
 #endif
 
-	clear_slot_flag(zram, index, ZRAM_IDLE);
-	clear_slot_flag(zram, index, ZRAM_INCOMPRESSIBLE);
-	clear_slot_flag(zram, index, ZRAM_PP_SLOT);
-	set_slot_comp_priority(zram, index, 0);
+	zram->table[index].attr.flags &= ~ZRAM_SLOT_CLEAR_MASK;
 
 	if (test_slot_flag(zram, index, ZRAM_HUGE)) {
-		/*
-		 * Writeback completion decrements ->huge_pages but keeps
-		 * ZRAM_HUGE flag for deferred decompression path.
-		 */
 		if (!test_slot_flag(zram, index, ZRAM_WB))
 			atomic64_dec(&zram->stats.huge_pages);
 		clear_slot_flag(zram, index, ZRAM_HUGE);
@@ -2029,10 +2031,6 @@ static void slot_free(struct zram *zram, u32 index)
 		goto out;
 	}
 
-	/*
-	 * No memory is allocated for same element filled pages.
-	 * Simply clear same page flag.
-	 */
 	if (test_slot_flag(zram, index, ZRAM_SAME)) {
 		clear_slot_flag(zram, index, ZRAM_SAME);
 		atomic64_dec(&zram->stats.same_pages);
@@ -2041,9 +2039,7 @@ static void slot_free(struct zram *zram, u32 index)
 
 	handle = get_slot_handle(zram, index);
 	if (!handle)
-		return;
-
-	zs_free(zram->mem_pool, handle);
+		return 0;
 
 	atomic64_sub(get_slot_size(zram, index),
 		     &zram->stats.compr_data_size);
@@ -2051,6 +2047,15 @@ static void slot_free(struct zram *zram, u32 index)
 	atomic64_dec(&zram->stats.pages_stored);
 	set_slot_handle(zram, index, 0);
 	set_slot_size(zram, index, 0);
+
+	return handle;
+}
+
+static void slot_free(struct zram *zram, u32 index)
+{
+	unsigned long handle = slot_free_extract(zram, index);
+
+	zs_free(zram->mem_pool, handle);
 }
 
 static int read_same_filled_page(struct zram *zram, struct page *page,
@@ -2797,7 +2802,7 @@ static void zram_slot_free_notify(struct block_device *bdev,
 		return;
 	}
 
-	slot_free(zram, index);
+	zs_free_deferred(zram->mem_pool, slot_free_extract(zram, index));
 	slot_unlock(zram, index);
 }


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release
  2026-05-09  0:08 ` Nhat Pham
@ 2026-05-09  8:45   ` Wenchao Hao
  0 siblings, 0 replies; 12+ messages in thread
From: Wenchao Hao @ 2026-05-09  8:45 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Sergey Senozhatsky, Yosry Ahmed, Wenchao Hao

On Sat, May 9, 2026 at 8:08 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> >
> > Swap freeing can be expensive when unmapping a VMA containing many swap
> > entries. This has been reported to significantly delay memory reclamation
> > during Android's low-memory killing, especially when multiple processes
> > are terminated to free memory, with slot_free() accounting for more than
> > 80% of the total cost of freeing swap entries.
> >
> > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > to asynchronously collect and free swap entries [1][2], but the design
> > itself is fairly complex.
> >
> > When anon folios and swap entries are mixed within a process, reclaiming
> > anon folios from killed processes helps return memory to the system as
> > quickly as possible, so that newly launched applications can satisfy
> > their memory demands. It is not ideal for swap freeing to block anon
> > folio freeing. On the other hand, swap freeing can still return memory
> > to the system, although at a slower rate due to memory compression.
> >
> > This series introduces a callback-based deferred free framework in
> > zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> > define what gets buffered and how it gets drained. The entire free
> > path including caller-side bookkeeping (slot_free, zswap_entry_free)
> > is deferred to a background worker.
> >
> > Implementation:
> >   - Each CPU owns a single-page buffer. The hot path writes a value
> >     via the push callback with preemption disabled (no locks).
> >   - When the buffer fills, it is swapped with a fresh page from a
> >     pre-allocated page pool. The full page is queued to a WQ_UNBOUND
> >     worker for drain.
> >   - The drain callback performs the actual expensive work (zs_free,
> >     slot_free, zswap_entry_free, etc.) in batch, off the hot path.
> >   - If no free page is available, the caller falls back to synchronous
> >     processing.
> >
> > The speedup comes from moving expensive swap slot freeing off the
> > munmap hot path into a background worker, so that intact anonymous
> > folios are released back to the system without blocking. The worker
> > drains at a slower rate since compressed objects are small and freeing
> > a single handle may not release an entire page until the zspage is
> > fully empty.
> >
> > Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):
> >
> > Test 1: munmap latency for 256MB swap-filled VMA (zram backend)
> >
> >   mode        Base       Patched     Speedup
> >   single      61.82ms    8.62ms      7.17x
> >   multi 2p    94.75ms    54.11ms     1.75x
> >   multi 3p    154.64ms   104.83ms    1.48x
> >
> > Test 2: munmap latency for different sizes (zram, single process)
> >
> >   Size       Base         Patched     Speedup
> >   64MB       14.11ms      2.18ms      6.47x
> >   128MB      29.45ms      4.48ms      6.57x
> >   192MB      43.85ms      6.62ms      6.62x
> >   256MB      57.01ms      9.08ms      6.28x
> >   512MB      115.13ms     55.58ms     2.07x
> >   1024MB     229.66ms     153.28ms    1.50x
> >
> > Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)
> >
> >   mode        Base       Patched     Speedup
> >   single      152.14ms   51.26ms     2.97x
> >   multi 2p    186.56ms   105.42ms    1.77x
> >   multi 3p    205.83ms   153.32ms    1.34x
> >
> > Test 4: munmap latency for different sizes (zswap, single process)
> >
> >   Size       Base         Patched     Speedup
> >   64MB       37.83ms      13.26ms     2.85x
> >   128MB      75.11ms      26.73ms     2.81x
> >   256MB      150.78ms     52.97ms     2.85x
> >   512MB      303.04ms     130.38ms    2.32x
> >   1024MB     599.95ms     287.10ms    2.09x
> >
>
> Hmmm, why are we batching at the zswap/zsmalloc level like this? I
> agree with Yosry that this seems like somewhat of an unnecessary
> layering violation. For example, do we observe a lot more performance
> wins by doing this instead of just simply:
>

Thanks for the reply, refer following thread for the perf breakdown
and detail data:

https://lore.kernel.org/linux-mm/CAOptpSPY3YL5VFJW9KKP99Yb17+_rdXKsKj93FdEn3_Zb350ow@mail.gmail.com/

> static void zswap_entry_free(swp_entry_t swp, bool deferred)
> {
>     ...
>     if (!deferred || !zs_deferred_free(entry->pool->zs_pool , entry->handle))
>         zs_free(entry->pool->zs_pool , entry->handle);
> }
>
> (basically what you had in the last version).
>
> One weird effect of doing deferred zswap entry freeing like what you
> are proposing here, is that the zswap LRU will be littered with stale
> zswap entries. Seems like you removed them from the zswap xarray, but
> they're still linked into the zswap LRU? At writeback time, that will
> throw off the statistics used in the heuristics, and will make
> writeback go through a bunch of stale entries, wasting more cycles :)
> Seems a bit inelegant, no?

You're right, that was an oversight -- thanks for pointing it. The
zsmalloc-only variant avoids this entirely: zswap_lru_del() stays
synchronous before the handle is queued, so the LRU never contains
torn-down entries. I'll make sure v4 doesn't have this issue
regardless of which direction we go.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops
  2026-05-09  0:29   ` Nhat Pham
@ 2026-05-09  8:47     ` Wenchao Hao
  0 siblings, 0 replies; 12+ messages in thread
From: Wenchao Hao @ 2026-05-09  8:47 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, Barry Song, Chengming Zhou, Jens Axboe,
	Johannes Weiner, linux-block, linux-kernel, linux-mm, Minchan Kim,
	Sergey Senozhatsky, Yosry Ahmed, Wenchao Hao

On Sat, May 9, 2026 at 8:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> >
> > Add a per-cpu deferred free mechanism to zsmalloc with a callback
> > interface that lets callers (zram, zswap) customize push and drain
> > behavior.
> >
> > Each CPU owns a single-page buffer. The hot path (zs_free_deferred)
> > writes a value into the current CPU's buffer via the push callback
> > with preemption disabled — no locks, no atomics. When the buffer
> > fills, it is swapped with a fresh page from a pre-allocated page
> > pool and the full page is queued to a WQ_UNBOUND worker for drain.
> >
> > The drain worker invokes the drain callback which performs the actual
> > expensive work (zs_free, slot_free, etc.) in batch, away from the
> > original hot path.
> >
> > Page pool management:
> >   - Pool is pre-allocated at enable time (ZS_DEFERRED_POOL_SIZE pages)
> >   - Full buffers are drained and returned to the pool
> >   - If no free page is available when buffer is full, the push falls
> >     back to synchronous processing by the caller
> >
> > Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
> > ---
> > +#define ZS_DEFERRED_POOL_SIZE  (256 * 1024 / PAGE_SIZE)
>
> Seems oddly specific? :) And this doesn't quite scale with number of
> CPUs, or memory size?
>

256K holds the deferred metadata for ~128MB zswap or ~256MB zram
entries, which matches what a killed process typically has swapped
out. Pages sitting in the pool are memory that can't be used
elsewhere, so I didn't want it to grow with RAM/CPU. Happy to
parameterize it if you'd prefer.

> > +
> > +struct zs_deferred_percpu {
> > +       unsigned int count;
> > +       void *buf;
> > +};
> > +
> >  struct zs_pool {
> >         const char *name;
> >
> > @@ -217,6 +224,18 @@ struct zs_pool {
> >         /* protect zspage migration/compaction */
> >         rwlock_t lock;
> >         atomic_t compaction_in_progress;
> > +
> > +       /* per-cpu deferred free */
> > +       const struct zs_deferred_ops *deferred_ops;
> > +       void *deferred_private;
> > +       struct zs_deferred_percpu __percpu *deferred;
> > +       struct work_struct deferred_work;
> > +       struct workqueue_struct *deferred_wq;
> > +       struct list_head deferred_pool;
> > +       unsigned int deferred_pool_count;
> > +       spinlock_t deferred_pool_lock;
> > +       struct list_head deferred_drain_list;
> > +       spinlock_t deferred_drain_lock;
> >  };
> >
> >  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
> > @@ -1416,6 +1435,171 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> >  }
> >  EXPORT_SYMBOL_GPL(zs_free);
> >
> > +static struct page *deferred_pool_get(struct zs_pool *pool)
> > +{
> > +       struct page *page = NULL;
> > +
> > +       spin_lock(&pool->deferred_pool_lock);
> > +       if (!list_empty(&pool->deferred_pool)) {
> > +               page = list_first_entry(&pool->deferred_pool, struct page, lru);
> > +               list_del(&page->lru);
> > +               pool->deferred_pool_count--;
> > +       }
> > +       spin_unlock(&pool->deferred_pool_lock);
> > +       return page;
> > +}
> > +
> > +static void deferred_pool_put(struct zfs_pool *pool, struct page *page)
> > +{
> > +       spin_lock(&pool->deferred_pool_lock);
> > +       list_add_tail(&page->lru, &pool->deferred_pool);
> > +       pool->deferred_pool_count++;
> > +       spin_unlock(&pool->deferred_pool_lock);
> > +}
> > +
> > +static void zs_deferred_work_fn(struct work_struct *work)
> > +{
> > +       struct zs_pool *pool = container_of(work, struct zs_pool, deferred_work);
> > +       struct page *page;
> > +
> > +       while (true) {
> > +               unsigned int count;
> > +
> > +               spin_lock(&pool->deferred_drain_lock);
> > +               if (list_empty(&pool->deferred_drain_list)) {
> > +                       spin_unlock(&pool->deferred_drain_lock);
> > +                       break;
> > +               }
> > +               page = list_first_entry(&pool->deferred_drain_list,
> > +                                       struct page, lru);
> > +               list_del(&page->lru);
> > +               count = page_private(page);
> > +               spin_unlock(&pool->deferred_drain_lock);
> > +
> > +               pool->deferred_ops->drain(pool->deferred_private,
> > +                                         page_address(page), count);
> > +               deferred_pool_put(pool, page);
> > +               cond_resched();
> > +       }
> > +}
> > +
> > +bool zs_free_deferred(struct zs_pool *pool, unsigned long value)
> > +{
> > +       struct zs_deferred_percpu *def;
> > +       struct page *new_page, *full_page;
> > +       enum zs_push_ret ret;
> > +
> > +       if (!pool->deferred)
> > +               return false;
> > +
> > +       def = get_cpu_ptr(pool->deferred);
> > +
> > +       ret = pool->deferred_ops->push(def->buf, def->count, value);
> > +       if (ret == ZS_PUSH_OK) {
> > +               def->count++;
> > +               put_cpu_ptr(pool->deferred);
> > +               return true;
> > +       }
> > +
> > +       if (ret == ZS_PUSH_FULL_QUEUED)
> > +               def->count++;
> > +
> > +       new_page = deferred_pool_get(pool);
> > +       if (new_page) {
> > +               full_page = virt_to_page(def->buf);
> > +               set_page_private(full_page, def->count);
> > +               def->buf = page_address(new_page);
> > +               def->count = 0;
> > +
> > +               if (ret == ZS_PUSH_FULL) {
> > +                       pool->deferred_ops->push(def->buf, 0, value);
> > +                       def->count = 1;
> > +               }
> > +               put_cpu_ptr(pool->deferred);
> > +
> > +               spin_lock(&pool->deferred_drain_lock);
> > +               list_add_tail(&full_page->lru, &pool->deferred_drain_list);
> > +               spin_unlock(&pool->deferred_drain_lock);
> > +               queue_work(pool->deferred_wq, &pool->deferred_work);
> > +               return true;
> > +       }
> > +       put_cpu_ptr(pool->deferred);
> > +
> > +       /* ret==2: value already queued, will be drained eventually */
> > +       if (ret == 2)
>
> == 2? :)
>

Will replace with ZS_PUSH_FULL_QUEUED, if v4 still has
this logic.

Thanks,
Wenchao


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-09  8:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08  6:07 [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 1/4] mm/zsmalloc: introduce deferred free framework with callback ops Wenchao Hao
2026-05-09  0:29   ` Nhat Pham
2026-05-09  8:47     ` Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 2/4] mm/zswap: use zsmalloc deferred free callback for async invalidate Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 3/4] zram: use zsmalloc deferred free callback for async slot free Wenchao Hao
2026-05-08  6:07 ` [RFC PATCH v3 4/4] zram: batch clear flags in slot_free with single write Wenchao Hao
2026-05-08 20:12 ` [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release Yosry Ahmed
2026-05-09  8:32   ` Wenchao Hao
2026-05-09  8:38     ` Wenchao Hao
2026-05-09  0:08 ` Nhat Pham
2026-05-09  8:45   ` Wenchao Hao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox