[RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
@ 2026-04-21 12:16 Wenchao Hao
  2026-04-21 12:16 ` [RFC PATCH v2 1/4] mm:zsmalloc: drop class lock before freeing zspage Wenchao Hao
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-04-21 12:16 UTC (permalink / raw)
  To: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm
  Cc: Barry Song, Xueyuan Chen, Wenchao Hao

Swap freeing can be expensive when unmapping a VMA containing
many swap entries. This has been reported to significantly
delay memory reclamation during Android's low-memory killing,
especially when multiple processes are terminated to free
memory, with slot_free() accounting for more than 80% of
the total cost of freeing swap entries.

Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
to asynchronously collect and free swap entries [1][2], but the
design itself is fairly complex.

When anon folios and swap entries are mixed within a
process, reclaiming anon folios from killed processes
helps return memory to the system as quickly as possible,
so that newly launched applications can satisfy their
memory demands. It is not ideal for swap freeing to block
anon folio freeing. On the other hand, swap freeing can
still return memory to the system, although at a slower
rate due to memory compression.

Therefore, we introduce a GC worker to allow anon
folio freeing and slot_free to run in parallel, since
slot_free is performed asynchronously, maximizing the rate at
which memory is returned to the system.

This series takes two complementary approaches to reduce zs_free()
latency:

- Shrink zs_free() class->lock critical section by moving zspage
  freeing outside the lock.
- Defer zs_free() to a workqueue via zs_free_deferred(), benefiting
  both zram and zswap.

The deferred free approach builds on Barry Song's earlier RFC [1] with
changes based on community feedback: optimization moved to zsmalloc
layer instead of zram; fixed array storing handles (not indices) with
O(1) enqueue to avoid memory allocation on the exit path and data
consistency issues on slot reuse; size-based capacity scaling with
PAGE_SIZE.

Xueyuan's test on RK3588 with Barry's RFC v1 [3] shows that unmapping
a 256MB swap-filled VMA becomes 3.4x faster when pinning tasks to CPU2,
reducing the execution time from 63,102,982 ns to 18,570,726 ns.

A positive side effect is that async GC also slightly improves
do_swap_page() performance, as it no longer has to wait for
slot_free() to complete.

Xueyuan's test with Barry's RFC v1 [3] shows that swapping in 256MB of
data (each page filled with repeating patterns such as "1024 one",
"1024 two", "1024 three", and "1024 four") reduces execution time from
1,358,133,886 ns to 1,104,315,986 ns, achieving a 1.22x speedup.

[1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@vivo.com/
[2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@vivo.com/
[3] https://lore.kernel.org/linux-mm/20260412060450.15813-1-baohua@kernel.org/

Xueyuan Chen (1):
  mm:zsmalloc: drop class lock before freeing zspage

Barry Song (Xiaomi) (1):
  zram: defer zs_free() in swap slot free notification path

Wenchao Hao (2):
  mm/zsmalloc: introduce zs_free_deferred() for async handle freeing
  mm/zswap: defer zs_free() in zswap_invalidate() path

 drivers/block/zram/zram_drv.c |  37 ++++++---
 include/linux/zsmalloc.h      |   2 +
 mm/zsmalloc.c                 | 141 ++++++++++++++++++++++++++++++++--
 mm/zswap.c                    |  16 ++++-
 4 files changed, 177 insertions(+), 19 deletions(-)

--
2.34.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH v2 1/4] mm:zsmalloc: drop class lock before freeing zspage
  2026-04-21 12:16 [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Wenchao Hao
@ 2026-04-21 12:16 ` Wenchao Hao
  2026-04-21 12:16 ` [RFC PATCH v2 2/4] mm/zsmalloc: introduce zs_free_deferred() for async handle freeing Wenchao Hao
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-04-21 12:16 UTC (permalink / raw)
  To: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm
  Cc: Barry Song, Xueyuan Chen, Wenchao Hao

From: Xueyuan Chen <xueyuan.chen21@gmail.com>

Currently in zs_free(), the class->lock is held until the zspage is
completely freed and the counters are updated. However, freeing pages back
to the buddy allocator requires acquiring the zone lock.

Under heavy memory pressure, zone lock contention can be severe. When this
happens, the CPU holding the class->lock will stall waiting for the zone
lock, thereby blocking all other CPUs attempting to acquire the same
class->lock.

This patch shrinks the critical section of the class->lock to reduce lock
contention. By moving the actual page freeing process outside the
class->lock, we can improve the concurrency performance of zs_free().

Testing on the RADXA O6 platform shows that with 12 CPUs concurrently
performing zs_free() operations, the execution time is reduced by 20%.

Signed-off-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 mm/zsmalloc.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 63128ddb7959..40687c8a7469 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -801,13 +801,10 @@ static int trylock_zspage(struct zspage *zspage)
 	return 0;
 }
 
-static void __free_zspage(struct zs_pool *pool, struct size_class *class,
-				struct zspage *zspage)
+static inline void __free_zspage_lockless(struct zs_pool *pool, struct zspage *zspage)
 {
 	struct zpdesc *zpdesc, *next;
 
-	assert_spin_locked(&class->lock);
-
 	VM_BUG_ON(get_zspage_inuse(zspage));
 	VM_BUG_ON(zspage->fullness != ZS_INUSE_RATIO_0);
 
@@ -823,7 +820,13 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 	} while (zpdesc != NULL);
 
 	cache_free_zspage(zspage);
+}
 
+static void __free_zspage(struct zs_pool *pool, struct size_class *class,
+				struct zspage *zspage)
+{
+	assert_spin_locked(&class->lock);
+	__free_zspage_lockless(pool, zspage);
 	class_stat_sub(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
 	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
 }
@@ -1388,6 +1391,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	unsigned long obj;
 	struct size_class *class;
 	int fullness;
+	struct zspage *zspage_to_free = NULL;
 
 	if (IS_ERR_OR_NULL((void *)handle))
 		return;
@@ -1408,10 +1412,22 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	obj_free(class->size, obj);
 
 	fullness = fix_fullness_group(class, zspage);
-	if (fullness == ZS_INUSE_RATIO_0)
-		free_zspage(pool, class, zspage);
+	if (fullness == ZS_INUSE_RATIO_0) {
+		if (trylock_zspage(zspage)) {
+			remove_zspage(class, zspage);
+			class_stat_sub(class, ZS_OBJS_ALLOCATED,
+				class->objs_per_zspage);
+			zspage_to_free = zspage;
+		} else
+			kick_deferred_free(pool);
+	}
 
 	spin_unlock(&class->lock);
+
+	if (likely(zspage_to_free)) {
+		__free_zspage_lockless(pool, zspage_to_free);
+		atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
+	}
 	cache_free_handle(handle);
 }
 EXPORT_SYMBOL_GPL(zs_free);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH v2 2/4] mm/zsmalloc: introduce zs_free_deferred() for async handle freeing
  2026-04-21 12:16 [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Wenchao Hao
  2026-04-21 12:16 ` [RFC PATCH v2 1/4] mm:zsmalloc: drop class lock before freeing zspage Wenchao Hao
@ 2026-04-21 12:16 ` Wenchao Hao
  2026-04-21 19:46   ` Nhat Pham
  2026-04-21 12:16 ` [RFC PATCH v2 3/4] zram: defer zs_free() in swap slot free notification path Wenchao Hao
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Wenchao Hao @ 2026-04-21 12:16 UTC (permalink / raw)
  To: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm
  Cc: Barry Song, Xueyuan Chen, Wenchao Hao

zs_free() is expensive due to internal locking (pool->lock, class->lock)
and potential zspage freeing. On the process exit path, the slow
zs_free() blocks memory reclamation, delaying overall memory release.
This has been reported to significantly impact Android low-memory
killing where slot_free() accounts for over 80% of the total swap
entry freeing cost.

Introduce zs_free_deferred() which queues handles into a fixed-size
per-pool array for later processing by a workqueue. This allows callers
to defer the expensive zs_free() and return quickly, so the process
exit path can release memory faster. The array capacity is derived from
a 128MB uncompressed data budget (128MB >> PAGE_SHIFT entries), which
scales naturally with PAGE_SIZE. When the array reaches half capacity,
the workqueue is scheduled to drain pending handles.

zs_free_deferred() uses spin_trylock() to access the deferred queue.
If the lock is contended (e.g. drain in progress) or the queue is full,
it falls back to synchronous zs_free() to guarantee correctness.

Also introduce zs_free_deferred_flush() for use during pool teardown to
ensure all pending handles are freed.

Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 include/linux/zsmalloc.h |   2 +
 mm/zsmalloc.c            | 111 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 478410c880b1..1e5ac1a39d41 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -30,6 +30,8 @@ void zs_destroy_pool(struct zs_pool *pool);
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags,
 			const int nid);
 void zs_free(struct zs_pool *pool, unsigned long obj);
+void zs_free_deferred(struct zs_pool *pool, unsigned long handle);
+void zs_free_deferred_flush(struct zs_pool *pool);
 
 size_t zs_huge_class_size(struct zs_pool *pool);
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 40687c8a7469..defc892555e4 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -53,6 +53,10 @@
 
 #define ZS_HANDLE_SIZE (sizeof(unsigned long))
 
+#define ZS_DEFERRED_FREE_MAX_BYTES	(128 << 20)
+#define ZS_DEFERRED_FREE_CAPACITY	(ZS_DEFERRED_FREE_MAX_BYTES >> PAGE_SHIFT)
+#define ZS_DEFERRED_FREE_THRESHOLD	(ZS_DEFERRED_FREE_CAPACITY / 2)
+
 /*
  * Object location (<PFN>, <obj_idx>) is encoded as
  * a single (unsigned long) handle value.
@@ -217,6 +221,13 @@ struct zs_pool {
 	/* protect zspage migration/compaction */
 	rwlock_t lock;
 	atomic_t compaction_in_progress;
+
+	/* deferred free support */
+	spinlock_t deferred_lock;
+	unsigned long *deferred_handles;
+	unsigned int deferred_count;
+	unsigned int deferred_capacity;
+	struct work_struct deferred_free_work;
 };
 
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
@@ -579,6 +590,19 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 }
 DEFINE_SHOW_ATTRIBUTE(zs_stats_size);
 
+static int zs_stats_deferred_show(struct seq_file *s, void *v)
+{
+	struct zs_pool *pool = s->private;
+
+	spin_lock(&pool->deferred_lock);
+	seq_printf(s, "pending: %u\n", pool->deferred_count);
+	seq_printf(s, "capacity: %u\n", pool->deferred_capacity);
+	spin_unlock(&pool->deferred_lock);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(zs_stats_deferred);
+
 static void zs_pool_stat_create(struct zs_pool *pool, const char *name)
 {
 	if (!zs_stat_root) {
@@ -590,6 +614,9 @@ static void zs_pool_stat_create(struct zs_pool *pool, const char *name)
 
 	debugfs_create_file("classes", S_IFREG | 0444, pool->stat_dentry, pool,
 			    &zs_stats_size_fops);
+	debugfs_create_file("deferred_free", S_IFREG | 0444,
+			    pool->stat_dentry, pool,
+			    &zs_stats_deferred_fops);
 }
 
 static void zs_pool_stat_destroy(struct zs_pool *pool)
@@ -1432,6 +1459,76 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 }
 EXPORT_SYMBOL_GPL(zs_free);
 
+static void zs_deferred_free_work(struct work_struct *work)
+{
+	struct zs_pool *pool = container_of(work, struct zs_pool,
+					    deferred_free_work);
+	unsigned long handle;
+
+	while (1) {
+		spin_lock(&pool->deferred_lock);
+		if (pool->deferred_count == 0) {
+			spin_unlock(&pool->deferred_lock);
+			break;
+		}
+		handle = pool->deferred_handles[--pool->deferred_count];
+		spin_unlock(&pool->deferred_lock);
+
+		zs_free(pool, handle);
+		cond_resched();
+	}
+}
+
+/**
+ * zs_free_deferred - queue a handle for asynchronous freeing
+ * @pool: pool to free from
+ * @handle: handle to free
+ *
+ * Place @handle into a deferred free queue for later processing by a
+ * workqueue.  This is intended for callers that are in atomic context
+ * (e.g. under a spinlock) and cannot afford the cost of zs_free()
+ * directly.  When the queue reaches a threshold the work is scheduled.
+ * Falls back to synchronous zs_free() if the lock is contended (drain
+ * in progress) or if the queue is full.
+ */
+void zs_free_deferred(struct zs_pool *pool, unsigned long handle)
+{
+	if (IS_ERR_OR_NULL((void *)handle))
+		return;
+
+	if (!spin_trylock(&pool->deferred_lock))
+		goto sync_free;
+
+	if (pool->deferred_count >= pool->deferred_capacity) {
+		spin_unlock(&pool->deferred_lock);
+		goto sync_free;
+	}
+
+	pool->deferred_handles[pool->deferred_count++] = handle;
+	if (pool->deferred_count >= ZS_DEFERRED_FREE_THRESHOLD)
+		queue_work(system_wq, &pool->deferred_free_work);
+	spin_unlock(&pool->deferred_lock);
+	return;
+
+sync_free:
+	zs_free(pool, handle);
+}
+EXPORT_SYMBOL_GPL(zs_free_deferred);
+
+/**
+ * zs_free_deferred_flush - flush all pending deferred frees
+ * @pool: pool to flush
+ *
+ * Wait for any scheduled work to complete, then drain any remaining
+ * handles.  Must be called from process context.
+ */
+void zs_free_deferred_flush(struct zs_pool *pool)
+{
+	flush_work(&pool->deferred_free_work);
+	zs_deferred_free_work(&pool->deferred_free_work);
+}
+EXPORT_SYMBOL_GPL(zs_free_deferred_flush);
+
 static void zs_object_copy(struct size_class *class, unsigned long dst,
 				unsigned long src)
 {
@@ -2099,6 +2196,18 @@ struct zs_pool *zs_create_pool(const char *name)
 	rwlock_init(&pool->lock);
 	atomic_set(&pool->compaction_in_progress, 0);
 
+	spin_lock_init(&pool->deferred_lock);
+	pool->deferred_capacity = ZS_DEFERRED_FREE_CAPACITY;
+	pool->deferred_handles = kvmalloc_array(pool->deferred_capacity,
+						sizeof(unsigned long),
+						GFP_KERNEL);
+	if (!pool->deferred_handles) {
+		kfree(pool);
+		return NULL;
+	}
+	pool->deferred_count = 0;
+	INIT_WORK(&pool->deferred_free_work, zs_deferred_free_work);
+
 	pool->name = kstrdup(name, GFP_KERNEL);
 	if (!pool->name)
 		goto err;
@@ -2201,6 +2310,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 	int i;
 
 	zs_unregister_shrinker(pool);
+	zs_free_deferred_flush(pool);
 	zs_flush_migration(pool);
 	zs_pool_stat_destroy(pool);
 
@@ -2224,6 +2334,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 		kfree(class);
 	}
 
+	kvfree(pool->deferred_handles);
 	kfree(pool->name);
 	kfree(pool);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH v2 3/4] zram: defer zs_free() in swap slot free notification path
  2026-04-21 12:16 [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Wenchao Hao
  2026-04-21 12:16 ` [RFC PATCH v2 1/4] mm:zsmalloc: drop class lock before freeing zspage Wenchao Hao
  2026-04-21 12:16 ` [RFC PATCH v2 2/4] mm/zsmalloc: introduce zs_free_deferred() for async handle freeing Wenchao Hao
@ 2026-04-21 12:16 ` Wenchao Hao
  2026-04-21 12:16 ` [RFC PATCH v2 4/4] mm/zswap: defer zs_free() in zswap_invalidate() path Wenchao Hao
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-04-21 12:16 UTC (permalink / raw)
  To: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm
  Cc: Barry Song, Xueyuan Chen, Wenchao Hao

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

zram_slot_free_notify() is called on the process exit path when
unmapping swap entries. The slot_free() it calls internally invokes
zs_free(), which accounts for ~87% of slot_free() cost due to zsmalloc
internal locking (pool->lock, class->lock) and potential zspage freeing.
This blocks the process exit path, delaying overall memory release
during Android low-memory killing.

Split slot_free() into slot_free_extract() and the actual zs_free()
call. slot_free_extract() handles all slot metadata cleanup (clearing
flags, updating stats, zeroing handle/size) and returns the zsmalloc
handle that needs freeing. This separation has two benefits:

1. It makes the two responsibilities of slot_free() explicit: slot
   metadata management (must be done under slot lock) vs zsmalloc
   memory release (can be deferred).

2. It allows zram_slot_free_notify() to use zs_free_deferred() for
   the handle, deferring the expensive zs_free() to a workqueue so
   the exit path can release memory faster.

While at it, merge three separate clear_slot_flag() calls for
ZRAM_IDLE, ZRAM_INCOMPRESSIBLE, and ZRAM_PP_SLOT into a single
bitmask operation via clear_slot_flags_on_free(), reducing redundant
read-modify-write cycles on the same flags word.

All other slot_free() callers (write, discard, meta_free) continue
to use synchronous zs_free() through the unchanged slot_free()
wrapper.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 drivers/block/zram/zram_drv.c | 37 ++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index c2afd1c34f4a..382c4dc57c8d 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -165,6 +165,15 @@ static inline bool slot_allocated(struct zram *zram, u32 index)
 		test_slot_flag(zram, index, ZRAM_WB);
 }
 
+#define ZRAM_FLAGS_TO_CLEAR_ON_FREE	(BIT(ZRAM_IDLE) | \
+					 BIT(ZRAM_INCOMPRESSIBLE) | \
+					 BIT(ZRAM_PP_SLOT))
+
+static inline void clear_slot_flags_on_free(struct zram *zram, u32 index)
+{
+	zram->table[index].attr.flags &= ~ZRAM_FLAGS_TO_CLEAR_ON_FREE;
+}
+
 static inline void set_slot_comp_priority(struct zram *zram, u32 index,
 					  u32 prio)
 {
@@ -2000,17 +2009,20 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 	return true;
 }
 
-static void slot_free(struct zram *zram, u32 index)
+/*
+ * Clear slot metadata and extract the zsmalloc handle for freeing.
+ * Returns the handle that needs to be freed via zs_free(), or 0 if
+ * no zsmalloc freeing is needed (e.g. same-filled or writeback slots).
+ */
+static unsigned long slot_free_extract(struct zram *zram, u32 index)
 {
-	unsigned long handle;
+	unsigned long handle = 0;
 
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	zram->table[index].attr.ac_time = 0;
 #endif
 
-	clear_slot_flag(zram, index, ZRAM_IDLE);
-	clear_slot_flag(zram, index, ZRAM_INCOMPRESSIBLE);
-	clear_slot_flag(zram, index, ZRAM_PP_SLOT);
+	clear_slot_flags_on_free(zram, index);
 	set_slot_comp_priority(zram, index, 0);
 
 	if (test_slot_flag(zram, index, ZRAM_HUGE)) {
@@ -2041,9 +2053,7 @@ static void slot_free(struct zram *zram, u32 index)
 
 	handle = get_slot_handle(zram, index);
 	if (!handle)
-		return;
-
-	zs_free(zram->mem_pool, handle);
+		return 0;
 
 	atomic64_sub(get_slot_size(zram, index),
 		     &zram->stats.compr_data_size);
@@ -2051,6 +2061,15 @@ static void slot_free(struct zram *zram, u32 index)
 	atomic64_dec(&zram->stats.pages_stored);
 	set_slot_handle(zram, index, 0);
 	set_slot_size(zram, index, 0);
+
+	return handle;
+}
+
+static void slot_free(struct zram *zram, u32 index)
+{
+	unsigned long handle = slot_free_extract(zram, index);
+
+	zs_free(zram->mem_pool, handle);
 }
 
 static int read_same_filled_page(struct zram *zram, struct page *page,
@@ -2794,7 +2813,7 @@ static void zram_slot_free_notify(struct block_device *bdev,
 		return;
 	}
 
-	slot_free(zram, index);
+	zs_free_deferred(zram->mem_pool, slot_free_extract(zram, index));
 	slot_unlock(zram, index);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH v2 4/4] mm/zswap: defer zs_free() in zswap_invalidate() path
  2026-04-21 12:16 [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Wenchao Hao
                   ` (2 preceding siblings ...)
  2026-04-21 12:16 ` [RFC PATCH v2 3/4] zram: defer zs_free() in swap slot free notification path Wenchao Hao
@ 2026-04-21 12:16 ` Wenchao Hao
  2026-04-21 17:03   ` Nhat Pham
  2026-04-21 15:54 ` [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Nhat Pham
  2026-04-26  4:13 ` Wenchao Hao
  5 siblings, 1 reply; 26+ messages in thread
From: Wenchao Hao @ 2026-04-21 12:16 UTC (permalink / raw)
  To: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm
  Cc: Barry Song, Xueyuan Chen, Wenchao Hao

zswap_invalidate() is called on the same process exit path as
zram_slot_free_notify(). The zswap_entry_free() it calls internally
performs zs_free() which is expensive due to zsmalloc internal locking.
Unlike zram which has a trylock fallback, zswap_invalidate() executes
unconditionally, making the latency impact potentially worse.

Like zram, the expensive zs_free() here blocks the process exit path,
delaying overall memory release. Additionally, zswap_entry_free()
performs extra work beyond zs_free(): list_lru_del() (takes its own
spinlock), obj_cgroup accounting, and kmem_cache_free for the entry
itself.

Use zs_free_deferred() in zswap_invalidate() path to defer the
expensive zsmalloc handle freeing to a workqueue, allowing the exit
path to release memory faster. All other callers (zswap_load,
zswap_writeback_entry, zswap_store error paths) run in process context
and continue to use synchronous zs_free().

Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
---
 mm/zswap.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 0823cadd02b6..7291f6deb5b6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -713,11 +713,16 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
 /*
  * Carries out the common pattern of freeing an entry's zsmalloc allocation,
  * freeing the entry itself, and decrementing the number of stored pages.
+ * When @deferred is true, the zsmalloc handle is queued for async freeing
+ * instead of being freed immediately.
  */
-static void zswap_entry_free(struct zswap_entry *entry)
+static void __zswap_entry_free(struct zswap_entry *entry, bool deferred)
 {
 	zswap_lru_del(&zswap_list_lru, entry);
-	zs_free(entry->pool->zs_pool, entry->handle);
+	if (deferred)
+		zs_free_deferred(entry->pool->zs_pool, entry->handle);
+	else
+		zs_free(entry->pool->zs_pool, entry->handle);
 	zswap_pool_put(entry->pool);
 	if (entry->objcg) {
 		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
@@ -729,6 +734,11 @@ static void zswap_entry_free(struct zswap_entry *entry)
 	atomic_long_dec(&zswap_stored_pages);
 }
 
+static void zswap_entry_free(struct zswap_entry *entry)
+{
+	__zswap_entry_free(entry, false);
+}
+
 /*********************************
 * compressed storage functions
 **********************************/
@@ -1655,7 +1665,7 @@ void zswap_invalidate(swp_entry_t swp)
 
 	entry = xa_erase(tree, offset);
 	if (entry)
-		zswap_entry_free(entry);
+		__zswap_entry_free(entry, true);
 }
 
 int zswap_swapon(int type, unsigned long nr_pages)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-21 12:16 [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Wenchao Hao
                   ` (3 preceding siblings ...)
  2026-04-21 12:16 ` [RFC PATCH v2 4/4] mm/zswap: defer zs_free() in zswap_invalidate() path Wenchao Hao
@ 2026-04-21 15:54 ` Nhat Pham
  2026-04-21 17:17   ` Kairui Song
  2026-04-26  4:13 ` Wenchao Hao
  5 siblings, 1 reply; 26+ messages in thread
From: Nhat Pham @ 2026-04-21 15:54 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Sergey Senozhatsky, Yosry Ahmed, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao,
	Kairui Song

On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing
> many swap entries. This has been reported to significantly
> delay memory reclamation during Android's low-memory killing,
> especially when multiple processes are terminated to free
> memory, with slot_free() accounting for more than 80% of
> the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the
> design itself is fairly complex.
>
> When anon folios and swap entries are mixed within a
> process, reclaiming anon folios from killed processes
> helps return memory to the system as quickly as possible,
> so that newly launched applications can satisfy their
> memory demands. It is not ideal for swap freeing to block
> anon folio freeing. On the other hand, swap freeing can
> still return memory to the system, although at a slower
> rate due to memory compression.

Is this correct? I don't think we do decompression in
zswap_invalidate() path. We do decompression in zswap_load(), but as a
separate step from zswap_invalidate().

zswap/zsmalloc entry freeing is decoupled from decompression. For
example, on process teardown, we free the zsmalloc memory but never
decompress (if we do then it's a bug to be fixed lol, but I doubt it).

Zsmalloc freeing might not be worth as much bang-for-your-buck wise
compared to anon folio freeing, but if it's "expensive", then I think
that points to a different root-cause: zsmalloc's poor scalability in
the free path.

I've stared at this code path for a bit, because my other patch series
(vswap - see [1]) was reported to display regression on the free path
on the usemem benchmark. And one of the issues was the contention
between compaction (both systemwide compaction, i.e zs_page_migrate,
and zsmalloc's internal compaction, but mostly the former).:

* zs_free read-acquires pool->lock, and compaction write-acquires the
same lock. So the compaction thread will make all zs free-ers wait for
it. I saw this read lock delay when I perfed the free step of usemem.

* If this lock has fair queue-ing semantics (I have not checked), then
if there a compaction is behind a bunch of zs_free in the queue, then
all the subsequent zs_free's ers are blocked :)

* I'm also curious about cache-friendliness of this rwlock, bouncing
across CPUs, if you have multiple processes being torn down
concurrently.

Have you perf-ed process teardown yet? Can I ask you for a perf trace
on this part? I'm not against async zs-freeing (might still be
required after all), but if it's something fixable on the zsmalloc
side, we should probably prioritize that :) Otherwise these swap
freeing workers will exhibit the same poor scalability behavior - we
might be better off because we manage to get rid of bigger chunks of
uncompressed memory first, but we will still be slowed in releasing
the system's and cgroup's (in zswap's case) compressed memory

I'd love to hear more about thoughts from Yosry, Johannes, Sergey and
Minchan too.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 4/4] mm/zswap: defer zs_free() in zswap_invalidate() path
  2026-04-21 12:16 ` [RFC PATCH v2 4/4] mm/zswap: defer zs_free() in zswap_invalidate() path Wenchao Hao
@ 2026-04-21 17:03   ` Nhat Pham
  0 siblings, 0 replies; 26+ messages in thread
From: Nhat Pham @ 2026-04-21 17:03 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Sergey Senozhatsky, Yosry Ahmed, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> zswap_invalidate() is called on the same process exit path as
> zram_slot_free_notify(). The zswap_entry_free() it calls internally
> performs zs_free() which is expensive due to zsmalloc internal locking.
> Unlike zram which has a trylock fallback, zswap_invalidate() executes
> unconditionally, making the latency impact potentially worse.

Hmmm my understanding is that we don't have contention at this point,
because zswap mainly relies on swap cache to synchronize.

But yeah I can see the effect of slow zsmalloc entry freeing here.

>
> Like zram, the expensive zs_free() here blocks the process exit path,
> delaying overall memory release. Additionally, zswap_entry_free()
> performs extra work beyond zs_free(): list_lru_del() (takes its own
> spinlock), obj_cgroup accounting, and kmem_cache_free for the entry
> itself.
>
> Use zs_free_deferred() in zswap_invalidate() path to defer the
> expensive zsmalloc handle freeing to a workqueue, allowing the exit
> path to release memory faster. All other callers (zswap_load,
> zswap_writeback_entry, zswap_store error paths) run in process context
> and continue to use synchronous zs_free().

I wonder if this approach can speed up zswap_load() (i.e page fault
latency) too?

Code LGTM correctness-wise (assuming zs_free_deferred works) :)

>
> Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
> ---
>  mm/zswap.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0823cadd02b6..7291f6deb5b6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -713,11 +713,16 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>  /*
>   * Carries out the common pattern of freeing an entry's zsmalloc allocation,
>   * freeing the entry itself, and decrementing the number of stored pages.
> + * When @deferred is true, the zsmalloc handle is queued for async freeing
> + * instead of being freed immediately.
>   */
> -static void zswap_entry_free(struct zswap_entry *entry)
> +static void __zswap_entry_free(struct zswap_entry *entry, bool deferred)
>  {
>         zswap_lru_del(&zswap_list_lru, entry);
> -       zs_free(entry->pool->zs_pool, entry->handle);
> +       if (deferred)
> +               zs_free_deferred(entry->pool->zs_pool, entry->handle);
> +       else
> +               zs_free(entry->pool->zs_pool, entry->handle);
>         zswap_pool_put(entry->pool);
>         if (entry->objcg) {
>                 obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
> @@ -729,6 +734,11 @@ static void zswap_entry_free(struct zswap_entry *entry)
>         atomic_long_dec(&zswap_stored_pages);
>  }
>
> +static void zswap_entry_free(struct zswap_entry *entry)
> +{
> +       __zswap_entry_free(entry, false);
> +}
> +
>  /*********************************
>  * compressed storage functions
>  **********************************/
> @@ -1655,7 +1665,7 @@ void zswap_invalidate(swp_entry_t swp)
>
>         entry = xa_erase(tree, offset);
>         if (entry)
> -               zswap_entry_free(entry);
> +               __zswap_entry_free(entry, true);
>  }
>
>  int zswap_swapon(int type, unsigned long nr_pages)
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-21 15:54 ` [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Nhat Pham
@ 2026-04-21 17:17   ` Kairui Song
  2026-04-21 18:07     ` Nhat Pham
  0 siblings, 1 reply; 26+ messages in thread
From: Kairui Song @ 2026-04-21 17:17 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Wenchao Hao, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm, Barry Song, Xueyuan Chen,
	Wenchao Hao

On Tue, Apr 21, 2026 at 11:55 PM Nhat Pham <nphamcs@gmail.com> wrote:
>

Thanks for adding me to the Cc list :), Barry started this idea with
ZRAM, which looks very interesting to me.

> On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote:
> >
> > Swap freeing can be expensive when unmapping a VMA containing
> > many swap entries. This has been reported to significantly
> > delay memory reclamation during Android's low-memory killing,
> > especially when multiple processes are terminated to free
> > memory, with slot_free() accounting for more than 80% of
> > the total cost of freeing swap entries.
> >
> > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > to asynchronously collect and free swap entries [1][2], but the
> > design itself is fairly complex.
> >
> > When anon folios and swap entries are mixed within a
> > process, reclaiming anon folios from killed processes
> > helps return memory to the system as quickly as possible,
> > so that newly launched applications can satisfy their
> > memory demands. It is not ideal for swap freeing to block
> > anon folio freeing. On the other hand, swap freeing can
> > still return memory to the system, although at a slower
> > rate due to memory compression.
>
> Is this correct? I don't think we do decompression in
> zswap_invalidate() path. We do decompression in zswap_load(), but as a
> separate step from zswap_invalidate().

It's not about decompression. I think what Wenchao means here is that:
freeing the swap entry also releases the backing compression data, but
compared to freeing an actual folio (which bring back a free folio to
reduce memory pressure), you may need to free a lot of swap entries to
free one whole folio, because the compressed data could be much
smaller than folio and with fragmentation. And swap entry freeing is
still not fast enough to be ignored.

>
> zswap/zsmalloc entry freeing is decoupled from decompression. For
> example, on process teardown, we free the zsmalloc memory but never
> decompress (if we do then it's a bug to be fixed lol, but I doubt it).
>
> Zsmalloc freeing might not be worth as much bang-for-your-buck wise
> compared to anon folio freeing, but if it's "expensive", then I think
> that points to a different root-cause: zsmalloc's poor scalability in
> the free path.

That's a very nice insight. I had an idea previously that can we have
something like a zs free bulk? Freeing handles one by one does seem
expensive.
https://lore.kernel.org/linux-mm/adt3Q_SRToF6fb3W@KASONG-MC4/

It might be tricky to do so though.

It will be best if we can speed up everything, doing things async
doesn't reduce the total amount of work, and might cause more trouble
like worker overhead or delayed freeing causing more memory pressure,
if the workqueue didn't run in time. Or maybe a process is almost
completely swapped out, then this won't help at all.

I'm not against the async idea, they might combine well.

>
> I've stared at this code path for a bit, because my other patch series
> (vswap - see [1]) was reported to display regression on the free path
> on the usemem benchmark. And one of the issues was the contention
> between compaction (both systemwide compaction, i.e zs_page_migrate,
> and zsmalloc's internal compaction, but mostly the former).:
>
> * zs_free read-acquires pool->lock, and compaction write-acquires the
> same lock. So the compaction thread will make all zs free-ers wait for
> it. I saw this read lock delay when I perfed the free step of usemem.
>
> * If this lock has fair queue-ing semantics (I have not checked), then
> if there a compaction is behind a bunch of zs_free in the queue, then
> all the subsequent zs_free's ers are blocked :)
>
> * I'm also curious about cache-friendliness of this rwlock, bouncing
> across CPUs, if you have multiple processes being torn down
> concurrently.

That's interesting, when I mentioned zs free bulk I was thinking that,
if we have a percpu queue, at least we may try read lock that on every
enqueue, free the whole queue if successful, then release the lock.
I'm sure there are more ways to optimize that, just a random idea :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-21 17:17   ` Kairui Song
@ 2026-04-21 18:07     ` Nhat Pham
  2026-04-21 18:25       ` Nhat Pham
  0 siblings, 1 reply; 26+ messages in thread
From: Nhat Pham @ 2026-04-21 18:07 UTC (permalink / raw)
  To: Kairui Song
  Cc: Wenchao Hao, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm, Barry Song, Xueyuan Chen,
	Wenchao Hao

On Tue, Apr 21, 2026 at 10:18 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Apr 21, 2026 at 11:55 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
>
> Thanks for adding me to the Cc list :), Barry started this idea with
> ZRAM, which looks very interesting to me.
>
> > On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > >
> > > Swap freeing can be expensive when unmapping a VMA containing
> > > many swap entries. This has been reported to significantly
> > > delay memory reclamation during Android's low-memory killing,
> > > especially when multiple processes are terminated to free
> > > memory, with slot_free() accounting for more than 80% of
> > > the total cost of freeing swap entries.
> > >
> > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > > to asynchronously collect and free swap entries [1][2], but the
> > > design itself is fairly complex.
> > >
> > > When anon folios and swap entries are mixed within a
> > > process, reclaiming anon folios from killed processes
> > > helps return memory to the system as quickly as possible,
> > > so that newly launched applications can satisfy their
> > > memory demands. It is not ideal for swap freeing to block
> > > anon folio freeing. On the other hand, swap freeing can
> > > still return memory to the system, although at a slower
> > > rate due to memory compression.
> >
> > Is this correct? I don't think we do decompression in
> > zswap_invalidate() path. We do decompression in zswap_load(), but as a
> > separate step from zswap_invalidate().
>
> It's not about decompression. I think what Wenchao means here is that:
> freeing the swap entry also releases the backing compression data, but
> compared to freeing an actual folio (which bring back a free folio to
> reduce memory pressure), you may need to free a lot of swap entries to
> free one whole folio, because the compressed data could be much
> smaller than folio and with fragmentation. And swap entry freeing is
> still not fast enough to be ignored.

Ah I see yeah. That's the not "as much bang-for-your-buck" as folio
freeing category. I agree on this point.

>
> >
> > zswap/zsmalloc entry freeing is decoupled from decompression. For
> > example, on process teardown, we free the zsmalloc memory but never
> > decompress (if we do then it's a bug to be fixed lol, but I doubt it).
> >
> > Zsmalloc freeing might not be worth as much bang-for-your-buck wise
> > compared to anon folio freeing, but if it's "expensive", then I think
> > that points to a different root-cause: zsmalloc's poor scalability in
> > the free path.
>
> That's a very nice insight. I had an idea previously that can we have
> something like a zs free bulk? Freeing handles one by one does seem
> expensive.
> https://lore.kernel.org/linux-mm/adt3Q_SRToF6fb3W@KASONG-MC4/
>
> It might be tricky to do so though.
>
> It will be best if we can speed up everything, doing things async
> doesn't reduce the total amount of work, and might cause more trouble
> like worker overhead or delayed freeing causing more memory pressure,
> if the workqueue didn't run in time. Or maybe a process is almost
> completely swapped out, then this won't help at all.
>
> I'm not against the async idea, they might combine well.

Completely agree! I was thinking about batching the free operations
for zsmalloc. Right now seems like even if we have a contiguous range
of swap slots to be freed, we call one
zram_slot_free_notify/zswap_invalidate at a time, which then call
zs_free one at a time? I wonder if there's any batching opportunity
here. Might be complicated with the pool lock and class lock dance in
zs_free() though :)

And yeah the async stuff is orthogonal too.

>
> >
> > I've stared at this code path for a bit, because my other patch series
> > (vswap - see [1]) was reported to display regression on the free path
> > on the usemem benchmark. And one of the issues was the contention
> > between compaction (both systemwide compaction, i.e zs_page_migrate,
> > and zsmalloc's internal compaction, but mostly the former).:
> >
> > * zs_free read-acquires pool->lock, and compaction write-acquires the
> > same lock. So the compaction thread will make all zs free-ers wait for
> > it. I saw this read lock delay when I perfed the free step of usemem.
> >
> > * If this lock has fair queue-ing semantics (I have not checked), then
> > if there a compaction is behind a bunch of zs_free in the queue, then
> > all the subsequent zs_free's ers are blocked :)
> >
> > * I'm also curious about cache-friendliness of this rwlock, bouncing
> > across CPUs, if you have multiple processes being torn down
> > concurrently.
>
> That's interesting, when I mentioned zs free bulk I was thinking that,
> if we have a percpu queue, at least we may try read lock that on every
> enqueue, free the whole queue if successful, then release the lock.
> I'm sure there are more ways to optimize that, just a random idea :)

Yep! Would be nice to have some perf trace to pinpoint where the overhead is.

On my end, I perfed the free phase of usemem. It varies a bit based on
exact build config, kernel version, or even between runs, but the
cheapest I've seen for the pool lock contention overhead is about 3%
of the free phase (this is on baseline, not vswap kernel). That's
pretty big (bigger than vswap overhead even on the kernels with vswap,
which is kinda silly). Obviously the host was very overcommitted, so
compaction was running in the background at the same time, but
still...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-21 18:07     ` Nhat Pham
@ 2026-04-21 18:25       ` Nhat Pham
  2026-04-22  0:34         ` Xueyuan Chen
  0 siblings, 1 reply; 26+ messages in thread
From: Nhat Pham @ 2026-04-21 18:25 UTC (permalink / raw)
  To: Kairui Song
  Cc: Wenchao Hao, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm, Barry Song, Xueyuan Chen,
	Wenchao Hao

On Tue, Apr 21, 2026 at 11:07 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Apr 21, 2026 at 10:18 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Tue, Apr 21, 2026 at 11:55 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> >
> > Thanks for adding me to the Cc list :), Barry started this idea with
> > ZRAM, which looks very interesting to me.
> >
> > > On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > > >
> > > > Swap freeing can be expensive when unmapping a VMA containing
> > > > many swap entries. This has been reported to significantly
> > > > delay memory reclamation during Android's low-memory killing,
> > > > especially when multiple processes are terminated to free
> > > > memory, with slot_free() accounting for more than 80% of
> > > > the total cost of freeing swap entries.
> > > >
> > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > > > to asynchronously collect and free swap entries [1][2], but the
> > > > design itself is fairly complex.
> > > >
> > > > When anon folios and swap entries are mixed within a
> > > > process, reclaiming anon folios from killed processes
> > > > helps return memory to the system as quickly as possible,
> > > > so that newly launched applications can satisfy their
> > > > memory demands. It is not ideal for swap freeing to block
> > > > anon folio freeing. On the other hand, swap freeing can
> > > > still return memory to the system, although at a slower
> > > > rate due to memory compression.
> > >
> > > Is this correct? I don't think we do decompression in
> > > zswap_invalidate() path. We do decompression in zswap_load(), but as a
> > > separate step from zswap_invalidate().
> >
> > It's not about decompression. I think what Wenchao means here is that:
> > freeing the swap entry also releases the backing compression data, but
> > compared to freeing an actual folio (which bring back a free folio to
> > reduce memory pressure), you may need to free a lot of swap entries to
> > free one whole folio, because the compressed data could be much
> > smaller than folio and with fragmentation. And swap entry freeing is
> > still not fast enough to be ignored.
>
> Ah I see yeah. That's the not "as much bang-for-your-buck" as folio
> freeing category. I agree on this point.
>
> >
> > >
> > > zswap/zsmalloc entry freeing is decoupled from decompression. For
> > > example, on process teardown, we free the zsmalloc memory but never
> > > decompress (if we do then it's a bug to be fixed lol, but I doubt it).
> > >
> > > Zsmalloc freeing might not be worth as much bang-for-your-buck wise
> > > compared to anon folio freeing, but if it's "expensive", then I think
> > > that points to a different root-cause: zsmalloc's poor scalability in
> > > the free path.
> >
> > That's a very nice insight. I had an idea previously that can we have
> > something like a zs free bulk? Freeing handles one by one does seem
> > expensive.
> > https://lore.kernel.org/linux-mm/adt3Q_SRToF6fb3W@KASONG-MC4/
> >
> > It might be tricky to do so though.
> >
> > It will be best if we can speed up everything, doing things async
> > doesn't reduce the total amount of work, and might cause more trouble
> > like worker overhead or delayed freeing causing more memory pressure,
> > if the workqueue didn't run in time. Or maybe a process is almost
> > completely swapped out, then this won't help at all.
> >
> > I'm not against the async idea, they might combine well.
>
> Completely agree! I was thinking about batching the free operations
> for zsmalloc. Right now seems like even if we have a contiguous range
> of swap slots to be freed, we call one
> zram_slot_free_notify/zswap_invalidate at a time, which then call
> zs_free one at a time? I wonder if there's any batching opportunity
> here. Might be complicated with the pool lock and class lock dance in
> zs_free() though :)
>
> And yeah the async stuff is orthogonal too.
>
> >
> > >
> > > I've stared at this code path for a bit, because my other patch series
> > > (vswap - see [1]) was reported to display regression on the free path
> > > on the usemem benchmark. And one of the issues was the contention
> > > between compaction (both systemwide compaction, i.e zs_page_migrate,
> > > and zsmalloc's internal compaction, but mostly the former).:
> > >
> > > * zs_free read-acquires pool->lock, and compaction write-acquires the
> > > same lock. So the compaction thread will make all zs free-ers wait for
> > > it. I saw this read lock delay when I perfed the free step of usemem.
> > >
> > > * If this lock has fair queue-ing semantics (I have not checked), then
> > > if there a compaction is behind a bunch of zs_free in the queue, then
> > > all the subsequent zs_free's ers are blocked :)
> > >
> > > * I'm also curious about cache-friendliness of this rwlock, bouncing
> > > across CPUs, if you have multiple processes being torn down
> > > concurrently.
> >
> > That's interesting, when I mentioned zs free bulk I was thinking that,
> > if we have a percpu queue, at least we may try read lock that on every
> > enqueue, free the whole queue if successful, then release the lock.
> > I'm sure there are more ways to optimize that, just a random idea :)
>
> Yep! Would be nice to have some perf trace to pinpoint where the overhead is.
>

Ah OK - I found this thread now:

https://lore.kernel.org/linux-mm/20260414054930.225853-1-xueyuan.chen21@gmail.com/

Hmm, free_zspage() and kmem_cache_free().

* kmem_cache_free() is just handle freeing. Bulk-freeing?

* free_zspage() looks like just ordinary teardown work :( Seems like
we're not spinning any lock here - we just try lock the backing pages,
and the rest is normal work. Not sure how to optimize this - perhaps
deferring is the only way.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/zsmalloc: introduce zs_free_deferred() for async handle freeing
  2026-04-21 12:16 ` [RFC PATCH v2 2/4] mm/zsmalloc: introduce zs_free_deferred() for async handle freeing Wenchao Hao
@ 2026-04-21 19:46   ` Nhat Pham
  2026-04-21 21:42     ` Barry Song
  0 siblings, 1 reply; 26+ messages in thread
From: Nhat Pham @ 2026-04-21 19:46 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Sergey Senozhatsky, Yosry Ahmed, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> zs_free() is expensive due to internal locking (pool->lock, class->lock)
> and potential zspage freeing. On the process exit path, the slow
> zs_free() blocks memory reclamation, delaying overall memory release.
> This has been reported to significantly impact Android low-memory
> killing where slot_free() accounts for over 80% of the total swap
> entry freeing cost.
>
> Introduce zs_free_deferred() which queues handles into a fixed-size
> per-pool array for later processing by a workqueue. This allows callers
> to defer the expensive zs_free() and return quickly, so the process
> exit path can release memory faster. The array capacity is derived from
> a 128MB uncompressed data budget (128MB >> PAGE_SHIFT entries), which
> scales naturally with PAGE_SIZE. When the array reaches half capacity,
> the workqueue is scheduled to drain pending handles.
>
> zs_free_deferred() uses spin_trylock() to access the deferred queue.
> If the lock is contended (e.g. drain in progress) or the queue is full,
> it falls back to synchronous zs_free() to guarantee correctness.
>
> Also introduce zs_free_deferred_flush() for use during pool teardown to
> ensure all pending handles are freed.

Hmmm per-pool workqueue.

Does that mean that if you only have one zs pool (in the case of
zswap, or if you only have one zram device), you'll have less
concurrency in freeing up zsmalloc memory for process teardown? Would
this be problematic?

I think Kairui was also suggesting per-cpu-fying these batches/queues.

>
> Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
> ---
>  include/linux/zsmalloc.h |   2 +
>  mm/zsmalloc.c            | 111 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 113 insertions(+)
>
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index 478410c880b1..1e5ac1a39d41 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -30,6 +30,8 @@ void zs_destroy_pool(struct zs_pool *pool);
>  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags,
>                         const int nid);
>  void zs_free(struct zs_pool *pool, unsigned long obj);
> +void zs_free_deferred(struct zs_pool *pool, unsigned long handle);
> +void zs_free_deferred_flush(struct zs_pool *pool);
>
>  size_t zs_huge_class_size(struct zs_pool *pool);
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 40687c8a7469..defc892555e4 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -53,6 +53,10 @@
>
>  #define ZS_HANDLE_SIZE (sizeof(unsigned long))
>
> +#define ZS_DEFERRED_FREE_MAX_BYTES     (128 << 20)
> +#define ZS_DEFERRED_FREE_CAPACITY      (ZS_DEFERRED_FREE_MAX_BYTES >> PAGE_SHIFT)
> +#define ZS_DEFERRED_FREE_THRESHOLD     (ZS_DEFERRED_FREE_CAPACITY / 2)
> +
>  /*
>   * Object location (<PFN>, <obj_idx>) is encoded as
>   * a single (unsigned long) handle value.
> @@ -217,6 +221,13 @@ struct zs_pool {
>         /* protect zspage migration/compaction */
>         rwlock_t lock;
>         atomic_t compaction_in_progress;
> +
> +       /* deferred free support */
> +       spinlock_t deferred_lock;
> +       unsigned long *deferred_handles;
> +       unsigned int deferred_count;
> +       unsigned int deferred_capacity;
> +       struct work_struct deferred_free_work;
>  };
>
>  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
> @@ -579,6 +590,19 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
>  }
>  DEFINE_SHOW_ATTRIBUTE(zs_stats_size);
>
> +static int zs_stats_deferred_show(struct seq_file *s, void *v)
> +{
> +       struct zs_pool *pool = s->private;
> +
> +       spin_lock(&pool->deferred_lock);
> +       seq_printf(s, "pending: %u\n", pool->deferred_count);
> +       seq_printf(s, "capacity: %u\n", pool->deferred_capacity);
> +       spin_unlock(&pool->deferred_lock);
> +
> +       return 0;
> +}
> +DEFINE_SHOW_ATTRIBUTE(zs_stats_deferred);
> +
>  static void zs_pool_stat_create(struct zs_pool *pool, const char *name)
>  {
>         if (!zs_stat_root) {
> @@ -590,6 +614,9 @@ static void zs_pool_stat_create(struct zs_pool *pool, const char *name)
>
>         debugfs_create_file("classes", S_IFREG | 0444, pool->stat_dentry, pool,
>                             &zs_stats_size_fops);
> +       debugfs_create_file("deferred_free", S_IFREG | 0444,
> +                           pool->stat_dentry, pool,
> +                           &zs_stats_deferred_fops);
>  }
>
>  static void zs_pool_stat_destroy(struct zs_pool *pool)
> @@ -1432,6 +1459,76 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  }
>  EXPORT_SYMBOL_GPL(zs_free);
>
> +static void zs_deferred_free_work(struct work_struct *work)
> +{
> +       struct zs_pool *pool = container_of(work, struct zs_pool,
> +                                           deferred_free_work);
> +       unsigned long handle;
> +
> +       while (1) {
> +               spin_lock(&pool->deferred_lock);
> +               if (pool->deferred_count == 0) {
> +                       spin_unlock(&pool->deferred_lock);
> +                       break;
> +               }
> +               handle = pool->deferred_handles[--pool->deferred_count];
> +               spin_unlock(&pool->deferred_lock);

Any reason why we're locking, grabbing a handle, then unlocking, one
at a time? Why dont we just lock, grab all the handles (or at least a
batch of them), unlock, then process the handles one at a time?

We can also have a pair of handle arrays. Whenever defer worker is
woken up, just swap the arrays under the lock, then free the handles
in the old array :)


> +
> +               zs_free(pool, handle);
> +               cond_resched();
> +       }
> +}
> +
> +/**
> + * zs_free_deferred - queue a handle for asynchronous freeing
> + * @pool: pool to free from
> + * @handle: handle to free
> + *
> + * Place @handle into a deferred free queue for later processing by a
> + * workqueue.  This is intended for callers that are in atomic context
> + * (e.g. under a spinlock) and cannot afford the cost of zs_free()
> + * directly.  When the queue reaches a threshold the work is scheduled.
> + * Falls back to synchronous zs_free() if the lock is contended (drain
> + * in progress) or if the queue is full.
> + */
> +void zs_free_deferred(struct zs_pool *pool, unsigned long handle)
> +{
> +       if (IS_ERR_OR_NULL((void *)handle))
> +               return;
> +
> +       if (!spin_trylock(&pool->deferred_lock))
> +               goto sync_free;
> +
> +       if (pool->deferred_count >= pool->deferred_capacity) {
> +               spin_unlock(&pool->deferred_lock);
> +               goto sync_free;
> +       }
> +
> +       pool->deferred_handles[pool->deferred_count++] = handle;
> +       if (pool->deferred_count >= ZS_DEFERRED_FREE_THRESHOLD)
> +               queue_work(system_wq, &pool->deferred_free_work);
> +       spin_unlock(&pool->deferred_lock);
> +       return;
> +
> +sync_free:
> +       zs_free(pool, handle);
> +}
> +EXPORT_SYMBOL_GPL(zs_free_deferred);
> +
> +/**
> + * zs_free_deferred_flush - flush all pending deferred frees
> + * @pool: pool to flush
> + *
> + * Wait for any scheduled work to complete, then drain any remaining
> + * handles.  Must be called from process context.
> + */
> +void zs_free_deferred_flush(struct zs_pool *pool)
> +{
> +       flush_work(&pool->deferred_free_work);
> +       zs_deferred_free_work(&pool->deferred_free_work);
> +}
> +EXPORT_SYMBOL_GPL(zs_free_deferred_flush);
> +
>  static void zs_object_copy(struct size_class *class, unsigned long dst,
>                                 unsigned long src)
>  {
> @@ -2099,6 +2196,18 @@ struct zs_pool *zs_create_pool(const char *name)
>         rwlock_init(&pool->lock);
>         atomic_set(&pool->compaction_in_progress, 0);
>
> +       spin_lock_init(&pool->deferred_lock);
> +       pool->deferred_capacity = ZS_DEFERRED_FREE_CAPACITY;
> +       pool->deferred_handles = kvmalloc_array(pool->deferred_capacity,
> +                                               sizeof(unsigned long),
> +                                               GFP_KERNEL);
> +       if (!pool->deferred_handles) {
> +               kfree(pool);
> +               return NULL;
> +       }
> +       pool->deferred_count = 0;
> +       INIT_WORK(&pool->deferred_free_work, zs_deferred_free_work);
> +
>         pool->name = kstrdup(name, GFP_KERNEL);
>         if (!pool->name)
>                 goto err;
> @@ -2201,6 +2310,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>         int i;
>
>         zs_unregister_shrinker(pool);
> +       zs_free_deferred_flush(pool);
>         zs_flush_migration(pool);
>         zs_pool_stat_destroy(pool);
>
> @@ -2224,6 +2334,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>                 kfree(class);
>         }
>
> +       kvfree(pool->deferred_handles);
>         kfree(pool->name);
>         kfree(pool);
>  }
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/zsmalloc: introduce zs_free_deferred() for async handle freeing
  2026-04-21 19:46   ` Nhat Pham
@ 2026-04-21 21:42     ` Barry Song
  2026-04-23 16:40       ` Nhat Pham
  0 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2026-04-21 21:42 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Wenchao Hao, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm, Xueyuan Chen, Wenchao Hao

On Wed, Apr 22, 2026 at 3:47 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote:
> >
> > zs_free() is expensive due to internal locking (pool->lock, class->lock)
> > and potential zspage freeing. On the process exit path, the slow
> > zs_free() blocks memory reclamation, delaying overall memory release.
> > This has been reported to significantly impact Android low-memory
> > killing where slot_free() accounts for over 80% of the total swap
> > entry freeing cost.
> >
> > Introduce zs_free_deferred() which queues handles into a fixed-size
> > per-pool array for later processing by a workqueue. This allows callers
> > to defer the expensive zs_free() and return quickly, so the process
> > exit path can release memory faster. The array capacity is derived from
> > a 128MB uncompressed data budget (128MB >> PAGE_SHIFT entries), which
> > scales naturally with PAGE_SIZE. When the array reaches half capacity,
> > the workqueue is scheduled to drain pending handles.
> >
> > zs_free_deferred() uses spin_trylock() to access the deferred queue.
> > If the lock is contended (e.g. drain in progress) or the queue is full,
> > it falls back to synchronous zs_free() to guarantee correctness.
> >
> > Also introduce zs_free_deferred_flush() for use during pool teardown to
> > ensure all pending handles are freed.
>
> Hmmm per-pool workqueue.
>
> Does that mean that if you only have one zs pool (in the case of
> zswap, or if you only have one zram device), you'll have less
> concurrency in freeing up zsmalloc memory for process teardown? Would
> this be problematic?

I believe so, as reported in the original email from Lei and Zhiguo,
which proposed introducing a swap entries list for async free.

>
> I think Kairui was also suggesting per-cpu-fying these batches/queues.

I guess a per–size-class workqueue might strike a balance
between scalability and reducing lock contention across
multiple classes, where the locks actually reside.

Thanks
Barry

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-21 18:25       ` Nhat Pham
@ 2026-04-22  0:34         ` Xueyuan Chen
  0 siblings, 0 replies; 26+ messages in thread
From: Xueyuan Chen @ 2026-04-22  0:34 UTC (permalink / raw)
  To: nphamcs
  Cc: ryncsn, haowenchao22, akpm, chengming.zhou, axboe, hannes,
	minchan, senozhatsky, yosry, linux-block, linux-kernel, linux-mm,
	baohua, xueyuan.chen21, haowenchao


On Tue, Apr 21, 2026 at 11:25:17AM -0700, Nhat Pham wrote:

[...]

>Hmm, free_zspage() and kmem_cache_free().
>
>* kmem_cache_free() is just handle freeing. Bulk-freeing?
>
>* free_zspage() looks like just ordinary teardown work :( Seems like
>we're not spinning any lock here - we just try lock the backing pages,
>and the rest is normal work. Not sure how to optimize this - perhaps
>deferring is the only way.
>
>

Hi Nhat,

Currently, free_zspage() is called while holding the class->lock. 
However, free_zspage() eventually invokes folio_put(), which may acquire
the zone->lock.

This creates a nested lock dependency. If multiple CPUs contend for the
same class->lock and the current holder is stalled waiting for the
zone->lock, it significantly extends the hold time of the class->lock.
This causes other CPUs to wait much longer.

Here is the ftrace data showing the severe contention on class->lock.
Under contention, the time spent in queued_spin_lock_slowpath() jumps 
from ~1.3us to over 30us, significantly increasing the total latency
of zs_free().

  7)               |  zs_free() {
  7)   0.220 us    |    _raw_read_lock();
  7)               |    _raw_spin_lock() {
  7)   1.320 us    |      queued_spin_lock_slowpath();
  7)   1.820 us    |    }
  7)   0.170 us    |    _raw_read_unlock();
  7)   0.170 us    |    obj_free();
  7)   0.190 us    |    fix_fullness_group();
  7)   0.150 us    |    _raw_spin_unlock();
  7)   0.170 us    |    kmem_cache_free();
  7)   4.610 us    |  }

---------------------------------------------------------

  7)               |  zs_free() {
  7)   0.230 us    |    _raw_read_lock();
  7)               |    _raw_spin_lock() {
  7) + 30.100 us   |      queued_spin_lock_slowpath();
  7) + 30.600 us   |    }
  7)   0.200 us    |    _raw_read_unlock();
  7)   0.170 us    |    obj_free();
  7)   0.170 us    |    fix_fullness_group();
  7)   0.170 us    |    _raw_spin_unlock();
  7)   0.210 us    |    kmem_cache_free();
  7) + 33.850 us   |  }

Best regards,
Xueyuan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 2/4] mm/zsmalloc: introduce zs_free_deferred() for async handle freeing
  2026-04-21 21:42     ` Barry Song
@ 2026-04-23 16:40       ` Nhat Pham
  0 siblings, 0 replies; 26+ messages in thread
From: Nhat Pham @ 2026-04-23 16:40 UTC (permalink / raw)
  To: Barry Song
  Cc: Wenchao Hao, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm, Xueyuan Chen, Wenchao Hao

On Tue, Apr 21, 2026 at 2:42 PM Barry Song <baohua@kernel.org> wrote:
> On Wed, Apr 22, 2026 at 3:47 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > >
> > > zs_free() is expensive due to internal locking (pool->lock, class->lock)
> > > and potential zspage freeing. On the process exit path, the slow
> > > zs_free() blocks memory reclamation, delaying overall memory release.
> > > This has been reported to significantly impact Android low-memory
> > > killing where slot_free() accounts for over 80% of the total swap
> > > entry freeing cost.
> > >
> > > Introduce zs_free_deferred() which queues handles into a fixed-size
> > > per-pool array for later processing by a workqueue. This allows callers
> > > to defer the expensive zs_free() and return quickly, so the process
> > > exit path can release memory faster. The array capacity is derived from
> > > a 128MB uncompressed data budget (128MB >> PAGE_SHIFT entries), which
> > > scales naturally with PAGE_SIZE. When the array reaches half capacity,
> > > the workqueue is scheduled to drain pending handles.
> > >
> > > zs_free_deferred() uses spin_trylock() to access the deferred queue.
> > > If the lock is contended (e.g. drain in progress) or the queue is full,
> > > it falls back to synchronous zs_free() to guarantee correctness.
> > >
> > > Also introduce zs_free_deferred_flush() for use during pool teardown to
> > > ensure all pending handles are freed.
> >
> > Hmmm per-pool workqueue.
> >
> > Does that mean that if you only have one zs pool (in the case of
> > zswap, or if you only have one zram device), you'll have less
> > concurrency in freeing up zsmalloc memory for process teardown? Would
> > this be problematic?
>
> I believe so, as reported in the original email from Lei and Zhiguo,
> which proposed introducing a swap entries list for async free.
>
> >
> > I think Kairui was also suggesting per-cpu-fying these batches/queues.
>
> I guess a per–size-class workqueue might strike a balance
> between scalability and reducing lock contention across
> multiple classes, where the locks actually reside.

Sounds good! Let the numbers decide :)

>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-21 12:16 [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Wenchao Hao
                   ` (4 preceding siblings ...)
  2026-04-21 15:54 ` [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Nhat Pham
@ 2026-04-26  4:13 ` Wenchao Hao
  2026-04-26  8:50   ` Xueyuan Chen
  2026-04-27 18:17   ` Yosry Ahmed
  5 siblings, 2 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-04-26  4:13 UTC (permalink / raw)
  To: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, Yosry Ahmed,
	linux-block, linux-kernel, linux-mm
  Cc: Barry Song, Xueyuan Chen, Wenchao Hao

On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing
> many swap entries. This has been reported to significantly
> delay memory reclamation during Android's low-memory killing,
> especially when multiple processes are terminated to free
> memory, with slot_free() accounting for more than 80% of
> the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the
> design itself is fairly complex.
>
Hi Nhat, Kairui, Barry, Xueyuan,

Thanks for the review. I agree with the direction and have some ideas for
an alternative approach.

My approach: first eliminate pool->lock from zs_free() itself, then defer
free to per-cpu buffers with a lockless handoff, and finally reduce
class->lock overhead during drain by exploiting natural class locality.
Achieving both per-cpu and per-class is difficult, so the class->lock
optimization is a compromise — but one that works well in practice.

1. Encode class_idx in obj to eliminate pool->lock

OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
(chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
for obj_idx, leaving 14 spare bits.
We can split OBJ_INDEX into class_idx + obj_idx:

    obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]

OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
(8 bits for 4K pages, 9 for 64K).
Since class_idx is invariant across migration (only PFN changes), zs_free()
can extract class_idx locklessly, then acquire class->lock and re-read obj for a
stable PFN. No pool->lock needed.

2. Per-cpu deferred free with lockless buffer swap

Defer zs_free() to per-cpu dynamically-allocated buffers (~2048 entries).
Enqueue: one array write + WRITE_ONCE under preempt_disable — no lock,
no atomic. When buffers full, schedule a drain worker; overflow falls back
to sync zs_free().

Drain: allocate a fresh buffer, swap it in, reset count. Since
the producer stops writing at count==SIZE, the handoff is
race-free without any lock.

Pseudo-code:

    /* enqueue - hot path */
    def = get_cpu_ptr(pool->deferred);
    if (def->count < SIZE) {
        def->handles[def->count] = handle;
        WRITE_ONCE(def->count, def->count + 1);
        if (def->count == SIZE)
            schedule_work(&pool->drain_work);
    } else {
        zs_free(pool, handle);  /* fallback */
    }
    put_cpu_ptr(pool->deferred);

    /* drain - worker */
    for_each_possible_cpu(cpu) {
        def = per_cpu_ptr(pool->deferred, cpu);
        if (def->count < SIZE)
            continue;
        new_buf = kvmalloc_array(SIZE, sizeof(long));
        old_buf = def->handles;
        old_count = def->count;
        def->handles = new_buf;
        WRITE_ONCE(def->count, 0);
        /* now drain old_buf[0..old_count-1] */
        ...
        kvfree(old_buf);
    }

3. Consecutive-class batching during drain

The drain worker extracts class_idx from each handle locklessly, and holds
class->lock across consecutive same-class handles.
On the exit path, compressed sizes tend to cluster, so consecutive handles
naturally share the same class — giving batch-like lock
amortization without sorting.

Pseudo-code:

    cur_cls = -1;
    for (i = 0; i < count; i++) {
        obj = handle_to_obj(handles[i]);
        cls = obj_to_class_idx(obj);
        if (cls != cur_cls) {
            if (cur_cls >= 0)
                spin_unlock(&pool->size_class[cur_cls]->lock);
            spin_lock(&pool->size_class[cls]->lock);
            cur_cls = cls;
        }
        __zs_free(pool, handles[i]);  /* free under lock */
    }
    if (cur_cls >= 0)
        spin_unlock(&pool->size_class[cur_cls]->lock);

---

Benefits over current mainline:
- Removes pool->lock from zs_free() entirely
- Deferred free path is nearly zero-cost
- class->lock is amortized across batches instead of acquired per-handle
- Producer-consumer handoff is fully lockless

I've prototyped this on 64-bit and it works. Still need to sort out
32-bit compatibility and Kconfig gating. Does this direction look reasonable?

Thanks,
Wenchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-26  4:13 ` Wenchao Hao
@ 2026-04-26  8:50   ` Xueyuan Chen
  2026-04-27  3:10     ` Wenchao Hao
  2026-04-27 18:17   ` Yosry Ahmed
  1 sibling, 1 reply; 26+ messages in thread
From: Xueyuan Chen @ 2026-04-26  8:50 UTC (permalink / raw)
  To: haowenchao22
  Cc: akpm, chengming.zhou, axboe, hannes, minchan, nphamcs,
	senozhatsky, yosry, linux-block, linux-kernel, linux-mm, baohua,
	xueyuan.chen21, haowenchao


On Sun, Apr 26, 2026 at 12:13:02PM +0800, Wenchao Hao wrote:

[...]

>2. Per-cpu deferred free with lockless buffer swap
>
>Defer zs_free() to per-cpu dynamically-allocated buffers (~2048 entries).
>Enqueue: one array write + WRITE_ONCE under preempt_disable — no lock,
>no atomic. When buffers full, schedule a drain worker; overflow falls back
>to sync zs_free().
>
>Drain: allocate a fresh buffer, swap it in, reset count. Since
>the producer stops writing at count==SIZE, the handoff is
>race-free without any lock.
>
>Pseudo-code:
>
>    /* enqueue - hot path */
>    def = get_cpu_ptr(pool->deferred);
>    if (def->count < SIZE) {
>        def->handles[def->count] = handle;
>        WRITE_ONCE(def->count, def->count + 1);
>        if (def->count == SIZE)
>            schedule_work(&pool->drain_work);
>    } else {
>        zs_free(pool, handle);  /* fallback */
>    }
>    put_cpu_ptr(pool->deferred);
>
>    /* drain - worker */
>    for_each_possible_cpu(cpu) {
>        def = per_cpu_ptr(pool->deferred, cpu);
>        if (def->count < SIZE)
>            continue;
>        new_buf = kvmalloc_array(SIZE, sizeof(long));
>        old_buf = def->handles;
>        old_count = def->count;
>        def->handles = new_buf;
>        WRITE_ONCE(def->count, 0);
>        /* now drain old_buf[0..old_count-1] */
>        ...
>        kvfree(old_buf);
>    }
>

Hi Wenchao,

I suspect there is a memory ordering issue here:

def->handles = new_buf;
WRITE_ONCE(def->count, 0);

Since there are no explicit memory barriers, we cannot guarantee the 
order of these stores. If def->count is cleared to 0 first, an enqueue 
might end up operating on the old_buf.

This race condition is more likely to be triggered when the size is
smaller. Perhaps we should consider using smp_store_release() to enforce
the ordering?

Thanks
Xueyuan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-26  8:50   ` Xueyuan Chen
@ 2026-04-27  3:10     ` Wenchao Hao
  0 siblings, 0 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-04-27  3:10 UTC (permalink / raw)
  To: Xueyuan Chen
  Cc: akpm, chengming.zhou, axboe, hannes, minchan, nphamcs,
	senozhatsky, yosry, linux-block, linux-kernel, linux-mm, baohua,
	haowenchao

On Sun, Apr 26, 2026 at 4:50 PM Xueyuan Chen <xueyuan.chen21@gmail.com> wrote:
>
>
> On Sun, Apr 26, 2026 at 12:13:02PM +0800, Wenchao Hao wrote:
>
> [...]
>
> >2. Per-cpu deferred free with lockless buffer swap
> >
> >Defer zs_free() to per-cpu dynamically-allocated buffers (~2048 entries).
> >Enqueue: one array write + WRITE_ONCE under preempt_disable — no lock,
> >no atomic. When buffers full, schedule a drain worker; overflow falls back
> >to sync zs_free().
> >
> >Drain: allocate a fresh buffer, swap it in, reset count. Since
> >the producer stops writing at count==SIZE, the handoff is
> >race-free without any lock.
> >
> >Pseudo-code:
> >
> >    /* enqueue - hot path */
> >    def = get_cpu_ptr(pool->deferred);
> >    if (def->count < SIZE) {
> >        def->handles[def->count] = handle;
> >        WRITE_ONCE(def->count, def->count + 1);
> >        if (def->count == SIZE)
> >            schedule_work(&pool->drain_work);
> >    } else {
> >        zs_free(pool, handle);  /* fallback */
> >    }
> >    put_cpu_ptr(pool->deferred);
> >
> >    /* drain - worker */
> >    for_each_possible_cpu(cpu) {
> >        def = per_cpu_ptr(pool->deferred, cpu);
> >        if (def->count < SIZE)
> >            continue;
> >        new_buf = kvmalloc_array(SIZE, sizeof(long));
> >        old_buf = def->handles;
> >        old_count = def->count;
> >        def->handles = new_buf;
> >        WRITE_ONCE(def->count, 0);
> >        /* now drain old_buf[0..old_count-1] */
> >        ...
> >        kvfree(old_buf);
> >    }
> >
>
> Hi Wenchao,
>
> I suspect there is a memory ordering issue here:
>
> def->handles = new_buf;
> WRITE_ONCE(def->count, 0);
>
> Since there are no explicit memory barriers, we cannot guarantee the
> order of these stores. If def->count is cleared to 0 first, an enqueue
> might end up operating on the old_buf.
>
> This race condition is more likely to be triggered when the size is
> smaller. Perhaps we should consider using smp_store_release() to enforce
> the ordering?
>

Hi Xueyuan,

Good catch! You are right — there is a memory ordering issue between
the handles pointer swap and the count reset.

I'll fix this in the next version by using smp_store_release() /
smp_load_acquire() pairs:

    /* drain - worker */
    def->handles = new_buf;
    smp_store_release(&def->count, 0);

    /* enqueue - producer */
    count = smp_load_acquire(&def->count);
    if (count < SIZE) {
        def->handles[count] = handle;
        smp_store_release(&def->count, count + 1);
    }

This ensures the producer always observes the new handles pointer
before it sees count reset to 0. Will include this fix when posting
the formal patch series.

Thanks,
Wenchao

> Thanks
> Xueyuan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-26  4:13 ` Wenchao Hao
  2026-04-26  8:50   ` Xueyuan Chen
@ 2026-04-27 18:17   ` Yosry Ahmed
  2026-04-28 13:51     ` Wenchao Hao
  1 sibling, 1 reply; 26+ messages in thread
From: Yosry Ahmed @ 2026-04-27 18:17 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> >
> > Swap freeing can be expensive when unmapping a VMA containing
> > many swap entries. This has been reported to significantly
> > delay memory reclamation during Android's low-memory killing,
> > especially when multiple processes are terminated to free
> > memory, with slot_free() accounting for more than 80% of
> > the total cost of freeing swap entries.
> >
> > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > to asynchronously collect and free swap entries [1][2], but the
> > design itself is fairly complex.
> >
> Hi Nhat, Kairui, Barry, Xueyuan,
>
> Thanks for the review. I agree with the direction and have some ideas for
> an alternative approach.
>
> My approach: first eliminate pool->lock from zs_free() itself, then defer
> free to per-cpu buffers with a lockless handoff, and finally reduce
> class->lock overhead during drain by exploiting natural class locality.
> Achieving both per-cpu and per-class is difficult, so the class->lock
> optimization is a compromise — but one that works well in practice.
>
> 1. Encode class_idx in obj to eliminate pool->lock
>
> OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
> (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
> for obj_idx, leaving 14 spare bits.
> We can split OBJ_INDEX into class_idx + obj_idx:
>
>     obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
>
> OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
> (8 bits for 4K pages, 9 for 64K).
> Since class_idx is invariant across migration (only PFN changes), zs_free()
> can extract class_idx locklessly, then acquire class->lock and re-read obj for a
> stable PFN. No pool->lock needed.

How much of the benefit do we get with just these locking improvements
without having to defer any of the freeing work?

As others have pointed out, I don't want to just defer expensive work
without understanding why it's expensive and running into limitations
about why it cannot be improved without deferring.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-27 18:17   ` Yosry Ahmed
@ 2026-04-28 13:51     ` Wenchao Hao
  2026-04-28 13:55       ` Wenchao Hao
                         ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-04-28 13:51 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> >
> > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > >
> > > Swap freeing can be expensive when unmapping a VMA containing
> > > many swap entries. This has been reported to significantly
> > > delay memory reclamation during Android's low-memory killing,
> > > especially when multiple processes are terminated to free
> > > memory, with slot_free() accounting for more than 80% of
> > > the total cost of freeing swap entries.
> > >
> > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > > to asynchronously collect and free swap entries [1][2], but the
> > > design itself is fairly complex.
> > >
> > Hi Nhat, Kairui, Barry, Xueyuan,
> >
> > Thanks for the review. I agree with the direction and have some ideas for
> > an alternative approach.
> >
> > My approach: first eliminate pool->lock from zs_free() itself, then defer
> > free to per-cpu buffers with a lockless handoff, and finally reduce
> > class->lock overhead during drain by exploiting natural class locality.
> > Achieving both per-cpu and per-class is difficult, so the class->lock
> > optimization is a compromise — but one that works well in practice.
> >
> > 1. Encode class_idx in obj to eliminate pool->lock
> >
> > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
> > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
> > for obj_idx, leaving 14 spare bits.
> > We can split OBJ_INDEX into class_idx + obj_idx:
> >
> >     obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
> >
> > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
> > (8 bits for 4K pages, 9 for 64K).
> > Since class_idx is invariant across migration (only PFN changes), zs_free()
> > can extract class_idx locklessly, then acquire class->lock and re-read obj for a
> > stable PFN. No pool->lock needed.
>
> How much of the benefit do we get with just these locking improvements
> without having to defer any of the freeing work?
>

Hi Yosry,

Thanks for the review. Great question — we tested exactly this.

With only the class_idx-in-obj encoding (eliminating pool->lock from
zs_free, no deferred freeing), we measured on two platforms.

Test: each process independently mmap 256MB, write data, madvise
MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.

Raspberry Pi 4B (4-core ARM64 Cortex-A72):

  mode        Base       ClassIdx-only   Speedup
  single      59.0ms     56.0ms          1.05x
  multi 2p    94.6ms     66.7ms          1.42x
  multi 4p    202.9ms    110.6ms         1.83x

x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):

  mode        Base       ClassIdx-only   Speedup
  single      11.7ms     9.8ms           1.19x
  multi 2p    24.1ms     17.2ms          1.40x
  multi 4p    63.0ms     45.3ms          1.39x

Single-process shows modest improvement. With multiple processes,
each read_lock/read_unlock atomically modifies the shared rwlock
reader count, and the cost of these atomic operations increases
with more CPUs accessing the same cacheline concurrently.
Eliminating pool->lock removes this overhead entirely.

This only works on 64-bit systems where OBJ_INDEX_BITS has enough
spare bits to fit class_idx. 32-bit systems don't have the room.
I'm still working on the compile-time gating to properly enable
this based on architecture and page size configuration.

> As others have pointed out, I don't want to just defer expensive work
> without understanding why it's expensive and running into limitations
> about why it cannot be improved without deferring.

For the deferred freeing part: the class_idx-in-obj optimization
addresses the multi-process scenario where concurrent atomic
operations on pool->lock become expensive, but does not help
single-process munmap. Deferred freeing moves the entire zs_free
cost (including class->lock and zspage freeing) off the munmap
hot path, which benefits even single-process workloads. The two
optimizations are complementary.

Thanks,
Wenchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-28 13:51     ` Wenchao Hao
@ 2026-04-28 13:55       ` Wenchao Hao
  2026-04-29 22:44       ` Yosry Ahmed
  2026-05-02  7:21       ` Nhat Pham
  2 siblings, 0 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-04-28 13:55 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

On Tue, Apr 28, 2026 at 9:51 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > >
> > > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > > >
> > > > Swap freeing can be expensive when unmapping a VMA containing
> > > > many swap entries. This has been reported to significantly
> > > > delay memory reclamation during Android's low-memory killing,
> > > > especially when multiple processes are terminated to free
> > > > memory, with slot_free() accounting for more than 80% of
> > > > the total cost of freeing swap entries.
> > > >
> > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > > > to asynchronously collect and free swap entries [1][2], but the
> > > > design itself is fairly complex.
> > > >
> > > Hi Nhat, Kairui, Barry, Xueyuan,
> > >
> > > Thanks for the review. I agree with the direction and have some ideas for
> > > an alternative approach.
> > >
> > > My approach: first eliminate pool->lock from zs_free() itself, then defer
> > > free to per-cpu buffers with a lockless handoff, and finally reduce
> > > class->lock overhead during drain by exploiting natural class locality.
> > > Achieving both per-cpu and per-class is difficult, so the class->lock
> > > optimization is a compromise — but one that works well in practice.
> > >
> > > 1. Encode class_idx in obj to eliminate pool->lock
> > >
> > > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
> > > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
> > > for obj_idx, leaving 14 spare bits.
> > > We can split OBJ_INDEX into class_idx + obj_idx:
> > >
> > >     obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
> > >
> > > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
> > > (8 bits for 4K pages, 9 for 64K).
> > > Since class_idx is invariant across migration (only PFN changes), zs_free()
> > > can extract class_idx locklessly, then acquire class->lock and re-read obj for a
> > > stable PFN. No pool->lock needed.
> >
> > How much of the benefit do we get with just these locking improvements
> > without having to defer any of the freeing work?
> >
>
> Hi Yosry,
>
> Thanks for the review. Great question — we tested exactly this.
>
> With only the class_idx-in-obj encoding (eliminating pool->lock from
> zs_free, no deferred freeing), we measured on two platforms.
>
> Test: each process independently mmap 256MB, write data, madvise
> MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
>
> Raspberry Pi 4B (4-core ARM64 Cortex-A72):
>
>   mode        Base       ClassIdx-only   Speedup
>   single      59.0ms     56.0ms          1.05x
>   multi 2p    94.6ms     66.7ms          1.42x
>   multi 4p    202.9ms    110.6ms         1.83x
>
> x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
>
>   mode        Base       ClassIdx-only   Speedup
>   single      11.7ms     9.8ms           1.19x
>   multi 2p    24.1ms     17.2ms          1.40x
>   multi 4p    63.0ms     45.3ms          1.39x
>

Correction on the x86 test description: the machine is a 20-core
Intel i7-12700, not 4-core. The test only ran 4 concurrent
processes. The multi 4p result (1.39x) is with 4 out of 20 cores
active — pool->lock contention would be higher with more
concurrent processes on this machine.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-28 13:51     ` Wenchao Hao
  2026-04-28 13:55       ` Wenchao Hao
@ 2026-04-29 22:44       ` Yosry Ahmed
  2026-04-30  7:38         ` Wenchao Hao
  2026-05-02  7:21       ` Nhat Pham
  2 siblings, 1 reply; 26+ messages in thread
From: Yosry Ahmed @ 2026-04-29 22:44 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

> > How much of the benefit do we get with just these locking improvements
> > without having to defer any of the freeing work?
> >
>
> Hi Yosry,
>
> Thanks for the review. Great question — we tested exactly this.
>
> With only the class_idx-in-obj encoding (eliminating pool->lock from
> zs_free, no deferred freeing), we measured on two platforms.
>
> Test: each process independently mmap 256MB, write data, madvise
> MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
>
> Raspberry Pi 4B (4-core ARM64 Cortex-A72):
>
>   mode        Base       ClassIdx-only   Speedup
>   single      59.0ms     56.0ms          1.05x
>   multi 2p    94.6ms     66.7ms          1.42x
>   multi 4p    202.9ms    110.6ms         1.83x
>
> x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
>
>   mode        Base       ClassIdx-only   Speedup
>   single      11.7ms     9.8ms           1.19x
>   multi 2p    24.1ms     17.2ms          1.40x
>   multi 4p    63.0ms     45.3ms          1.39x
>
> Single-process shows modest improvement. With multiple processes,
> each read_lock/read_unlock atomically modifies the shared rwlock
> reader count, and the cost of these atomic operations increases
> with more CPUs accessing the same cacheline concurrently.
> Eliminating pool->lock removes this overhead entirely.
>
> This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> spare bits to fit class_idx. 32-bit systems don't have the room.
> I'm still working on the compile-time gating to properly enable
> this based on architecture and page size configuration.
>
> > As others have pointed out, I don't want to just defer expensive work
> > without understanding why it's expensive and running into limitations
> > about why it cannot be improved without deferring.
>
> For the deferred freeing part: the class_idx-in-obj optimization
> addresses the multi-process scenario where concurrent atomic
> operations on pool->lock become expensive, but does not help
> single-process munmap. Deferred freeing moves the entire zs_free
> cost (including class->lock and zspage freeing) off the munmap
> hot path, which benefits even single-process workloads. The two
> optimizations are complementary.

What is the extra speedup added by the deferred freeing on top of the
locking improvements? I couldn't immediately tell by looking at this
vs. the cover letter. I wonder what portion of the improvement comes
from the deferred freeing?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-29 22:44       ` Yosry Ahmed
@ 2026-04-30  7:38         ` Wenchao Hao
  2026-04-30  8:00           ` Kairui Song
  0 siblings, 1 reply; 26+ messages in thread
From: Wenchao Hao @ 2026-04-30  7:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Chengming Zhou, Jens Axboe, Johannes Weiner,
	Minchan Kim, Nhat Pham, Sergey Senozhatsky, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

On Thu, Apr 30, 2026 at 6:44 AM Yosry Ahmed <yosry@kernel.org> wrote:
>
> > > How much of the benefit do we get with just these locking improvements
> > > without having to defer any of the freeing work?
> > >
> >
> > Hi Yosry,
> >
> > Thanks for the review. Great question — we tested exactly this.
> >
> > With only the class_idx-in-obj encoding (eliminating pool->lock from
> > zs_free, no deferred freeing), we measured on two platforms.
> >
> > Test: each process independently mmap 256MB, write data, madvise
> > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
> >
> > Raspberry Pi 4B (4-core ARM64 Cortex-A72):
> >
> >   mode        Base       ClassIdx-only   Speedup
> >   single      59.0ms     56.0ms          1.05x
> >   multi 2p    94.6ms     66.7ms          1.42x
> >   multi 4p    202.9ms    110.6ms         1.83x
> >
> > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
> >
> >   mode        Base       ClassIdx-only   Speedup
> >   single      11.7ms     9.8ms           1.19x
> >   multi 2p    24.1ms     17.2ms          1.40x
> >   multi 4p    63.0ms     45.3ms          1.39x
> >
> > Single-process shows modest improvement. With multiple processes,
> > each read_lock/read_unlock atomically modifies the shared rwlock
> > reader count, and the cost of these atomic operations increases
> > with more CPUs accessing the same cacheline concurrently.
> > Eliminating pool->lock removes this overhead entirely.
> >
> > This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> > spare bits to fit class_idx. 32-bit systems don't have the room.
> > I'm still working on the compile-time gating to properly enable
> > this based on architecture and page size configuration.
> >
> > > As others have pointed out, I don't want to just defer expensive work
> > > without understanding why it's expensive and running into limitations
> > > about why it cannot be improved without deferring.
> >
> > For the deferred freeing part: the class_idx-in-obj optimization
> > addresses the multi-process scenario where concurrent atomic
> > operations on pool->lock become expensive, but does not help
> > single-process munmap. Deferred freeing moves the entire zs_free
> > cost (including class->lock and zspage freeing) off the munmap
> > hot path, which benefits even single-process workloads. The two
> > optimizations are complementary.
>
> What is the extra speedup added by the deferred freeing
> on top of the locking improvements?

The data I shared earlier was class_idx-in-obj only — no
deferred freeing at all.

> I couldn't immediately tell by looking at this vs. the cover letter.  I wonder
> what portion of the improvement comes from the deferred freeing?

On top of that, we added deferred freeing in the zsmalloc
layer (per-cpu page-pool based buffer swap + WQ_UNBOUND
drain worker). With both class_idx + deferred:

Test 1: concurrent munmap (256MB/process, RPi 4B):

  mode      Base       Deferred    Speedup
  single    56.2ms     17.2ms      3.27x
  multi 3p  153.2ms    51.5ms      2.97x

Test 2: single process munmap (various sizes):

  size      Base       Deferred    Speedup
  64MB      15.0ms     4.3ms       3.47x
  128MB     28.7ms     8.5ms       3.37x
  192MB     43.2ms     13.0ms      3.32x
  256MB     57.0ms     17.3ms      3.30x
  512MB     114.4ms    38.5ms      2.97x

However, this is not the ceiling. Profiling with perf
shows that after deferred zs_free, zram_slot_free_notify
still accounts for ~65% of munmap time — mostly
slot_trylock/unlock and slot metadata operations.

To understand the theoretical limit, I tested an extreme
version that removes slot_trylock from the hot path
entirely (not safe for production, just benchmarking):

  size    Base     Deferred   No-lock    Speedup
  64MB    15.0ms   4.3ms      2.3ms      6.50x
  128MB   28.7ms   8.5ms      4.7ms      6.14x
  192MB   43.2ms   13.0ms     6.8ms      6.31x
  256MB   57.0ms   17.3ms     9.0ms      6.30x
  512MB   114.4ms  38.5ms     33.0ms     3.46x

I'm exploring ways to further reduce or eliminate the lock
from this path, any suggestions on how to approach this
would be appreciated.

Unless otherwise noted, all data is from Raspberry Pi 4B
(4-core ARM64 Cortex-A72, 8GB RAM, zram 2GB, lzo-rle).
Test: mmap + fill + madvise(MADV_PAGEOUT) to swap out
via zram, then measure munmap time.

Thanks,
Wenchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-30  7:38         ` Wenchao Hao
@ 2026-04-30  8:00           ` Kairui Song
  2026-04-30 15:15             ` Wenchao Hao
  0 siblings, 1 reply; 26+ messages in thread
From: Kairui Song @ 2026-04-30  8:00 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Yosry Ahmed, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Nhat Pham, Sergey Senozhatsky,
	linux-block, linux-kernel, linux-mm, Barry Song, Xueyuan Chen,
	Wenchao Hao

On Thu, Apr 30, 2026 at 3:43 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> The data I shared earlier was class_idx-in-obj only — no
> deferred freeing at all.
>
> > I couldn't immediately tell by looking at this vs. the cover letter.  I wonder
> > what portion of the improvement comes from the deferred freeing?
>
> On top of that, we added deferred freeing in the zsmalloc
> layer (per-cpu page-pool based buffer swap + WQ_UNBOUND
> drain worker). With both class_idx + deferred:
>
> Test 1: concurrent munmap (256MB/process, RPi 4B):
>
>   mode      Base       Deferred    Speedup
>   single    56.2ms     17.2ms      3.27x
>   multi 3p  153.2ms    51.5ms      2.97x
>
> Test 2: single process munmap (various sizes):
>
>   size      Base       Deferred    Speedup
>   64MB      15.0ms     4.3ms       3.47x
>   128MB     28.7ms     8.5ms       3.37x
>   192MB     43.2ms     13.0ms      3.32x
>   256MB     57.0ms     17.3ms      3.30x
>   512MB     114.4ms    38.5ms      2.97x

Hi Wenchao,

One concern here is that the total amount of work is unchanged. I mean
you observe speed up because you offloaded the work to an async
worker. But when under pressure these workers could be a larger
burden. Is it possible for you to measure that part too?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-30  8:00           ` Kairui Song
@ 2026-04-30 15:15             ` Wenchao Hao
  0 siblings, 0 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-04-30 15:15 UTC (permalink / raw)
  To: Kairui Song
  Cc: Yosry Ahmed, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Nhat Pham, Sergey Senozhatsky,
	linux-block, linux-kernel, linux-mm, Barry Song, Xueyuan Chen,
	Wenchao Hao

On Thu, Apr 30, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Apr 30, 2026 at 3:43 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > The data I shared earlier was class_idx-in-obj only — no
> > deferred freeing at all.
> >
> > > I couldn't immediately tell by looking at this vs. the cover letter.  I wonder
> > > what portion of the improvement comes from the deferred freeing?
> >
> > On top of that, we added deferred freeing in the zsmalloc
> > layer (per-cpu page-pool based buffer swap + WQ_UNBOUND
> > drain worker). With both class_idx + deferred:
> >
> > Test 1: concurrent munmap (256MB/process, RPi 4B):
> >
> >   mode      Base       Deferred    Speedup
> >   single    56.2ms     17.2ms      3.27x
> >   multi 3p  153.2ms    51.5ms      2.97x
> >
> > Test 2: single process munmap (various sizes):
> >
> >   size      Base       Deferred    Speedup
> >   64MB      15.0ms     4.3ms       3.47x
> >   128MB     28.7ms     8.5ms       3.37x
> >   192MB     43.2ms     13.0ms      3.32x
> >   256MB     57.0ms     17.3ms      3.30x
> >   512MB     114.4ms    38.5ms      2.97x
>

Hi Kairui,

> One concern here is that the total amount of work is
> unchanged. But when under pressure these workers could
> be a larger burden.

The total CPU work is actually slightly reduced — the
batch drain eliminates pool->lock entirely, and holds
class->lock across consecutive same-class handles rather
than acquiring/releasing per handle. So the deferred
path does less lock work than synchronous per-handle
zs_free. I'm also exploring further reductions, such as
merging zram flags operations in the notify path (as you
suggested earlier) and reducing lock overhead. Suggestions
are welcome.

The key win is not reducing work but unblocking anon
folio freeing. Each folio free returns a full page
immediately, whereas zs_free may need many handle frees
before a zspage becomes empty (multiple compressed
objects share the same zspage). By not blocking folio
freeing with expensive zs_free, we improve the rate at
which usable memory returns to the system.

With parallelism (munmap + worker on different CPUs),
the process exits faster and memory is returned sooner.
For example, what used to take ~1s on one CPU can now
complete in ~400ms across two CPUs. Under memory
pressure, spending a bit more CPU to release memory
faster is a reasonable tradeoff.

> Is it possible for you to measure that part too?

Sure. Could you describe the specific scenario you're
concerned about — CPU contention, memory pressure, or
scheduling latency? I'm happy to design and run a test
around it.

Thanks,
Wenchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-04-28 13:51     ` Wenchao Hao
  2026-04-28 13:55       ` Wenchao Hao
  2026-04-29 22:44       ` Yosry Ahmed
@ 2026-05-02  7:21       ` Nhat Pham
  2026-05-06 13:55         ` Wenchao Hao
  2 siblings, 1 reply; 26+ messages in thread
From: Nhat Pham @ 2026-05-02  7:21 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Yosry Ahmed, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Sergey Senozhatsky, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

On Tue, Apr 28, 2026 at 2:51 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
>
> On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > >
> > > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@gmail.com> wrote:
> > > >
> > > > Swap freeing can be expensive when unmapping a VMA containing
> > > > many swap entries. This has been reported to significantly
> > > > delay memory reclamation during Android's low-memory killing,
> > > > especially when multiple processes are terminated to free
> > > > memory, with slot_free() accounting for more than 80% of
> > > > the total cost of freeing swap entries.
> > > >
> > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > > > to asynchronously collect and free swap entries [1][2], but the
> > > > design itself is fairly complex.
> > > >
> > > Hi Nhat, Kairui, Barry, Xueyuan,
> > >
> > > Thanks for the review. I agree with the direction and have some ideas for
> > > an alternative approach.
> > >
> > > My approach: first eliminate pool->lock from zs_free() itself, then defer
> > > free to per-cpu buffers with a lockless handoff, and finally reduce
> > > class->lock overhead during drain by exploiting natural class locality.
> > > Achieving both per-cpu and per-class is difficult, so the class->lock
> > > optimization is a compromise — but one that works well in practice.
> > >
> > > 1. Encode class_idx in obj to eliminate pool->lock
> > >
> > > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
> > > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
> > > for obj_idx, leaving 14 spare bits.
> > > We can split OBJ_INDEX into class_idx + obj_idx:
> > >
> > >     obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
> > >
> > > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
> > > (8 bits for 4K pages, 9 for 64K).
> > > Since class_idx is invariant across migration (only PFN changes), zs_free()
> > > can extract class_idx locklessly, then acquire class->lock and re-read obj for a
> > > stable PFN. No pool->lock needed.
> >
> > How much of the benefit do we get with just these locking improvements
> > without having to defer any of the freeing work?
> >
>
> Hi Yosry,
>
> Thanks for the review. Great question — we tested exactly this.
>
> With only the class_idx-in-obj encoding (eliminating pool->lock from
> zs_free, no deferred freeing), we measured on two platforms.
>
> Test: each process independently mmap 256MB, write data, madvise
> MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
>
> Raspberry Pi 4B (4-core ARM64 Cortex-A72):
>
>   mode        Base       ClassIdx-only   Speedup
>   single      59.0ms     56.0ms          1.05x
>   multi 2p    94.6ms     66.7ms          1.42x
>   multi 4p    202.9ms    110.6ms         1.83x
>
> x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
>
>   mode        Base       ClassIdx-only   Speedup
>   single      11.7ms     9.8ms           1.19x
>   multi 2p    24.1ms     17.2ms          1.40x
>   multi 4p    63.0ms     45.3ms          1.39x

Oh man, you are eliminating pool lock here right? This would help my
other patch series a lot too :)

https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@mail.gmail.com/
https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@mail.gmail.com/

Well, the deferred freeing would completely move that contention out
of the way lol. But this would benefit all users, regardless of
whether we're deferring the free step or not (for instance, this will
reduce contention between page fault and compaction, IIUC?) I feel
like you'll get some good numbers testing in a system with compaction
and THP enabled, with lots of swap activities. Which is... a lot of
server setup :)

If the deferred freeing is too controversial, this smells like
something that should be upstreamed independently.

>
> Single-process shows modest improvement. With multiple processes,
> each read_lock/read_unlock atomically modifies the shared rwlock
> reader count, and the cost of these atomic operations increases
> with more CPUs accessing the same cacheline concurrently.
> Eliminating pool->lock removes this overhead entirely.
>
> This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> spare bits to fit class_idx. 32-bit systems don't have the room.
> I'm still working on the compile-time gating to properly enable
> this based on architecture and page size configuration.

/*
* The pool->lock protects the race with zpage's migration
* so it's safe to get the page from handle.
*/
read_lock(&pool->lock);
obj = handle_to_obj(handle);
obj_to_zpdesc(obj, &f_zpdesc);
zspage = get_zspage(f_zpdesc);
class = zspage_class(pool, zspage);
spin_lock(&class->lock);
read_unlock(&pool->lock);

It's basically just this blob right?

>
> > As others have pointed out, I don't want to just defer expensive work
> > without understanding why it's expensive and running into limitations
> > about why it cannot be improved without deferring.
>
> For the deferred freeing part: the class_idx-in-obj optimization
> addresses the multi-process scenario where concurrent atomic
> operations on pool->lock become expensive, but does not help
> single-process munmap. Deferred freeing moves the entire zs_free
> cost (including class->lock and zspage freeing) off the munmap
> hot path, which benefits even single-process workloads. The two
> optimizations are complementary.

+1 :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
  2026-05-02  7:21       ` Nhat Pham
@ 2026-05-06 13:55         ` Wenchao Hao
  0 siblings, 0 replies; 26+ messages in thread
From: Wenchao Hao @ 2026-05-06 13:55 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Yosry Ahmed, Andrew Morton, Chengming Zhou, Jens Axboe,
	Johannes Weiner, Minchan Kim, Sergey Senozhatsky, linux-block,
	linux-kernel, linux-mm, Barry Song, Xueyuan Chen, Wenchao Hao

On Sat, May 2, 2026 at 3:21 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> > With only the class_idx-in-obj encoding (eliminating pool->lock from
> > zs_free, no deferred freeing), we measured on two platforms.
> >
> > Test: each process independently mmap 256MB, write data, madvise
> > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
> >
> > Raspberry Pi 4B (4-core ARM64 Cortex-A72):
> >
> >   mode        Base       ClassIdx-only   Speedup
> >   single      59.0ms     56.0ms          1.05x
> >   multi 2p    94.6ms     66.7ms          1.42x
> >   multi 4p    202.9ms    110.6ms         1.83x
> >
> > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
> >
> >   mode        Base       ClassIdx-only   Speedup
> >   single      11.7ms     9.8ms           1.19x
> >   multi 2p    24.1ms     17.2ms          1.40x
> >   multi 4p    63.0ms     45.3ms          1.39x
>
> Oh man, you are eliminating pool lock here right? This would help my
> other patch series a lot too :)
>
> https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@mail.gmail.com/
> https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@mail.gmail.com/
>

Yes, exactly. With class_idx encoded in the obj value,
zs_free() can determine the correct size_class without
any pool-level lock. The lockless read gives a valid
class_idx because it's invariant across migration (only
PFN changes), and we re-read obj under class->lock to
get a stable PFN.

> Well, the deferred freeing would completely move that contention out
> of the way lol. But this would benefit all users, regardless of
> whether we're deferring the free step or not (for instance, this will
> reduce contention between page fault and compaction, IIUC?) I feel
> like you'll get some good numbers testing in a system with compaction
> and THP enabled, with lots of swap activities. Which is... a lot of
> server setup :)
>
> If the deferred freeing is too controversial, this smells like
> something that should be upstreamed independently.
>

Agreed. We're planning to split the series so that the
class_idx encoding + pool->lock elimination can be
reviewed and merged independently of the deferred free
framework. It's a pure win with no behavioral change
— just less lock contention.

> >
> > Single-process shows modest improvement. With multiple processes,
> > each read_lock/read_unlock atomically modifies the shared rwlock
> > reader count, and the cost of these atomic operations increases
> > with more CPUs accessing the same cacheline concurrently.
> > Eliminating pool->lock removes this overhead entirely.
> >
> > This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> > spare bits to fit class_idx. 32-bit systems don't have the room.
> > I'm still working on the compile-time gating to properly enable
> > this based on architecture and page size configuration.
>
> /*
> * The pool->lock protects the race with zpage's migration
> * so it's safe to get the page from handle.
> */
> read_lock(&pool->lock);
> obj = handle_to_obj(handle);
> obj_to_zpdesc(obj, &f_zpdesc);
> zspage = get_zspage(f_zpdesc);
> class = zspage_class(pool, zspage);
> spin_lock(&class->lock);
> read_unlock(&pool->lock);
>
> It's basically just this blob right?
>

Yes, that's the blob being replaced. On the
ZS_OBJ_CLASS_IDX path (64-bit systems), it becomes:

    obj = handle_to_obj(handle);
    class = pool->size_class[obj_to_class_idx(obj)];
    spin_lock(&class->lock);
    obj = handle_to_obj(handle); /* re-read for stable PFN */

No pool->lock at all. We've also added compile-time
gating (#if BITS_PER_LONG >= 64) since 32-bit systems
lack the spare bits in OBJ_INDEX to fit class_idx. On
32-bit, it falls back to the original pool->lock path.

> >
> > > As others have pointed out, I don't want to just defer expensive work
> > > without understanding why it's expensive and running into limitations
> > > about why it cannot be improved without deferring.
> >
> > For the deferred freeing part: the class_idx-in-obj optimization
> > addresses the multi-process scenario where concurrent atomic
> > operations on pool->lock become expensive, but does not help
> > single-process munmap. Deferred freeing moves the entire zs_free
> > cost (including class->lock and zspage freeing) off the munmap
> > hot path, which benefits even single-process workloads. The two
> > optimizations are complementary.
>
> +1 :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-05-06 13:55 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21 12:16 [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Wenchao Hao
2026-04-21 12:16 ` [RFC PATCH v2 1/4] mm:zsmalloc: drop class lock before freeing zspage Wenchao Hao
2026-04-21 12:16 ` [RFC PATCH v2 2/4] mm/zsmalloc: introduce zs_free_deferred() for async handle freeing Wenchao Hao
2026-04-21 19:46   ` Nhat Pham
2026-04-21 21:42     ` Barry Song
2026-04-23 16:40       ` Nhat Pham
2026-04-21 12:16 ` [RFC PATCH v2 3/4] zram: defer zs_free() in swap slot free notification path Wenchao Hao
2026-04-21 12:16 ` [RFC PATCH v2 4/4] mm/zswap: defer zs_free() in zswap_invalidate() path Wenchao Hao
2026-04-21 17:03   ` Nhat Pham
2026-04-21 15:54 ` [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path Nhat Pham
2026-04-21 17:17   ` Kairui Song
2026-04-21 18:07     ` Nhat Pham
2026-04-21 18:25       ` Nhat Pham
2026-04-22  0:34         ` Xueyuan Chen
2026-04-26  4:13 ` Wenchao Hao
2026-04-26  8:50   ` Xueyuan Chen
2026-04-27  3:10     ` Wenchao Hao
2026-04-27 18:17   ` Yosry Ahmed
2026-04-28 13:51     ` Wenchao Hao
2026-04-28 13:55       ` Wenchao Hao
2026-04-29 22:44       ` Yosry Ahmed
2026-04-30  7:38         ` Wenchao Hao
2026-04-30  8:00           ` Kairui Song
2026-04-30 15:15             ` Wenchao Hao
2026-05-02  7:21       ` Nhat Pham
2026-05-06 13:55         ` Wenchao Hao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox