* [PATCH v2 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg
2026-05-25 12:22 [PATCH v2 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
@ 2026-05-25 12:22 ` Hao Jia
2026-05-25 12:22 ` [PATCH v2 2/4] mm/zswap: Implement proactive writeback Hao Jia
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Hao Jia @ 2026-05-25 12:22 UTC (permalink / raw)
To: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
chengming.zhou, muchun.song, roman.gushchin
Cc: cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia
From: Hao Jia <jiahao1@lixiang.com>
The zswap background writeback worker shrink_worker() uses a global
cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
across the online memcgs under root_mem_cgroup.
Proactive writeback also wants a similar per-memcg cursor that is
scoped to the specified memcg, so that repeated invocations against
the same memcg make forward progress across its descendant memcgs
instead of restarting from the first child memcg each time.
Naturally, group the cursor and its protecting spinlock into a
zswap_wb_iter struct, and make it a member of struct mem_cgroup to
realize per-memcg cursor management. Accordingly, shrink_worker() now
uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.
Because the cursor is now per-memcg, the offline cleanup must visit
every ancestor that could be holding a reference to the dying memcg.
Factor out __zswap_memcg_offline_cleanup() and walk from dead_memcg up
to the root.
No functional change intended for shrink_worker().
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
include/linux/memcontrol.h | 3 +
include/linux/zswap.h | 9 +++
mm/memcontrol.c | 3 +
mm/zswap.c | 119 ++++++++++++++++++++++++++-----------
4 files changed, 98 insertions(+), 36 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b..b8323c8d6565 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -228,6 +228,9 @@ struct mem_cgroup {
* swap, and from being swapped out on zswap store failures.
*/
bool zswap_writeback;
+
+ /* Per-memcg writeback cursor */
+ struct zswap_wb_iter zswap_wb_iter;
#endif
/* vmpressure notifications */
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e..efa6b551217e 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -11,6 +11,15 @@ extern atomic_long_t zswap_stored_pages;
#ifdef CONFIG_ZSWAP
+/* Iteration cursor for zswap writeback over a memcg's subtree. */
+struct zswap_wb_iter {
+ /* protects @pos against concurrent advances */
+ spinlock_t lock;
+ struct mem_cgroup *pos;
+};
+
+void zswap_wb_iter_init(struct zswap_wb_iter *iter);
+
struct zswap_lruvec_state {
/*
* Number of swapped in pages from disk, i.e not found in the zswap pool.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c03d4787d466..409c41359dc8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4022,6 +4022,9 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
INIT_LIST_HEAD(&memcg->memory_peaks);
INIT_LIST_HEAD(&memcg->swap_peaks);
spin_lock_init(&memcg->peaks_lock);
+#ifdef CONFIG_ZSWAP
+ zswap_wb_iter_init(&memcg->zswap_wb_iter);
+#endif
memcg->socket_pressure = get_jiffies_64();
#if BITS_PER_LONG < 64
seqlock_init(&memcg->socket_pressure_seqlock);
diff --git a/mm/zswap.c b/mm/zswap.c
index 4b5149173b0e..6519f646b496 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -163,9 +163,6 @@ struct zswap_pool {
/* Global LRU lists shared by all zswap pools. */
static struct list_lru zswap_list_lru;
-/* The lock protects zswap_next_shrink updates. */
-static DEFINE_SPINLOCK(zswap_shrink_lock);
-static struct mem_cgroup *zswap_next_shrink;
static struct work_struct zswap_shrink_work;
static struct shrinker *zswap_shrinker;
@@ -717,28 +714,88 @@ void zswap_folio_swapin(struct folio *folio)
}
}
-/*
- * This function should be called when a memcg is being offlined.
+void zswap_wb_iter_init(struct zswap_wb_iter *iter)
+{
+ spin_lock_init(&iter->lock);
+}
+
+#ifdef CONFIG_MEMCG
+/**
+ * zswap_mem_cgroup_iter - advance the writeback cursor
+ * @root: subtree root whose cursor to advance
+ *
+ * Advance @root->zswap_wb_iter.pos to @root itself or the next online
+ * descendant. Passing root_mem_cgroup yields a global walk.
*
- * Since the global shrinker shrink_worker() may hold a reference
- * of the memcg, we must check and release the reference in
- * zswap_next_shrink.
+ * The cursor is retained across invocations, so successive calls walk
+ * @root's subtree cyclically in pre-order and, after %NULL, restart
+ * from the beginning.
*
- * shrink_worker() must handle the case where this function releases
- * the reference of memcg being shrunk.
+ * The returned memcg carries an extra reference; release it with
+ * mem_cgroup_put().
+ *
+ * Return: the next online memcg in @root's subtree, or @root itself,
+ * with an extra reference, or %NULL after a full round-trip.
*/
-void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
+static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root)
{
- /* lock out zswap shrinker walking memcg tree */
- spin_lock(&zswap_shrink_lock);
- if (zswap_next_shrink == memcg) {
+ struct mem_cgroup *memcg;
+
+ if (mem_cgroup_disabled())
+ return NULL;
+
+ spin_lock(&root->zswap_wb_iter.lock);
+ do {
+ memcg = mem_cgroup_iter(root, root->zswap_wb_iter.pos, NULL);
+ root->zswap_wb_iter.pos = memcg;
+ } while (memcg && !mem_cgroup_tryget_online(memcg));
+ spin_unlock(&root->zswap_wb_iter.lock);
+
+ return memcg;
+}
+
+/*
+ * If @root's cursor currently points at @dead_memcg, advance it to the
+ * next online descendant so @dead_memcg can be freed.
+ */
+static void __zswap_memcg_offline_cleanup(struct mem_cgroup *root,
+ struct mem_cgroup *dead_memcg)
+{
+ spin_lock(&root->zswap_wb_iter.lock);
+ if (root->zswap_wb_iter.pos == dead_memcg) {
do {
- zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
- } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink));
+ root->zswap_wb_iter.pos =
+ mem_cgroup_iter(root,
+ root->zswap_wb_iter.pos, NULL);
+ } while (root->zswap_wb_iter.pos &&
+ !mem_cgroup_online(root->zswap_wb_iter.pos));
}
- spin_unlock(&zswap_shrink_lock);
+ spin_unlock(&root->zswap_wb_iter.lock);
+}
+
+/*
+ * Called when a memcg is being offlined. If @memcg or any of its
+ * ancestors has a cursor pointing at @memcg, it must be advanced
+ * past @memcg before @memcg can be freed. Walk the chain and
+ * release such references.
+ */
+void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *parent = memcg;
+
+ do {
+ __zswap_memcg_offline_cleanup(parent, memcg);
+ } while ((parent = parent_mem_cgroup(parent)));
+}
+#else /* !CONFIG_MEMCG */
+static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root)
+{
+ return NULL;
}
+void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { }
+#endif /* CONFIG_MEMCG */
+
/*********************************
* zswap entry functions
**********************************/
@@ -1328,38 +1385,28 @@ static void shrink_worker(struct work_struct *w)
* - No writeback-candidate memcgs found in a memcg tree walk.
* - Shrinking a writeback-candidate memcg failed.
*
- * We save iteration cursor memcg into zswap_next_shrink,
+ * We save the iteration cursor in root_mem_cgroup->zswap_wb_iter.pos,
* which can be modified by the offline memcg cleaner
* zswap_memcg_offline_cleanup().
*
* Since the offline cleaner is called only once, we cannot leave an
- * offline memcg reference in zswap_next_shrink.
+ * offline memcg reference in root_mem_cgroup->zswap_wb_iter.pos.
* We can rely on the cleaner only if we get online memcg under lock.
*
* If we get an offline memcg, we cannot determine if the cleaner has
* already been called or will be called later. We must put back the
* reference before returning from this function. Otherwise, the
- * offline memcg left in zswap_next_shrink will hold the reference
- * until the next run of shrink_worker().
+ * offline memcg left in root_mem_cgroup->zswap_wb_iter.pos will hold
+ * the reference until the next run of shrink_worker().
*/
do {
/*
- * Start shrinking from the next memcg after zswap_next_shrink.
- * When the offline cleaner has already advanced the cursor,
- * advancing the cursor here overlooks one memcg, but this
- * should be negligibly rare.
- *
- * If we get an online memcg, keep the extra reference in case
- * the original one obtained by mem_cgroup_iter() is dropped by
- * zswap_memcg_offline_cleanup() while we are shrinking the
- * memcg.
+ * Start shrinking from the next memcg after
+ * root_mem_cgroup->zswap_wb_iter.pos. When the offline cleaner
+ * has already advanced the cursor, advancing the cursor here
+ * overlooks one memcg, but this should be negligibly rare.
*/
- spin_lock(&zswap_shrink_lock);
- do {
- memcg = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
- zswap_next_shrink = memcg;
- } while (memcg && !mem_cgroup_tryget_online(memcg));
- spin_unlock(&zswap_shrink_lock);
+ memcg = zswap_mem_cgroup_iter(root_mem_cgroup);
if (!memcg) {
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v2 2/4] mm/zswap: Implement proactive writeback
2026-05-25 12:22 [PATCH v2 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-25 12:22 ` [PATCH v2 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
@ 2026-05-25 12:22 ` Hao Jia
2026-05-25 12:22 ` [PATCH v2 3/4] mm/zswap: Add per-memcg stat for " Hao Jia
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Hao Jia @ 2026-05-25 12:22 UTC (permalink / raw)
To: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
chengming.zhou, muchun.song, roman.gushchin
Cc: cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia
From: Hao Jia <jiahao1@lixiang.com>
Zswap currently writes back pages to backing swap reactively, triggered
either by the shrinker or when the pool reaches its size limit. There is
no mechanism to control the amount of writeback for a specific memory
cgroup. However, users may want to proactively write back zswap pages,
e.g., to free up memory for other applications or to prepare for
memory-intensive workloads.
Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
interface. When specified, this key bypasses standard memory reclaim
and exclusively performs proactive zswap writeback up to the requested
budget. If omitted, the default reclaim behavior remains unchanged.
Example usage:
# Write back 100MB of pages from zswap to the backing swap
echo "100M zswap_writeback_only" > memory.reclaim
Note that the actual amount written back may be less than requested due
to the zswap second-chance algorithm: referenced entries are rotated on
the LRU on the first encounter and only written back on a second pass.
The interface returns -EAGAIN if no pages were successfully written back.
Internally, extend user_proactive_reclaim() to parse the new
"zswap_writeback_only" token and invoke the dedicated handler. Add
zswap_proactive_writeback() to walk the target memcg subtree via the
per-memcg writeback cursor, draining per-node zswap LRUs through
list_lru_walk_one() with the shrink_memcg_cb() callback.
Suggested-by: Yosry Ahmed <yosry@kernel.org>
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
Documentation/admin-guide/cgroup-v2.rst | 18 +++-
Documentation/admin-guide/mm/zswap.rst | 11 +-
include/linux/zswap.h | 7 ++
mm/vmscan.c | 14 +++
mm/zswap.c | 138 ++++++++++++++++++++++++
5 files changed, 185 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..6564abf0dec5 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back.
The following nested keys are defined.
- ========== ================================
+ ==================== ==================================================
swappiness Swappiness value to reclaim with
- ========== ================================
+ zswap_writeback_only Only perform proactive zswap writeback
+ ==================== ==================================================
Specifying a swappiness value instructs the kernel to perform
the reclaim with that swappiness value. Note that this has the
@@ -1437,6 +1438,19 @@ The following nested keys are defined.
The valid range for swappiness is [0-200, max], setting
swappiness=max exclusively reclaims anonymous memory.
+ The zswap_writeback_only key skips ordinary memory reclaim and
+ writes back pages from zswap to the backing swap device until
+ the requested amount has been written or no further candidates
+ are found. This is useful to proactively offload cold pages from
+ the zswap pool to the swap device. It is only available if
+ zswap writeback is enabled. zswap_writeback_only cannot be combined
+ with swappiness; specifying both returns -EINVAL.
+
+ Example::
+
+ # Write back up to 100MB of pages from zswap to the backing swap
+ echo "100M zswap_writeback_only" > memory.reclaim
+
memory.peak
A read-write single value file which exists on non-root cgroups.
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 2464425c783d..1c0598e77958 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -131,7 +131,16 @@ User can enable it as follows::
echo Y > /sys/module/zswap/parameters/shrinker_enabled
This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
-selected.
+selected. Once enabled, the shrinker automatically writes back zswap pages to
+backing swap during memory reclaim.
+
+If users want to explicitly trigger proactive zswap writeback for a specific
+memory cgroup without invoking standard page reclaim, it can be done as follows::
+
+ echo "100M zswap_writeback_only" > /sys/fs/cgroup/<cgroup-name>/memory.reclaim
+
+Both of the methods mentioned above are subject to the ``memory.zswap.writeback``
+control. This means that ``memory.zswap.writeback`` can reject all zswap writeback.
A debugfs interface is provided for various statistic about pool size, number
of pages stored, same-value filled pages and various counters for the reasons
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index efa6b551217e..98434d39339a 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -44,6 +44,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
void zswap_folio_swapin(struct folio *folio);
bool zswap_is_enabled(void);
bool zswap_never_enabled(void);
+int zswap_proactive_writeback(struct mem_cgroup *memcg, unsigned long nr_to_writeback);
#else
struct zswap_lruvec_state {};
@@ -78,6 +79,12 @@ static inline bool zswap_never_enabled(void)
return true;
}
+static inline int zswap_proactive_writeback(struct mem_cgroup *memcg,
+ unsigned long nr_to_writeback)
+{
+ return -EOPNOTSUPP;
+}
+
#endif
#endif /* _LINUX_ZSWAP_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..6249176b9886 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -64,6 +64,7 @@
#include <linux/swapops.h>
#include <linux/sched/sysctl.h>
+#include <linux/zswap.h>
#include "internal.h"
#include "swap.h"
@@ -7894,11 +7895,13 @@ static unsigned long __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask,
enum {
MEMORY_RECLAIM_SWAPPINESS = 0,
MEMORY_RECLAIM_SWAPPINESS_MAX,
+ MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY,
MEMORY_RECLAIM_NULL,
};
static const match_table_t tokens = {
{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
{ MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"},
+ { MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY, "zswap_writeback_only"},
{ MEMORY_RECLAIM_NULL, NULL },
};
@@ -7908,6 +7911,7 @@ int user_proactive_reclaim(char *buf,
unsigned int nr_retries = MAX_RECLAIM_RETRIES;
unsigned long nr_to_reclaim, nr_reclaimed = 0;
int swappiness = -1;
+ bool zswap_writeback_only = false;
char *old_buf, *start;
substring_t args[MAX_OPT_ARGS];
gfp_t gfp_mask = GFP_KERNEL;
@@ -7938,11 +7942,21 @@ int user_proactive_reclaim(char *buf,
case MEMORY_RECLAIM_SWAPPINESS_MAX:
swappiness = SWAPPINESS_ANON_ONLY;
break;
+ case MEMORY_RECLAIM_ZSWAP_WRITEBACK_ONLY:
+ zswap_writeback_only = true;
+ break;
default:
return -EINVAL;
}
}
+ if (zswap_writeback_only) {
+ /* zswap_writeback_only and swappiness are mutually exclusive. */
+ if (swappiness != -1)
+ return -EINVAL;
+ return zswap_proactive_writeback(memcg, nr_to_reclaim);
+ }
+
while (nr_reclaimed < nr_to_reclaim) {
/* Will converge on zero, but reclaim enforces a minimum */
unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4;
diff --git a/mm/zswap.c b/mm/zswap.c
index 6519f646b496..947507b9a185 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1684,6 +1684,144 @@ int zswap_load(struct folio *folio)
return 0;
}
+/*
+ * Maximum LRU scan limit:
+ * number of entries to scan per page of remaining budget.
+ */
+#define ZSWAP_PROACTIVE_WB_SCAN_RATIO 16UL
+/*
+ * Batch size for proactive writeback:
+ * - As the per-memcg writeback target in the outer memcg loop.
+ * - As the per-walk budget passed to list_lru_walk_one().
+ */
+#define ZSWAP_PROACTIVE_WB_BATCH 128UL
+
+/*
+ * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages.
+ * Returns the number of pages written back, or -ENOENT if @memcg is a
+ * zombie or has writeback disabled.
+ */
+static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
+ unsigned long nr_to_write)
+{
+ unsigned long nr_written = 0;
+ int nid;
+
+ if (!mem_cgroup_zswap_writeback_enabled(memcg))
+ return -ENOENT;
+
+ if (!mem_cgroup_online(memcg))
+ return -ENOENT;
+
+ for_each_node_state(nid, N_NORMAL_MEMORY) {
+ bool encountered_page_in_swapcache = false;
+ unsigned long nr_to_scan, nr_scanned = 0;
+
+ /*
+ * Cap by LRU length: bounds rewalks when referenced
+ * entries keep rotating to the tail.
+ */
+ nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
+ if (!nr_to_scan)
+ continue;
+
+ /*
+ * Cap by SCAN_RATIO * remaining budget: bounds scan cost
+ * to the remaining writeback budget.
+ */
+ nr_to_scan = min(nr_to_scan,
+ (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO);
+
+ while (nr_scanned < nr_to_scan) {
+ unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH,
+ nr_to_scan - nr_scanned);
+
+ if (signal_pending(current))
+ return nr_written;
+
+ /*
+ * Account for the committed budget rather than the walker's
+ * actual delta. If the list is emptied concurrently, the
+ * walker visits nothing and nr_scanned would never advance.
+ */
+ nr_scanned += nr_to_walk;
+
+ nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
+ &shrink_memcg_cb,
+ &encountered_page_in_swapcache,
+ &nr_to_walk);
+
+ if (nr_written >= nr_to_write)
+ return nr_written;
+ if (encountered_page_in_swapcache)
+ break;
+
+ cond_resched();
+ }
+ }
+
+ return nr_written;
+}
+
+int zswap_proactive_writeback(struct mem_cgroup *memcg,
+ unsigned long nr_to_writeback)
+{
+ struct mem_cgroup *iter_memcg;
+ unsigned long nr_written = 0;
+ int failures = 0, attempts = 0;
+
+ if (!memcg)
+ return -EINVAL;
+ if (!nr_to_writeback)
+ return 0;
+
+ /*
+ * Writeback will be aborted with -EAGAIN if @nr_written is still
+ * zero and we encounter the following MAX_RECLAIM_RETRIES times:
+ * - No writeback-candidate memcgs found in a subtree walk.
+ * - A writeback-candidate memcg wrote back zero pages.
+ */
+ while (nr_written < nr_to_writeback) {
+ unsigned long batch_size;
+ long shrunk;
+
+ if (signal_pending(current))
+ return -EINTR;
+
+ iter_memcg = zswap_mem_cgroup_iter(memcg);
+
+ if (!iter_memcg) {
+ /*
+ * Continue without incrementing failures if we found
+ * candidate memcgs in the last subtree walk.
+ */
+ if (!attempts && ++failures == MAX_RECLAIM_RETRIES)
+ goto out;
+ attempts = 0;
+ continue;
+ }
+
+ batch_size = min(nr_to_writeback - nr_written,
+ ZSWAP_PROACTIVE_WB_BATCH);
+ shrunk = zswap_proactive_shrink_memcg(iter_memcg, batch_size);
+ mem_cgroup_put(iter_memcg);
+
+ /* Writeback-disabled or offline: skip without counting. */
+ if (shrunk == -ENOENT)
+ continue;
+
+ ++attempts;
+ if (shrunk > 0)
+ nr_written += shrunk;
+ else if (++failures == MAX_RECLAIM_RETRIES)
+ goto out;
+
+ cond_resched();
+ }
+out:
+ return nr_written ? 0 : -EAGAIN;
+}
+
void zswap_invalidate(swp_entry_t swp)
{
pgoff_t offset = swp_offset(swp);
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v2 3/4] mm/zswap: Add per-memcg stat for proactive writeback
2026-05-25 12:22 [PATCH v2 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-25 12:22 ` [PATCH v2 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-25 12:22 ` [PATCH v2 2/4] mm/zswap: Implement proactive writeback Hao Jia
@ 2026-05-25 12:22 ` Hao Jia
2026-05-25 12:22 ` [PATCH v2 4/4] selftests/cgroup: Add tests for zswap " Hao Jia
2026-05-25 19:24 ` [PATCH v2 0/4] mm/zswap: Implement per-cgroup " Andrew Morton
4 siblings, 0 replies; 6+ messages in thread
From: Hao Jia @ 2026-05-25 12:22 UTC (permalink / raw)
To: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
chengming.zhou, muchun.song, roman.gushchin
Cc: cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia
From: Hao Jia <jiahao1@lixiang.com>
Currently, zswap writeback can be triggered by either the pool limit
being hit or by the proactive writeback mechanism. However, the
existing 'zswpwb' metric in memory.stat and /proc/vmstat counts all
written back pages, making it difficult to distinguish between pages
written back due to the pool limit and those written back proactively.
Add a new statistic 'zswpwb_proactive' to memory.stat and /proc/vmstat.
This counter tracks the number of pages written back due to proactive
writeback. This allows users to better monitor and tune the proactive
writeback mechanism.
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
Documentation/admin-guide/cgroup-v2.rst | 4 +++
include/linux/vm_event_item.h | 1 +
mm/memcontrol.c | 1 +
mm/vmstat.c | 1 +
mm/zswap.c | 41 ++++++++++++++++++-------
5 files changed, 37 insertions(+), 11 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6564abf0dec5..7d65aef83f7b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1748,6 +1748,10 @@ The following nested keys are defined.
zswpwb
Number of pages written from zswap to swap.
+ zswpwb_proactive
+ Number of pages written from zswap to swap by proactive
+ writeback. This is a subset of zswpwb.
+
zswap_incomp
Number of incompressible pages currently stored in zswap
without compression. These pages could not be compressed to
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..7a5bee0a20b6 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -138,6 +138,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
ZSWPIN,
ZSWPOUT,
ZSWPWB,
+ ZSWPWB_PROACTIVE,
#endif
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 409c41359dc8..67de71b2a659 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -571,6 +571,7 @@ static const unsigned int memcg_vm_event_stat[] = {
ZSWPIN,
ZSWPOUT,
ZSWPWB,
+ ZSWPWB_PROACTIVE,
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
THP_FAULT_ALLOC,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..66fd06d1bb01 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1452,6 +1452,7 @@ const char * const vmstat_text[] = {
[I(ZSWPIN)] = "zswpin",
[I(ZSWPOUT)] = "zswpout",
[I(ZSWPWB)] = "zswpwb",
+ [I(ZSWPWB_PROACTIVE)] = "zswpwb_proactive",
#endif
#ifdef CONFIG_X86
[I(DIRECT_MAP_LEVEL2_SPLIT)] = "direct_map_level2_splits",
diff --git a/mm/zswap.c b/mm/zswap.c
index 947507b9a185..78190631e2c4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -160,6 +160,11 @@ struct zswap_pool {
char tfm_name[CRYPTO_MAX_ALG_NAME];
};
+struct zswap_shrink_walk_arg {
+ bool proactive;
+ bool encountered_page_in_swapcache;
+};
+
/* Global LRU lists shared by all zswap pools. */
static struct list_lru zswap_list_lru;
@@ -1042,7 +1047,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
* freed.
*/
static int zswap_writeback_entry(struct zswap_entry *entry,
- swp_entry_t swpentry)
+ swp_entry_t swpentry,
+ bool proactive)
{
struct xarray *tree;
pgoff_t offset = swp_offset(swpentry);
@@ -1102,6 +1108,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
if (entry->objcg)
count_objcg_events(entry->objcg, ZSWPWB, 1);
+ if (proactive) {
+ count_vm_event(ZSWPWB_PROACTIVE);
+ if (entry->objcg)
+ count_objcg_events(entry->objcg, ZSWPWB_PROACTIVE, 1);
+ }
+
zswap_entry_free(entry);
/* folio is up to date */
@@ -1151,7 +1163,8 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
void *arg)
{
struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
- bool *encountered_page_in_swapcache = (bool *)arg;
+ struct zswap_shrink_walk_arg *walk_arg = arg;
+ bool proactive_wb = walk_arg && walk_arg->proactive;
swp_entry_t swpentry;
enum lru_status ret = LRU_REMOVED_RETRY;
int writeback_result;
@@ -1206,7 +1219,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
*/
spin_unlock(&l->lock);
- writeback_result = zswap_writeback_entry(entry, swpentry);
+ writeback_result = zswap_writeback_entry(entry, swpentry, proactive_wb);
if (writeback_result) {
zswap_reject_reclaim_fail++;
@@ -1217,9 +1230,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
* into the warmer region. We should terminate shrinking (if we're in the dynamic
* shrinker context).
*/
- if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
+ if (writeback_result == -EEXIST && walk_arg) {
ret = LRU_STOP;
- *encountered_page_in_swapcache = true;
+ walk_arg->encountered_page_in_swapcache = true;
}
} else {
zswap_written_back_pages++;
@@ -1231,8 +1244,11 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
struct shrink_control *sc)
{
+ struct zswap_shrink_walk_arg walk_arg = {
+ .proactive = false,
+ .encountered_page_in_swapcache = false,
+ };
unsigned long shrink_ret;
- bool encountered_page_in_swapcache = false;
if (!zswap_shrinker_enabled ||
!mem_cgroup_zswap_writeback_enabled(sc->memcg)) {
@@ -1241,9 +1257,9 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
}
shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
- &encountered_page_in_swapcache);
+ &walk_arg);
- if (encountered_page_in_swapcache)
+ if (walk_arg.encountered_page_in_swapcache)
return SHRINK_STOP;
return shrink_ret ? shrink_ret : SHRINK_STOP;
@@ -1714,7 +1730,10 @@ static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
return -ENOENT;
for_each_node_state(nid, N_NORMAL_MEMORY) {
- bool encountered_page_in_swapcache = false;
+ struct zswap_shrink_walk_arg walk_arg = {
+ .proactive = true,
+ .encountered_page_in_swapcache = false,
+ };
unsigned long nr_to_scan, nr_scanned = 0;
/*
@@ -1748,12 +1767,12 @@ static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
&shrink_memcg_cb,
- &encountered_page_in_swapcache,
+ &walk_arg,
&nr_to_walk);
if (nr_written >= nr_to_write)
return nr_written;
- if (encountered_page_in_swapcache)
+ if (walk_arg.encountered_page_in_swapcache)
break;
cond_resched();
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v2 4/4] selftests/cgroup: Add tests for zswap proactive writeback
2026-05-25 12:22 [PATCH v2 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
` (2 preceding siblings ...)
2026-05-25 12:22 ` [PATCH v2 3/4] mm/zswap: Add per-memcg stat for " Hao Jia
@ 2026-05-25 12:22 ` Hao Jia
2026-05-25 19:24 ` [PATCH v2 0/4] mm/zswap: Implement per-cgroup " Andrew Morton
4 siblings, 0 replies; 6+ messages in thread
From: Hao Jia @ 2026-05-25 12:22 UTC (permalink / raw)
To: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
chengming.zhou, muchun.song, roman.gushchin
Cc: cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia
From: Hao Jia <jiahao1@lixiang.com>
Add test_zswap_proactive_writeback() to cover the new memory.reclaim
"zswap_writeback_only" key. The test populates a memory cgroup zswap
pool, triggers proactive writeback, and verifies the behavior by
observing the change in zswpwb_proactive. Invalid input combinations
are also covered.
Extend test_zswap_writeback_one() to assert that the existing
non-proactive writeback path leaves zswpwb_proactive at zero.
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
tools/testing/selftests/cgroup/test_zswap.c | 161 +++++++++++++++++++-
1 file changed, 153 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c
index a7bdcdd09d62..b80ed13bc5e2 100644
--- a/tools/testing/selftests/cgroup/test_zswap.c
+++ b/tools/testing/selftests/cgroup/test_zswap.c
@@ -57,6 +57,11 @@ static long get_cg_wb_count(const char *cg)
return cg_read_key_long(cg, "memory.stat", "zswpwb");
}
+static long get_cg_pwb_count(const char *cg)
+{
+ return cg_read_key_long(cg, "memory.stat", "zswpwb_proactive");
+}
+
static long get_zswpout(const char *cgroup)
{
return cg_read_key_long(cgroup, "memory.stat", "zswpout ");
@@ -323,11 +328,17 @@ static int attempt_writeback(const char *cgroup, void *arg)
static int test_zswap_writeback_one(const char *cgroup, bool wb)
{
- long zswpwb_before, zswpwb_after;
+ long wb_cnt, pwb_cnt;
+
+ wb_cnt = get_cg_wb_count(cgroup);
+ if (wb_cnt != 0) {
+ ksft_print_msg("zswpwb_before = %ld instead of 0\n", wb_cnt);
+ return -1;
+ }
- zswpwb_before = get_cg_wb_count(cgroup);
- if (zswpwb_before != 0) {
- ksft_print_msg("zswpwb_before = %ld instead of 0\n", zswpwb_before);
+ pwb_cnt = get_cg_pwb_count(cgroup);
+ if (pwb_cnt != 0) {
+ ksft_print_msg("zswpwb_proactive_before = %ld instead of 0\n", pwb_cnt);
return -1;
}
@@ -335,13 +346,24 @@ static int test_zswap_writeback_one(const char *cgroup, bool wb)
return -1;
/* Verify that zswap writeback occurred only if writeback was enabled */
- zswpwb_after = get_cg_wb_count(cgroup);
- if (zswpwb_after < 0)
+ wb_cnt = get_cg_wb_count(cgroup);
+ if (wb_cnt < 0)
return -1;
- if (wb != !!zswpwb_after) {
+ if (wb != !!wb_cnt) {
ksft_print_msg("zswpwb_after is %ld while wb is %s\n",
- zswpwb_after, wb ? "enabled" : "disabled");
+ wb_cnt, wb ? "enabled" : "disabled");
+ return -1;
+ }
+
+ /*
+ * attempt_writeback() does not use the proactive writeback path, so
+ * zswpwb_proactive must stay at zero regardless of whether writeback
+ * was enabled.
+ */
+ pwb_cnt = get_cg_pwb_count(cgroup);
+ if (pwb_cnt != 0) {
+ ksft_print_msg("zswpwb_proactive_after is %ld, expected 0\n", pwb_cnt);
return -1;
}
@@ -709,6 +731,128 @@ static int test_zswap_incompressible(const char *root)
return ret;
}
+/*
+ * Trigger proactive zswap writeback with the following steps:
+ * 1. Allocate memory.
+ * 2. Push allocated memory into zswap.
+ * 3. Proactively write back zswap pages to swap
+ * using "zswap_writeback_only".
+ */
+static int proactive_writeback_workload(const char *cgroup, void *arg)
+{
+ long pagesize = sysconf(_SC_PAGESIZE);
+ size_t memsize = MB(4);
+ char reclaim_cmd[64];
+ char buf[pagesize];
+ int ret = -1;
+ char *mem;
+
+ mem = (char *)malloc(memsize);
+ if (!mem)
+ return ret;
+
+ for (int i = 0; i < pagesize; i++)
+ buf[i] = i < pagesize / 2 ? (char)i : 0;
+ for (int i = 0; i < memsize; i += pagesize)
+ memcpy(&mem[i], buf, pagesize);
+
+ /* Evict allocated memory into zswap. */
+ if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) {
+ ksft_print_msg("Failed to push pages into zswap\n");
+ goto out;
+ }
+
+ /* Trigger proactive zswap writeback for the same amount. */
+ snprintf(reclaim_cmd, sizeof(reclaim_cmd), "%zu zswap_writeback_only", memsize);
+ if (cg_write(cgroup, "memory.reclaim", reclaim_cmd)) {
+ ksft_print_msg("memory.reclaim zswap_writeback_only failed\n");
+ goto out;
+ }
+
+ ret = 0;
+out:
+ free(mem);
+ return ret;
+}
+
+static int check_writeback_invalid_inputs(const char *cgroup)
+{
+ static char * const bad_inputs[] = {
+ "zswap_writeback_only",
+ "1M zswap_writeback_only swappiness=60",
+ "1M swappiness=60 zswap_writeback_only",
+ "1M zswap_writeback_only swappiness=max",
+ "1M swappiness=max zswap_writeback_only",
+ };
+ int i, rc;
+
+ for (i = 0; i < ARRAY_SIZE(bad_inputs); i++) {
+ rc = cg_write(cgroup, "memory.reclaim", bad_inputs[i]);
+ if (rc != -EINVAL) {
+ ksft_print_msg("memory.reclaim '%s': returned %d, expected %d\n",
+ bad_inputs[i], rc, -EINVAL);
+ return -1;
+ }
+ }
+ return 0;
+}
+
+static int test_zswap_proactive_writeback(const char *root)
+{
+ long pwb_before, wb_before, pwb_after, wb_after;
+ long pwb_delta, wb_delta;
+ int ret = KSFT_FAIL;
+ char *test_group;
+
+ if (cg_read_strcmp(root, "memory.zswap.writeback", "1"))
+ return KSFT_SKIP;
+
+ test_group = cg_name(root, "zswap_proactive_test");
+ if (!test_group)
+ return KSFT_FAIL;
+ if (cg_create(test_group))
+ goto out;
+ if (check_writeback_invalid_inputs(test_group))
+ goto out;
+
+ pwb_before = get_cg_pwb_count(test_group);
+ wb_before = get_cg_wb_count(test_group);
+ if (pwb_before < 0 || wb_before < 0)
+ goto out;
+
+ if (cg_run(test_group, proactive_writeback_workload, NULL))
+ goto out;
+
+ pwb_after = get_cg_pwb_count(test_group);
+ wb_after = get_cg_wb_count(test_group);
+ if (pwb_after < 0 || wb_after < 0)
+ goto out;
+
+ pwb_delta = pwb_after - pwb_before;
+ wb_delta = wb_after - wb_before;
+
+ if (pwb_delta <= 0) {
+ ksft_print_msg("zswpwb_proactive did not increase: delta=%ld\n",
+ pwb_delta);
+ goto out;
+ }
+ if (wb_delta <= 0) {
+ ksft_print_msg("zswpwb did not increase: delta=%ld\n", wb_delta);
+ goto out;
+ }
+ if (pwb_delta > wb_delta) {
+ ksft_print_msg("zswpwb_proactive delta (%ld) > zswpwb delta (%ld)\n",
+ pwb_delta, wb_delta);
+ goto out;
+ }
+
+ ret = KSFT_PASS;
+out:
+ cg_destroy(test_group);
+ free(test_group);
+ return ret;
+}
+
#define T(x) { x, #x }
struct zswap_test {
int (*fn)(const char *root);
@@ -722,6 +866,7 @@ struct zswap_test {
T(test_no_kmem_bypass),
T(test_no_invasive_cgroup_shrink),
T(test_zswap_incompressible),
+ T(test_zswap_proactive_writeback),
};
#undef T
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH v2 0/4] mm/zswap: Implement per-cgroup proactive writeback
2026-05-25 12:22 [PATCH v2 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
` (3 preceding siblings ...)
2026-05-25 12:22 ` [PATCH v2 4/4] selftests/cgroup: Add tests for zswap " Hao Jia
@ 2026-05-25 19:24 ` Andrew Morton
4 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2026-05-25 19:24 UTC (permalink / raw)
To: Hao Jia
Cc: tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
linux-kernel, linux-doc, Hao Jia
On Mon, 25 May 2026 20:22:38 +0800 Hao Jia <jiahao.kernel@gmail.com> wrote:
> Zswap currently writes back pages to backing swap reactively, triggered
> either by the shrinker or by the pool reaching its size limit. Although
> proactive memory reclaim can automatically write back a portion of zswap
> pages via the shrinker, it cannot explicitly control the amount of
> writeback for a specific memory cgroup. Moreover, proactive memory reclaim
> may not always be triggered during a steady state.
>
> In certain scenarios, it is desirable to trigger writeback in advance to
> free up memory. For example, users may want to prepare for an upcoming
> memory-intensive workload by flushing cold memory to the backing storage
> when the system is relatively idle.
>
> This patch series introduces a "zswap_writeback_only" key to memory.reclaim
> cgroup interface, allowing users to proactively write back cold compressed
> pages from zswap to the backing swap device. When specified, this key
> bypasses standard memory reclaim and exclusively performs proactive zswap
> writeback up to the requested budget. If omitted, the default reclaim
> behavior remains unchanged.
Thanks. AI review found a few things to complain about, one of them
described as "preexisting".
^ permalink raw reply [flat|nested] 6+ messages in thread