[PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback

Linux cgroups development
 help / color / mirror / Atom feed

* [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback
@ 2026-05-11 10:51 Hao Jia
  2026-05-11 10:51 ` [PATCH 1/3] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Hao Jia @ 2026-05-11 10:51 UTC (permalink / raw)
  To: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin
  Cc: cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia

From: Hao Jia <jiahao1@lixiang.com>

Zswap currently writes back pages to backing swap devices reactively,
triggered either by memory pressure via the shrinker or by the pool
reaching its size limit. However, this reactive approach makes writeback
timing indeterminate and can disrupt latency-sensitive workloads when
eviction happens to coincide with a critical execution window.

Furthermore, in certain scenarios, it is desirable to trigger writeback
in advance to free up memory. For example, users may want to prepare for
an upcoming memory-intensive workload by flushing cold memory to the
backing storage when the system is relatively idle.

To address these issues, this patch series introduces a per-cgroup
interface that allows users to proactively write back cold compressed
pages from zswap to the backing swap device.

Users can trigger writeback by writing to this interface with the following
parameters:

- "max=<bytes>" : Optional. The maximum amount of data to write back.
    (default: unlimited).
- "<age>" : Required. The minimum age of the pages to write back
    (in seconds). Only pages that have been in the zswap pool for at
    least this amount of time will be written back.

Example usage:
  # Write back pages older than 1 hour (3600 seconds), max 10MB
  echo "max=10M 3600" > memory.zswap.proactive_writeback

Patch 1: Move the global zswap shrink cursor into struct mem_cgroup as a
  per-memcg zswap_wb_iter, so patch 2 can scope writeback to a given memcg
  and make forward progress across its subtree on repeated invocations.

Patch 2: Add the memory.zswap.proactive_writeback cgroupv2 interface,
  allowing users to trigger writeback with optional size limit and
  age threshold.

Patch 3: Add a zswpwb_proactive counter to memory.stat and /proc/vmstat
  to track the number of writebacks triggered by proactive writeback.

Hao Jia (3):
  mm/zswap: Make shrink_worker writeback cursor per-memcg
  mm/zswap: Implement proactive writeback
  mm/zswap: Add per-memcg stat for proactive writeback

 Documentation/admin-guide/cgroup-v2.rst |  28 +++
 include/linux/memcontrol.h              |   6 +
 include/linux/vm_event_item.h           |   1 +
 include/linux/zswap.h                   |  17 ++
 mm/memcontrol.c                         |  80 +++++++
 mm/vmstat.c                             |   1 +
 mm/zswap.c                              | 303 ++++++++++++++++++++----
 7 files changed, 390 insertions(+), 46 deletions(-)

--
2.34.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] mm/zswap: Make shrink_worker writeback cursor per-memcg
  2026-05-11 10:51 [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
@ 2026-05-11 10:51 ` Hao Jia
  2026-05-11 10:51 ` [PATCH 2/3] mm/zswap: Implement proactive writeback Hao Jia
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Hao Jia @ 2026-05-11 10:51 UTC (permalink / raw)
  To: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin
  Cc: cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia

From: Hao Jia <jiahao1@lixiang.com>

The zswap background writeback worker shrink_worker() uses a global
cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
across the online memcgs under root_mem_cgroup.

Proactive writeback, about to be introduced by
memory.zswap.proactive_writeback, also wants a similar per-memcg cursor
that is scoped to the specified memcg, so that repeated invocations
against the same memcg make forward progress across its descendant
memcgs instead of restarting from the first child memcg each time.

Naturally, group the cursor and its protecting spinlock into a
zswap_wb_iter struct, and make it a member of struct mem_cgroup to
realize per-memcg cursor management. Accordingly, shrink_worker() now
uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.

Because the cursor is now per-memcg, the offline cleanup must visit
every ancestor that could be holding a reference to the dying memcg.
Factor out __zswap_memcg_offline_cleanup() and walk from dead_memcg up
to the root.

No functional change intended for shrink_worker().

Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
 include/linux/memcontrol.h |   6 ++
 include/linux/zswap.h      |   9 +++
 mm/memcontrol.c            |   3 +
 mm/zswap.c                 | 116 +++++++++++++++++++++++++------------
 4 files changed, 98 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b..00ae646a3a15 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -228,6 +228,12 @@ struct mem_cgroup {
 	 * swap, and from being swapped out on zswap store failures.
 	 */
 	bool zswap_writeback;
+
+	/*
+	 * Per-memcg writeback cursor: root by shrink_worker, non-root by
+	 * proactive writeback.
+	 */
+	struct zswap_wb_iter zswap_wb_iter;
 #endif
 
 	/* vmpressure notifications */
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e..efa6b551217e 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -11,6 +11,15 @@ extern atomic_long_t zswap_stored_pages;
 
 #ifdef CONFIG_ZSWAP
 
+/* Iteration cursor for zswap writeback over a memcg's subtree. */
+struct zswap_wb_iter {
+	/* protects @pos against concurrent advances */
+	spinlock_t lock;
+	struct mem_cgroup *pos;
+};
+
+void zswap_wb_iter_init(struct zswap_wb_iter *iter);
+
 struct zswap_lruvec_state {
 	/*
 	 * Number of swapped in pages from disk, i.e not found in the zswap pool.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c03d4787d466..409c41359dc8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4022,6 +4022,9 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	INIT_LIST_HEAD(&memcg->memory_peaks);
 	INIT_LIST_HEAD(&memcg->swap_peaks);
 	spin_lock_init(&memcg->peaks_lock);
+#ifdef CONFIG_ZSWAP
+	zswap_wb_iter_init(&memcg->zswap_wb_iter);
+#endif
 	memcg->socket_pressure = get_jiffies_64();
 #if BITS_PER_LONG < 64
 	seqlock_init(&memcg->socket_pressure_seqlock);
diff --git a/mm/zswap.c b/mm/zswap.c
index 4b5149173b0e..19538d6f169a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -163,9 +163,6 @@ struct zswap_pool {
 /* Global LRU lists shared by all zswap pools. */
 static struct list_lru zswap_list_lru;
 
-/* The lock protects zswap_next_shrink updates. */
-static DEFINE_SPINLOCK(zswap_shrink_lock);
-static struct mem_cgroup *zswap_next_shrink;
 static struct work_struct zswap_shrink_work;
 static struct shrinker *zswap_shrinker;
 
@@ -717,28 +714,85 @@ void zswap_folio_swapin(struct folio *folio)
 	}
 }
 
-/*
- * This function should be called when a memcg is being offlined.
+void zswap_wb_iter_init(struct zswap_wb_iter *iter)
+{
+	spin_lock_init(&iter->lock);
+}
+
+#ifdef CONFIG_MEMCG
+/**
+ * zswap_mem_cgroup_iter - advance the writeback cursor
+ * @root: subtree root whose cursor to advance
+ *
+ * Advance @root->zswap_wb_iter.pos to @root itself or the next online
+ * descendant. Passing root_mem_cgroup yields a global walk.
+ *
+ * The cursor is retained across invocations, so successive calls walk
+ * @root's subtree cyclically in pre-order and, after %NULL, restart
+ * from the beginning.
  *
- * Since the global shrinker shrink_worker() may hold a reference
- * of the memcg, we must check and release the reference in
- * zswap_next_shrink.
+ * The returned memcg carries an extra reference; release it with
+ * mem_cgroup_put().
  *
- * shrink_worker() must handle the case where this function releases
- * the reference of memcg being shrunk.
+ * Return: the next online memcg in @root's subtree, or @root itself,
+ * with an extra reference, or %NULL after a full round-trip.
  */
-void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
+static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root)
 {
-	/* lock out zswap shrinker walking memcg tree */
-	spin_lock(&zswap_shrink_lock);
-	if (zswap_next_shrink == memcg) {
+	struct mem_cgroup *memcg;
+
+	spin_lock(&root->zswap_wb_iter.lock);
+	do {
+		memcg = mem_cgroup_iter(root, root->zswap_wb_iter.pos, NULL);
+		root->zswap_wb_iter.pos = memcg;
+	} while (memcg && !mem_cgroup_tryget_online(memcg));
+	spin_unlock(&root->zswap_wb_iter.lock);
+
+	return memcg;
+}
+
+/*
+ * If @root's cursor currently points at @dead_memcg, advance it to the
+ * next online descendant so @dead_memcg can be freed.
+ */
+static void __zswap_memcg_offline_cleanup(struct mem_cgroup *root,
+					  struct mem_cgroup *dead_memcg)
+{
+	spin_lock(&root->zswap_wb_iter.lock);
+	if (root->zswap_wb_iter.pos == dead_memcg) {
 		do {
-			zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
-		} while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink));
+			root->zswap_wb_iter.pos =
+				mem_cgroup_iter(root,
+						root->zswap_wb_iter.pos, NULL);
+		} while (root->zswap_wb_iter.pos &&
+			 !mem_cgroup_online(root->zswap_wb_iter.pos));
 	}
-	spin_unlock(&zswap_shrink_lock);
+	spin_unlock(&root->zswap_wb_iter.lock);
 }
 
+/*
+ * Called when a memcg is being offlined. If @memcg or any of its
+ * ancestors has a cursor pointing at @memcg, it must be advanced
+ * past @memcg before @memcg can be freed. Walk the chain and
+ * release such references.
+ */
+void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	do {
+		__zswap_memcg_offline_cleanup(parent, memcg);
+	} while ((parent = parent_mem_cgroup(parent)));
+}
+#else /* !CONFIG_MEMCG */
+static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root)
+{
+	return NULL;
+}
+
+void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { }
+#endif /* CONFIG_MEMCG */
+
 /*********************************
 * zswap entry functions
 **********************************/
@@ -1328,38 +1382,28 @@ static void shrink_worker(struct work_struct *w)
 	 * - No writeback-candidate memcgs found in a memcg tree walk.
 	 * - Shrinking a writeback-candidate memcg failed.
 	 *
-	 * We save iteration cursor memcg into zswap_next_shrink,
+	 * We save the iteration cursor in root_mem_cgroup->zswap_wb_iter.pos,
 	 * which can be modified by the offline memcg cleaner
 	 * zswap_memcg_offline_cleanup().
 	 *
 	 * Since the offline cleaner is called only once, we cannot leave an
-	 * offline memcg reference in zswap_next_shrink.
+	 * offline memcg reference in root_mem_cgroup->zswap_wb_iter.pos.
 	 * We can rely on the cleaner only if we get online memcg under lock.
 	 *
 	 * If we get an offline memcg, we cannot determine if the cleaner has
 	 * already been called or will be called later. We must put back the
 	 * reference before returning from this function. Otherwise, the
-	 * offline memcg left in zswap_next_shrink will hold the reference
-	 * until the next run of shrink_worker().
+	 * offline memcg left in root_mem_cgroup->zswap_wb_iter.pos will hold
+	 * the reference until the next run of shrink_worker().
 	 */
 	do {
 		/*
-		 * Start shrinking from the next memcg after zswap_next_shrink.
-		 * When the offline cleaner has already advanced the cursor,
-		 * advancing the cursor here overlooks one memcg, but this
-		 * should be negligibly rare.
-		 *
-		 * If we get an online memcg, keep the extra reference in case
-		 * the original one obtained by mem_cgroup_iter() is dropped by
-		 * zswap_memcg_offline_cleanup() while we are shrinking the
-		 * memcg.
+		 * Start shrinking from the next memcg after
+		 * root_mem_cgroup->zswap_wb_iter.pos. When the offline cleaner
+		 * has already advanced the cursor, advancing the cursor here
+		 * overlooks one memcg, but this should be negligibly rare.
 		 */
-		spin_lock(&zswap_shrink_lock);
-		do {
-			memcg = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
-			zswap_next_shrink = memcg;
-		} while (memcg && !mem_cgroup_tryget_online(memcg));
-		spin_unlock(&zswap_shrink_lock);
+		memcg = zswap_mem_cgroup_iter(root_mem_cgroup);
 
 		if (!memcg) {
 			/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] mm/zswap: Implement proactive writeback
  2026-05-11 10:51 [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
  2026-05-11 10:51 ` [PATCH 1/3] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
@ 2026-05-11 10:51 ` Hao Jia
  2026-05-11 19:49   ` Nhat Pham
  2026-05-11 19:54   ` Nhat Pham
  2026-05-11 10:51 ` [PATCH 3/3] mm/zswap: Add per-memcg stat for " Hao Jia
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 12+ messages in thread
From: Hao Jia @ 2026-05-11 10:51 UTC (permalink / raw)
  To: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin
  Cc: cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia

From: Hao Jia <jiahao1@lixiang.com>

Zswap currently writes back pages to backing swap devices reactively,
triggered either by memory pressure via the shrinker or by the pool
reaching its size limit. This reactive approach offers no precise
control over when writeback happens, which can disturb latency-sensitive
workloads, and it cannot direct writeback at a specific memory cgroup.
However, there are scenarios where users might want to proactively
write back cold pages from zswap to the backing swap device, for
example, to free up memory for other applications or to prepare for
upcoming memory-intensive workloads.

Therefore, implement a proactive writeback mechanism for zswap by
adding a new cgroup interface file memory.zswap.proactive_writeback
within the memory controller.

Users can trigger writeback by writing to this file with the following
parameters:
- max=<bytes>: The maximum amount of memory to write back (optional,
  default: unlimited).
- <age>: The minimum age of the pages to write back. Only pages that
  have been in zswap for at least this duration will be written back.

Example usage:
  # Write back pages older than 1 hour (3600 seconds), max 10MB
  echo "max=10M 3600" > memory.zswap.proactive_writeback

The implementation consists of:
1. Add store_time to struct zswap_entry to record when each entry was
   inserted into zswap, used for proactive writeback age comparison.
2. Introduce struct zswap_shrink_walk_arg, passed as the cb_arg to
   list_lru_walk_one() in both the shrinker and proactive paths. It
   carries the per-invocation cutoff_time and proactive flag down to
   shrink_memcg_cb(), and propagates the encountered_page_in_swapcache
   out-signal from the callback back to the caller.
3. Modify the callback function shrink_memcg_cb() to proactively
   writeback zswap_entries that meet the time threshold.
4. Add zswap_proactive_writeback() as the proactive writeback driver:
   a per-node batched list_lru_walk_one() loop bounded by the
   writeback budget.

Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  24 ++++
 include/linux/zswap.h                   |   8 ++
 mm/memcontrol.c                         |  76 ++++++++++
 mm/zswap.c                              | 176 ++++++++++++++++++++++--
 4 files changed, 276 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..05b664b3b3e8 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1908,6 +1908,30 @@ The following nested keys are defined.
 	This setting has no effect if zswap is disabled, and swapping
 	is allowed unless memory.swap.max is set to 0.
 
+  memory.zswap.proactive_writeback
+	A write-only nested-keyed file which exists in non-root cgroups.
+
+	This interface allows proactive writeback of pages from the zswap
+	pool to the backing swap device. This is useful to offload cold
+	pages from the zswap pool to the slower swap device. It is only
+	available if zswap writeback is enabled.
+
+	Users can trigger writeback by writing to this file with the following
+	parameters:
+
+	- "max=<bytes>" : Optional. The maximum amount of data to write back.
+	  (default: unlimited). Please note that the kernel can over or under
+	  writeback this value.
+
+	- "<age>" : Required. The minimum age of the pages to write back
+	  (in seconds). Only pages that have been in the zswap pool for at
+	  least this amount of time will be written back.
+
+	Example::
+
+	  # Write back pages older than 1 hour (3600 seconds), max 10MB
+	  echo "max=10M 3600" > memory.zswap.proactive_writeback
+
   memory.pressure
 	A read-only nested-keyed file.
 
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index efa6b551217e..7a51b4f95017 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -44,6 +44,8 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
 void zswap_folio_swapin(struct folio *folio);
 bool zswap_is_enabled(void);
 bool zswap_never_enabled(void);
+int zswap_proactive_writeback(struct mem_cgroup *root, unsigned long nr_max_writeback,
+			      ktime_t cutoff);
 #else
 
 struct zswap_lruvec_state {};
@@ -78,6 +80,12 @@ static inline bool zswap_never_enabled(void)
 	return true;
 }
 
+static inline int zswap_proactive_writeback(struct mem_cgroup *root,
+					    unsigned long nr_max_writeback, ktime_t cutoff)
+{
+	return 0;
+}
+
 #endif
 
 #endif /* _LINUX_ZSWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 409c41359dc8..ba7f7b1954a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -70,6 +70,7 @@
 #include "memcontrol-v1.h"
 
 #include <linux/uaccess.h>
+#include <linux/parser.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/memcg.h>
@@ -5891,6 +5892,76 @@ static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+enum {
+	ZSWAP_WRITEBACK_MAX,
+	ZSWAP_WRITEBACK_AGE,
+	ZSWAP_WRITEBACK_ERR,
+};
+
+static const match_table_t zswap_writeback_tokens = {
+	{ ZSWAP_WRITEBACK_MAX, "max=%s" },
+	{ ZSWAP_WRITEBACK_AGE, "%u" },
+	{ ZSWAP_WRITEBACK_ERR, NULL },
+};
+
+static ssize_t zswap_proactive_writeback_write(struct kernfs_open_file *of,
+					       char *buf, size_t nbytes,
+					       loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long nr_max_writeback = ULONG_MAX;
+	substring_t args[MAX_OPT_ARGS];
+	unsigned int age_sec;
+	bool age_set = false;
+	ktime_t cutoff_time;
+	char *token, *end;
+	int err;
+
+	if (!mem_cgroup_zswap_writeback_enabled(memcg))
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((token = strsep(&buf, " ")) != NULL) {
+		if (!strlen(token))
+			continue;
+
+		switch (match_token(token, zswap_writeback_tokens, args)) {
+		case ZSWAP_WRITEBACK_MAX:
+			nr_max_writeback = memparse(args[0].from, &end);
+			if (*end != '\0')
+				return -EINVAL;
+			nr_max_writeback >>= PAGE_SHIFT;
+			break;
+		case ZSWAP_WRITEBACK_AGE:
+			if (age_set)
+				return -EINVAL;
+
+			if (match_uint(&args[0], &age_sec))
+				return -EINVAL;
+			age_set = true;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	if (!age_set || !age_sec || !nr_max_writeback)
+		return -EINVAL;
+
+	cutoff_time = ktime_sub(ktime_get_boottime(),
+				ns_to_ktime((u64)age_sec * NSEC_PER_SEC));
+	/* age_sec >= uptime: no entry can be that old, skip the walk. */
+	if (ktime_to_ns(cutoff_time) <= 0)
+		return nbytes;
+
+	err = zswap_proactive_writeback(memcg, nr_max_writeback, cutoff_time);
+	if (err)
+		return err;
+
+	return nbytes;
+}
+
 static struct cftype zswap_files[] = {
 	{
 		.name = "zswap.current",
@@ -5908,6 +5979,11 @@ static struct cftype zswap_files[] = {
 		.seq_show = zswap_writeback_show,
 		.write = zswap_writeback_write,
 	},
+	{
+		.name = "zswap.proactive_writeback",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.write = zswap_proactive_writeback_write,
+	},
 	{ }	/* terminate */
 };
 #endif /* CONFIG_ZSWAP */
diff --git a/mm/zswap.c b/mm/zswap.c
index 19538d6f169a..1173ac6836fa 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -36,6 +36,7 @@
 #include <linux/workqueue.h>
 #include <linux/list_lru.h>
 #include <linux/zsmalloc.h>
+#include <linux/timekeeping.h>
 
 #include "swap.h"
 #include "internal.h"
@@ -160,6 +161,12 @@ struct zswap_pool {
 	char tfm_name[CRYPTO_MAX_ALG_NAME];
 };
 
+struct zswap_shrink_walk_arg {
+	ktime_t cutoff_time;
+	bool proactive;
+	bool encountered_page_in_swapcache;
+};
+
 /* Global LRU lists shared by all zswap pools. */
 static struct list_lru zswap_list_lru;
 
@@ -183,6 +190,7 @@ static struct shrinker *zswap_shrinker;
  * handle - zsmalloc allocation handle that stores the compressed page data
  * objcg - the obj_cgroup that the compressed memory is charged to
  * lru - handle to the pool's lru used to evict pages.
+ * store_time - Time when the entry was stored, for proactive writeback.
  */
 struct zswap_entry {
 	swp_entry_t swpentry;
@@ -192,6 +200,7 @@ struct zswap_entry {
 	unsigned long handle;
 	struct obj_cgroup *objcg;
 	struct list_head lru;
+	ktime_t store_time;
 };
 
 static struct xarray *zswap_trees[MAX_SWAPFILES];
@@ -1148,10 +1157,19 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 				       void *arg)
 {
 	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
-	bool *encountered_page_in_swapcache = (bool *)arg;
-	swp_entry_t swpentry;
+	struct zswap_shrink_walk_arg *walk_arg = arg;
+	bool proactive_wb = walk_arg && walk_arg->proactive;
 	enum lru_status ret = LRU_REMOVED_RETRY;
 	int writeback_result;
+	swp_entry_t swpentry;
+
+	/*
+	 * For proactive writeback, rotate young entries to the LRU tail
+	 * so that subsequent list_lru_walk_one() batches start past
+	 * them.
+	 */
+	if (proactive_wb && ktime_after(entry->store_time, walk_arg->cutoff_time))
+		return LRU_ROTATE;
 
 	/*
 	 * Second chance algorithm: if the entry has its referenced bit set, give it
@@ -1160,7 +1178,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 	 */
 	if (entry->referenced) {
 		entry->referenced = false;
-		return LRU_ROTATE;
+		/* Proactive writeback is an explicit hint; don't rotate. */
+		if (!proactive_wb)
+			return LRU_ROTATE;
 	}
 
 	/*
@@ -1214,9 +1234,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 		 * into the warmer region. We should terminate shrinking (if we're in the dynamic
 		 * shrinker context).
 		 */
-		if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
+		if (writeback_result == -EEXIST && walk_arg) {
 			ret = LRU_STOP;
-			*encountered_page_in_swapcache = true;
+			walk_arg->encountered_page_in_swapcache = true;
 		}
 	} else {
 		zswap_written_back_pages++;
@@ -1228,8 +1248,12 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 		struct shrink_control *sc)
 {
+	struct zswap_shrink_walk_arg walk_arg = {
+		.cutoff_time = KTIME_MAX,
+		.proactive = false,
+		.encountered_page_in_swapcache = false,
+	};
 	unsigned long shrink_ret;
-	bool encountered_page_in_swapcache = false;
 
 	if (!zswap_shrinker_enabled ||
 			!mem_cgroup_zswap_writeback_enabled(sc->memcg)) {
@@ -1238,9 +1262,9 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 	}
 
 	shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
-		&encountered_page_in_swapcache);
+					  &walk_arg);
 
-	if (encountered_page_in_swapcache)
+	if (walk_arg.encountered_page_in_swapcache)
 		return SHRINK_STOP;
 
 	return shrink_ret ? shrink_ret : SHRINK_STOP;
@@ -1508,6 +1532,7 @@ static bool zswap_store_page(struct page *page,
 	entry->swpentry = page_swpentry;
 	entry->objcg = objcg;
 	entry->referenced = true;
+	entry->store_time = ktime_get_boottime();
 	if (entry->length) {
 		INIT_LIST_HEAD(&entry->lru);
 		zswap_lru_add(&zswap_list_lru, entry);
@@ -1681,6 +1706,141 @@ int zswap_load(struct folio *folio)
 	return 0;
 }
 
+/* Cap LRU scan to this many entries per page of remaining budget. */
+#define ZSWAP_PROACTIVE_WB_SCAN_RATIO	16UL
+/*
+ * Batch size for proactive writeback, used both as the per-memcg
+ * writeback target in the outer memcg loop and as the per-walk budget
+ * for list_lru_walk_one().
+ */
+#define ZSWAP_PROACTIVE_WB_BATCH	128UL
+
+/*
+ * Walk @memcg's per-node LRUs, writing back entries older than @cutoff
+ * up to @nr_to_write pages. Returns the number of pages written back,
+ * or -ENOENT if @memcg is a zombie or has writeback disabled.
+ */
+static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
+					 ktime_t cutoff,
+					 unsigned long nr_to_write)
+{
+	unsigned long nr_written = 0;
+	int nid;
+
+	if (!mem_cgroup_zswap_writeback_enabled(memcg))
+		return -ENOENT;
+
+	if (!mem_cgroup_online(memcg))
+		return -ENOENT;
+
+	for_each_node_state(nid, N_NORMAL_MEMORY) {
+		struct zswap_shrink_walk_arg walk_arg = {
+			.cutoff_time = cutoff,
+			.proactive = true,
+			.encountered_page_in_swapcache = false,
+		};
+		unsigned long nr_to_scan, nr_scanned = 0;
+
+		/*
+		 * Cap by LRU length: bounds rewalks when entries keep
+		 * rotating (young or referenced).
+		 */
+		nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
+		if (!nr_to_scan)
+			continue;
+
+		/*
+		 * Cap by SCAN_RATIO * remaining budget: bounds scan cost
+		 * to the remaining writeback budget.
+		 */
+		nr_to_scan = min(nr_to_scan,
+				 (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO);
+
+		while (nr_scanned < nr_to_scan) {
+			unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH,
+						       nr_to_scan - nr_scanned);
+
+			if (signal_pending(current))
+				return nr_written;
+
+			/*
+			 * Account the committed budget rather than the walker's
+			 * actual delta: if the list empties under us the walker
+			 * visits nothing and nr_scanned would never advance.
+			 */
+			nr_scanned += nr_to_walk;
+
+			nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
+							&shrink_memcg_cb, &walk_arg,
+							&nr_to_walk);
+
+			if (nr_written >= nr_to_write)
+				return nr_written;
+			if (walk_arg.encountered_page_in_swapcache)
+				break;
+
+			cond_resched();
+		}
+	}
+
+	return nr_written;
+}
+
+int zswap_proactive_writeback(struct mem_cgroup *root,
+			      unsigned long nr_max_writeback,
+			      ktime_t cutoff)
+{
+	struct mem_cgroup *memcg;
+	unsigned long nr_written = 0;
+	int failures = 0, attempts = 0;
+
+	/*
+	 * Writeback will be aborted with -EAGAIN if @nr_written is still
+	 * zero and we encounter the following MAX_RECLAIM_RETRIES times:
+	 * - No writeback-candidate memcgs found in a subtree walk.
+	 * - A writeback-candidate memcg wrote back zero pages.
+	 */
+	while (nr_written < nr_max_writeback) {
+		unsigned long nr_to_write;
+		long shrunk;
+
+		if (signal_pending(current))
+			return -EINTR;
+
+		memcg = zswap_mem_cgroup_iter(root);
+
+		if (!memcg) {
+			/*
+			 * Continue without incrementing failures if we found
+			 * candidate memcgs in the last subtree walk.
+			 */
+			if (!attempts && ++failures == MAX_RECLAIM_RETRIES)
+				goto out;
+			attempts = 0;
+			continue;
+		}
+
+		nr_to_write = min(nr_max_writeback - nr_written,
+				  ZSWAP_PROACTIVE_WB_BATCH);
+		shrunk = zswap_proactive_shrink_memcg(memcg, cutoff, nr_to_write);
+		mem_cgroup_put(memcg);
+
+		/* Writeback-disabled or offline: skip without counting. */
+		if (shrunk == -ENOENT)
+			continue;
+
+		++attempts;
+		if (shrunk > 0)
+			nr_written += shrunk;
+		else if (++failures == MAX_RECLAIM_RETRIES)
+			goto out;
+
+		cond_resched();
+	}
+out:
+	return nr_written ? 0 : -EAGAIN;
+}
+
 void zswap_invalidate(swp_entry_t swp)
 {
 	pgoff_t offset = swp_offset(swp);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] mm/zswap: Add per-memcg stat for proactive writeback
  2026-05-11 10:51 [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
  2026-05-11 10:51 ` [PATCH 1/3] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
  2026-05-11 10:51 ` [PATCH 2/3] mm/zswap: Implement proactive writeback Hao Jia
@ 2026-05-11 10:51 ` Hao Jia
  2026-05-11 11:39 ` [PATCH 0/3] mm/zswap: Implement per-cgroup " Michal Koutný
  2026-05-11 19:53 ` Nhat Pham
  4 siblings, 0 replies; 12+ messages in thread
From: Hao Jia @ 2026-05-11 10:51 UTC (permalink / raw)
  To: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin
  Cc: cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia

From: Hao Jia <jiahao1@lixiang.com>

Currently, zswap writeback can be triggered by either the pool limit
being hit or by the proactive writeback mechanism. However, the
existing 'zswpwb' metric in memory.stat and /proc/vmstat counts all
written back pages, making it difficult to distinguish between pages
written back due to the pool limit and those written back proactively.

Add a new statistic 'zswpwb_proactive' to memory.stat and /proc/vmstat.
This counter tracks the number of pages written back due to proactive
writeback. This allows users to better monitor and tune the proactive
writeback mechanism.

Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  4 ++++
 include/linux/vm_event_item.h           |  1 +
 mm/memcontrol.c                         |  1 +
 mm/vmstat.c                             |  1 +
 mm/zswap.c                              | 11 +++++++++--
 5 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 05b664b3b3e8..29a189b18efc 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1734,6 +1734,10 @@ The following nested keys are defined.
 	  zswpwb
 		Number of pages written from zswap to swap.
 
+	  zswpwb_proactive
+		Number of pages written from zswap to swap by proactive
+		writeback. This is a subset of zswpwb.
+
 	  zswap_incomp
 		Number of incompressible pages currently stored in zswap
 		without compression. These pages could not be compressed to
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..7a5bee0a20b6 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -138,6 +138,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		ZSWPIN,
 		ZSWPOUT,
 		ZSWPWB,
+		ZSWPWB_PROACTIVE,
 #endif
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba7f7b1954a8..830d895e77c3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -572,6 +572,7 @@ static const unsigned int memcg_vm_event_stat[] = {
 	ZSWPIN,
 	ZSWPOUT,
 	ZSWPWB,
+	ZSWPWB_PROACTIVE,
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	THP_FAULT_ALLOC,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..66fd06d1bb01 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1452,6 +1452,7 @@ const char * const vmstat_text[] = {
 	[I(ZSWPIN)]				= "zswpin",
 	[I(ZSWPOUT)]				= "zswpout",
 	[I(ZSWPWB)]				= "zswpwb",
+	[I(ZSWPWB_PROACTIVE)]			= "zswpwb_proactive",
 #endif
 #ifdef CONFIG_X86
 	[I(DIRECT_MAP_LEVEL2_SPLIT)]		= "direct_map_level2_splits",
diff --git a/mm/zswap.c b/mm/zswap.c
index 1173ac6836fa..bf23c46e838e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1048,7 +1048,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
  * freed.
  */
 static int zswap_writeback_entry(struct zswap_entry *entry,
-				 swp_entry_t swpentry)
+				 swp_entry_t swpentry,
+				 bool proactive)
 {
 	struct xarray *tree;
 	pgoff_t offset = swp_offset(swpentry);
@@ -1108,6 +1109,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	if (entry->objcg)
 		count_objcg_events(entry->objcg, ZSWPWB, 1);
 
+	if (proactive) {
+		count_vm_event(ZSWPWB_PROACTIVE);
+		if (entry->objcg)
+			count_objcg_events(entry->objcg, ZSWPWB_PROACTIVE, 1);
+	}
+
 	zswap_entry_free(entry);
 
 	/* folio is up to date */
@@ -1223,7 +1230,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 	 */
 	spin_unlock(&l->lock);
 
-	writeback_result = zswap_writeback_entry(entry, swpentry);
+	writeback_result = zswap_writeback_entry(entry, swpentry, proactive_wb);
 
 	if (writeback_result) {
 		zswap_reject_reclaim_fail++;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback
  2026-05-11 10:51 [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
                   ` (2 preceding siblings ...)
  2026-05-11 10:51 ` [PATCH 3/3] mm/zswap: Add per-memcg stat for " Hao Jia
@ 2026-05-11 11:39 ` Michal Koutný
  2026-05-12 11:23   ` Hao Jia
  2026-05-11 19:53 ` Nhat Pham
  4 siblings, 1 reply; 12+ messages in thread
From: Michal Koutný @ 2026-05-11 11:39 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia

[-- Attachment #1: Type: text/plain, Size: 1445 bytes --]

On Mon, May 11, 2026 at 06:51:46PM +0800, Hao Jia <jiahao.kernel@gmail.com> wrote:
> From: Hao Jia <jiahao1@lixiang.com>
> 
> Zswap currently writes back pages to backing swap devices reactively,
> triggered either by memory pressure via the shrinker or by the pool
> reaching its size limit. However, this reactive approach makes writeback
> timing indeterminate and can disrupt latency-sensitive workloads when
> eviction happens to coincide with a critical execution window.
> 
> Furthermore, in certain scenarios, it is desirable to trigger writeback
> in advance to free up memory. For example, users may want to prepare for
> an upcoming memory-intensive workload by flushing cold memory to the
> backing storage when the system is relatively idle.

I can imagine the zswap writeout can come at the least possible
moment...

> To address these issues, this patch series introduces a per-cgroup
> interface that allows users to proactively write back cold compressed
> pages from zswap to the backing swap device.

...but I see this series is not only per-cgroup proactive reclaim but
it's also age-based reclaim.

The per-cg consumption and limits (and regular memory reclaim) are all
measured in sizes. This age-based invocations don't seem commensurable
(e.g. how would users in practice determine what is the desired input to
here).

Could you explain more reasoning behind this design?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
  2026-05-11 10:51 ` [PATCH 2/3] mm/zswap: Implement proactive writeback Hao Jia
@ 2026-05-11 19:49   ` Nhat Pham
  2026-05-11 19:57     ` Yosry Ahmed
  2026-05-11 19:54   ` Nhat Pham
  1 sibling, 1 reply; 12+ messages in thread
From: Nhat Pham @ 2026-05-11 19:49 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia

On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
> From: Hao Jia <jiahao1@lixiang.com>
>
> Zswap currently writes back pages to backing swap devices reactively,
> triggered either by memory pressure via the shrinker or by the pool
> reaching its size limit. This reactive approach offers no precise
> control over when writeback happens, which can disturb latency-sensitive
> workloads, and it cannot direct writeback at a specific memory cgroup.
> However, there are scenarios where users might want to proactively
> write back cold pages from zswap to the backing swap device, for
> example, to free up memory for other applications or to prepare for
> upcoming memory-intensive workloads.
>
> Therefore, implement a proactive writeback mechanism for zswap by
> adding a new cgroup interface file memory.zswap.proactive_writeback
> within the memory controller.


We already have memory.reclaim, no? Would that not work to create
headroom generally for your use case? Is there a reason why we are
treating zswap memory as special here?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback
  2026-05-11 10:51 [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
                   ` (3 preceding siblings ...)
  2026-05-11 11:39 ` [PATCH 0/3] mm/zswap: Implement per-cgroup " Michal Koutný
@ 2026-05-11 19:53 ` Nhat Pham
  4 siblings, 0 replies; 12+ messages in thread
From: Nhat Pham @ 2026-05-11 19:53 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia

On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
> From: Hao Jia <jiahao1@lixiang.com>
>
> Zswap currently writes back pages to backing swap devices reactively,
> triggered either by memory pressure via the shrinker or by the pool
> reaching its size limit. However, this reactive approach makes writeback
> timing indeterminate and can disrupt latency-sensitive workloads when
> eviction happens to coincide with a critical execution window.

You can make the same argument about ordinary memory reclaim :) That's
why we have kswapd (asynchronous reclaim ahead of time) and proactive
reclaim solutions (memory.reclaim), which would all target zswap as
well.

>
> Furthermore, in certain scenarios, it is desirable to trigger writeback
> in advance to free up memory. For example, users may want to prepare for
> an upcoming memory-intensive workload by flushing cold memory to the
> backing storage when the system is relatively idle.

Would memory.reclaim not work here? Why are we treating zswap memory
footprint as special here, and spare file and anon?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
  2026-05-11 10:51 ` [PATCH 2/3] mm/zswap: Implement proactive writeback Hao Jia
  2026-05-11 19:49   ` Nhat Pham
@ 2026-05-11 19:54   ` Nhat Pham
  2026-05-12  9:37     ` Hao Jia
  1 sibling, 1 reply; 12+ messages in thread
From: Nhat Pham @ 2026-05-11 19:54 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia

On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
> From: Hao Jia <jiahao1@lixiang.com>
>
> Zswap currently writes back pages to backing swap devices reactively,
> triggered either by memory pressure via the shrinker or by the pool
> reaching its size limit. This reactive approach offers no precise
> control over when writeback happens, which can disturb latency-sensitive
> workloads, and it cannot direct writeback at a specific memory cgroup.
> However, there are scenarios where users might want to proactively
> write back cold pages from zswap to the backing swap device, for
> example, to free up memory for other applications or to prepare for
> upcoming memory-intensive workloads.
>
> Therefore, implement a proactive writeback mechanism for zswap by
> adding a new cgroup interface file memory.zswap.proactive_writeback
> within the memory controller.
>
> Users can trigger writeback by writing to this file with the following
> parameters:
> - max=<bytes>: The maximum amount of memory to write back (optional,
>   default: unlimited).
> - <age>: The minimum age of the pages to write back. Only pages that
>   have been in zswap for at least this duration will be written back.
>
> Example usage:
>   # Write back pages older than 1 hour (3600 seconds), max 10MB
>   echo "max=10M 3600" > memory.zswap.proactive_writeback
>
> The implementation consists of:
> 1. Add store_time to struct zswap_entry to record when each entry was
>    inserted into zswap, used for proactive writeback age comparison.
> 2. Introduce struct zswap_shrink_walk_arg, passed as the cb_arg to
>    list_lru_walk_one() in both the shrinker and proactive paths. It
>    carries the per-invocation cutoff_time and proactive flag down to
>    shrink_memcg_cb(), and propagates the encountered_page_in_swapcache
>    out-signal from the callback back to the caller.
> 3. Modify the callback function shrink_memcg_cb() to proactively
>    writeback zswap_entries that meet the time threshold.
> 4. Add zswap_proactive_writeback() as the proactive writeback driver:
>    a per-node batched list_lru_walk_one() loop bounded by the
>    writeback budget.
>
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  24 ++++
>  include/linux/zswap.h                   |   8 ++
>  mm/memcontrol.c                         |  76 ++++++++++
>  mm/zswap.c                              | 176 ++++++++++++++++++++++--
>  4 files changed, 276 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6efd0095ed99..05b664b3b3e8 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1908,6 +1908,30 @@ The following nested keys are defined.
>         This setting has no effect if zswap is disabled, and swapping
>         is allowed unless memory.swap.max is set to 0.
>
> +  memory.zswap.proactive_writeback
> +       A write-only nested-keyed file which exists in non-root cgroups.
> +
> +       This interface allows proactive writeback of pages from the zswap
> +       pool to the backing swap device. This is useful to offload cold
> +       pages from the zswap pool to the slower swap device. It is only
> +       available if zswap writeback is enabled.
> +
> +       Users can trigger writeback by writing to this file with the following
> +       parameters:
> +
> +       - "max=<bytes>" : Optional. The maximum amount of data to write back.
> +         (default: unlimited). Please note that the kernel can over or under
> +         writeback this value.
> +
> +       - "<age>" : Required. The minimum age of the pages to write back
> +         (in seconds). Only pages that have been in the zswap pool for at
> +         least this amount of time will be written back.
> +
> +       Example::
> +
> +         # Write back pages older than 1 hour (3600 seconds), max 10MB
> +         echo "max=10M 3600" > memory.zswap.proactive_writeback
> +
>    memory.pressure
>         A read-only nested-keyed file.
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index efa6b551217e..7a51b4f95017 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -44,6 +44,8 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>  void zswap_folio_swapin(struct folio *folio);
>  bool zswap_is_enabled(void);
>  bool zswap_never_enabled(void);
> +int zswap_proactive_writeback(struct mem_cgroup *root, unsigned long nr_max_writeback,
> +                             ktime_t cutoff);
>  #else
>
>  struct zswap_lruvec_state {};
> @@ -78,6 +80,12 @@ static inline bool zswap_never_enabled(void)
>         return true;
>  }
>
> +static inline int zswap_proactive_writeback(struct mem_cgroup *root,
> +                                           unsigned long nr_max_writeback, ktime_t cutoff)
> +{
> +       return 0;
> +}
> +
>  #endif
>
>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 409c41359dc8..ba7f7b1954a8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -70,6 +70,7 @@
>  #include "memcontrol-v1.h"
>
>  #include <linux/uaccess.h>
> +#include <linux/parser.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/memcg.h>
> @@ -5891,6 +5892,76 @@ static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +enum {
> +       ZSWAP_WRITEBACK_MAX,
> +       ZSWAP_WRITEBACK_AGE,
> +       ZSWAP_WRITEBACK_ERR,
> +};
> +
> +static const match_table_t zswap_writeback_tokens = {
> +       { ZSWAP_WRITEBACK_MAX, "max=%s" },
> +       { ZSWAP_WRITEBACK_AGE, "%u" },
> +       { ZSWAP_WRITEBACK_ERR, NULL },
> +};
> +
> +static ssize_t zswap_proactive_writeback_write(struct kernfs_open_file *of,
> +                                              char *buf, size_t nbytes,
> +                                              loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       unsigned long nr_max_writeback = ULONG_MAX;
> +       substring_t args[MAX_OPT_ARGS];
> +       unsigned int age_sec;
> +       bool age_set = false;
> +       ktime_t cutoff_time;
> +       char *token, *end;
> +       int err;
> +
> +       if (!mem_cgroup_zswap_writeback_enabled(memcg))
> +               return -EINVAL;
> +
> +       buf = strstrip(buf);
> +
> +       while ((token = strsep(&buf, " ")) != NULL) {
> +               if (!strlen(token))
> +                       continue;
> +
> +               switch (match_token(token, zswap_writeback_tokens, args)) {
> +               case ZSWAP_WRITEBACK_MAX:
> +                       nr_max_writeback = memparse(args[0].from, &end);
> +                       if (*end != '\0')
> +                               return -EINVAL;
> +                       nr_max_writeback >>= PAGE_SHIFT;
> +                       break;
> +               case ZSWAP_WRITEBACK_AGE:
> +                       if (age_set)
> +                               return -EINVAL;
> +
> +                       if (match_uint(&args[0], &age_sec))
> +                               return -EINVAL;
> +                       age_set = true;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       if (!age_set || !age_sec || !nr_max_writeback)
> +               return -EINVAL;
> +
> +       cutoff_time = ktime_sub(ktime_get_boottime(),
> +                               ns_to_ktime((u64)age_sec * NSEC_PER_SEC));
> +       /* age_sec >= uptime: no entry can be that old, skip the walk. */
> +       if (ktime_to_ns(cutoff_time) <= 0)
> +               return nbytes;
> +
> +       err = zswap_proactive_writeback(memcg, nr_max_writeback, cutoff_time);
> +       if (err)
> +               return err;
> +
> +       return nbytes;
> +}
> +
>  static struct cftype zswap_files[] = {
>         {
>                 .name = "zswap.current",
> @@ -5908,6 +5979,11 @@ static struct cftype zswap_files[] = {
>                 .seq_show = zswap_writeback_show,
>                 .write = zswap_writeback_write,
>         },
> +       {
> +               .name = "zswap.proactive_writeback",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .write = zswap_proactive_writeback_write,
> +       },
>         { }     /* terminate */
>  };
>  #endif /* CONFIG_ZSWAP */
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 19538d6f169a..1173ac6836fa 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -36,6 +36,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/list_lru.h>
>  #include <linux/zsmalloc.h>
> +#include <linux/timekeeping.h>
>
>  #include "swap.h"
>  #include "internal.h"
> @@ -160,6 +161,12 @@ struct zswap_pool {
>         char tfm_name[CRYPTO_MAX_ALG_NAME];
>  };
>
> +struct zswap_shrink_walk_arg {
> +       ktime_t cutoff_time;
> +       bool proactive;
> +       bool encountered_page_in_swapcache;
> +};
> +
>  /* Global LRU lists shared by all zswap pools. */
>  static struct list_lru zswap_list_lru;
>
> @@ -183,6 +190,7 @@ static struct shrinker *zswap_shrinker;
>   * handle - zsmalloc allocation handle that stores the compressed page data
>   * objcg - the obj_cgroup that the compressed memory is charged to
>   * lru - handle to the pool's lru used to evict pages.
> + * store_time - Time when the entry was stored, for proactive writeback.
>   */
>  struct zswap_entry {
>         swp_entry_t swpentry;
> @@ -192,6 +200,7 @@ struct zswap_entry {
>         unsigned long handle;
>         struct obj_cgroup *objcg;
>         struct list_head lru;
> +       ktime_t store_time;

On the implementation side - will this blow up struct zswap_entry
memory footprint? If so, can you guard this behind a CONFIG option, if
we are to go this route?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
  2026-05-11 19:49   ` Nhat Pham
@ 2026-05-11 19:57     ` Yosry Ahmed
  2026-05-12  9:32       ` Hao Jia
  0 siblings, 1 reply; 12+ messages in thread
From: Yosry Ahmed @ 2026-05-11 19:57 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Hao Jia, akpm, tj, hannes, shakeel.butt, mhocko, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia

On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
> >
> > From: Hao Jia <jiahao1@lixiang.com>
> >
> > Zswap currently writes back pages to backing swap devices reactively,
> > triggered either by memory pressure via the shrinker or by the pool
> > reaching its size limit. This reactive approach offers no precise
> > control over when writeback happens, which can disturb latency-sensitive
> > workloads, and it cannot direct writeback at a specific memory cgroup.
> > However, there are scenarios where users might want to proactively
> > write back cold pages from zswap to the backing swap device, for
> > example, to free up memory for other applications or to prepare for
> > upcoming memory-intensive workloads.
> >
> > Therefore, implement a proactive writeback mechanism for zswap by
> > adding a new cgroup interface file memory.zswap.proactive_writeback
> > within the memory controller.
>
>
> We already have memory.reclaim, no? Would that not work to create
> headroom generally for your use case? Is there a reason why we are
> treating zswap memory as special here?

+1, why do we need to specifically proactively reclaim the compressed memory?

Also, if we do need to minimize the compressed memory and force higher
writeback rates, we can do so with memory.zswap.max, right?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
  2026-05-11 19:57     ` Yosry Ahmed
@ 2026-05-12  9:32       ` Hao Jia
  0 siblings, 0 replies; 12+ messages in thread
From: Hao Jia @ 2026-05-12  9:32 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, chengming.zhou,
	muchun.song, roman.gushchin, cgroups, linux-mm, linux-kernel,
	linux-doc, Hao Jia

On 2026/5/12 03:57, Yosry Ahmed wrote:
> On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>
>> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>>
>>> From: Hao Jia <jiahao1@lixiang.com>
>>>
>>> Zswap currently writes back pages to backing swap devices reactively,
>>> triggered either by memory pressure via the shrinker or by the pool
>>> reaching its size limit. This reactive approach offers no precise
>>> control over when writeback happens, which can disturb latency-sensitive
>>> workloads, and it cannot direct writeback at a specific memory cgroup.
>>> However, there are scenarios where users might want to proactively
>>> write back cold pages from zswap to the backing swap device, for
>>> example, to free up memory for other applications or to prepare for
>>> upcoming memory-intensive workloads.
>>>
>>> Therefore, implement a proactive writeback mechanism for zswap by
>>> adding a new cgroup interface file memory.zswap.proactive_writeback
>>> within the memory controller.
>>

Thanks Nhat, Yosry — let me address both comments together.

>>
>> We already have memory.reclaim, no? Would that not work to create
>> headroom generally for your use case? Is there a reason why we are
>> treating zswap memory as special here?
> 

Apologies for the lack of detailed explanation in the patch description, 
which led to the confusion.

While we are already utilizing memory.reclaim, it does not fully address 
our requirements.

Our deployment runs a userspace proactive reclaimer that drives 
memory.reclaim based on the system's runtime state (memory/CPU/IO 
pressure, refault rate, ...) and workload-specific
policy. That first stage compresses cold anon pages into zswap. Entries 
that then remain in zswap past a policy-defined age threshold are 
considered "twice cold", and the reclaimer wants
to write them back to the backing swap device at a moment of its own 
choosing, to further reclaim the DRAM still held by the compressed data.

This is the "second-level offloading" pattern described in Meta's TMO 
paper [1]. zswap proactive writeback is what this series introduces to 
address that second-level offloading stage.

[1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf

> +1, why do we need to specifically proactively reclaim the compressed memory?
> 
> Also, if we do need to minimize the compressed memory and force higher
> writeback rates, we can do so with memory.zswap.max, right?

Here are a few reasons why memory.zswap.max is not enough:

1. Writing memory.zswap.max itself does not trigger any writeback 
immediately. For a memcg that has reached steady state (on which the 
userspace reclaimer is no longer invoking
memory.reclaim), after enough time has passed, the reclaimer has no good 
way to trigger proactive writeback for second-level offloading by 
lowering memory.zswap.max, because in steady
state nothing drives the zswap_store() -> shrink_memcg() path. The 
userspace reclaimer still has no control over when proactive writeback 
happens.

2. memory.zswap.max currently triggers zswap writeback via zswap_store() 
-> shrink_memcg(), and each over-limit event can write back at most 
NR_NODES entries. If zswap residency is far
above memory.zswap.max, converging to the target size requires at least 
O(over-limit pages / NR_NODES) zswap_store() events, with no batching — 
proactive writeback therefore has
significant latency.

3. memory.zswap.max is a stateful interface. If the userspace reclaimer 
crashes for any reason mid-operation, it may leave memory.zswap.max at 
some set value, putting the application in a
  persistently throttled bad state.

4. Once the userspace reclaimer has lowered memory.zswap.max, if the 
workload is rapidly expanding and triggers memory reclaim via 
memory.high / kswapd / etc., the actual amount written
back can exceed what was intended.

Thanks,
Hao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
  2026-05-11 19:54   ` Nhat Pham
@ 2026-05-12  9:37     ` Hao Jia
  0 siblings, 0 replies; 12+ messages in thread
From: Hao Jia @ 2026-05-12  9:37 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia



On 2026/5/12 03:54, Nhat Pham wrote:
> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 19538d6f169a..1173ac6836fa 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -36,6 +36,7 @@
>>   #include <linux/workqueue.h>
>>   #include <linux/list_lru.h>
>>   #include <linux/zsmalloc.h>
>> +#include <linux/timekeeping.h>
>>
>>   #include "swap.h"
>>   #include "internal.h"
>> @@ -160,6 +161,12 @@ struct zswap_pool {
>>          char tfm_name[CRYPTO_MAX_ALG_NAME];
>>   };
>>
>> +struct zswap_shrink_walk_arg {
>> +       ktime_t cutoff_time;
>> +       bool proactive;
>> +       bool encountered_page_in_swapcache;
>> +};
>> +
>>   /* Global LRU lists shared by all zswap pools. */
>>   static struct list_lru zswap_list_lru;
>>
>> @@ -183,6 +190,7 @@ static struct shrinker *zswap_shrinker;
>>    * handle - zsmalloc allocation handle that stores the compressed page data
>>    * objcg - the obj_cgroup that the compressed memory is charged to
>>    * lru - handle to the pool's lru used to evict pages.
>> + * store_time - Time when the entry was stored, for proactive writeback.
>>    */
>>   struct zswap_entry {
>>          swp_entry_t swpentry;
>> @@ -192,6 +200,7 @@ struct zswap_entry {
>>          unsigned long handle;
>>          struct obj_cgroup *objcg;
>>          struct list_head lru;
>> +       ktime_t store_time;
> 
> On the implementation side - will this blow up struct zswap_entry
> memory footprint? If so, can you guard this behind a CONFIG option, if
> we are to go this route?

Thanks for the review. I'll address this in v2.

Thanks,
Hao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback
  2026-05-11 11:39 ` [PATCH 0/3] mm/zswap: Implement per-cgroup " Michal Koutný
@ 2026-05-12 11:23   ` Hao Jia
  0 siblings, 0 replies; 12+ messages in thread
From: Hao Jia @ 2026-05-12 11:23 UTC (permalink / raw)
  To: Michal Koutný
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia

On 2026/5/11 19:39, Michal Koutný wrote:
> On Mon, May 11, 2026 at 06:51:46PM +0800, Hao Jia <jiahao.kernel@gmail.com> wrote:
>> From: Hao Jia <jiahao1@lixiang.com>
>>
>> Zswap currently writes back pages to backing swap devices reactively,
>> triggered either by memory pressure via the shrinker or by the pool
>> reaching its size limit. However, this reactive approach makes writeback
>> timing indeterminate and can disrupt latency-sensitive workloads when
>> eviction happens to coincide with a critical execution window.
>>
>> Furthermore, in certain scenarios, it is desirable to trigger writeback
>> in advance to free up memory. For example, users may want to prepare for
>> an upcoming memory-intensive workload by flushing cold memory to the
>> backing storage when the system is relatively idle.
> 
> I can imagine the zswap writeout can come at the least possible
> moment...
> 
>> To address these issues, this patch series introduces a per-cgroup
>> interface that allows users to proactively write back cold compressed
>> pages from zswap to the backing swap device.
> 
> ...but I see this series is not only per-cgroup proactive reclaim but
> it's also age-based reclaim.
> 
> The per-cg consumption and limits (and regular memory reclaim) are all
> measured in sizes. This age-based invocations don't seem commensurable
> (e.g. how would users in practice determine what is the desired input to
> here).
> 

Thanks Michal — you are right. The series is both per-memcg *and*
age-based.

The interface carries a size budget, like memory.reclaim. The two
parameters play different roles:

   "write back up to <max> bytes, chosen from entries whose residency
    in zswap is at least <age>"

Size stays the unit of *amount*; age is just how we describe *which*
entries are eligible.

> Could you explain more reasoning behind this design?
> 

Context on the use case:

Our deployment runs a userspace proactive reclaimer driven by the
system's runtime state (memory/CPU/IO pressure, refault rate, ...)
and workload-specific policy. It uses memory.reclaim to drive
reclaim, which compresses cold anon pages into zswap as the first
stage. For entries that then remain in zswap past a policy-defined
age threshold, the reclaimer wants to write them back to the backing
swap device at a moment of its own choosing, to further reclaim the
DRAM still held by the compressed data.

Why age is a reasonable selector at this stage:

Pages in zswap have already passed a first-stage coldness judgement
(otherwise they would not have been compressed). For second-level
offloading, the question is which of them are cold *enough*.
Time-in-zswap is a natural proxy for that. A swap-in invalidates the
corresponding zswap entry and resets the clock, so by construction
an entry that has sat in zswap for N seconds has not been faulted in
for at least N seconds. Residency in zswap is therefore a strong
signal that the entry is not about to refault.

In our deployment the userspace reclaimer starts from a conservative 
threshold (the starting value depends on the workload) and adjusts it 
through closed-loop feedback:

   - on one side, the age distribution of zswap entries, to see
     whether there is a meaningful population past the threshold;
   - on the other side, the post-writeback refault rate and related
     signals, to confirm that entries written back were in fact cold
     enough.

Both <age> and max=<bytes> are tuned against these signals until the
realized writeback volume matches target. This is the same
control-loop style already used to drive the first-stage
memory.reclaim budget.

Thanks,
Hao

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-12 11:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11 10:51 [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-11 10:51 ` [PATCH 1/3] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-11 10:51 ` [PATCH 2/3] mm/zswap: Implement proactive writeback Hao Jia
2026-05-11 19:49   ` Nhat Pham
2026-05-11 19:57     ` Yosry Ahmed
2026-05-12  9:32       ` Hao Jia
2026-05-11 19:54   ` Nhat Pham
2026-05-12  9:37     ` Hao Jia
2026-05-11 10:51 ` [PATCH 3/3] mm/zswap: Add per-memcg stat for " Hao Jia
2026-05-11 11:39 ` [PATCH 0/3] mm/zswap: Implement per-cgroup " Michal Koutný
2026-05-12 11:23   ` Hao Jia
2026-05-11 19:53 ` Nhat Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox