From: Yosry Ahmed <yosry@kernel.org>
To: Hao Jia <jiahao.kernel@gmail.com>
Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
nphamcs@gmail.com, chengming.zhou@linux.dev,
muchun.song@linux.dev, roman.gushchin@linux.dev,
cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
Date: Sat, 30 May 2026 01:37:18 +0000 [thread overview]
Message-ID: <aho-Z6wshceTAYd9@google.com> (raw)
In-Reply-To: <20260526114601.67041-3-jiahao.kernel@gmail.com>
On Tue, May 26, 2026 at 07:45:59PM +0800, Hao Jia wrote:
> From: Hao Jia <jiahao1@lixiang.com>
>
> Zswap currently writes back pages to backing swap reactively, triggered
> either by the shrinker or when the pool reaches its size limit. There is
> no mechanism to control the amount of writeback for a specific memory
> cgroup. However, users may want to proactively write back zswap pages,
> e.g., to free up memory for other applications or to prepare for
> memory-intensive workloads.
>
> Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> interface. When specified, this key bypasses standard memory reclaim
> and exclusively performs proactive zswap writeback up to the requested
> budget. If omitted, the default reclaim behavior remains unchanged.
>
> Example usage:
> # Write back 100MB of pages from zswap to the backing swap
> echo "100M zswap_writeback_only" > memory.reclaim
>
> Note that the actual amount written back may be less than requested due
> to the zswap second-chance algorithm: referenced entries are rotated on
> the LRU on the first encounter and only written back on a second pass.
> If fewer bytes are written back than requested, -EAGAIN is returned,
> matching the existing memory.reclaim semantics.
>
> Internally, extend user_proactive_reclaim() to parse the new
> "zswap_writeback_only" token and invoke the dedicated handler. Add
> zswap_proactive_writeback() to walk the target memcg subtree via the
> per-memcg writeback cursor, draining per-node zswap LRUs through
> list_lru_walk_one() with the shrink_memcg_cb() callback.
>
> Suggested-by: Yosry Ahmed <yosry@kernel.org>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 18 +++-
> Documentation/admin-guide/mm/zswap.rst | 11 +-
> include/linux/zswap.h | 7 ++
> mm/vmscan.c | 14 +++
> mm/zswap.c | 138 ++++++++++++++++++++++++
> 5 files changed, 185 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6efd0095ed99..6564abf0dec5 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back.
>
> The following nested keys are defined.
>
> - ========== ================================
> + ==================== ==================================================
> swappiness Swappiness value to reclaim with
> - ========== ================================
> + zswap_writeback_only Only perform proactive zswap writeback
> + ==================== ==================================================
>
> Specifying a swappiness value instructs the kernel to perform
> the reclaim with that swappiness value. Note that this has the
> @@ -1437,6 +1438,19 @@ The following nested keys are defined.
> The valid range for swappiness is [0-200, max], setting
> swappiness=max exclusively reclaims anonymous memory.
>
> + The zswap_writeback_only key skips ordinary memory reclaim and
> + writes back pages from zswap to the backing swap device until
> + the requested amount has been written or no further candidates
> + are found. This is useful to proactively offload cold pages from
> + the zswap pool to the swap device. It is only available if
> + zswap writeback is enabled. zswap_writeback_only cannot be combined
> + with swappiness; specifying both returns -EINVAL.
> +
> + Example::
> +
> + # Write back up to 100MB of pages from zswap to the backing swap
> + echo "100M zswap_writeback_only" > memory.reclaim
memcg folks need to chime in about the interface here. An alternative
would be a separate interface (e.g. memory.zswap.do_writeback or
memory.zswap.writeback.reclaim or sth).
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 73e64a635690..7bcbf788f634 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1679,6 +1679,144 @@ int zswap_load(struct folio *folio)
> return 0;
> }
>
> +/*
> + * Maximum LRU scan limit:
> + * number of entries to scan per page of remaining budget.
> + */
> +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO 16UL
> +/*
> + * Batch size for proactive writeback:
> + * - As the per-memcg writeback target in the outer memcg loop.
> + * - As the per-walk budget passed to list_lru_walk_one().
> + */
> +#define ZSWAP_PROACTIVE_WB_BATCH 128UL
> +
> +/*
> + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages.
> + * Returns the number of pages written back, or -ENOENT if @memcg is a
> + * zombie or has writeback disabled.
> + */
> +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
> + unsigned long nr_to_write)
> +{
> + unsigned long nr_written = 0;
> + int nid;
> +
> + if (!mem_cgroup_zswap_writeback_enabled(memcg))
> + return -ENOENT;
> +
> + if (!mem_cgroup_online(memcg))
> + return -ENOENT;
> +
> + for_each_node_state(nid, N_NORMAL_MEMORY) {
> + bool encountered_page_in_swapcache = false;
> + unsigned long nr_to_scan, nr_scanned = 0;
> +
> + /*
> + * Cap by LRU length: bounds rewalks when referenced
> + * entries keep rotating to the tail.
> + */
> + nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
> + if (!nr_to_scan)
> + continue;
> +
> + /*
> + * Cap by SCAN_RATIO * remaining budget: bounds scan cost
> + * to the remaining writeback budget.
> + */
> + nr_to_scan = min(nr_to_scan,
> + (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO);
> +
> + while (nr_scanned < nr_to_scan) {
> + unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH,
> + nr_to_scan - nr_scanned);
> +
> + if (signal_pending(current))
> + return nr_written;
> +
> + /*
> + * Account for the committed budget rather than the walker's
> + * actual delta. If the list is emptied concurrently, the
> + * walker visits nothing and nr_scanned would never advance.
> + */
> + nr_scanned += nr_to_walk;
> +
> + nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> + &shrink_memcg_cb,
> + &encountered_page_in_swapcache,
> + &nr_to_walk);
> +
> + if (nr_written >= nr_to_write)
> + return nr_written;
> + if (encountered_page_in_swapcache)
> + break;
> +
> + cond_resched();
> + }
> + }
> +
> + return nr_written;
> +}
> +
> +int zswap_proactive_writeback(struct mem_cgroup *memcg,
> + unsigned long nr_to_writeback)
> +{
> + struct mem_cgroup *iter_memcg;
> + unsigned long nr_written = 0;
> + int failures = 0, attempts = 0;
> +
> + if (!memcg)
> + return -EINVAL;
> + if (!nr_to_writeback)
> + return 0;
> +
> + /*
> + * Writeback will be aborted with -EAGAIN if we encounter
> + * the following MAX_RECLAIM_RETRIES times:
> + * - No writeback-candidate memcgs found in a subtree walk.
> + * - A writeback-candidate memcg wrote back zero pages.
> + */
> + while (nr_written < nr_to_writeback) {
> + unsigned long batch_size;
> + long shrunk;
> +
> + if (signal_pending(current))
> + return -EINTR;
> +
> + iter_memcg = zswap_mem_cgroup_iter(memcg);
> +
> + if (!iter_memcg) {
> + /*
> + * Continue without incrementing failures if we found
> + * candidate memcgs in the last subtree walk.
> + */
> + if (!attempts && ++failures == MAX_RECLAIM_RETRIES)
> + return -EAGAIN;
> + attempts = 0;
> + continue;
> + }
> +
> + batch_size = min(nr_to_writeback - nr_written,
> + ZSWAP_PROACTIVE_WB_BATCH);
> + shrunk = zswap_proactive_shrink_memcg(iter_memcg, batch_size);
> + mem_cgroup_put(iter_memcg);
> +
> + /* Writeback-disabled or offline: skip without counting. */
> + if (shrunk == -ENOENT)
> + continue;
> +
> + ++attempts;
> + if (shrunk > 0)
> + nr_written += shrunk;
> + else if (++failures == MAX_RECLAIM_RETRIES)
> + return -EAGAIN;
> +
> + cond_resched();
> + }
> +
> + return 0;
> +}
> +
There is a lot of copy+paste from shrink_worker() and shrink_memcg()
here. We really should be able to reuse shrink_memcg().
Is the main difference that we are scanning in batches here? I think we
can have shrink_memcg() do that too. If anything, it might make the
shrinker more efficient. Over-reclaim is ofc a concern, and especially
in the zswap_store() path as the overhead can be noticeable. Maybe we
can parameterize the batch size based on the code path.
Nhat, what do you think?
next prev parent reply other threads:[~2026-05-30 1:37 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-26 11:45 [PATCH v3 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-26 11:45 ` [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-29 19:51 ` Nhat Pham
2026-05-30 1:24 ` Yosry Ahmed
2026-06-01 11:07 ` Hao Jia
2026-06-01 16:44 ` Nhat Pham
2026-06-01 16:47 ` Nhat Pham
2026-06-01 17:08 ` Nhat Pham
2026-06-02 11:32 ` Hao Jia
2026-06-02 0:31 ` Yosry Ahmed
2026-06-02 11:33 ` Hao Jia
2026-06-02 23:19 ` Yosry Ahmed
2026-06-03 3:02 ` Hao Jia
2026-06-03 17:53 ` Yosry Ahmed
2026-06-04 1:58 ` Hao Jia
2026-06-04 5:34 ` Yosry Ahmed
2026-06-04 13:06 ` Hao Jia
2026-06-04 16:10 ` Yosry Ahmed
2026-06-04 17:23 ` Nhat Pham
2026-05-26 11:45 ` [PATCH v3 2/4] mm/zswap: Implement proactive writeback Hao Jia
2026-05-29 19:58 ` Nhat Pham
2026-05-30 1:40 ` Yosry Ahmed
2026-06-03 11:22 ` Hao Jia
2026-06-03 17:58 ` Yosry Ahmed
2026-06-03 18:14 ` Nhat Pham
2026-06-04 2:11 ` Hao Jia
2026-06-04 5:36 ` Yosry Ahmed
2026-06-04 14:01 ` Shakeel Butt
2026-05-30 1:37 ` Yosry Ahmed [this message]
2026-06-03 11:27 ` Hao Jia
2026-06-03 17:55 ` Yosry Ahmed
2026-06-03 18:23 ` Nhat Pham
2026-06-03 18:26 ` Yosry Ahmed
2026-06-03 18:34 ` Nhat Pham
2026-06-03 18:43 ` Yosry Ahmed
2026-06-03 18:51 ` Nhat Pham
2026-06-03 18:54 ` Yosry Ahmed
2026-05-26 11:46 ` [PATCH v3 3/4] mm/zswap: Add per-memcg stat for " Hao Jia
2026-05-29 20:01 ` Nhat Pham
2026-06-03 11:29 ` Hao Jia
2026-05-26 11:46 ` [PATCH v3 4/4] selftests/cgroup: Add tests for zswap " Hao Jia
2026-05-29 20:02 ` Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aho-Z6wshceTAYd9@google.com \
--to=yosry@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=jiahao.kernel@gmail.com \
--cc=jiahao1@lixiang.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.