From: Yosry Ahmed <yosry@kernel.org>
To: Hao Jia <jiahao.kernel@gmail.com>
Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
nphamcs@gmail.com, chengming.zhou@linux.dev,
muchun.song@linux.dev, roman.gushchin@linux.dev,
cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
Date: Sat, 30 May 2026 01:37:18 +0000 [thread overview]
Message-ID: <aho-Z6wshceTAYd9@google.com> (raw)
In-Reply-To: <20260526114601.67041-3-jiahao.kernel@gmail.com>
On Tue, May 26, 2026 at 07:45:59PM +0800, Hao Jia wrote:
> From: Hao Jia <jiahao1@lixiang.com>
>
> Zswap currently writes back pages to backing swap reactively, triggered
> either by the shrinker or when the pool reaches its size limit. There is
> no mechanism to control the amount of writeback for a specific memory
> cgroup. However, users may want to proactively write back zswap pages,
> e.g., to free up memory for other applications or to prepare for
> memory-intensive workloads.
>
> Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> interface. When specified, this key bypasses standard memory reclaim
> and exclusively performs proactive zswap writeback up to the requested
> budget. If omitted, the default reclaim behavior remains unchanged.
>
> Example usage:
> # Write back 100MB of pages from zswap to the backing swap
> echo "100M zswap_writeback_only" > memory.reclaim
>
> Note that the actual amount written back may be less than requested due
> to the zswap second-chance algorithm: referenced entries are rotated on
> the LRU on the first encounter and only written back on a second pass.
> If fewer bytes are written back than requested, -EAGAIN is returned,
> matching the existing memory.reclaim semantics.
>
> Internally, extend user_proactive_reclaim() to parse the new
> "zswap_writeback_only" token and invoke the dedicated handler. Add
> zswap_proactive_writeback() to walk the target memcg subtree via the
> per-memcg writeback cursor, draining per-node zswap LRUs through
> list_lru_walk_one() with the shrink_memcg_cb() callback.
>
> Suggested-by: Yosry Ahmed <yosry@kernel.org>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 18 +++-
> Documentation/admin-guide/mm/zswap.rst | 11 +-
> include/linux/zswap.h | 7 ++
> mm/vmscan.c | 14 +++
> mm/zswap.c | 138 ++++++++++++++++++++++++
> 5 files changed, 185 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6efd0095ed99..6564abf0dec5 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back.
>
> The following nested keys are defined.
>
> - ========== ================================
> + ==================== ==================================================
> swappiness Swappiness value to reclaim with
> - ========== ================================
> + zswap_writeback_only Only perform proactive zswap writeback
> + ==================== ==================================================
>
> Specifying a swappiness value instructs the kernel to perform
> the reclaim with that swappiness value. Note that this has the
> @@ -1437,6 +1438,19 @@ The following nested keys are defined.
> The valid range for swappiness is [0-200, max], setting
> swappiness=max exclusively reclaims anonymous memory.
>
> + The zswap_writeback_only key skips ordinary memory reclaim and
> + writes back pages from zswap to the backing swap device until
> + the requested amount has been written or no further candidates
> + are found. This is useful to proactively offload cold pages from
> + the zswap pool to the swap device. It is only available if
> + zswap writeback is enabled. zswap_writeback_only cannot be combined
> + with swappiness; specifying both returns -EINVAL.
> +
> + Example::
> +
> + # Write back up to 100MB of pages from zswap to the backing swap
> + echo "100M zswap_writeback_only" > memory.reclaim
memcg folks need to chime in about the interface here. An alternative
would be a separate interface (e.g. memory.zswap.do_writeback or
memory.zswap.writeback.reclaim or sth).
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 73e64a635690..7bcbf788f634 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1679,6 +1679,144 @@ int zswap_load(struct folio *folio)
> return 0;
> }
>
> +/*
> + * Maximum LRU scan limit:
> + * number of entries to scan per page of remaining budget.
> + */
> +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO 16UL
> +/*
> + * Batch size for proactive writeback:
> + * - As the per-memcg writeback target in the outer memcg loop.
> + * - As the per-walk budget passed to list_lru_walk_one().
> + */
> +#define ZSWAP_PROACTIVE_WB_BATCH 128UL
> +
> +/*
> + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages.
> + * Returns the number of pages written back, or -ENOENT if @memcg is a
> + * zombie or has writeback disabled.
> + */
> +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
> + unsigned long nr_to_write)
> +{
> + unsigned long nr_written = 0;
> + int nid;
> +
> + if (!mem_cgroup_zswap_writeback_enabled(memcg))
> + return -ENOENT;
> +
> + if (!mem_cgroup_online(memcg))
> + return -ENOENT;
> +
> + for_each_node_state(nid, N_NORMAL_MEMORY) {
> + bool encountered_page_in_swapcache = false;
> + unsigned long nr_to_scan, nr_scanned = 0;
> +
> + /*
> + * Cap by LRU length: bounds rewalks when referenced
> + * entries keep rotating to the tail.
> + */
> + nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
> + if (!nr_to_scan)
> + continue;
> +
> + /*
> + * Cap by SCAN_RATIO * remaining budget: bounds scan cost
> + * to the remaining writeback budget.
> + */
> + nr_to_scan = min(nr_to_scan,
> + (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO);
> +
> + while (nr_scanned < nr_to_scan) {
> + unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH,
> + nr_to_scan - nr_scanned);
> +
> + if (signal_pending(current))
> + return nr_written;
> +
> + /*
> + * Account for the committed budget rather than the walker's
> + * actual delta. If the list is emptied concurrently, the
> + * walker visits nothing and nr_scanned would never advance.
> + */
> + nr_scanned += nr_to_walk;
> +
> + nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> + &shrink_memcg_cb,
> + &encountered_page_in_swapcache,
> + &nr_to_walk);
> +
> + if (nr_written >= nr_to_write)
> + return nr_written;
> + if (encountered_page_in_swapcache)
> + break;
> +
> + cond_resched();
> + }
> + }
> +
> + return nr_written;
> +}
> +
> +int zswap_proactive_writeback(struct mem_cgroup *memcg,
> + unsigned long nr_to_writeback)
> +{
> + struct mem_cgroup *iter_memcg;
> + unsigned long nr_written = 0;
> + int failures = 0, attempts = 0;
> +
> + if (!memcg)
> + return -EINVAL;
> + if (!nr_to_writeback)
> + return 0;
> +
> + /*
> + * Writeback will be aborted with -EAGAIN if we encounter
> + * the following MAX_RECLAIM_RETRIES times:
> + * - No writeback-candidate memcgs found in a subtree walk.
> + * - A writeback-candidate memcg wrote back zero pages.
> + */
> + while (nr_written < nr_to_writeback) {
> + unsigned long batch_size;
> + long shrunk;
> +
> + if (signal_pending(current))
> + return -EINTR;
> +
> + iter_memcg = zswap_mem_cgroup_iter(memcg);
> +
> + if (!iter_memcg) {
> + /*
> + * Continue without incrementing failures if we found
> + * candidate memcgs in the last subtree walk.
> + */
> + if (!attempts && ++failures == MAX_RECLAIM_RETRIES)
> + return -EAGAIN;
> + attempts = 0;
> + continue;
> + }
> +
> + batch_size = min(nr_to_writeback - nr_written,
> + ZSWAP_PROACTIVE_WB_BATCH);
> + shrunk = zswap_proactive_shrink_memcg(iter_memcg, batch_size);
> + mem_cgroup_put(iter_memcg);
> +
> + /* Writeback-disabled or offline: skip without counting. */
> + if (shrunk == -ENOENT)
> + continue;
> +
> + ++attempts;
> + if (shrunk > 0)
> + nr_written += shrunk;
> + else if (++failures == MAX_RECLAIM_RETRIES)
> + return -EAGAIN;
> +
> + cond_resched();
> + }
> +
> + return 0;
> +}
> +
There is a lot of copy+paste from shrink_worker() and shrink_memcg()
here. We really should be able to reuse shrink_memcg().
Is the main difference that we are scanning in batches here? I think we
can have shrink_memcg() do that too. If anything, it might make the
shrinker more efficient. Over-reclaim is ofc a concern, and especially
in the zswap_store() path as the overhead can be noticeable. Maybe we
can parameterize the batch size based on the code path.
Nhat, what do you think?
next prev parent reply other threads:[~2026-05-30 1:37 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-26 11:45 [PATCH v3 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-26 11:45 ` [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-29 19:51 ` Nhat Pham
2026-05-30 1:24 ` Yosry Ahmed
2026-05-26 11:45 ` [PATCH v3 2/4] mm/zswap: Implement proactive writeback Hao Jia
2026-05-29 19:58 ` Nhat Pham
2026-05-30 1:40 ` Yosry Ahmed
2026-05-30 1:37 ` Yosry Ahmed [this message]
2026-05-26 11:46 ` [PATCH v3 3/4] mm/zswap: Add per-memcg stat for " Hao Jia
2026-05-29 20:01 ` Nhat Pham
[not found] ` <20260526114601.67041-5-jiahao.kernel@gmail.com>
2026-05-29 20:02 ` [PATCH v3 4/4] selftests/cgroup: Add tests for zswap " Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aho-Z6wshceTAYd9@google.com \
--to=yosry@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=jiahao.kernel@gmail.com \
--cc=jiahao1@lixiang.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox