Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosry@kernel.org>
To: Hao Jia <jiahao.kernel@gmail.com>
Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
	 shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
	nphamcs@gmail.com,  chengming.zhou@linux.dev,
	muchun.song@linux.dev, roman.gushchin@linux.dev,
	 cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,  linux-doc@vger.kernel.org,
	Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
Date: Sat, 30 May 2026 01:37:18 +0000	[thread overview]
Message-ID: <aho-Z6wshceTAYd9@google.com> (raw)
In-Reply-To: <20260526114601.67041-3-jiahao.kernel@gmail.com>

On Tue, May 26, 2026 at 07:45:59PM +0800, Hao Jia wrote:
> From: Hao Jia <jiahao1@lixiang.com>
> 
> Zswap currently writes back pages to backing swap reactively, triggered
> either by the shrinker or when the pool reaches its size limit. There is
> no mechanism to control the amount of writeback for a specific memory
> cgroup. However, users may want to proactively write back zswap pages,
> e.g., to free up memory for other applications or to prepare for
> memory-intensive workloads.
> 
> Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> interface. When specified, this key bypasses standard memory reclaim
> and exclusively performs proactive zswap writeback up to the requested
> budget. If omitted, the default reclaim behavior remains unchanged.
> 
> Example usage:
>   # Write back 100MB of pages from zswap to the backing swap
>   echo "100M zswap_writeback_only" > memory.reclaim
> 
> Note that the actual amount written back may be less than requested due
> to the zswap second-chance algorithm: referenced entries are rotated on
> the LRU on the first encounter and only written back on a second pass.
> If fewer bytes are written back than requested, -EAGAIN is returned,
> matching the existing memory.reclaim semantics.
> 
> Internally, extend user_proactive_reclaim() to parse the new
> "zswap_writeback_only" token and invoke the dedicated handler. Add
> zswap_proactive_writeback() to walk the target memcg subtree via the
> per-memcg writeback cursor, draining per-node zswap LRUs through
> list_lru_walk_one() with the shrink_memcg_cb() callback.
> 
> Suggested-by: Yosry Ahmed <yosry@kernel.org>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  18 +++-
>  Documentation/admin-guide/mm/zswap.rst  |  11 +-
>  include/linux/zswap.h                   |   7 ++
>  mm/vmscan.c                             |  14 +++
>  mm/zswap.c                              | 138 ++++++++++++++++++++++++
>  5 files changed, 185 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6efd0095ed99..6564abf0dec5 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back.
>  
>  The following nested keys are defined.
>  
> -	  ==========            ================================
> +	  ====================  ==================================================
>  	  swappiness            Swappiness value to reclaim with
> -	  ==========            ================================
> +	  zswap_writeback_only  Only perform proactive zswap writeback
> +	  ====================  ==================================================
>  
>  	Specifying a swappiness value instructs the kernel to perform
>  	the reclaim with that swappiness value. Note that this has the
> @@ -1437,6 +1438,19 @@ The following nested keys are defined.
>  	The valid range for swappiness is [0-200, max], setting
>  	swappiness=max exclusively reclaims anonymous memory.
>  
> +	The zswap_writeback_only key skips ordinary memory reclaim and
> +	writes back pages from zswap to the backing swap device until
> +	the requested amount has been written or no further candidates
> +	are found. This is useful to proactively offload cold pages from
> +	the zswap pool to the swap device. It is only available if
> +	zswap writeback is enabled. zswap_writeback_only cannot be combined
> +	with swappiness; specifying both returns -EINVAL.
> +
> +	Example::
> +
> +	  # Write back up to 100MB of pages from zswap to the backing swap
> +	  echo "100M zswap_writeback_only" > memory.reclaim


memcg folks need to chime in about the interface here. An alternative
would be a separate interface (e.g. memory.zswap.do_writeback or
memory.zswap.writeback.reclaim or sth).

> diff --git a/mm/zswap.c b/mm/zswap.c
> index 73e64a635690..7bcbf788f634 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1679,6 +1679,144 @@ int zswap_load(struct folio *folio)
>  	return 0;
>  }
>  
> +/*
> + * Maximum LRU scan limit:
> + * number of entries to scan per page of remaining budget.
> + */
> +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO	16UL
> +/*
> + * Batch size for proactive writeback:
> + * - As the per-memcg writeback target in the outer memcg loop.
> + * - As the per-walk budget passed to list_lru_walk_one().
> + */
> +#define ZSWAP_PROACTIVE_WB_BATCH	128UL
> +
> +/*
> + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages.
> + * Returns the number of pages written back, or -ENOENT if @memcg is a
> + * zombie or has writeback disabled.
> + */
> +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
> +					 unsigned long nr_to_write)
> +{
> +	unsigned long nr_written = 0;
> +	int nid;
> +
> +	if (!mem_cgroup_zswap_writeback_enabled(memcg))
> +		return -ENOENT;
> +
> +	if (!mem_cgroup_online(memcg))
> +		return -ENOENT;
> +
> +	for_each_node_state(nid, N_NORMAL_MEMORY) {
> +		bool encountered_page_in_swapcache = false;
> +		unsigned long nr_to_scan, nr_scanned = 0;
> +
> +		/*
> +		 * Cap by LRU length: bounds rewalks when referenced
> +		 * entries keep rotating to the tail.
> +		 */
> +		nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
> +		if (!nr_to_scan)
> +			continue;
> +
> +		/*
> +		 * Cap by SCAN_RATIO * remaining budget: bounds scan cost
> +		 * to the remaining writeback budget.
> +		 */
> +		nr_to_scan = min(nr_to_scan,
> +				 (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO);
> +
> +		while (nr_scanned < nr_to_scan) {
> +			unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH,
> +						       nr_to_scan - nr_scanned);
> +
> +			if (signal_pending(current))
> +				return nr_written;
> +
> +			/*
> +			 * Account for the committed budget rather than the walker's
> +			 * actual delta. If the list is emptied concurrently, the
> +			 * walker visits nothing and nr_scanned would never advance.
> +			 */
> +			nr_scanned += nr_to_walk;
> +
> +			nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> +							&shrink_memcg_cb,
> +							&encountered_page_in_swapcache,
> +							&nr_to_walk);
> +
> +			if (nr_written >= nr_to_write)
> +				return nr_written;
> +			if (encountered_page_in_swapcache)
> +				break;
> +
> +			cond_resched();
> +		}
> +	}
> +
> +	return nr_written;
> +}
> +
> +int zswap_proactive_writeback(struct mem_cgroup *memcg,
> +			      unsigned long nr_to_writeback)
> +{
> +	struct mem_cgroup *iter_memcg;
> +	unsigned long nr_written = 0;
> +	int failures = 0, attempts = 0;
> +
> +	if (!memcg)
> +		return -EINVAL;
> +	if (!nr_to_writeback)
> +		return 0;
> +
> +	/*
> +	 * Writeback will be aborted with -EAGAIN if we encounter
> +	 * the following MAX_RECLAIM_RETRIES times:
> +	 * - No writeback-candidate memcgs found in a subtree walk.
> +	 * - A writeback-candidate memcg wrote back zero pages.
> +	 */
> +	while (nr_written < nr_to_writeback) {
> +		unsigned long batch_size;
> +		long shrunk;
> +
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		iter_memcg = zswap_mem_cgroup_iter(memcg);
> +
> +		if (!iter_memcg) {
> +			/*
> +			 * Continue without incrementing failures if we found
> +			 * candidate memcgs in the last subtree walk.
> +			 */
> +			if (!attempts && ++failures == MAX_RECLAIM_RETRIES)
> +				return -EAGAIN;
> +			attempts = 0;
> +			continue;
> +		}
> +
> +		batch_size = min(nr_to_writeback - nr_written,
> +				 ZSWAP_PROACTIVE_WB_BATCH);
> +		shrunk = zswap_proactive_shrink_memcg(iter_memcg, batch_size);
> +		mem_cgroup_put(iter_memcg);
> +
> +		/* Writeback-disabled or offline: skip without counting. */
> +		if (shrunk == -ENOENT)
> +			continue;
> +
> +		++attempts;
> +		if (shrunk > 0)
> +			nr_written += shrunk;
> +		else if (++failures == MAX_RECLAIM_RETRIES)
> +			return -EAGAIN;
> +
> +		cond_resched();
> +	}
> +
> +	return 0;
> +}
> +

There is a lot of copy+paste from shrink_worker() and shrink_memcg()
here. We really should be able to reuse shrink_memcg().

Is the main difference that we are scanning in batches here? I think we
can have shrink_memcg() do that too. If anything, it might make the
shrinker more efficient. Over-reclaim is ofc a concern, and especially
in the zswap_store() path as the overhead can be noticeable. Maybe we
can parameterize the batch size based on the code path.

Nhat, what do you think?


  parent reply	other threads:[~2026-05-30  1:37 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-26 11:45 [PATCH v3 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-26 11:45 ` [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-29 19:51   ` Nhat Pham
2026-05-30  1:24   ` Yosry Ahmed
2026-05-26 11:45 ` [PATCH v3 2/4] mm/zswap: Implement proactive writeback Hao Jia
2026-05-29 19:58   ` Nhat Pham
2026-05-30  1:40     ` Yosry Ahmed
2026-05-30  1:37   ` Yosry Ahmed [this message]
2026-05-26 11:46 ` [PATCH v3 3/4] mm/zswap: Add per-memcg stat for " Hao Jia
2026-05-29 20:01   ` Nhat Pham
2026-05-26 11:46 ` [PATCH v3 4/4] selftests/cgroup: Add tests for zswap " Hao Jia
2026-05-29 20:02   ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aho-Z6wshceTAYd9@google.com \
    --to=yosry@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=jiahao.kernel@gmail.com \
    --cc=jiahao1@lixiang.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox