From: Hao Jia <jiahao.kernel@gmail.com>
To: Yosry Ahmed <yosry@kernel.org>
Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
nphamcs@gmail.com, chengming.zhou@linux.dev,
muchun.song@linux.dev, roman.gushchin@linux.dev,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability
Date: Wed, 24 Jun 2026 19:58:59 +0800 [thread overview]
Message-ID: <057ea303-4c27-1a6e-08de-cce26c699097@gmail.com> (raw)
In-Reply-To: <CAO9r8zMgaqP=n6rmhnMU+qhp1Www1Y5kdbLTLX1v=fj_ybHyiw@mail.gmail.com>
On 2026/6/24 02:17, Yosry Ahmed wrote:
>> My initial thought was that if cold memory is evenly distributed across
>> nodes and we are doing a large writeback, it would be better to balance
>> the zswap entry writeback across all nodes rather than just draining
>> node 0 first. However, since we currently lack a proper metric to
>> represent hot/cold memory (such as age-based tracking), doing this
>> probably doesn't make much sense right now.
>
> Yeah let's start simple and go from there.
>
>>
>> So, perhaps we want something like this? Please correct me if I'm wrong.
>>
>> static long shrink_memcg(struct mem_cgroup *memcg,
>> unsigned long nr_to_scan)
>> {
>> struct zswap_shrink_walk_arg walk_arg = {
>> .bytes_written = 0,
>> .encountered_page_in_swapcache = false,
>> };
>> unsigned long nr_remaining = nr_to_scan;
>> bool memcg_list_is_empty = true;
>> int nid;
>>
>> if (!mem_cgroup_zswap_writeback_enabled(memcg))
>> return -ENOENT;
>>
>> if (memcg && !mem_cgroup_online(memcg))
>> return -ENOENT;
>>
>> for_each_node_state(nid, N_NORMAL_MEMORY) {
>> unsigned long nr_to_walk;
>>
>> /*
>> * Cap the per-node scan by the current LRU length. A referenced
>> * entry is only rotated to the tail (second chance) and may be
>> * revisited within a single walk; without this cap those rotated
>> * entries could drain the shared scan budget on one node.
>> */
>
> The comment here is a bit misleading. It's not just about draining one
> node. One call to shrink_memcg() should only scan entries once. The
> caller can then choose to scan the memcg again, or scan a different
> one. In this case, the caller should iterate all memcgs first before
> retrying memcgs again and reclaiming rotated entries.
I have updated the comment. Please see below.
>
>> nr_to_walk = min(nr_remaining,
>> list_lru_count_one(&zswap_list_lru, nid, memcg));
>> if (!nr_to_walk)
>> continue;
>> memcg_list_is_empty = false;
>>
>> nr_remaining -= nr_to_walk;
>> list_lru_walk_one(&zswap_list_lru, nid, memcg,
>> &shrink_memcg_cb, &walk_arg, &nr_to_walk);
>> /* Return the unused share of the budget to the pool. */
>> nr_remaining += nr_to_walk;
>>
>> /* Bail out once the whole scan budget has been spent. */
>
> The comment is unnecessary.
I'll do this, thanks.
>
>> if (!nr_remaining)
>> break;
>>
>> cond_resched();
>
> Did you observe a problem here or did you just add this due to an
> abundance of caution?
The cond_resched() here was just out of caution. Given that both callers
(shrink_worker() and zswap_proactive_writeback()) already have
rescheduling checks, I suppose we can remove it from here."
>
>> }
>>
>> if (memcg_list_is_empty)
>
> Do we need memcg_list_is_empty? Can we just check if nr_remaining
> matches nr_to_scan?
>
indeed.
>> return -ENOENT;
>>
>> return walk_arg.bytes_written;
>> }
/*
* Scan up to @nr_to_scan pages across the per-node zswap LRUs of @memcg
* and write back the reclaimable ones.
*
* Since the second-chance algorithm rotates referenced entries to the
* LRU tail, the per-node scan is capped at the current LRU length so
* each entry is scanned at most once per call. It is up to the caller
* to handle retries, deciding whether to scan the next memcg to complete
* the full iteration, or to rescan the current memcg to drain its zswap
* entries.
*
* Return: The number of compressed bytes written back (>= 0), or -ENOENT
* if @memcg has writeback disabled, is a zombie cgroup, or has empty
* zswap LRUs.
*/
static long shrink_memcg(struct mem_cgroup *memcg, unsigned long nr_to_scan)
{
struct zswap_shrink_walk_arg walk_arg = {
.bytes_written = 0,
.encountered_page_in_swapcache = false,
};
unsigned long nr_remaining = nr_to_scan;
int nid;
if (!mem_cgroup_zswap_writeback_enabled(memcg))
return -ENOENT;
/*
* Skip zombies because their LRUs are reparented and we would be
* reclaiming from the parent instead of the dead memcg.
*/
if (memcg && !mem_cgroup_online(memcg))
return -ENOENT;
for_each_node_state(nid, N_NORMAL_MEMORY) {
unsigned long nr_to_walk;
/*
* Cap the walk at the current LRU length to ensure each entry is
* scanned at most once per call. Referenced entries are rotated
* to the tail for a second chance, and this bound prevents them
* from being revisited within a single call. Retries are left to
* the caller, which can choose to rescan the current memcg or
* move on to the next one.
*/
nr_to_walk = min(nr_remaining,
list_lru_count_one(&zswap_list_lru, nid, memcg));
if (!nr_to_walk)
continue;
nr_remaining -= nr_to_walk;
list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb,
&walk_arg, &nr_to_walk);
/* Return the unused share of the budget to the pool. */
nr_remaining += nr_to_walk;
if (!nr_remaining)
break;
}
/* Nothing was scanned: every LRU under @memcg was empty. */
if (nr_remaining == nr_to_scan)
return -ENOENT;
return walk_arg.bytes_written;
}
Thanks,
Hao
next prev parent reply other threads:[~2026-06-24 11:59 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-18 4:48 [PATCH v4 0/5] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-06-18 4:48 ` [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability Hao Jia
2026-06-22 23:33 ` Yosry Ahmed
2026-06-23 13:22 ` Hao Jia
2026-06-23 18:17 ` Yosry Ahmed
2026-06-24 11:58 ` Hao Jia [this message]
2026-06-24 16:57 ` Yosry Ahmed
2026-06-18 4:48 ` [PATCH v4 2/5] mm/zswap: Factor writeback loop out of shrink_worker() Hao Jia
2026-06-22 23:36 ` Yosry Ahmed
2026-06-24 11:55 ` Hao Jia
2026-06-24 17:00 ` Yosry Ahmed
2026-06-18 4:48 ` [PATCH v4 3/5] mm/zswap: Implement proactive writeback Hao Jia
2026-06-22 23:40 ` Yosry Ahmed
2026-06-18 4:48 ` [PATCH v4 4/5] mm/zswap: Add per-memcg stat for " Hao Jia
2026-06-22 23:42 ` Yosry Ahmed
2026-06-18 4:48 ` [PATCH v4 5/5] selftests/cgroup: Add tests for zswap " Hao Jia
2026-06-21 4:20 ` [PATCH v4 0/5] mm/zswap: Implement per-cgroup " Muchun Song
2026-06-22 6:08 ` Hao Jia
2026-06-22 10:04 ` Youngjun Park
2026-06-22 21:29 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=057ea303-4c27-1a6e-08de-cce26c699097@gmail.com \
--to=jiahao.kernel@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=chengming.zhou@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=jiahao1@lixiang.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
--cc=yosry@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox