Linux Documentation
 help / color / mirror / Atom feed
From: Hao Jia <jiahao.kernel@gmail.com>
To: Yosry Ahmed <yosry@kernel.org>
Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
	shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
	nphamcs@gmail.com, chengming.zhou@linux.dev,
	muchun.song@linux.dev, roman.gushchin@linux.dev,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability
Date: Wed, 24 Jun 2026 19:58:59 +0800	[thread overview]
Message-ID: <057ea303-4c27-1a6e-08de-cce26c699097@gmail.com> (raw)
In-Reply-To: <CAO9r8zMgaqP=n6rmhnMU+qhp1Www1Y5kdbLTLX1v=fj_ybHyiw@mail.gmail.com>



On 2026/6/24 02:17, Yosry Ahmed wrote:
>> My initial thought was that if cold memory is evenly distributed across
>> nodes and we are doing a large writeback, it would be better to balance
>> the zswap entry writeback across all nodes rather than just draining
>> node 0 first. However, since we currently lack a proper metric to
>> represent hot/cold memory (such as age-based tracking), doing this
>> probably doesn't make much sense right now.
> 
> Yeah let's start simple and go from there.
> 
>>
>> So, perhaps we want something like this? Please correct me if I'm wrong.
>>
>> static long shrink_memcg(struct mem_cgroup *memcg,
>>          unsigned long nr_to_scan)
>> {
>>     struct zswap_shrink_walk_arg walk_arg = {
>>       .bytes_written = 0,
>>       .encountered_page_in_swapcache = false,
>>     };
>>     unsigned long nr_remaining = nr_to_scan;
>>     bool memcg_list_is_empty = true;
>>     int nid;
>>
>>     if (!mem_cgroup_zswap_writeback_enabled(memcg))
>>       return -ENOENT;
>>
>>     if (memcg && !mem_cgroup_online(memcg))
>>       return -ENOENT;
>>
>>     for_each_node_state(nid, N_NORMAL_MEMORY) {
>>       unsigned long nr_to_walk;
>>
>>       /*
>>        * Cap the per-node scan by the current LRU length. A referenced
>>        * entry is only rotated to the tail (second chance) and may be
>>        * revisited within a single walk; without this cap those rotated
>>        * entries could drain the shared scan budget on one node.
>>        */
> 
> The comment here is a bit misleading. It's not just about draining one
> node. One call to shrink_memcg() should only scan entries once. The
> caller can then choose to scan the memcg again, or scan a different
> one. In this case, the caller should iterate all memcgs first before
> retrying memcgs again and reclaiming rotated entries.

I have updated the comment. Please see below.
> 
>>       nr_to_walk = min(nr_remaining,
>>            list_lru_count_one(&zswap_list_lru, nid, memcg));
>>       if (!nr_to_walk)
>>         continue;
>>       memcg_list_is_empty = false;
>>
>>       nr_remaining -= nr_to_walk;
>>       list_lru_walk_one(&zswap_list_lru, nid, memcg,
>>             &shrink_memcg_cb, &walk_arg, &nr_to_walk);
>>       /* Return the unused share of the budget to the pool. */
>>       nr_remaining += nr_to_walk;
>>
>>       /* Bail out once the whole scan budget has been spent. */
> 
> The comment is unnecessary.

I'll do this, thanks.
> 
>>       if (!nr_remaining)
>>         break;
>>
>>       cond_resched();
> 
> Did you observe a problem here or did you just add this due to an
> abundance of caution?

The cond_resched() here was just out of caution. Given that both callers 
(shrink_worker() and zswap_proactive_writeback()) already have 
rescheduling checks, I suppose we can remove it from here."
> 
>>     }
>>
>>     if (memcg_list_is_empty)
> 
> Do we need memcg_list_is_empty? Can we just check if nr_remaining
> matches nr_to_scan?
> 

indeed.
>>       return -ENOENT;
>>
>>     return walk_arg.bytes_written;
>> }


/*
  * Scan up to @nr_to_scan pages across the per-node zswap LRUs of @memcg
  * and write back the reclaimable ones.
  *
  * Since the second-chance algorithm rotates referenced entries to the
  * LRU tail, the per-node scan is capped at the current LRU length so
  * each entry is scanned at most once per call. It is up to the caller
  * to handle retries, deciding whether to scan the next memcg to complete
  * the full iteration, or to rescan the current memcg to drain its zswap
  * entries.
  *
  * Return: The number of compressed bytes written back (>= 0), or -ENOENT
  * if @memcg has writeback disabled, is a zombie cgroup, or has empty
  * zswap LRUs.
  */
static long shrink_memcg(struct mem_cgroup *memcg, unsigned long nr_to_scan)
{
     struct zswap_shrink_walk_arg walk_arg = {
         .bytes_written = 0,
         .encountered_page_in_swapcache = false,
     };
     unsigned long nr_remaining = nr_to_scan;
     int nid;

     if (!mem_cgroup_zswap_writeback_enabled(memcg))
         return -ENOENT;

     /*
      * Skip zombies because their LRUs are reparented and we would be
      * reclaiming from the parent instead of the dead memcg.
      */
     if (memcg && !mem_cgroup_online(memcg))
         return -ENOENT;

     for_each_node_state(nid, N_NORMAL_MEMORY) {
         unsigned long nr_to_walk;

         /*
          * Cap the walk at the current LRU length to ensure each entry is
          * scanned at most once per call. Referenced entries are rotated
          * to the tail for a second chance, and this bound prevents them
          * from being revisited within a single call. Retries are left to
          * the caller, which can choose to rescan the current memcg or
          * move on to the next one.
          */
         nr_to_walk = min(nr_remaining,
                  list_lru_count_one(&zswap_list_lru, nid, memcg));
         if (!nr_to_walk)
             continue;

         nr_remaining -= nr_to_walk;
         list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb,
                   &walk_arg, &nr_to_walk);
         /* Return the unused share of the budget to the pool. */
         nr_remaining += nr_to_walk;

         if (!nr_remaining)
             break;
     }

     /* Nothing was scanned: every LRU under @memcg was empty. */
     if (nr_remaining == nr_to_scan)
         return -ENOENT;

     return walk_arg.bytes_written;
}


Thanks,
Hao

  reply	other threads:[~2026-06-24 11:59 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-18  4:48 [PATCH v4 0/5] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-06-18  4:48 ` [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability Hao Jia
2026-06-22 23:33   ` Yosry Ahmed
2026-06-23 13:22     ` Hao Jia
2026-06-23 18:17       ` Yosry Ahmed
2026-06-24 11:58         ` Hao Jia [this message]
2026-06-24 16:57           ` Yosry Ahmed
2026-06-18  4:48 ` [PATCH v4 2/5] mm/zswap: Factor writeback loop out of shrink_worker() Hao Jia
2026-06-22 23:36   ` Yosry Ahmed
2026-06-24 11:55     ` Hao Jia
2026-06-24 17:00       ` Yosry Ahmed
2026-06-18  4:48 ` [PATCH v4 3/5] mm/zswap: Implement proactive writeback Hao Jia
2026-06-22 23:40   ` Yosry Ahmed
2026-06-18  4:48 ` [PATCH v4 4/5] mm/zswap: Add per-memcg stat for " Hao Jia
2026-06-22 23:42   ` Yosry Ahmed
2026-06-18  4:48 ` [PATCH v4 5/5] selftests/cgroup: Add tests for zswap " Hao Jia
2026-06-21  4:20 ` [PATCH v4 0/5] mm/zswap: Implement per-cgroup " Muchun Song
2026-06-22  6:08   ` Hao Jia
2026-06-22 10:04     ` Youngjun Park
2026-06-22 21:29       ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=057ea303-4c27-1a6e-08de-cce26c699097@gmail.com \
    --to=jiahao.kernel@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=jiahao1@lixiang.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=yosry@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox