Linux Documentation
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosry@kernel.org>
To: Hao Jia <jiahao.kernel@gmail.com>
Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
	 shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
	nphamcs@gmail.com,  chengming.zhou@linux.dev,
	muchun.song@linux.dev, roman.gushchin@linux.dev,
	 cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,  linux-doc@vger.kernel.org,
	Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg
Date: Tue, 2 Jun 2026 00:31:42 +0000	[thread overview]
Message-ID: <ah4ZZGl7GYJf54Wz@google.com> (raw)
In-Reply-To: <8c0e60e1-5713-69f0-a687-088c87e75764@gmail.com>

On Mon, Jun 01, 2026 at 07:07:45PM +0800, Hao Jia wrote:
> 
> 
> On 2026/5/30 09:24, Yosry Ahmed wrote:
> > On Tue, May 26, 2026 at 07:45:58PM +0800, Hao Jia wrote:
> > > From: Hao Jia <jiahao1@lixiang.com>
> > > 
> > > The zswap background writeback worker shrink_worker() uses a global
> > > cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
> > > across the online memcgs under root_mem_cgroup.
> > > 
> > > Proactive writeback also wants a similar per-memcg cursor that is
> > > scoped to the specified memcg, so that repeated invocations against
> > > the same memcg make forward progress across its descendant memcgs
> > > instead of restarting from the first child memcg each time.
> > 
> > Is this a problem in practice?
> > 
> > Is the concern the overhead of scanning memcgs repeatedly, or lack of
> > fairness? I wonder if we should just do writeback in batches from all
> > memcgs, similar to how reclaim does it, then evaluate at the end if we
> > need to start over?
> > 
> 
> Not using a per-cgroup cursor will cause issues for "repeated small-budget
> calls" cases. For example, repeatedly triggering a 2MB writeback might
> result in only writing back pages from the first few child memcgs every
> time. In the worst-case scenario (where the writeback amount is less than
> WB_BATCH), it might only ever write back from the first child memcg.

Right, so a fairness concern?

I wonder if we should just reclaim a batch from each memcg, then check
if we reached the goal, otherwise start over. If the batch size is small
enough that should work?

> 
> Similar to how memory reclaim uses mem_cgroup_iter() (via struct
> mem_cgroup_reclaim_iter) and the old shrink_worker() used zswap_next_shrink,
> we need a shared cursor here.

Right, I understand that in theory we need a cursor. I am just wondering
if the complexity is justified in practice. Reclaim is a much larger
beast than zswap writeback. I wonder if we can just get away with
scanning a batch from each child memcg -- for per-memcg reclaim, not
global.

We can always improve it later with a cursor if there's an actual need.

> 
> 
> > > 
> > > Naturally, group the cursor and its protecting spinlock into a
> > > zswap_wb_iter struct, and make it a member of struct mem_cgroup to
> > > realize per-memcg cursor management. Accordingly, shrink_worker() now
> > > uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.
> > 
> > If we really need to have per-memcg cursors (I am not a big fan), I
> > think we can minimize the overhead by making the cursor updates use
> > atomic cmpxchg instead of having a per-memcg lock.
> > 
> 
> Because mem_cgroup_iter() always calls css_put(&prev->css), we cannot simply
> update zswap_wb_iter.pos via cmpxchg() after calling it. Doing so could lead
> to a double css_put() issue on prev->css.
> 
> Therefore, if we switch to the cmpxchg() approach, we wouldn't be able to
> reuse the existing mem_cgroup_iter() logic. We would have to write a new
> function similar to cgroup_iter(), and its implementation might end up
> looking a bit obscure/complex.

What if we do something like this (for the global cursor):

	do {
		memcg = xchg(zswap_next_shrink, NULL);
		memcg = mem_cgroup_iter(NULL, memcg, NULL);
		/* If the cursor was advanced from under us, try again */
		if (!try_cmpxchg(zswap_next_shrink, NULL, memcg))
			continue;
	} while (..);
			

There is a window where a racing shrinker will see the cursor as NULL
and start over, but that should be fine. We can generalize this for the
per-memcg cursor.

That being said..

> 
> Currently, this lock is only used in shrink_memcg(), proactive writeback,
> and mem_cgroup_css_offline(). Note that shrink_memcg() only acquires the
> lock of the root cgroup, and mem_cgroup_css_offline() is unlikely to be a
> hot path.

..this made me realize it's probably fine to just use a global lock for
now?

IIUC the only additional contention to the existing lock will be from
userspace proactive writeback, and that shouldn't be a big deal
especially with the critical section being short?

> 
> So, should we keep the spin_lock or go with the cmpxchg() approach?
> Yosry and Nhat, what are your thoughts on this?

I think we should experiment with the global lock first. See if you
observe any regressions with workloads that put a lot of pressure on the
lock (a lot of threads in reclaim doing writeback + a few userspace
threads doing proactive writeback). See if the userspace threads
actually cause a meaningful regression.

> 
> 
> 
> > > 
> > > Because the cursor is now per-memcg, the offline cleanup must visit
> > > every ancestor that could be holding a reference to the dying memcg.
> > > Factor out __zswap_memcg_offline_cleanup() and walk from dead_memcg up
> > > to the root.
> > 
> > Another reason why I don't like per-memcg cursors. There is too much
> > complexity and I wonder if it's warranted. If we stick with per-memcg
> > cursors please do the refactoring in separate patches to make the
> > patches easier to review.
> 
> 
> Sorry about that. I will try to keep each patch as simple as possible in the
> next version.

No worries, thanks!

> 
> 
> Thanks,
> Hao
> 
> 

  parent reply	other threads:[~2026-06-02  0:31 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-26 11:45 [PATCH v3 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-26 11:45 ` [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-29 19:51   ` Nhat Pham
2026-05-30  1:24   ` Yosry Ahmed
2026-06-01 11:07     ` Hao Jia
2026-06-01 16:44       ` Nhat Pham
2026-06-01 16:47         ` Nhat Pham
2026-06-01 17:08       ` Nhat Pham
2026-06-02  0:31       ` Yosry Ahmed [this message]
2026-05-26 11:45 ` [PATCH v3 2/4] mm/zswap: Implement proactive writeback Hao Jia
2026-05-29 19:58   ` Nhat Pham
2026-05-30  1:40     ` Yosry Ahmed
2026-05-30  1:37   ` Yosry Ahmed
2026-05-26 11:46 ` [PATCH v3 3/4] mm/zswap: Add per-memcg stat for " Hao Jia
2026-05-29 20:01   ` Nhat Pham
2026-05-26 11:46 ` [PATCH v3 4/4] selftests/cgroup: Add tests for zswap " Hao Jia
2026-05-29 20:02   ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ah4ZZGl7GYJf54Wz@google.com \
    --to=yosry@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=jiahao.kernel@gmail.com \
    --cc=jiahao1@lixiang.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox