From: Hao Jia <jiahao.kernel@gmail.com>
To: Yosry Ahmed <yosry@kernel.org>
Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
nphamcs@gmail.com, chengming.zhou@linux.dev,
muchun.song@linux.dev, roman.gushchin@linux.dev,
cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg
Date: Tue, 2 Jun 2026 19:33:28 +0800 [thread overview]
Message-ID: <ff344c9f-51da-8b3a-e7a9-c4a7f4702ef8@gmail.com> (raw)
In-Reply-To: <ah4ZZGl7GYJf54Wz@google.com>
On 2026/6/2 08:31, Yosry Ahmed wrote:
> On Mon, Jun 01, 2026 at 07:07:45PM +0800, Hao Jia wrote:
>>
>>
>> On 2026/5/30 09:24, Yosry Ahmed wrote:
>>> On Tue, May 26, 2026 at 07:45:58PM +0800, Hao Jia wrote:
>>>> From: Hao Jia <jiahao1@lixiang.com>
>>>>
>>>> The zswap background writeback worker shrink_worker() uses a global
>>>> cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
>>>> across the online memcgs under root_mem_cgroup.
>>>>
>>>> Proactive writeback also wants a similar per-memcg cursor that is
>>>> scoped to the specified memcg, so that repeated invocations against
>>>> the same memcg make forward progress across its descendant memcgs
>>>> instead of restarting from the first child memcg each time.
>>>
>>> Is this a problem in practice?
>>>
>>> Is the concern the overhead of scanning memcgs repeatedly, or lack of
>>> fairness? I wonder if we should just do writeback in batches from all
>>> memcgs, similar to how reclaim does it, then evaluate at the end if we
>>> need to start over?
>>>
>>
>> Not using a per-cgroup cursor will cause issues for "repeated small-budget
>> calls" cases. For example, repeatedly triggering a 2MB writeback might
>> result in only writing back pages from the first few child memcgs every
>> time. In the worst-case scenario (where the writeback amount is less than
>> WB_BATCH), it might only ever write back from the first child memcg.
>
> Right, so a fairness concern?
>
> I wonder if we should just reclaim a batch from each memcg, then check
> if we reached the goal, otherwise start over. If the batch size is small
> enough that should work?
Even with a small batch size, for small writeback requests triggered by
user-space (e.g., 2MB, which is batch size * N), it might still
repeatedly write back from only the first N child memcgs. This could
cause the user-space agent to prematurely give up on zswap writeback.
>
>>
>> Similar to how memory reclaim uses mem_cgroup_iter() (via struct
>> mem_cgroup_reclaim_iter) and the old shrink_worker() used zswap_next_shrink,
>> we need a shared cursor here.
>
> Right, I understand that in theory we need a cursor. I am just wondering
> if the complexity is justified in practice. Reclaim is a much larger
> beast than zswap writeback. I wonder if we can just get away with
> scanning a batch from each child memcg -- for per-memcg reclaim, not
> global.
>
> We can always improve it later with a cursor if there's an actual need.
>
>>
>>
>>>>
>>>> Naturally, group the cursor and its protecting spinlock into a
>>>> zswap_wb_iter struct, and make it a member of struct mem_cgroup to
>>>> realize per-memcg cursor management. Accordingly, shrink_worker() now
>>>> uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.
>>>
>>> If we really need to have per-memcg cursors (I am not a big fan), I
>>> think we can minimize the overhead by making the cursor updates use
>>> atomic cmpxchg instead of having a per-memcg lock.
>>>
>>
>> Because mem_cgroup_iter() always calls css_put(&prev->css), we cannot simply
>> update zswap_wb_iter.pos via cmpxchg() after calling it. Doing so could lead
>> to a double css_put() issue on prev->css.
>>
>> Therefore, if we switch to the cmpxchg() approach, we wouldn't be able to
>> reuse the existing mem_cgroup_iter() logic. We would have to write a new
>> function similar to cgroup_iter(), and its implementation might end up
>> looking a bit obscure/complex.
>
> What if we do something like this (for the global cursor):
>
> do {
> memcg = xchg(zswap_next_shrink, NULL);
> memcg = mem_cgroup_iter(NULL, memcg, NULL);
> /* If the cursor was advanced from under us, try again */
> if (!try_cmpxchg(zswap_next_shrink, NULL, memcg))
> continue;
> } while (..);
>
>
Regarding the code above, IIRC, both the global and per-cgroup cursors
suffer from race conditions. This race can cause mem_cgroup_iter(NULL,
NULL, NULL) to return the root memcg or its descendants, leading zswap
to write back pages from the wrong memcg.
Additionally, since mem_cgroup_iter() puts the prev memcg ref and gets
the next memcg ref, a try_cmpxchg() failure on CPU1 might also lead to a
ref leak for memcg1.
CPU1 CPU2
memcg1 = xchg(pos, NULL)
memcg2 = xchg(pos, NULL) memcg2 = NULL;
memcg1 = mem_cgroup_iter()
mem_cgroup_iter(NULL, **NULL**, NULL) error memcg
try_cmpxchg(pos,NULL,memcg2) succeed
try_cmpxchg(pos,NULL,memcg1) **fail**
I took a stab at implementing a cmpxchg()-based zswap_mem_cgroup_iter()
modeled after mem_cgroup_iter(), and it actually doesn't look that
complex after all :)
Of course, as Nhat mentioned, we definitely need to add plenty of
comments for this function.
static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root)
{
struct cgroup_subsys_state *css;
struct mem_cgroup *pos, *next;
if (mem_cgroup_disabled())
return NULL;
if (!root)
root = root_mem_cgroup;
rcu_read_lock();
restart:
pos = READ_ONCE(root->zswap_wb_iter.pos);
css = pos ? &pos->css : NULL;
next = NULL;
while ((css = css_next_descendant_pre(css, &root->css))) {
if (css_tryget_online(css))
break;
}
next = css ? mem_cgroup_from_css(css) : NULL;
if (cmpxchg(&root->zswap_wb_iter.pos, pos, next) != pos) {
if (next)
css_put(&next->css);
goto restart;
}
rcu_read_unlock();
return next;
}
> There is a window where a racing shrinker will see the cursor as NULL
> and start over, but that should be fine. We can generalize this for the
> per-memcg cursor.
>
> That being said..
>
>>
>> Currently, this lock is only used in shrink_memcg(), proactive writeback,
>> and mem_cgroup_css_offline(). Note that shrink_memcg() only acquires the
>> lock of the root cgroup, and mem_cgroup_css_offline() is unlikely to be a
>> hot path.
>
> ..this made me realize it's probably fine to just use a global lock for
> now?
>
> IIUC the only additional contention to the existing lock will be from
> userspace proactive writeback, and that shouldn't be a big deal
> especially with the critical section being short?
>
In the current patch implementation, this lock protects the cgroup's own
cursor variable. During each writeback, we only acquire the spin_lock of
the target cgroup itself; we do not attempt to **spin on any child
cgroup's lock while iterating through the descendants**.
Specifically:
- shrink_memcg() will only attempt to acquire the root cgroup's lock
throughout the entire process.
- Proactive writeback will only acquire the lock of the target cgroup
**itself**.
- Only mem_cgroup_css_offline() might attempt to hold locks of other
cgroups, but normally, this shouldn't be a hot path.
Therefore, even if proactive writebacks are triggered concurrently on a
parent cgroup and its child cgroup, there will be **no** lock contention
at all (specifically referring to zswap_wb_iter.lock).
Lock contention would only occur if user-space **concurrently** triggers
proactive writeback on the exact **same** cgroup. And IIRC, in such a
scenario, the bottleneck is more likely to be on other locks anyway.
>>
>> So, should we keep the spin_lock or go with the cmpxchg() approach?
>> Yosry and Nhat, what are your thoughts on this?
>
> I think we should experiment with the global lock first. See if you
> observe any regressions with workloads that put a lot of pressure on the
> lock (a lot of threads in reclaim doing writeback + a few userspace
> threads doing proactive writeback). See if the userspace threads
> actually cause a meaningful regression.
Sorry, it seems there are some implementation issues with the global
lock approach.
In practice, our user-space agent mostly operates in the following two
scenarios:
- Triggering proactive writeback on the same cgroup at different times
(sequentially).
- Triggering proactive writeback on different cgroups at the same time
(concurrently).
In both cases, there is no lock contention. So, the current lock works
perfectly fine for us.
However, if we really hate zswap_wb_iter.lock, I can try replacing it
with the cmpxchg() approach.
Thanks,
Hao
next prev parent reply other threads:[~2026-06-02 11:33 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-26 11:45 [PATCH v3 0/4] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-26 11:45 ` [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-29 19:51 ` Nhat Pham
2026-05-30 1:24 ` Yosry Ahmed
2026-06-01 11:07 ` Hao Jia
2026-06-01 16:44 ` Nhat Pham
2026-06-01 16:47 ` Nhat Pham
2026-06-01 17:08 ` Nhat Pham
2026-06-02 11:32 ` Hao Jia
2026-06-02 0:31 ` Yosry Ahmed
2026-06-02 11:33 ` Hao Jia [this message]
2026-06-02 23:19 ` Yosry Ahmed
2026-06-03 3:02 ` Hao Jia
2026-06-03 17:53 ` Yosry Ahmed
2026-05-26 11:45 ` [PATCH v3 2/4] mm/zswap: Implement proactive writeback Hao Jia
2026-05-29 19:58 ` Nhat Pham
2026-05-30 1:40 ` Yosry Ahmed
2026-06-03 11:22 ` Hao Jia
2026-06-03 17:58 ` Yosry Ahmed
2026-06-03 18:14 ` Nhat Pham
2026-05-30 1:37 ` Yosry Ahmed
2026-06-03 11:27 ` Hao Jia
2026-06-03 17:55 ` Yosry Ahmed
2026-06-03 18:23 ` Nhat Pham
2026-06-03 18:26 ` Yosry Ahmed
2026-06-03 18:34 ` Nhat Pham
2026-06-03 18:43 ` Yosry Ahmed
2026-06-03 18:51 ` Nhat Pham
2026-06-03 18:54 ` Yosry Ahmed
2026-05-26 11:46 ` [PATCH v3 3/4] mm/zswap: Add per-memcg stat for " Hao Jia
2026-05-29 20:01 ` Nhat Pham
2026-06-03 11:29 ` Hao Jia
2026-05-26 11:46 ` [PATCH v3 4/4] selftests/cgroup: Add tests for zswap " Hao Jia
2026-05-29 20:02 ` Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ff344c9f-51da-8b3a-e7a9-c4a7f4702ef8@gmail.com \
--to=jiahao.kernel@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=jiahao1@lixiang.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
--cc=yosry@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox