From: Johannes Weiner <hannes@cmpxchg.org>
To: Yunzhao Li <yunzhao@cloudflare.com>
Cc: linux-mm@kvack.org, nphamcs@gmail.com, yosryahmed@google.com,
shakeel.butt@linux.dev, hawk@kernel.org,
akpm@linux-foundation.org, zhouchengming@bytedance.com
Subject: Re: [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count()
Date: Thu, 2 Jul 2026 16:16:05 -0400 [thread overview]
Message-ID: <akbHBZP9HxOJMtQB@cmpxchg.org> (raw)
In-Reply-To: <20260702180908.150136-1-yunzhao@cloudflare.com>
On Thu, Jul 02, 2026 at 11:07:35AM -0700, Yunzhao Li wrote:
> zswap_shrinker_count() calls mem_cgroup_flush_stats(), which takes the
> global cgroup rstat lock synchronously. On machines with many CPUs and
> NUMA nodes, this creates severe lock contention in the kswapd reclaim
> path:
>
> - Multiple kswapd threads (one per NUMA node) run concurrently.
> - do_shrink_slab() invokes zswap_shrinker_count() for each
> memcg-aware shrinker pass.
> - Each call flushes the full cgroup rstat hierarchy under the global
> lock.
>
> On AMD EPYC 9684X machines (96 cores, 192 threads, 12 NUMA nodes)
> running production workloads with zswap enabled, perf shows 2.88% of
> kernel cycles in osq_lock contention from this path:
>
> 2.88% [k] osq_lock
> --__mutex_lock.constprop.0
> --__cgroup_rstat_lock
> --cgroup_rstat_flush_locked
> --cgroup_rstat_flush
> --zswap_shrinker_count
> do_shrink_slab
> shrink_slab
> shrink_node
> balance_pgdat
> kswapd
>
> 84% of kswapd kernel cycles are spent in
> shrink_slab -> zswap_shrinker_count -> cgroup_rstat_flush, not in actual
> page reclaim (shrink_lruvec).
>
> Controlled A/B on identical hardware and workload:
>
> shrinker=Y: 2.88% osq_lock, memory PSI 1.58%
> shrinker=N: 0.00% osq_lock, memory PSI 0.57%
>
> eBPF-based rstat lock wait measurement across 8 production metals
> confirms the contention splits cleanly along shrinker enablement:
>
> shrinker=Y: 50-250x more contended lock acquisitions (248/s vs 1.1/s)
> shrinker=N: baseline lock wait (0.0017 s/s vs 1.04 s/s)
>
> zswap_shrinker_count() only produces a heuristic estimate, scaled by
> compression ratio via mult_frac(). The actual writeback happens in
> zswap_shrinker_scan(). Slightly stale stats are acceptable here.
>
> Switch to mem_cgroup_flush_stats_ratelimited(), which only flushes if
> the periodic 2-second flusher is one full cycle late. This matches the
> approach already used in prepare_scan_control() (mm/vmscan.c) for the
> same reclaim path.
>
> After applying this patch, rstat flush latency and lock wait time on
> shrinker=Y machines dropped to the same level as shrinker=N controls,
> while the zswap shrinker continues to function (pool size remains
> bounded under the max_pool_percent cap).
>
> Previously discussed:
> - Chengming Zhou (Dec 2023): rstat contention from
> zswap_shrinker_count [1]
> - Shakeel Butt (Aug 2024): zswap_shrinker_count still uses sync
> flush [2]
> - Yosry Ahmed (Aug 2024): suggested eliminating in-kernel
> flushers [3]
> - Jesper Dangaard Brouer (Sep 2024): cgroup/rstat V11 patch [4]
>
> [1] https://lore.kernel.org/linux-mm/20231206103935.3440502-1-zhouchengming@bytedance.com/
> [2] https://lore.kernel.org/linux-mm/CALvZod7LFxLCxVpOFH8b2Ppm8T40HPGMKQwX_=NPCWB_mFW+oQ@mail.gmail.com/
> [3] https://lore.kernel.org/linux-mm/CAJD7tkYvFyOSX+rP_FKGBhxvZiCDxtpsNp-c5CGOA-4Bq9oXSg@mail.gmail.com/
> [4] https://lore.kernel.org/linux-mm/172616070094.2055617.17676042522679701515.stgit@firesoul/
>
> Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Yunzhao Li <yunzhao@cloudflare.com>
> Tested-by: Yunzhao Li <yunzhao@cloudflare.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
A lot can happen in 2s, but I agree doing this every time is
silly. vmscan has been good with 2s for a while, so this should be
fine as well. We can re-evaluate if we run into weird behavior.
next prev parent reply other threads:[~2026-07-03 1:01 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-02 18:07 [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count() Yunzhao Li
2026-07-02 20:16 ` Johannes Weiner [this message]
2026-07-02 20:51 ` Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=akbHBZP9HxOJMtQB@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=hawk@kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=yosryahmed@google.com \
--cc=yunzhao@cloudflare.com \
--cc=zhouchengming@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox