Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yunzhao Li <yunzhao@cloudflare.com>
Cc: linux-mm@kvack.org, nphamcs@gmail.com, yosryahmed@google.com,
	shakeel.butt@linux.dev, hawk@kernel.org,
	akpm@linux-foundation.org, zhouchengming@bytedance.com
Subject: Re: [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count()
Date: Thu, 2 Jul 2026 16:16:05 -0400	[thread overview]
Message-ID: <akbHBZP9HxOJMtQB@cmpxchg.org> (raw)
In-Reply-To: <20260702180908.150136-1-yunzhao@cloudflare.com>

On Thu, Jul 02, 2026 at 11:07:35AM -0700, Yunzhao Li wrote:
> zswap_shrinker_count() calls mem_cgroup_flush_stats(), which takes the
> global cgroup rstat lock synchronously. On machines with many CPUs and
> NUMA nodes, this creates severe lock contention in the kswapd reclaim
> path:
> 
>   - Multiple kswapd threads (one per NUMA node) run concurrently.
>   - do_shrink_slab() invokes zswap_shrinker_count() for each
>     memcg-aware shrinker pass.
>   - Each call flushes the full cgroup rstat hierarchy under the global
>     lock.
> 
> On AMD EPYC 9684X machines (96 cores, 192 threads, 12 NUMA nodes)
> running production workloads with zswap enabled, perf shows 2.88% of
> kernel cycles in osq_lock contention from this path:
> 
>      2.88%  [k] osq_lock
>               --__mutex_lock.constprop.0
>                   --__cgroup_rstat_lock
>                       --cgroup_rstat_flush_locked
>                           --cgroup_rstat_flush
>                               --zswap_shrinker_count
>                                   do_shrink_slab
>                                   shrink_slab
>                                   shrink_node
>                                   balance_pgdat
>                                   kswapd
> 
> 84% of kswapd kernel cycles are spent in
> shrink_slab -> zswap_shrinker_count -> cgroup_rstat_flush, not in actual
> page reclaim (shrink_lruvec).
> 
> Controlled A/B on identical hardware and workload:
> 
>   shrinker=Y: 2.88% osq_lock, memory PSI 1.58%
>   shrinker=N: 0.00% osq_lock, memory PSI 0.57%
> 
> eBPF-based rstat lock wait measurement across 8 production metals
> confirms the contention splits cleanly along shrinker enablement:
> 
>   shrinker=Y: 50-250x more contended lock acquisitions (248/s vs 1.1/s)
>   shrinker=N: baseline lock wait (0.0017 s/s vs 1.04 s/s)
> 
> zswap_shrinker_count() only produces a heuristic estimate, scaled by
> compression ratio via mult_frac(). The actual writeback happens in
> zswap_shrinker_scan(). Slightly stale stats are acceptable here.
> 
> Switch to mem_cgroup_flush_stats_ratelimited(), which only flushes if
> the periodic 2-second flusher is one full cycle late. This matches the
> approach already used in prepare_scan_control() (mm/vmscan.c) for the
> same reclaim path.
> 
> After applying this patch, rstat flush latency and lock wait time on
> shrinker=Y machines dropped to the same level as shrinker=N controls,
> while the zswap shrinker continues to function (pool size remains
> bounded under the max_pool_percent cap).
> 
> Previously discussed:
>   - Chengming Zhou (Dec 2023): rstat contention from
>     zswap_shrinker_count [1]
>   - Shakeel Butt (Aug 2024): zswap_shrinker_count still uses sync
>     flush [2]
>   - Yosry Ahmed (Aug 2024): suggested eliminating in-kernel
>     flushers [3]
>   - Jesper Dangaard Brouer (Sep 2024): cgroup/rstat V11 patch [4]
> 
> [1] https://lore.kernel.org/linux-mm/20231206103935.3440502-1-zhouchengming@bytedance.com/
> [2] https://lore.kernel.org/linux-mm/CALvZod7LFxLCxVpOFH8b2Ppm8T40HPGMKQwX_=NPCWB_mFW+oQ@mail.gmail.com/
> [3] https://lore.kernel.org/linux-mm/CAJD7tkYvFyOSX+rP_FKGBhxvZiCDxtpsNp-c5CGOA-4Bq9oXSg@mail.gmail.com/
> [4] https://lore.kernel.org/linux-mm/172616070094.2055617.17676042522679701515.stgit@firesoul/
> 
> Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Yunzhao Li <yunzhao@cloudflare.com>
> Tested-by: Yunzhao Li <yunzhao@cloudflare.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

A lot can happen in 2s, but I agree doing this every time is
silly. vmscan has been good with 2s for a while, so this should be
fine as well. We can re-evaluate if we run into weird behavior.


  reply	other threads:[~2026-07-03  1:01 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-02 18:07 [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count() Yunzhao Li
2026-07-02 20:16 ` Johannes Weiner [this message]
2026-07-02 20:51   ` Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=akbHBZP9HxOJMtQB@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=hawk@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=yosryahmed@google.com \
    --cc=yunzhao@cloudflare.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox