Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>,
	Yunzhao Li <yunzhao@cloudflare.com>
Cc: linux-mm@kvack.org, nphamcs@gmail.com, yosryahmed@google.com,
	shakeel.butt@linux.dev, akpm@linux-foundation.org,
	zhouchengming@bytedance.com
Subject: Re: [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count()
Date: Thu, 2 Jul 2026 22:51:47 +0200	[thread overview]
Message-ID: <6dbee972-566c-4aa7-9fb3-dbc5f90e923f@kernel.org> (raw)
In-Reply-To: <akbHBZP9HxOJMtQB@cmpxchg.org>



On 02/07/2026 22.16, Johannes Weiner wrote:
> On Thu, Jul 02, 2026 at 11:07:35AM -0700, Yunzhao Li wrote:
>> zswap_shrinker_count() calls mem_cgroup_flush_stats(), which takes the
>> global cgroup rstat lock synchronously. On machines with many CPUs and
>> NUMA nodes, this creates severe lock contention in the kswapd reclaim
>> path:
>>
>>    - Multiple kswapd threads (one per NUMA node) run concurrently.
>>    - do_shrink_slab() invokes zswap_shrinker_count() for each
>>      memcg-aware shrinker pass.
>>    - Each call flushes the full cgroup rstat hierarchy under the global
>>      lock.
>>
>> On AMD EPYC 9684X machines (96 cores, 192 threads, 12 NUMA nodes)
>> running production workloads with zswap enabled, perf shows 2.88% of
>> kernel cycles in osq_lock contention from this path:
>>
>>       2.88%  [k] osq_lock
>>                --__mutex_lock.constprop.0
>>                    --__cgroup_rstat_lock
>>                        --cgroup_rstat_flush_locked
>>                            --cgroup_rstat_flush
>>                                --zswap_shrinker_count
>>                                    do_shrink_slab
>>                                    shrink_slab
>>                                    shrink_node
>>                                    balance_pgdat
>>                                    kswapd
>>
>> 84% of kswapd kernel cycles are spent in
>> shrink_slab -> zswap_shrinker_count -> cgroup_rstat_flush, not in actual
>> page reclaim (shrink_lruvec).
>>
>> Controlled A/B on identical hardware and workload:
>>
>>    shrinker=Y: 2.88% osq_lock, memory PSI 1.58%
>>    shrinker=N: 0.00% osq_lock, memory PSI 0.57%
>>
>> eBPF-based rstat lock wait measurement across 8 production metals
>> confirms the contention splits cleanly along shrinker enablement:
>>
>>    shrinker=Y: 50-250x more contended lock acquisitions (248/s vs 1.1/s)
>>    shrinker=N: baseline lock wait (0.0017 s/s vs 1.04 s/s)
>>
>> zswap_shrinker_count() only produces a heuristic estimate, scaled by
>> compression ratio via mult_frac(). The actual writeback happens in
>> zswap_shrinker_scan(). Slightly stale stats are acceptable here.
>>
>> Switch to mem_cgroup_flush_stats_ratelimited(), which only flushes if
>> the periodic 2-second flusher is one full cycle late. This matches the
>> approach already used in prepare_scan_control() (mm/vmscan.c) for the
>> same reclaim path.
>>
>> After applying this patch, rstat flush latency and lock wait time on
>> shrinker=Y machines dropped to the same level as shrinker=N controls,
>> while the zswap shrinker continues to function (pool size remains
>> bounded under the max_pool_percent cap).
>>
>> Previously discussed:
>>    - Chengming Zhou (Dec 2023): rstat contention from
>>      zswap_shrinker_count [1]
>>    - Shakeel Butt (Aug 2024): zswap_shrinker_count still uses sync
>>      flush [2]
>>    - Yosry Ahmed (Aug 2024): suggested eliminating in-kernel
>>      flushers [3]
>>    - Jesper Dangaard Brouer (Sep 2024): cgroup/rstat V11 patch [4]
>>
>> [1] https://lore.kernel.org/linux-mm/20231206103935.3440502-1-zhouchengming@bytedance.com/
>> [2] https://lore.kernel.org/linux-mm/CALvZod7LFxLCxVpOFH8b2Ppm8T40HPGMKQwX_=NPCWB_mFW+oQ@mail.gmail.com/
>> [3] https://lore.kernel.org/linux-mm/CAJD7tkYvFyOSX+rP_FKGBhxvZiCDxtpsNp-c5CGOA-4Bq9oXSg@mail.gmail.com/
>> [4] https://lore.kernel.org/linux-mm/172616070094.2055617.17676042522679701515.stgit@firesoul/
>>
>> Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Yunzhao Li <yunzhao@cloudflare.com>
>> Tested-by: Yunzhao Li <yunzhao@cloudflare.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> A lot can happen in 2s, but I agree doing this every time is
> silly. vmscan has been good with 2s for a while, so this should be
> fine as well. We can re-evaluate if we run into weird behavior.

I don't know if an ACK from me is needed, but I'm just acknowledging
that I helped Yunzhao with developing this patch internally at Cloudflare.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

--Jesper


      reply	other threads:[~2026-07-03  0:28 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-02 18:07 [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count() Yunzhao Li
2026-07-02 20:16 ` Johannes Weiner
2026-07-02 20:51   ` Jesper Dangaard Brouer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6dbee972-566c-4aa7-9fb3-dbc5f90e923f@kernel.org \
    --to=hawk@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=yosryahmed@google.com \
    --cc=yunzhao@cloudflare.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox