Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count()
@ 2026-07-02 18:07 Yunzhao Li
  2026-07-02 20:16 ` Johannes Weiner
  0 siblings, 1 reply; 3+ messages in thread
From: Yunzhao Li @ 2026-07-02 18:07 UTC (permalink / raw)
  To: linux-mm
  Cc: nphamcs, yosryahmed, shakeel.butt, hannes, hawk, akpm,
	zhouchengming, Yunzhao Li

zswap_shrinker_count() calls mem_cgroup_flush_stats(), which takes the
global cgroup rstat lock synchronously. On machines with many CPUs and
NUMA nodes, this creates severe lock contention in the kswapd reclaim
path:

  - Multiple kswapd threads (one per NUMA node) run concurrently.
  - do_shrink_slab() invokes zswap_shrinker_count() for each
    memcg-aware shrinker pass.
  - Each call flushes the full cgroup rstat hierarchy under the global
    lock.

On AMD EPYC 9684X machines (96 cores, 192 threads, 12 NUMA nodes)
running production workloads with zswap enabled, perf shows 2.88% of
kernel cycles in osq_lock contention from this path:

     2.88%  [k] osq_lock
              --__mutex_lock.constprop.0
                  --__cgroup_rstat_lock
                      --cgroup_rstat_flush_locked
                          --cgroup_rstat_flush
                              --zswap_shrinker_count
                                  do_shrink_slab
                                  shrink_slab
                                  shrink_node
                                  balance_pgdat
                                  kswapd

84% of kswapd kernel cycles are spent in
shrink_slab -> zswap_shrinker_count -> cgroup_rstat_flush, not in actual
page reclaim (shrink_lruvec).

Controlled A/B on identical hardware and workload:

  shrinker=Y: 2.88% osq_lock, memory PSI 1.58%
  shrinker=N: 0.00% osq_lock, memory PSI 0.57%

eBPF-based rstat lock wait measurement across 8 production metals
confirms the contention splits cleanly along shrinker enablement:

  shrinker=Y: 50-250x more contended lock acquisitions (248/s vs 1.1/s)
  shrinker=N: baseline lock wait (0.0017 s/s vs 1.04 s/s)

zswap_shrinker_count() only produces a heuristic estimate, scaled by
compression ratio via mult_frac(). The actual writeback happens in
zswap_shrinker_scan(). Slightly stale stats are acceptable here.

Switch to mem_cgroup_flush_stats_ratelimited(), which only flushes if
the periodic 2-second flusher is one full cycle late. This matches the
approach already used in prepare_scan_control() (mm/vmscan.c) for the
same reclaim path.

After applying this patch, rstat flush latency and lock wait time on
shrinker=Y machines dropped to the same level as shrinker=N controls,
while the zswap shrinker continues to function (pool size remains
bounded under the max_pool_percent cap).

Previously discussed:
  - Chengming Zhou (Dec 2023): rstat contention from
    zswap_shrinker_count [1]
  - Shakeel Butt (Aug 2024): zswap_shrinker_count still uses sync
    flush [2]
  - Yosry Ahmed (Aug 2024): suggested eliminating in-kernel
    flushers [3]
  - Jesper Dangaard Brouer (Sep 2024): cgroup/rstat V11 patch [4]

[1] https://lore.kernel.org/linux-mm/20231206103935.3440502-1-zhouchengming@bytedance.com/
[2] https://lore.kernel.org/linux-mm/CALvZod7LFxLCxVpOFH8b2Ppm8T40HPGMKQwX_=NPCWB_mFW+oQ@mail.gmail.com/
[3] https://lore.kernel.org/linux-mm/CAJD7tkYvFyOSX+rP_FKGBhxvZiCDxtpsNp-c5CGOA-4Bq9oXSg@mail.gmail.com/
[4] https://lore.kernel.org/linux-mm/172616070094.2055617.17676042522679701515.stgit@firesoul/

Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yunzhao Li <yunzhao@cloudflare.com>
Tested-by: Yunzhao Li <yunzhao@cloudflare.com>
---
 mm/zswap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 761cd699e..b5a17ea20 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1217,7 +1217,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	 * Without memcg, use the zswap pool-wide metrics.
 	 */
 	if (!mem_cgroup_disabled()) {
-		mem_cgroup_flush_stats(memcg);
+		mem_cgroup_flush_stats_ratelimited(memcg);
 		nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
 		nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
 	} else {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count()
  2026-07-02 18:07 [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count() Yunzhao Li
@ 2026-07-02 20:16 ` Johannes Weiner
  2026-07-02 20:51   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 3+ messages in thread
From: Johannes Weiner @ 2026-07-02 20:16 UTC (permalink / raw)
  To: Yunzhao Li
  Cc: linux-mm, nphamcs, yosryahmed, shakeel.butt, hawk, akpm,
	zhouchengming

On Thu, Jul 02, 2026 at 11:07:35AM -0700, Yunzhao Li wrote:
> zswap_shrinker_count() calls mem_cgroup_flush_stats(), which takes the
> global cgroup rstat lock synchronously. On machines with many CPUs and
> NUMA nodes, this creates severe lock contention in the kswapd reclaim
> path:
> 
>   - Multiple kswapd threads (one per NUMA node) run concurrently.
>   - do_shrink_slab() invokes zswap_shrinker_count() for each
>     memcg-aware shrinker pass.
>   - Each call flushes the full cgroup rstat hierarchy under the global
>     lock.
> 
> On AMD EPYC 9684X machines (96 cores, 192 threads, 12 NUMA nodes)
> running production workloads with zswap enabled, perf shows 2.88% of
> kernel cycles in osq_lock contention from this path:
> 
>      2.88%  [k] osq_lock
>               --__mutex_lock.constprop.0
>                   --__cgroup_rstat_lock
>                       --cgroup_rstat_flush_locked
>                           --cgroup_rstat_flush
>                               --zswap_shrinker_count
>                                   do_shrink_slab
>                                   shrink_slab
>                                   shrink_node
>                                   balance_pgdat
>                                   kswapd
> 
> 84% of kswapd kernel cycles are spent in
> shrink_slab -> zswap_shrinker_count -> cgroup_rstat_flush, not in actual
> page reclaim (shrink_lruvec).
> 
> Controlled A/B on identical hardware and workload:
> 
>   shrinker=Y: 2.88% osq_lock, memory PSI 1.58%
>   shrinker=N: 0.00% osq_lock, memory PSI 0.57%
> 
> eBPF-based rstat lock wait measurement across 8 production metals
> confirms the contention splits cleanly along shrinker enablement:
> 
>   shrinker=Y: 50-250x more contended lock acquisitions (248/s vs 1.1/s)
>   shrinker=N: baseline lock wait (0.0017 s/s vs 1.04 s/s)
> 
> zswap_shrinker_count() only produces a heuristic estimate, scaled by
> compression ratio via mult_frac(). The actual writeback happens in
> zswap_shrinker_scan(). Slightly stale stats are acceptable here.
> 
> Switch to mem_cgroup_flush_stats_ratelimited(), which only flushes if
> the periodic 2-second flusher is one full cycle late. This matches the
> approach already used in prepare_scan_control() (mm/vmscan.c) for the
> same reclaim path.
> 
> After applying this patch, rstat flush latency and lock wait time on
> shrinker=Y machines dropped to the same level as shrinker=N controls,
> while the zswap shrinker continues to function (pool size remains
> bounded under the max_pool_percent cap).
> 
> Previously discussed:
>   - Chengming Zhou (Dec 2023): rstat contention from
>     zswap_shrinker_count [1]
>   - Shakeel Butt (Aug 2024): zswap_shrinker_count still uses sync
>     flush [2]
>   - Yosry Ahmed (Aug 2024): suggested eliminating in-kernel
>     flushers [3]
>   - Jesper Dangaard Brouer (Sep 2024): cgroup/rstat V11 patch [4]
> 
> [1] https://lore.kernel.org/linux-mm/20231206103935.3440502-1-zhouchengming@bytedance.com/
> [2] https://lore.kernel.org/linux-mm/CALvZod7LFxLCxVpOFH8b2Ppm8T40HPGMKQwX_=NPCWB_mFW+oQ@mail.gmail.com/
> [3] https://lore.kernel.org/linux-mm/CAJD7tkYvFyOSX+rP_FKGBhxvZiCDxtpsNp-c5CGOA-4Bq9oXSg@mail.gmail.com/
> [4] https://lore.kernel.org/linux-mm/172616070094.2055617.17676042522679701515.stgit@firesoul/
> 
> Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Yunzhao Li <yunzhao@cloudflare.com>
> Tested-by: Yunzhao Li <yunzhao@cloudflare.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

A lot can happen in 2s, but I agree doing this every time is
silly. vmscan has been good with 2s for a while, so this should be
fine as well. We can re-evaluate if we run into weird behavior.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count()
  2026-07-02 20:16 ` Johannes Weiner
@ 2026-07-02 20:51   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2026-07-02 20:51 UTC (permalink / raw)
  To: Johannes Weiner, Yunzhao Li
  Cc: linux-mm, nphamcs, yosryahmed, shakeel.butt, akpm, zhouchengming



On 02/07/2026 22.16, Johannes Weiner wrote:
> On Thu, Jul 02, 2026 at 11:07:35AM -0700, Yunzhao Li wrote:
>> zswap_shrinker_count() calls mem_cgroup_flush_stats(), which takes the
>> global cgroup rstat lock synchronously. On machines with many CPUs and
>> NUMA nodes, this creates severe lock contention in the kswapd reclaim
>> path:
>>
>>    - Multiple kswapd threads (one per NUMA node) run concurrently.
>>    - do_shrink_slab() invokes zswap_shrinker_count() for each
>>      memcg-aware shrinker pass.
>>    - Each call flushes the full cgroup rstat hierarchy under the global
>>      lock.
>>
>> On AMD EPYC 9684X machines (96 cores, 192 threads, 12 NUMA nodes)
>> running production workloads with zswap enabled, perf shows 2.88% of
>> kernel cycles in osq_lock contention from this path:
>>
>>       2.88%  [k] osq_lock
>>                --__mutex_lock.constprop.0
>>                    --__cgroup_rstat_lock
>>                        --cgroup_rstat_flush_locked
>>                            --cgroup_rstat_flush
>>                                --zswap_shrinker_count
>>                                    do_shrink_slab
>>                                    shrink_slab
>>                                    shrink_node
>>                                    balance_pgdat
>>                                    kswapd
>>
>> 84% of kswapd kernel cycles are spent in
>> shrink_slab -> zswap_shrinker_count -> cgroup_rstat_flush, not in actual
>> page reclaim (shrink_lruvec).
>>
>> Controlled A/B on identical hardware and workload:
>>
>>    shrinker=Y: 2.88% osq_lock, memory PSI 1.58%
>>    shrinker=N: 0.00% osq_lock, memory PSI 0.57%
>>
>> eBPF-based rstat lock wait measurement across 8 production metals
>> confirms the contention splits cleanly along shrinker enablement:
>>
>>    shrinker=Y: 50-250x more contended lock acquisitions (248/s vs 1.1/s)
>>    shrinker=N: baseline lock wait (0.0017 s/s vs 1.04 s/s)
>>
>> zswap_shrinker_count() only produces a heuristic estimate, scaled by
>> compression ratio via mult_frac(). The actual writeback happens in
>> zswap_shrinker_scan(). Slightly stale stats are acceptable here.
>>
>> Switch to mem_cgroup_flush_stats_ratelimited(), which only flushes if
>> the periodic 2-second flusher is one full cycle late. This matches the
>> approach already used in prepare_scan_control() (mm/vmscan.c) for the
>> same reclaim path.
>>
>> After applying this patch, rstat flush latency and lock wait time on
>> shrinker=Y machines dropped to the same level as shrinker=N controls,
>> while the zswap shrinker continues to function (pool size remains
>> bounded under the max_pool_percent cap).
>>
>> Previously discussed:
>>    - Chengming Zhou (Dec 2023): rstat contention from
>>      zswap_shrinker_count [1]
>>    - Shakeel Butt (Aug 2024): zswap_shrinker_count still uses sync
>>      flush [2]
>>    - Yosry Ahmed (Aug 2024): suggested eliminating in-kernel
>>      flushers [3]
>>    - Jesper Dangaard Brouer (Sep 2024): cgroup/rstat V11 patch [4]
>>
>> [1] https://lore.kernel.org/linux-mm/20231206103935.3440502-1-zhouchengming@bytedance.com/
>> [2] https://lore.kernel.org/linux-mm/CALvZod7LFxLCxVpOFH8b2Ppm8T40HPGMKQwX_=NPCWB_mFW+oQ@mail.gmail.com/
>> [3] https://lore.kernel.org/linux-mm/CAJD7tkYvFyOSX+rP_FKGBhxvZiCDxtpsNp-c5CGOA-4Bq9oXSg@mail.gmail.com/
>> [4] https://lore.kernel.org/linux-mm/172616070094.2055617.17676042522679701515.stgit@firesoul/
>>
>> Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Yunzhao Li <yunzhao@cloudflare.com>
>> Tested-by: Yunzhao Li <yunzhao@cloudflare.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> A lot can happen in 2s, but I agree doing this every time is
> silly. vmscan has been good with 2s for a while, so this should be
> fine as well. We can re-evaluate if we run into weird behavior.

I don't know if an ACK from me is needed, but I'm just acknowledging
that I helped Yunzhao with developing this patch internally at Cloudflare.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

--Jesper


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-07-03  1:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 18:07 [PATCH] mm/zswap: use ratelimited stats flush in zswap_shrinker_count() Yunzhao Li
2026-07-02 20:16 ` Johannes Weiner
2026-07-02 20:51   ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox