[PATCH v2] mm/memcg: scale memory.high penalty based on refault recency

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
@ 2025-12-29  3:39 Jiayuan Chen
  2025-12-29 10:42 ` Markus Elfring
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Jiayuan Chen @ 2025-12-29  3:39 UTC (permalink / raw)
  To: linux-mm
  Cc: Jiayuan Chen, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel

From: Jiayuan Chen <jiayuan.chen@shopee.com>

Problem
-------
We observed an issue in production where a workload continuously
triggering memory.high also generates massive disk IO READ, causing
system-wide performance degradation.

This happens because memory.high penalty is currently based solely on
the overage amount, not the actual impact of that overage:

1. A memcg over memory.high reclaiming cold/unused pages
   → minimal system impact, light penalty is appropriate

2. A memcg over memory.high with hot pages being continuously
   reclaimed and refaulted → severe IO pressure, needs heavy penalty

Both cases receive identical penalties today. Users are forced to
combine memory.high with io.max as a workaround, but this is:
- The wrong abstraction level (memory policy shouldn't require IO tuning)
- Hard to configure correctly across different storage devices
- Unintuitive for users who only want memory control

Reproduction
------------
A simple test program demonstrates the issue:

    int fd = open("./200MB.file", O_RDWR|O_CREAT, 777);
    char *mem = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    while (1) {
        for (size_t i = 0; i < size; i += 4096) {
            if (mem[rand() % size] != 0)
                return -1;
        }
    }

Run with memory.high constraint:

    cgcreate -g io,cpu,cpuset,memory:/always_high
    cgset -r cpuset.cpus=0 always_high
    cgset -r memory.high=150M always_high
    cgexec -g cpu,cpuset,memory:/always_high ./high_test 200 &

Solution
--------
Incorporate refault recency into the penalty calculation. If a refault
occurred recently when memory.high is triggered, it indicates active
thrashing and warrants additional throttling.

Why not use refault counters directly?
- Refault statistics (WORKINGSET_REFAULT_*) are aggregated periodically,
  not available in real-time for accurate delta calculation
- Calling mem_cgroup_flush_stats() on every charge would be prohibitively
  expensive in the hot path
- Due to readahead, the same refault count can represent vastly different
  IO loads, making counter-based estimation unreliable

The timestamp-based approach is:
- O(1) cost: single timestamp read and comparison
- Self-calibrating: penalty scales naturally with refault frequency
- Conservative: only triggers when refault and memory.high event
  occur in close temporal proximity

When refault_penalty is active:
- Skip the "reclaim made progress" retry loop to apply throttling sooner
- Skip the "penalty too small" bypass to ensure some delay is applied
- Add refault-based delay to the overage-based delay

Results
-------
Before this patch (memory.high triggered, severe thrashing):

    sar -d 1
    Time          DEV       tps     rkB/s    %util
    04:17:42      sda   3242.00  272684.00   89.60
    04:17:43      sda   3412.00  251160.00   91.60
    04:17:44      sda   3185.00  254532.00   88.00
    04:17:45      sda   3230.00  253332.00   88.40
    04:17:46      sda   3416.00  224712.00   92.40
    04:17:47      sda   3613.00  206612.00   94.40

After this patch with MADV_RANDOM (no readahead):

    sar -d 1
    Time          DEV       tps     rkB/s    %util
    04:08:57      sda    512.00    2048.00    5.60
    04:08:58      sda    576.00    2304.00    6.80
    04:08:59      sda    512.00    2048.00    6.80
    04:09:00      sda    536.00    2144.00    4.80
    04:09:01      sda    552.00    2208.00   10.40
    04:09:02      sda    512.00    2048.00    9.20

After this patch (memory.high triggered, thrashing mitigated):
    sar -d 1
    Time          DEV       tps     rkB/s    %util
    04:27:03      sda     40.00    5880.00    0.00
    04:27:04      sda     41.00    6472.00    0.00
    04:27:05      sda     37.00    4716.00    0.00
    04:27:06      sda     48.00    8512.00    0.00
    04:27:07      sda     33.00    4556.00    0.00

The patch reduces disk utilization from ~90% to ~6-10%, effectively
preventing memory.high-induced thrashing from overwhelming the IO
subsystem.

Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>

---
v1 -> v2 : fix compile error when CONFIG_MEMCG is disabled
---
 include/linux/memcontrol.h | 26 ++++++++++++++++++++++++
 mm/memcontrol.c            | 41 +++++++++++++++++++++++++++++++++++---
 mm/workingset.c            |  4 ++++
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fd400082313a..98d4268457c0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -321,6 +321,9 @@ struct mem_cgroup {
 	spinlock_t event_list_lock;
 #endif /* CONFIG_MEMCG_V1 */
 
+	/* Timestamp of most recent refault, for thrashing detection */
+	u64 last_refault;
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
@@ -1038,6 +1041,20 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 }
 
 extern int mem_cgroup_init(void);
+
+static inline void mem_cgroup_update_last_refault(struct mem_cgroup *memcg)
+{
+	if (memcg)
+		WRITE_ONCE(memcg->last_refault, jiffies);
+}
+
+static inline unsigned long mem_cgroup_get_last_refault(struct mem_cgroup *memcg)
+{
+	if (memcg)
+		return READ_ONCE(memcg->last_refault);
+
+	return 0;
+}
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1433,6 +1450,15 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 }
 
 static inline int mem_cgroup_init(void) { return 0; }
+
+static inline void mem_cgroup_update_last_refault(struct mem_cgroup *memcg)
+{
+}
+
+static inline unsigned long mem_cgroup_get_last_refault(struct mem_cgroup *memcg)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75fc22a33b28..04f3a2511cbb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2226,6 +2226,38 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
+/*
+ * Check if a refault occurred recently, indicating active thrashing.
+ * Returns additional penalty jiffies based on refault recency.
+ *
+ * We use timestamp rather than refault counters because:
+ * 1. Counter aggregation is periodic and expensive to flush
+ * 2. Readahead makes counter-to-IO correlation unreliable
+ * 3. Timestamp gives us recency which directly reflects thrashing intensity
+ */
+static unsigned long calculate_refault(struct mem_cgroup *memcg)
+{
+	unsigned long last_refault = mem_cgroup_get_last_refault(memcg);
+	unsigned long now = jiffies;
+	long diff;
+
+	/*
+	 * Only care about refaults within the last second. The closer
+	 * the refault is to now, the higher the penalty:
+	 *
+	 *   diff = 1 tick   -> penalty = HZ      (capped to HZ/10 = 100ms)
+	 *   diff = HZ/10    -> penalty = 10 ticks = 10ms
+	 *   diff = HZ/2     -> penalty = 2 ticks  = 2ms
+	 *   diff >= HZ      -> penalty = 0        (too old, not thrashing)
+	 */
+	if (last_refault && time_before(now, last_refault + HZ)) {
+		diff = max((long)now - (long)last_refault, 1L);
+		/* Cap at 100ms to avoid excessive delays */
+		return min(HZ / diff, HZ / 10);
+	}
+	return 0;
+}
+
 /*
  * Reclaims memory over the high limit. Called directly from
  * try_charge() (context permitting), as well as from the userland
@@ -2233,6 +2265,7 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
  */
 void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 {
+	unsigned long refault_penalty;
 	unsigned long penalty_jiffies;
 	unsigned long pflags;
 	unsigned long nr_reclaimed;
@@ -2279,12 +2312,14 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
 						swap_find_max_overage(memcg));
 
+	refault_penalty = calculate_refault(memcg);
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
 	 * extremely slowly.
 	 */
-	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+	penalty_jiffies = min(penalty_jiffies + refault_penalty, MEMCG_MAX_HIGH_DELAY_JIFFIES);
 
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
@@ -2292,7 +2327,7 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	 * go only a small amount over their memory.high value and maybe haven't
 	 * been aggressively reclaimed enough yet.
 	 */
-	if (penalty_jiffies <= HZ / 100)
+	if (!refault_penalty && penalty_jiffies <= HZ / 100)
 		goto out;
 
 	/*
@@ -2300,7 +2335,7 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	 * memory.high, we want to encourage that rather than doing allocator
 	 * throttling.
 	 */
-	if (nr_reclaimed || nr_retries--) {
+	if (!refault_penalty && (nr_reclaimed || nr_retries--)) {
 		in_retry = true;
 		goto retry_reclaim;
 	}
diff --git a/mm/workingset.c b/mm/workingset.c
index e9f05634747a..597fcab497f4 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -297,6 +297,8 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 	if (lruvec != folio_lruvec(folio))
 		goto unlock;
 
+	mem_cgroup_update_last_refault(folio_memcg(folio));
+
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
 
 	if (!recent)
@@ -561,6 +563,8 @@ void workingset_refault(struct folio *folio, void *shadow)
 	pgdat = folio_pgdat(folio);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
+	mem_cgroup_update_last_refault(memcg);
+
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
 	if (!workingset_test_recent(shadow, file, &workingset, true))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
  2025-12-29  3:39 [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Jiayuan Chen
@ 2025-12-29 10:42 ` Markus Elfring
  2025-12-30 11:37 ` Michal Koutný
  2026-01-05 17:08 ` Shakeel Butt
  2 siblings, 0 replies; 5+ messages in thread
From: Markus Elfring @ 2025-12-29 10:42 UTC (permalink / raw)
  To: Jiayuan Chen, linux-mm, cgroups
  Cc: Jiayuan Chen, LKML, Andrew Morton, Axel Rasmussen,
	David Hildenbrand, Johannes Weiner, Lorenzo Stoakes, Michal Hocko,
	Muchun Song, Qi Zheng, Roman Gushchin, Shakeel Butt, Wei Xu,
	Yuanchu Xie

…
> We observed an issue in production where a workload continuously
…

Please avoid a typo in the summary phrase.

Regards,
Markus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
  2025-12-29  3:39 [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Jiayuan Chen
  2025-12-29 10:42 ` Markus Elfring
@ 2025-12-30 11:37 ` Michal Koutný
  2026-01-05 17:08 ` Shakeel Butt
  2 siblings, 0 replies; 5+ messages in thread
From: Michal Koutný @ 2025-12-30 11:37 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]

Hello Jiayuan.

On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
<snip>
> Users are forced to combine memory.high with io.max as a workaround,
> but this is:
> - The wrong abstraction level (memory policy shouldn't require IO tuning)
> - Hard to configure correctly across different storage devices
> - Unintuitive for users who only want memory control

I'd say the need for IO control is as designed, not a workaround. When
you apply control on one type of resource it may manifest by increased
consumption of another like in communicating vessels. (Johannes may
explain in better.)

IIUC, the injection of extra refaul_penalty slows down the thrashing
task and in effect reduces the excessive IO.
Naïvely thinking, wouldn't it have same effect if memory.high was
lowered (to start high throttling earlier)?

<snip>
> This happens because memory.high penalty is currently based solely on
> the overage amount, not the actual impact of that overage:
> 
> 1. A memcg over memory.high reclaiming cold/unused pages
>    → minimal system impact, light penalty is appropriate
> 
> 2. A memcg over memory.high with hot pages being continuously
>    reclaimed and refaulted → severe IO pressure, needs heavy penalty
> 
> Both cases receive identical penalties today.

(If you want to avoid IO control,) the latter case indicates the memcg's
memory.high is underprovisioned given its needs, so the solution would
be to increase the memory.high (this sounds more natural than the
opposite conjecture above). In theory (don't quote me on that), it
should be visible in PSI since the latter case would accumulate more
stalls than the former, so the cases could be treated accordingly.


> Solution
> --------
> Incorporate refault recency into the penalty calculation. If a refault
> occurred recently when memory.high is triggered, it indicates active
> thrashing and warrants additional throttling.

I find it little inconsistent that IO induced by memory.high would have
this refault scaling but IO by principially equal memory.max could still
grow unlimited :-/

> 
> Why not use refault counters directly?
> - Refault statistics (WORKINGSET_REFAULT_*) are aggregated periodically,
>   not available in real-time for accurate delta calculation
> - Calling mem_cgroup_flush_stats() on every charge would be prohibitively
>   expensive in the hot path
> - Due to readahead, the same refault count can represent vastly different
>   IO loads, making counter-based estimation unreliable
> 
> The timestamp-based approach is:
> - O(1) cost: single timestamp read and comparison
> - Self-calibrating: penalty scales naturally with refault frequency

Can you explain whether this would work universally?
IIUC, you measure frequency per memcg but the scaling is applied
per task, so I imagine there is discrepancy for multi task (process)
workloads.

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
  2025-12-29  3:39 [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Jiayuan Chen
  2025-12-29 10:42 ` Markus Elfring
  2025-12-30 11:37 ` Michal Koutný
@ 2026-01-05 17:08 ` Shakeel Butt
  2026-01-06  3:14   ` Jiayuan Chen
  2 siblings, 1 reply; 5+ messages in thread
From: Shakeel Butt @ 2026-01-05 17:08 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel, Hui Zhu

+Hui Zhu

Hi Jiayuan,

On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> Problem
> -------
> We observed an issue in production where a workload continuously
> triggering memory.high also generates massive disk IO READ, causing
> system-wide performance degradation.
> 
> This happens because memory.high penalty is currently based solely on
> the overage amount, not the actual impact of that overage:
> 
> 1. A memcg over memory.high reclaiming cold/unused pages
>    → minimal system impact, light penalty is appropriate
> 
> 2. A memcg over memory.high with hot pages being continuously
>    reclaimed and refaulted → severe IO pressure, needs heavy penalty
> 
> Both cases receive identical penalties today. Users are forced to
> combine memory.high with io.max as a workaround, but this is:
> - The wrong abstraction level (memory policy shouldn't require IO tuning)
> - Hard to configure correctly across different storage devices
> - Unintuitive for users who only want memory control
>

Thanks for raising and reporting this use-case. Overall I am supportive
of making memory.high more useful but instead of adding more more
heuristic in the kernel, I would prefer to make the enforcement of
memory.high more flexible with BPF.

At the moment, Hui Zhu is working on adding BPF support for memcg but it
is very generic and I would prefer to start with specific and real
use-case. I think your use-case is real and will be beneficial to many
other users. Can you please followup on that Hui's RFC to present your
use-case? I will also try to push the effort from the review side.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
  2026-01-05 17:08 ` Shakeel Butt
@ 2026-01-06  3:14   ` Jiayuan Chen
  0 siblings, 0 replies; 5+ messages in thread
From: Jiayuan Chen @ 2026-01-06  3:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, Jiayuan Chen, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel, Hui Zhu

January 6, 2026 at 01:08, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:


> 
> +Hui Zhu
> 
> Hi Jiayuan,
> 
> On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen wrote:
> 
> > 
> > From: Jiayuan Chen <jiayuan.chen@shopee.com>
> >  
> >  Problem
> >  -------
> >  We observed an issue in production where a workload continuously
> >  triggering memory.high also generates massive disk IO READ, causing
> >  system-wide performance degradation.
> >  
> >  This happens because memory.high penalty is currently based solely on
> >  the overage amount, not the actual impact of that overage:
> >  
> >  1. A memcg over memory.high reclaiming cold/unused pages
> >  → minimal system impact, light penalty is appropriate
> >  
> >  2. A memcg over memory.high with hot pages being continuously
> >  reclaimed and refaulted → severe IO pressure, needs heavy penalty
> >  
> >  Both cases receive identical penalties today. Users are forced to
> >  combine memory.high with io.max as a workaround, but this is:
> >  - The wrong abstraction level (memory policy shouldn't require IO tuning)
> >  - Hard to configure correctly across different storage devices
> >  - Unintuitive for users who only want memory control
> > 
> Thanks for raising and reporting this use-case. Overall I am supportive
> of making memory.high more useful but instead of adding more more
> heuristic in the kernel, I would prefer to make the enforcement of
> memory.high more flexible with BPF.
> 
> At the moment, Hui Zhu is working on adding BPF support for memcg but it
> is very generic and I would prefer to start with specific and real
> use-case. I think your use-case is real and will be beneficial to many
> other users. Can you please followup on that Hui's RFC to present your
> use-case? I will also try to push the effort from the review side.
> 
> thanks,
> Shakeel
>

Hi Shakeel,

Thanks for the feedback and pointing to Hui's RFC.

I noticed Michal has already forwarded my patch to that thread, and
Hui has responded. I'll wait to see how that discussion evolves and
whether there's an opportunity to integrate my use-case into his
BPF framework.

You're right that my timestamp-based approach is heuristic. It was
designed as a simple, low-overhead approximation to detect active
thrashing without the cost of flushing refault counters on every
charge. But I agree that a more flexible BPF-based solution could
be cleaner in the long term.

I'll follow up on Hui's thread once there's more progress.

Thanks,
Jiayuan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-01-06  3:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-29  3:39 [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Jiayuan Chen
2025-12-29 10:42 ` Markus Elfring
2025-12-30 11:37 ` Michal Koutný
2026-01-05 17:08 ` Shakeel Butt
2026-01-06  3:14   ` Jiayuan Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).