From: Jiayuan Chen <jiayuan.chen@linux.dev>
To: linux-mm@kvack.org
Cc: Jiayuan Chen <jiayuan.chen@shopee.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Qi Zheng <zhengqi.arch@bytedance.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
Date: Mon, 29 Dec 2025 11:39:55 +0800 [thread overview]
Message-ID: <20251229033957.296257-1-jiayuan.chen@linux.dev> (raw)
From: Jiayuan Chen <jiayuan.chen@shopee.com>
Problem
-------
We observed an issue in production where a workload continuously
triggering memory.high also generates massive disk IO READ, causing
system-wide performance degradation.
This happens because memory.high penalty is currently based solely on
the overage amount, not the actual impact of that overage:
1. A memcg over memory.high reclaiming cold/unused pages
→ minimal system impact, light penalty is appropriate
2. A memcg over memory.high with hot pages being continuously
reclaimed and refaulted → severe IO pressure, needs heavy penalty
Both cases receive identical penalties today. Users are forced to
combine memory.high with io.max as a workaround, but this is:
- The wrong abstraction level (memory policy shouldn't require IO tuning)
- Hard to configure correctly across different storage devices
- Unintuitive for users who only want memory control
Reproduction
------------
A simple test program demonstrates the issue:
int fd = open("./200MB.file", O_RDWR|O_CREAT, 777);
char *mem = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
while (1) {
for (size_t i = 0; i < size; i += 4096) {
if (mem[rand() % size] != 0)
return -1;
}
}
Run with memory.high constraint:
cgcreate -g io,cpu,cpuset,memory:/always_high
cgset -r cpuset.cpus=0 always_high
cgset -r memory.high=150M always_high
cgexec -g cpu,cpuset,memory:/always_high ./high_test 200 &
Solution
--------
Incorporate refault recency into the penalty calculation. If a refault
occurred recently when memory.high is triggered, it indicates active
thrashing and warrants additional throttling.
Why not use refault counters directly?
- Refault statistics (WORKINGSET_REFAULT_*) are aggregated periodically,
not available in real-time for accurate delta calculation
- Calling mem_cgroup_flush_stats() on every charge would be prohibitively
expensive in the hot path
- Due to readahead, the same refault count can represent vastly different
IO loads, making counter-based estimation unreliable
The timestamp-based approach is:
- O(1) cost: single timestamp read and comparison
- Self-calibrating: penalty scales naturally with refault frequency
- Conservative: only triggers when refault and memory.high event
occur in close temporal proximity
When refault_penalty is active:
- Skip the "reclaim made progress" retry loop to apply throttling sooner
- Skip the "penalty too small" bypass to ensure some delay is applied
- Add refault-based delay to the overage-based delay
Results
-------
Before this patch (memory.high triggered, severe thrashing):
sar -d 1
Time DEV tps rkB/s %util
04:17:42 sda 3242.00 272684.00 89.60
04:17:43 sda 3412.00 251160.00 91.60
04:17:44 sda 3185.00 254532.00 88.00
04:17:45 sda 3230.00 253332.00 88.40
04:17:46 sda 3416.00 224712.00 92.40
04:17:47 sda 3613.00 206612.00 94.40
After this patch with MADV_RANDOM (no readahead):
sar -d 1
Time DEV tps rkB/s %util
04:08:57 sda 512.00 2048.00 5.60
04:08:58 sda 576.00 2304.00 6.80
04:08:59 sda 512.00 2048.00 6.80
04:09:00 sda 536.00 2144.00 4.80
04:09:01 sda 552.00 2208.00 10.40
04:09:02 sda 512.00 2048.00 9.20
After this patch (memory.high triggered, thrashing mitigated):
sar -d 1
Time DEV tps rkB/s %util
04:27:03 sda 40.00 5880.00 0.00
04:27:04 sda 41.00 6472.00 0.00
04:27:05 sda 37.00 4716.00 0.00
04:27:06 sda 48.00 8512.00 0.00
04:27:07 sda 33.00 4556.00 0.00
The patch reduces disk utilization from ~90% to ~6-10%, effectively
preventing memory.high-induced thrashing from overwhelming the IO
subsystem.
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
v1 -> v2 : fix compile error when CONFIG_MEMCG is disabled
---
include/linux/memcontrol.h | 26 ++++++++++++++++++++++++
mm/memcontrol.c | 41 +++++++++++++++++++++++++++++++++++---
mm/workingset.c | 4 ++++
3 files changed, 68 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fd400082313a..98d4268457c0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -321,6 +321,9 @@ struct mem_cgroup {
spinlock_t event_list_lock;
#endif /* CONFIG_MEMCG_V1 */
+ /* Timestamp of most recent refault, for thrashing detection */
+ u64 last_refault;
+
struct mem_cgroup_per_node *nodeinfo[];
};
@@ -1038,6 +1041,20 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
}
extern int mem_cgroup_init(void);
+
+static inline void mem_cgroup_update_last_refault(struct mem_cgroup *memcg)
+{
+ if (memcg)
+ WRITE_ONCE(memcg->last_refault, jiffies);
+}
+
+static inline unsigned long mem_cgroup_get_last_refault(struct mem_cgroup *memcg)
+{
+ if (memcg)
+ return READ_ONCE(memcg->last_refault);
+
+ return 0;
+}
#else /* CONFIG_MEMCG */
#define MEM_CGROUP_ID_SHIFT 0
@@ -1433,6 +1450,15 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
}
static inline int mem_cgroup_init(void) { return 0; }
+
+static inline void mem_cgroup_update_last_refault(struct mem_cgroup *memcg)
+{
+}
+
+static inline unsigned long mem_cgroup_get_last_refault(struct mem_cgroup *memcg)
+{
+ return 0;
+}
#endif /* CONFIG_MEMCG */
/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75fc22a33b28..04f3a2511cbb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2226,6 +2226,38 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
}
+/*
+ * Check if a refault occurred recently, indicating active thrashing.
+ * Returns additional penalty jiffies based on refault recency.
+ *
+ * We use timestamp rather than refault counters because:
+ * 1. Counter aggregation is periodic and expensive to flush
+ * 2. Readahead makes counter-to-IO correlation unreliable
+ * 3. Timestamp gives us recency which directly reflects thrashing intensity
+ */
+static unsigned long calculate_refault(struct mem_cgroup *memcg)
+{
+ unsigned long last_refault = mem_cgroup_get_last_refault(memcg);
+ unsigned long now = jiffies;
+ long diff;
+
+ /*
+ * Only care about refaults within the last second. The closer
+ * the refault is to now, the higher the penalty:
+ *
+ * diff = 1 tick -> penalty = HZ (capped to HZ/10 = 100ms)
+ * diff = HZ/10 -> penalty = 10 ticks = 10ms
+ * diff = HZ/2 -> penalty = 2 ticks = 2ms
+ * diff >= HZ -> penalty = 0 (too old, not thrashing)
+ */
+ if (last_refault && time_before(now, last_refault + HZ)) {
+ diff = max((long)now - (long)last_refault, 1L);
+ /* Cap at 100ms to avoid excessive delays */
+ return min(HZ / diff, HZ / 10);
+ }
+ return 0;
+}
+
/*
* Reclaims memory over the high limit. Called directly from
* try_charge() (context permitting), as well as from the userland
@@ -2233,6 +2265,7 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
*/
void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
{
+ unsigned long refault_penalty;
unsigned long penalty_jiffies;
unsigned long pflags;
unsigned long nr_reclaimed;
@@ -2279,12 +2312,14 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
penalty_jiffies += calculate_high_delay(memcg, nr_pages,
swap_find_max_overage(memcg));
+ refault_penalty = calculate_refault(memcg);
+
/*
* Clamp the max delay per usermode return so as to still keep the
* application moving forwards and also permit diagnostics, albeit
* extremely slowly.
*/
- penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+ penalty_jiffies = min(penalty_jiffies + refault_penalty, MEMCG_MAX_HIGH_DELAY_JIFFIES);
/*
* Don't sleep if the amount of jiffies this memcg owes us is so low
@@ -2292,7 +2327,7 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
* go only a small amount over their memory.high value and maybe haven't
* been aggressively reclaimed enough yet.
*/
- if (penalty_jiffies <= HZ / 100)
+ if (!refault_penalty && penalty_jiffies <= HZ / 100)
goto out;
/*
@@ -2300,7 +2335,7 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
* memory.high, we want to encourage that rather than doing allocator
* throttling.
*/
- if (nr_reclaimed || nr_retries--) {
+ if (!refault_penalty && (nr_reclaimed || nr_retries--)) {
in_retry = true;
goto retry_reclaim;
}
diff --git a/mm/workingset.c b/mm/workingset.c
index e9f05634747a..597fcab497f4 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -297,6 +297,8 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
if (lruvec != folio_lruvec(folio))
goto unlock;
+ mem_cgroup_update_last_refault(folio_memcg(folio));
+
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
if (!recent)
@@ -561,6 +563,8 @@ void workingset_refault(struct folio *folio, void *shadow)
pgdat = folio_pgdat(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ mem_cgroup_update_last_refault(memcg);
+
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
if (!workingset_test_recent(shadow, file, &workingset, true))
--
2.43.0
next reply other threads:[~2025-12-29 3:48 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-29 3:39 Jiayuan Chen [this message]
2025-12-29 10:42 ` [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Markus Elfring
2025-12-30 11:37 ` Michal Koutný
2026-01-05 17:08 ` Shakeel Butt
2026-01-06 3:14 ` Jiayuan Chen
-- strict thread matches above, loose matches on Subject: below --
2025-12-30 14:29 kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251229033957.296257-1-jiayuan.chen@linux.dev \
--to=jiayuan.chen@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=cgroups@vger.kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jiayuan.chen@shopee.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.