All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
@ 2025-12-29  3:39 Jiayuan Chen
  2025-12-29 10:42 ` Markus Elfring
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Jiayuan Chen @ 2025-12-29  3:39 UTC (permalink / raw)
  To: linux-mm
  Cc: Jiayuan Chen, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel

From: Jiayuan Chen <jiayuan.chen@shopee.com>

Problem
-------
We observed an issue in production where a workload continuously
triggering memory.high also generates massive disk IO READ, causing
system-wide performance degradation.

This happens because memory.high penalty is currently based solely on
the overage amount, not the actual impact of that overage:

1. A memcg over memory.high reclaiming cold/unused pages
   → minimal system impact, light penalty is appropriate

2. A memcg over memory.high with hot pages being continuously
   reclaimed and refaulted → severe IO pressure, needs heavy penalty

Both cases receive identical penalties today. Users are forced to
combine memory.high with io.max as a workaround, but this is:
- The wrong abstraction level (memory policy shouldn't require IO tuning)
- Hard to configure correctly across different storage devices
- Unintuitive for users who only want memory control

Reproduction
------------
A simple test program demonstrates the issue:

    int fd = open("./200MB.file", O_RDWR|O_CREAT, 777);
    char *mem = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    while (1) {
        for (size_t i = 0; i < size; i += 4096) {
            if (mem[rand() % size] != 0)
                return -1;
        }
    }

Run with memory.high constraint:

    cgcreate -g io,cpu,cpuset,memory:/always_high
    cgset -r cpuset.cpus=0 always_high
    cgset -r memory.high=150M always_high
    cgexec -g cpu,cpuset,memory:/always_high ./high_test 200 &

Solution
--------
Incorporate refault recency into the penalty calculation. If a refault
occurred recently when memory.high is triggered, it indicates active
thrashing and warrants additional throttling.

Why not use refault counters directly?
- Refault statistics (WORKINGSET_REFAULT_*) are aggregated periodically,
  not available in real-time for accurate delta calculation
- Calling mem_cgroup_flush_stats() on every charge would be prohibitively
  expensive in the hot path
- Due to readahead, the same refault count can represent vastly different
  IO loads, making counter-based estimation unreliable

The timestamp-based approach is:
- O(1) cost: single timestamp read and comparison
- Self-calibrating: penalty scales naturally with refault frequency
- Conservative: only triggers when refault and memory.high event
  occur in close temporal proximity

When refault_penalty is active:
- Skip the "reclaim made progress" retry loop to apply throttling sooner
- Skip the "penalty too small" bypass to ensure some delay is applied
- Add refault-based delay to the overage-based delay

Results
-------
Before this patch (memory.high triggered, severe thrashing):

    sar -d 1
    Time          DEV       tps     rkB/s    %util
    04:17:42      sda   3242.00  272684.00   89.60
    04:17:43      sda   3412.00  251160.00   91.60
    04:17:44      sda   3185.00  254532.00   88.00
    04:17:45      sda   3230.00  253332.00   88.40
    04:17:46      sda   3416.00  224712.00   92.40
    04:17:47      sda   3613.00  206612.00   94.40

After this patch with MADV_RANDOM (no readahead):

    sar -d 1
    Time          DEV       tps     rkB/s    %util
    04:08:57      sda    512.00    2048.00    5.60
    04:08:58      sda    576.00    2304.00    6.80
    04:08:59      sda    512.00    2048.00    6.80
    04:09:00      sda    536.00    2144.00    4.80
    04:09:01      sda    552.00    2208.00   10.40
    04:09:02      sda    512.00    2048.00    9.20

After this patch (memory.high triggered, thrashing mitigated):
    sar -d 1
    Time          DEV       tps     rkB/s    %util
    04:27:03      sda     40.00    5880.00    0.00
    04:27:04      sda     41.00    6472.00    0.00
    04:27:05      sda     37.00    4716.00    0.00
    04:27:06      sda     48.00    8512.00    0.00
    04:27:07      sda     33.00    4556.00    0.00

The patch reduces disk utilization from ~90% to ~6-10%, effectively
preventing memory.high-induced thrashing from overwhelming the IO
subsystem.

Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>

---
v1 -> v2 : fix compile error when CONFIG_MEMCG is disabled
---
 include/linux/memcontrol.h | 26 ++++++++++++++++++++++++
 mm/memcontrol.c            | 41 +++++++++++++++++++++++++++++++++++---
 mm/workingset.c            |  4 ++++
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fd400082313a..98d4268457c0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -321,6 +321,9 @@ struct mem_cgroup {
 	spinlock_t event_list_lock;
 #endif /* CONFIG_MEMCG_V1 */
 
+	/* Timestamp of most recent refault, for thrashing detection */
+	u64 last_refault;
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
@@ -1038,6 +1041,20 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 }
 
 extern int mem_cgroup_init(void);
+
+static inline void mem_cgroup_update_last_refault(struct mem_cgroup *memcg)
+{
+	if (memcg)
+		WRITE_ONCE(memcg->last_refault, jiffies);
+}
+
+static inline unsigned long mem_cgroup_get_last_refault(struct mem_cgroup *memcg)
+{
+	if (memcg)
+		return READ_ONCE(memcg->last_refault);
+
+	return 0;
+}
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1433,6 +1450,15 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 }
 
 static inline int mem_cgroup_init(void) { return 0; }
+
+static inline void mem_cgroup_update_last_refault(struct mem_cgroup *memcg)
+{
+}
+
+static inline unsigned long mem_cgroup_get_last_refault(struct mem_cgroup *memcg)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75fc22a33b28..04f3a2511cbb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2226,6 +2226,38 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
+/*
+ * Check if a refault occurred recently, indicating active thrashing.
+ * Returns additional penalty jiffies based on refault recency.
+ *
+ * We use timestamp rather than refault counters because:
+ * 1. Counter aggregation is periodic and expensive to flush
+ * 2. Readahead makes counter-to-IO correlation unreliable
+ * 3. Timestamp gives us recency which directly reflects thrashing intensity
+ */
+static unsigned long calculate_refault(struct mem_cgroup *memcg)
+{
+	unsigned long last_refault = mem_cgroup_get_last_refault(memcg);
+	unsigned long now = jiffies;
+	long diff;
+
+	/*
+	 * Only care about refaults within the last second. The closer
+	 * the refault is to now, the higher the penalty:
+	 *
+	 *   diff = 1 tick   -> penalty = HZ      (capped to HZ/10 = 100ms)
+	 *   diff = HZ/10    -> penalty = 10 ticks = 10ms
+	 *   diff = HZ/2     -> penalty = 2 ticks  = 2ms
+	 *   diff >= HZ      -> penalty = 0        (too old, not thrashing)
+	 */
+	if (last_refault && time_before(now, last_refault + HZ)) {
+		diff = max((long)now - (long)last_refault, 1L);
+		/* Cap at 100ms to avoid excessive delays */
+		return min(HZ / diff, HZ / 10);
+	}
+	return 0;
+}
+
 /*
  * Reclaims memory over the high limit. Called directly from
  * try_charge() (context permitting), as well as from the userland
@@ -2233,6 +2265,7 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
  */
 void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 {
+	unsigned long refault_penalty;
 	unsigned long penalty_jiffies;
 	unsigned long pflags;
 	unsigned long nr_reclaimed;
@@ -2279,12 +2312,14 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
 						swap_find_max_overage(memcg));
 
+	refault_penalty = calculate_refault(memcg);
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
 	 * extremely slowly.
 	 */
-	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+	penalty_jiffies = min(penalty_jiffies + refault_penalty, MEMCG_MAX_HIGH_DELAY_JIFFIES);
 
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
@@ -2292,7 +2327,7 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	 * go only a small amount over their memory.high value and maybe haven't
 	 * been aggressively reclaimed enough yet.
 	 */
-	if (penalty_jiffies <= HZ / 100)
+	if (!refault_penalty && penalty_jiffies <= HZ / 100)
 		goto out;
 
 	/*
@@ -2300,7 +2335,7 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	 * memory.high, we want to encourage that rather than doing allocator
 	 * throttling.
 	 */
-	if (nr_reclaimed || nr_retries--) {
+	if (!refault_penalty && (nr_reclaimed || nr_retries--)) {
 		in_retry = true;
 		goto retry_reclaim;
 	}
diff --git a/mm/workingset.c b/mm/workingset.c
index e9f05634747a..597fcab497f4 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -297,6 +297,8 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 	if (lruvec != folio_lruvec(folio))
 		goto unlock;
 
+	mem_cgroup_update_last_refault(folio_memcg(folio));
+
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
 
 	if (!recent)
@@ -561,6 +563,8 @@ void workingset_refault(struct folio *folio, void *shadow)
 	pgdat = folio_pgdat(folio);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
+	mem_cgroup_update_last_refault(memcg);
+
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
 	if (!workingset_test_recent(shadow, file, &workingset, true))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread
* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
@ 2025-12-30 14:29 kernel test robot
  0 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2025-12-30 14:29 UTC (permalink / raw)
  To: oe-kbuild; +Cc: lkp, Dan Carpenter

BCC: lkp@intel.com
CC: oe-kbuild-all@lists.linux.dev
In-Reply-To: <20251229033957.296257-1-jiayuan.chen@linux.dev>
References: <20251229033957.296257-1-jiayuan.chen@linux.dev>
TO: Jiayuan Chen <jiayuan.chen@linux.dev>
TO: linux-mm@kvack.org
CC: Jiayuan Chen <jiayuan.chen@shopee.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Roman Gushchin <roman.gushchin@linux.dev>
CC: Shakeel Butt <shakeel.butt@linux.dev>
CC: Muchun Song <muchun.song@linux.dev>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux Memory Management List <linux-mm@kvack.org>
CC: David Hildenbrand <david@kernel.org>
CC: Qi Zheng <zhengqi.arch@bytedance.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Axel Rasmussen <axelrasmussen@google.com>
CC: Yuanchu Xie <yuanchu@google.com>
CC: Wei Xu <weixugc@google.com>
CC: cgroups@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Hi Jiayuan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Jiayuan-Chen/mm-memcg-scale-memory-high-penalty-based-on-refault-recency/20251229-115015
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251229033957.296257-1-jiayuan.chen%40linux.dev
patch subject: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
:::::: branch date: 35 hours ago
:::::: commit date: 35 hours ago
config: arm64-randconfig-r072-20251230 (https://download.01.org/0day-ci/archive/20251230/202512302204.0Hly34Fm-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 8.5.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <error27@gmail.com>
| Closes: https://lore.kernel.org/r/202512302204.0Hly34Fm-lkp@intel.com/

smatch warnings:
mm/msync.c:90 __do_sys_msync() warn: comparison of a potentially tagged address (__do_sys_msync, -2, __UNIQUE_ID_x__781)
mm/msync.c:90 __do_sys_msync() warn: comparison of a potentially tagged address (__do_sys_msync, -2, __UNIQUE_ID_x__781)

vim +90 mm/msync.c

^1da177e4c3f415 Linus Torvalds     2005-04-16  17  
^1da177e4c3f415 Linus Torvalds     2005-04-16  18  /*
^1da177e4c3f415 Linus Torvalds     2005-04-16  19   * MS_SYNC syncs the entire file - including mappings.
^1da177e4c3f415 Linus Torvalds     2005-04-16  20   *
204ec841fbea3e5 Peter Zijlstra     2006-09-25  21   * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
204ec841fbea3e5 Peter Zijlstra     2006-09-25  22   * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
204ec841fbea3e5 Peter Zijlstra     2006-09-25  23   * Now it doesn't do anything, since dirty pages are properly tracked.
204ec841fbea3e5 Peter Zijlstra     2006-09-25  24   *
204ec841fbea3e5 Peter Zijlstra     2006-09-25  25   * The application may now run fsync() to
^1da177e4c3f415 Linus Torvalds     2005-04-16  26   * write out the dirty pages and wait on the writeout and check the result.
^1da177e4c3f415 Linus Torvalds     2005-04-16  27   * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
^1da177e4c3f415 Linus Torvalds     2005-04-16  28   * async writeout immediately.
16538c40776b8be Amos Waterland     2006-03-24  29   * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
^1da177e4c3f415 Linus Torvalds     2005-04-16  30   * applications.
^1da177e4c3f415 Linus Torvalds     2005-04-16  31   */
6a6160a7b5c27b3 Heiko Carstens     2009-01-14  32  SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
^1da177e4c3f415 Linus Torvalds     2005-04-16  33  {
^1da177e4c3f415 Linus Torvalds     2005-04-16  34  	unsigned long end;
204ec841fbea3e5 Peter Zijlstra     2006-09-25  35  	struct mm_struct *mm = current->mm;
^1da177e4c3f415 Linus Torvalds     2005-04-16  36  	struct vm_area_struct *vma;
676758bdb7bfca8 Andrew Morton      2006-03-24  37  	int unmapped_error = 0;
676758bdb7bfca8 Andrew Morton      2006-03-24  38  	int error = -EINVAL;
^1da177e4c3f415 Linus Torvalds     2005-04-16  39  
057d3389108eda8 Andrey Konovalov   2019-09-25  40  	start = untagged_addr(start);
057d3389108eda8 Andrey Konovalov   2019-09-25  41  
^1da177e4c3f415 Linus Torvalds     2005-04-16  42  	if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
^1da177e4c3f415 Linus Torvalds     2005-04-16  43  		goto out;
b0d61c7e56815b0 Alexander Kuleshov 2015-11-05  44  	if (offset_in_page(start))
^1da177e4c3f415 Linus Torvalds     2005-04-16  45  		goto out;
^1da177e4c3f415 Linus Torvalds     2005-04-16  46  	if ((flags & MS_ASYNC) && (flags & MS_SYNC))
^1da177e4c3f415 Linus Torvalds     2005-04-16  47  		goto out;
^1da177e4c3f415 Linus Torvalds     2005-04-16  48  	error = -ENOMEM;
^1da177e4c3f415 Linus Torvalds     2005-04-16  49  	len = (len + ~PAGE_MASK) & PAGE_MASK;
^1da177e4c3f415 Linus Torvalds     2005-04-16  50  	end = start + len;
^1da177e4c3f415 Linus Torvalds     2005-04-16  51  	if (end < start)
^1da177e4c3f415 Linus Torvalds     2005-04-16  52  		goto out;
^1da177e4c3f415 Linus Torvalds     2005-04-16  53  	error = 0;
^1da177e4c3f415 Linus Torvalds     2005-04-16  54  	if (end == start)
^1da177e4c3f415 Linus Torvalds     2005-04-16  55  		goto out;
^1da177e4c3f415 Linus Torvalds     2005-04-16  56  	/*
^1da177e4c3f415 Linus Torvalds     2005-04-16  57  	 * If the interval [start,end) covers some unmapped address ranges,
f6899bc03cbadc6 Nikita Ermakov     2021-04-29  58  	 * just ignore them, but return -ENOMEM at the end. Besides, if the
f6899bc03cbadc6 Nikita Ermakov     2021-04-29  59  	 * flag is MS_ASYNC (w/o MS_INVALIDATE) the result would be -ENOMEM
f6899bc03cbadc6 Nikita Ermakov     2021-04-29  60  	 * anyway and there is nothing left to do, so return immediately.
^1da177e4c3f415 Linus Torvalds     2005-04-16  61  	 */
d8ed45c5dcd455f Michel Lespinasse  2020-06-08  62  	mmap_read_lock(mm);
204ec841fbea3e5 Peter Zijlstra     2006-09-25  63  	vma = find_vma(mm, start);
204ec841fbea3e5 Peter Zijlstra     2006-09-25  64  	for (;;) {
9c50823eebf7c25 Andrew Morton      2006-03-24  65  		struct file *file;
7fc34a62ca4434a Matthew Wilcox     2014-06-04  66  		loff_t fstart, fend;
9c50823eebf7c25 Andrew Morton      2006-03-24  67  
204ec841fbea3e5 Peter Zijlstra     2006-09-25  68  		/* Still start < end. */
204ec841fbea3e5 Peter Zijlstra     2006-09-25  69  		error = -ENOMEM;
204ec841fbea3e5 Peter Zijlstra     2006-09-25  70  		if (!vma)
204ec841fbea3e5 Peter Zijlstra     2006-09-25  71  			goto out_unlock;
^1da177e4c3f415 Linus Torvalds     2005-04-16  72  		/* Here start < vma->vm_end. */
^1da177e4c3f415 Linus Torvalds     2005-04-16  73  		if (start < vma->vm_start) {
f6899bc03cbadc6 Nikita Ermakov     2021-04-29  74  			if (flags == MS_ASYNC)
f6899bc03cbadc6 Nikita Ermakov     2021-04-29  75  				goto out_unlock;
^1da177e4c3f415 Linus Torvalds     2005-04-16  76  			start = vma->vm_start;
204ec841fbea3e5 Peter Zijlstra     2006-09-25  77  			if (start >= end)
9c50823eebf7c25 Andrew Morton      2006-03-24  78  				goto out_unlock;
204ec841fbea3e5 Peter Zijlstra     2006-09-25  79  			unmapped_error = -ENOMEM;
^1da177e4c3f415 Linus Torvalds     2005-04-16  80  		}
204ec841fbea3e5 Peter Zijlstra     2006-09-25  81  		/* Here vma->vm_start <= start < vma->vm_end. */
204ec841fbea3e5 Peter Zijlstra     2006-09-25  82  		if ((flags & MS_INVALIDATE) &&
204ec841fbea3e5 Peter Zijlstra     2006-09-25  83  				(vma->vm_flags & VM_LOCKED)) {
204ec841fbea3e5 Peter Zijlstra     2006-09-25  84  			error = -EBUSY;
9c50823eebf7c25 Andrew Morton      2006-03-24  85  			goto out_unlock;
9c50823eebf7c25 Andrew Morton      2006-03-24  86  		}
9c50823eebf7c25 Andrew Morton      2006-03-24  87  		file = vma->vm_file;
496a8e68654a5f4 Namjae Jeon        2014-07-02  88  		fstart = (start - vma->vm_start) +
496a8e68654a5f4 Namjae Jeon        2014-07-02  89  			 ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
7fc34a62ca4434a Matthew Wilcox     2014-06-04 @90  		fend = fstart + (min(end, vma->vm_end) - start) - 1;

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-06  3:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-29  3:39 [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Jiayuan Chen
2025-12-29 10:42 ` Markus Elfring
2025-12-30 11:37 ` Michal Koutný
2026-01-05 17:08 ` Shakeel Butt
2026-01-06  3:14   ` Jiayuan Chen
  -- strict thread matches above, loose matches on Subject: below --
2025-12-30 14:29 kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.