public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH v3] mm/vmpressure: skip socket pressure for costly order reclaim
@ 2026-04-06 17:44 JP Kobryn (Meta)
  2026-04-06 17:54 ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: JP Kobryn (Meta) @ 2026-04-06 17:44 UTC (permalink / raw)
  To: linux-mm, willy, hannes, akpm, david, ljs, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, kasong, qi.zheng, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, riel, kuba, edumazet
  Cc: netdev, linux-kernel, kernel-team

When reclaim is triggered by high order allocations on a fragmented system,
vmpressure() can report poor reclaim efficiency even though the system has
plenty of free memory. This is because many pages are scanned, but few are
found to actually reclaim - the pages are actively in use and don't need to
be freed. The resulting scan:reclaim ratio causes vmpressure() to assert
socket pressure, throttling TCP throughput unnecessarily.

Costly order allocations (above PAGE_ALLOC_COSTLY_ORDER) rely heavily on
compaction to succeed, so poor reclaim efficiency at these orders does not
necessarily indicate memory pressure. The kernel already treats this order
as the boundary where reclaim is no longer expected to succeed and
compaction may take over.

Make vmpressure() order-aware through an additional parameter sourced from
scan_control at existing call sites. Socket pressure is now only asserted
when order <= PAGE_ALLOC_COSTLY_ORDER.

Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
uses order 0, which passes the filter unconditionally. Similarly,
vmpressure_prio() now passes order 0 internally when calling vmpressure(),
ensuring critical pressure from low reclaim priority is not suppressed by
the order filter.

Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Jakub Kicinski <kuba@kernel.org>
---
v3
 - update changelog to justify patch beyond just networking
 - update changelog to expand scope of vmpressure beyond kswapd

v2
 - dropped extern specifier from vmpressure decl
 - added comment to explain rationale of adjusted conditional

v1: https://lore.kernel.org/all/20260401203752.643259-1-jp.kobryn@linux.dev/

 include/linux/vmpressure.h |  9 +++++----
 mm/vmpressure.c            | 15 ++++++++++++---
 mm/vmscan.c                |  8 ++++----
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 6a2f51ebbfd35..faecd55224017 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -30,8 +30,8 @@ struct vmpressure {
 struct mem_cgroup;
 
 #ifdef CONFIG_MEMCG
-extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
-		       unsigned long scanned, unsigned long reclaimed);
+void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
+		unsigned long scanned, unsigned long reclaimed);
 extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
 
 extern void vmpressure_init(struct vmpressure *vmpr);
@@ -44,8 +44,9 @@ extern int vmpressure_register_event(struct mem_cgroup *memcg,
 extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
 					struct eventfd_ctx *eventfd);
 #else
-static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
-			      unsigned long scanned, unsigned long reclaimed) {}
+static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg,
+			      bool tree, unsigned long scanned,
+			      unsigned long reclaimed) {}
 static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
 				   int prio) {}
 #endif /* CONFIG_MEMCG */
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 3fbb86996c4d2..f053554e58264 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -218,6 +218,7 @@ static void vmpressure_work_fn(struct work_struct *work)
 /**
  * vmpressure() - Account memory pressure through scanned/reclaimed ratio
  * @gfp:	reclaimer's gfp mask
+ * @order:	allocation order being reclaimed for
  * @memcg:	cgroup memory controller handle
  * @tree:	legacy subtree mode
  * @scanned:	number of pages scanned
@@ -236,7 +237,7 @@ static void vmpressure_work_fn(struct work_struct *work)
  *
  * This function does not return any value.
  */
-void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
+void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 		unsigned long scanned, unsigned long reclaimed)
 {
 	struct vmpressure *vmpr;
@@ -307,7 +308,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 
 		level = vmpressure_calc_level(scanned, reclaimed);
 
-		if (level > VMPRESSURE_LOW) {
+		/*
+		 * Once we go above COSTLY_ORDER, reclaim relies heavily on
+		 * compaction to make progress. Reclaim efficiency was never a
+		 * great proxy for pressure to begin with, but it's outright
+		 * misleading with these high orders. Don't throttle sockets
+		 * because somebody is attempting something crazy like an order-7
+		 * and predictably struggling.
+		 */
+		if (level > VMPRESSURE_LOW && order <= PAGE_ALLOC_COSTLY_ORDER) {
 			/*
 			 * Let the socket buffer allocator know that
 			 * we are having trouble reclaiming LRU pages.
@@ -348,7 +357,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
 	 * to the vmpressure() basically means that we signal 'critical'
 	 * level.
 	 */
-	vmpressure(gfp, memcg, true, vmpressure_win, 0);
+	vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
 }
 
 #define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a8c8fcccbfc9..1342323a0b41f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5071,8 +5071,8 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
 	if (!sc->proactive)
-		vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
-			   sc->nr_reclaimed - reclaimed);
+		vmpressure(sc->gfp_mask, sc->order, memcg, false,
+			   sc->nr_scanned - scanned, sc->nr_reclaimed - reclaimed);
 
 	flush_reclaim_state(sc);
 
@@ -6175,7 +6175,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		/* Record the group's reclaim efficiency */
 		if (!sc->proactive)
-			vmpressure(sc->gfp_mask, memcg, false,
+			vmpressure(sc->gfp_mask, sc->order, memcg, false,
 				   sc->nr_scanned - scanned,
 				   sc->nr_reclaimed - reclaimed);
 
@@ -6220,7 +6220,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 	/* Record the subtree's reclaim efficiency */
 	if (!sc->proactive)
-		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+		vmpressure(sc->gfp_mask, sc->order, sc->target_mem_cgroup, true,
 			   sc->nr_scanned - nr_scanned, nr_node_reclaimed);
 
 	if (nr_node_reclaimed)
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] mm/vmpressure: skip socket pressure for costly order reclaim
  2026-04-06 17:44 [PATCH v3] mm/vmpressure: skip socket pressure for costly order reclaim JP Kobryn (Meta)
@ 2026-04-06 17:54 ` Andrew Morton
  2026-04-06 19:07   ` JP Kobryn (Meta)
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2026-04-06 17:54 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, willy, hannes, david, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, kasong, qi.zheng, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, riel, kuba, edumazet, netdev,
	linux-kernel, kernel-team

On Mon,  6 Apr 2026 10:44:25 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:

> When reclaim is triggered by high order allocations on a fragmented system,
> vmpressure() can report poor reclaim efficiency even though the system has
> plenty of free memory. This is because many pages are scanned, but few are
> found to actually reclaim - the pages are actively in use and don't need to
> be freed. The resulting scan:reclaim ratio causes vmpressure() to assert
> socket pressure, throttling TCP throughput unnecessarily.
> 
> Costly order allocations (above PAGE_ALLOC_COSTLY_ORDER) rely heavily on
> compaction to succeed, so poor reclaim efficiency at these orders does not
> necessarily indicate memory pressure. The kernel already treats this order
> as the boundary where reclaim is no longer expected to succeed and
> compaction may take over.
> 
> Make vmpressure() order-aware through an additional parameter sourced from
> scan_control at existing call sites. Socket pressure is now only asserted
> when order <= PAGE_ALLOC_COSTLY_ORDER.
> 
> Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
> uses order 0, which passes the filter unconditionally. Similarly,
> vmpressure_prio() now passes order 0 internally when calling vmpressure(),
> ensuring critical pressure from low reclaim priority is not suppressed by
> the order filter.

Thanks.  I'd prefer to park this until after next -rc1.  I could be
argued with, but....

What I'm not understanding from the above is how beneficial this patch
is.  Some description of observed before-and-after behavior, preferably
with impressive measurements?




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] mm/vmpressure: skip socket pressure for costly order reclaim
  2026-04-06 17:54 ` Andrew Morton
@ 2026-04-06 19:07   ` JP Kobryn (Meta)
  2026-04-06 19:16     ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: JP Kobryn (Meta) @ 2026-04-06 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, willy, hannes, david, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, kasong, qi.zheng, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, riel, kuba, edumazet, netdev,
	linux-kernel, kernel-team

On 4/6/26 10:54 AM, Andrew Morton wrote:
> On Mon,  6 Apr 2026 10:44:25 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:
> 
>> When reclaim is triggered by high order allocations on a fragmented system,
>> vmpressure() can report poor reclaim efficiency even though the system has
>> plenty of free memory. This is because many pages are scanned, but few are
>> found to actually reclaim - the pages are actively in use and don't need to
>> be freed. The resulting scan:reclaim ratio causes vmpressure() to assert
>> socket pressure, throttling TCP throughput unnecessarily.
>>
>> Costly order allocations (above PAGE_ALLOC_COSTLY_ORDER) rely heavily on
>> compaction to succeed, so poor reclaim efficiency at these orders does not
>> necessarily indicate memory pressure. The kernel already treats this order
>> as the boundary where reclaim is no longer expected to succeed and
>> compaction may take over.
>>
>> Make vmpressure() order-aware through an additional parameter sourced from
>> scan_control at existing call sites. Socket pressure is now only asserted
>> when order <= PAGE_ALLOC_COSTLY_ORDER.
>>
>> Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
>> uses order 0, which passes the filter unconditionally. Similarly,
>> vmpressure_prio() now passes order 0 internally when calling vmpressure(),
>> ensuring critical pressure from low reclaim priority is not suppressed by
>> the order filter.
> 
> Thanks.  I'd prefer to park this until after next -rc1.  I could be
> argued with, but....
> 
> What I'm not understanding from the above is how beneficial this patch
> is.  Some description of observed before-and-after behavior, preferably
> with impressive measurements?

Let me know if this data helps, and if you'd like this added to the
changelog.

On one affected host with impacted net throughput, the memory state at
the time showed ~15GB available, zero cgroup pressure, and the following
buddyinfo state:

Order FreePages
0:    133,970
1:    29,230
2:    17,351
3:    18,984
7+:   0

Using bpf, it was found that 94% of vmpressure calls on this host were
from order-7 kswapd reclaim.

TCP minimum recv window is rcv_ssthresh:19712.

Before patch:
723 out of 3,843 (19%) TCP connections stuck at minimum recv window

After live-patching and ~30min elapsed:
0 out of 3,470 TCP connections stuck at minimum recv window


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] mm/vmpressure: skip socket pressure for costly order reclaim
  2026-04-06 19:07   ` JP Kobryn (Meta)
@ 2026-04-06 19:16     ` Andrew Morton
  0 siblings, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2026-04-06 19:16 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, willy, hannes, david, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, kasong, qi.zheng, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, riel, kuba, edumazet, netdev,
	linux-kernel, kernel-team

On Mon, 6 Apr 2026 12:07:57 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:

> Let me know if this data helps, and if you'd like this added to the
> changelog.
> 
> On one affected host with impacted net throughput, the memory state at
> the time showed ~15GB available, zero cgroup pressure, and the following
> buddyinfo state:
> 
> Order FreePages
> 0:    133,970
> 1:    29,230
> 2:    17,351
> 3:    18,984
> 7+:   0
> 
> Using bpf, it was found that 94% of vmpressure calls on this host were
> from order-7 kswapd reclaim.
> 
> TCP minimum recv window is rcv_ssthresh:19712.
> 
> Before patch:
> 723 out of 3,843 (19%) TCP connections stuck at minimum recv window
> 
> After live-patching and ~30min elapsed:
> 0 out of 3,470 TCP connections stuck at minimum recv window

Well I'm impressed ;)

Yes please, it's a useful thing to include.  When people look at a
contribution, question #1 is "why should I spend time on this".  A
clear (and tasty) description of userspace benefit is the ideal way of
answering that.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-06 19:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-06 17:44 [PATCH v3] mm/vmpressure: skip socket pressure for costly order reclaim JP Kobryn (Meta)
2026-04-06 17:54 ` Andrew Morton
2026-04-06 19:07   ` JP Kobryn (Meta)
2026-04-06 19:16     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox