public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim
@ 2026-04-02 23:25 JP Kobryn (Meta)
  2026-04-03  0:45 ` Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: JP Kobryn (Meta) @ 2026-04-02 23:25 UTC (permalink / raw)
  To: linux-mm, willy, hannes, akpm, david, ljs, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, kasong, qi.zheng, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, riel, kuba, edumazet
  Cc: netdev, linux-kernel, kernel-team

When kswapd reclaims at high order due to fragmentation, vmpressure() can
report poor reclaim efficiency even though the system has plenty of free
memory. This is because kswapd scans many pages but finds little to reclaim
- the pages are actively in use and don't need to be freed. The resulting
scan:reclaim ratio triggers socket pressure, throttling TCP throughput
unnecessarily.

Net allocations do not exceed order 3 (PAGE_ALLOC_COSTLY_ORDER), so high
order reclaim difficulty should not trigger socket pressure. The kernel
already treats this order as the boundary where reclaim is no longer
expected to succeed and compaction may take over.

Make vmpressure() order-aware through an additional parameter sourced from
scan_control at existing call sites. Socket pressure is now only asserted
when order <= PAGE_ALLOC_COSTLY_ORDER.

Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
uses order 0, which passes the filter unconditionally. Similarly,
vmpressure_prio() now passes order 0 internally when calling vmpressure(),
ensuring critical pressure from low reclaim priority is not suppressed by
the order filter.

Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
---
v2
 - dropped extern specifier from vmpressure decl
 - added comment to explain rationale of adjusted conditional

v1: https://lore.kernel.org/all/20260401203752.643259-1-jp.kobryn@linux.dev/

 include/linux/vmpressure.h |  9 +++++----
 mm/vmpressure.c            | 15 ++++++++++++---
 mm/vmscan.c                |  8 ++++----
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 6a2f51ebbfd35..faecd55224017 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -30,8 +30,8 @@ struct vmpressure {
 struct mem_cgroup;
 
 #ifdef CONFIG_MEMCG
-extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
-		       unsigned long scanned, unsigned long reclaimed);
+void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
+		unsigned long scanned, unsigned long reclaimed);
 extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
 
 extern void vmpressure_init(struct vmpressure *vmpr);
@@ -44,8 +44,9 @@ extern int vmpressure_register_event(struct mem_cgroup *memcg,
 extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
 					struct eventfd_ctx *eventfd);
 #else
-static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
-			      unsigned long scanned, unsigned long reclaimed) {}
+static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg,
+			      bool tree, unsigned long scanned,
+			      unsigned long reclaimed) {}
 static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
 				   int prio) {}
 #endif /* CONFIG_MEMCG */
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 3fbb86996c4d2..f053554e58264 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -218,6 +218,7 @@ static void vmpressure_work_fn(struct work_struct *work)
 /**
  * vmpressure() - Account memory pressure through scanned/reclaimed ratio
  * @gfp:	reclaimer's gfp mask
+ * @order:	allocation order being reclaimed for
  * @memcg:	cgroup memory controller handle
  * @tree:	legacy subtree mode
  * @scanned:	number of pages scanned
@@ -236,7 +237,7 @@ static void vmpressure_work_fn(struct work_struct *work)
  *
  * This function does not return any value.
  */
-void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
+void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 		unsigned long scanned, unsigned long reclaimed)
 {
 	struct vmpressure *vmpr;
@@ -307,7 +308,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 
 		level = vmpressure_calc_level(scanned, reclaimed);
 
-		if (level > VMPRESSURE_LOW) {
+		/*
+		 * Once we go above COSTLY_ORDER, reclaim relies heavily on
+		 * compaction to make progress. Reclaim efficiency was never a
+		 * great proxy for pressure to begin with, but it's outright
+		 * misleading with these high orders. Don't throttle sockets
+		 * because somebody is attempting something crazy like an order-7
+		 * and predictably struggling.
+		 */
+		if (level > VMPRESSURE_LOW && order <= PAGE_ALLOC_COSTLY_ORDER) {
 			/*
 			 * Let the socket buffer allocator know that
 			 * we are having trouble reclaiming LRU pages.
@@ -348,7 +357,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
 	 * to the vmpressure() basically means that we signal 'critical'
 	 * level.
 	 */
-	vmpressure(gfp, memcg, true, vmpressure_win, 0);
+	vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
 }
 
 #define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a8c8fcccbfc9..1342323a0b41f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5071,8 +5071,8 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
 	if (!sc->proactive)
-		vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
-			   sc->nr_reclaimed - reclaimed);
+		vmpressure(sc->gfp_mask, sc->order, memcg, false,
+			   sc->nr_scanned - scanned, sc->nr_reclaimed - reclaimed);
 
 	flush_reclaim_state(sc);
 
@@ -6175,7 +6175,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		/* Record the group's reclaim efficiency */
 		if (!sc->proactive)
-			vmpressure(sc->gfp_mask, memcg, false,
+			vmpressure(sc->gfp_mask, sc->order, memcg, false,
 				   sc->nr_scanned - scanned,
 				   sc->nr_reclaimed - reclaimed);
 
@@ -6220,7 +6220,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 	/* Record the subtree's reclaim efficiency */
 	if (!sc->proactive)
-		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+		vmpressure(sc->gfp_mask, sc->order, sc->target_mem_cgroup, true,
 			   sc->nr_scanned - nr_scanned, nr_node_reclaimed);
 
 	if (nr_node_reclaimed)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim
  2026-04-02 23:25 [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim JP Kobryn (Meta)
@ 2026-04-03  0:45 ` Johannes Weiner
  2026-04-03  1:03 ` Shakeel Butt
  2026-04-03 20:49 ` Jakub Kicinski
  2 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2026-04-03  0:45 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, willy, akpm, david, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, kasong, qi.zheng, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, riel, kuba, edumazet, netdev,
	linux-kernel, kernel-team

nOn Thu, Apr 02, 2026 at 04:25:11PM -0700, JP Kobryn (Meta) wrote:
> When kswapd reclaims at high order due to fragmentation, vmpressure() can
> report poor reclaim efficiency even though the system has plenty of free
> memory. This is because kswapd scans many pages but finds little to reclaim
> - the pages are actively in use and don't need to be freed. The resulting
> scan:reclaim ratio triggers socket pressure, throttling TCP throughput
> unnecessarily.
> 
> Net allocations do not exceed order 3 (PAGE_ALLOC_COSTLY_ORDER), so high
> order reclaim difficulty should not trigger socket pressure. The kernel
> already treats this order as the boundary where reclaim is no longer
> expected to succeed and compaction may take over.
> 
> Make vmpressure() order-aware through an additional parameter sourced from
> scan_control at existing call sites. Socket pressure is now only asserted
> when order <= PAGE_ALLOC_COSTLY_ORDER.
> 
> Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
> uses order 0, which passes the filter unconditionally. Similarly,
> vmpressure_prio() now passes order 0 internally when calling vmpressure(),
> ensuring critical pressure from low reclaim priority is not suppressed by
> the order filter.
> 
> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim
  2026-04-02 23:25 [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim JP Kobryn (Meta)
  2026-04-03  0:45 ` Johannes Weiner
@ 2026-04-03  1:03 ` Shakeel Butt
  2026-04-03  1:11   ` Rik van Riel
  2026-04-06 17:34   ` JP Kobryn (Meta)
  2026-04-03 20:49 ` Jakub Kicinski
  2 siblings, 2 replies; 6+ messages in thread
From: Shakeel Butt @ 2026-04-03  1:03 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, willy, hannes, akpm, david, ljs, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, kasong, qi.zheng, baohua, axelrasmussen,
	yuanchu, weixugc, riel, kuba, edumazet, netdev, linux-kernel,
	kernel-team

On Thu, Apr 02, 2026 at 04:25:11PM -0700, JP Kobryn (Meta) wrote:
> When kswapd reclaims at high order due to fragmentation,

* kswapd is woken up for the higher order reclaim request

But this can be direct reclaim as well.

> vmpressure() can
> report poor reclaim efficiency even though the system has plenty of free
> memory. This is because kswapd scans many pages but finds little to reclaim
> - the pages are actively in use and don't need to be freed. The resulting
> scan:reclaim ratio triggers socket pressure, throttling TCP throughput
> unnecessarily.
> 
> Net allocations do not exceed order 3 (PAGE_ALLOC_COSTLY_ORDER),

Net not doing costly order allocations is irrelevant here. IIUC you want all
costly order allocations (like THPs) to not raise vmpressure as those don't
necessarily represents the memory pressure.

> so high
> order reclaim difficulty should not trigger socket pressure. The kernel
> already treats this order as the boundary where reclaim is no longer
> expected to succeed and compaction may take over.
> 
> Make vmpressure() order-aware through an additional parameter sourced from
> scan_control at existing call sites. Socket pressure is now only asserted
> when order <= PAGE_ALLOC_COSTLY_ORDER.
> 
> Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
> uses order 0, which passes the filter unconditionally. Similarly,
> vmpressure_prio() now passes order 0 internally when calling vmpressure(),
> ensuring critical pressure from low reclaim priority is not suppressed by
> the order filter.
> 
> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>

The patch looks good. I think we can ask Andrew to just adjust the commit
message and then you don't need to resend.

Moving networking stack away from vmpressure in my plan for a long time and this
tells me I should get to it sooner.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim
  2026-04-03  1:03 ` Shakeel Butt
@ 2026-04-03  1:11   ` Rik van Riel
  2026-04-06 17:34   ` JP Kobryn (Meta)
  1 sibling, 0 replies; 6+ messages in thread
From: Rik van Riel @ 2026-04-03  1:11 UTC (permalink / raw)
  To: Shakeel Butt, JP Kobryn (Meta)
  Cc: linux-mm, willy, hannes, akpm, david, ljs, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, kasong, qi.zheng, baohua, axelrasmussen,
	yuanchu, weixugc, kuba, edumazet, netdev, linux-kernel,
	kernel-team

On Thu, 2026-04-02 at 18:03 -0700, Shakeel Butt wrote:
> 
> Net not doing costly order allocations is irrelevant here. IIUC you
> want all
> costly order allocations (like THPs) to not raise vmpressure as those
> don't
> necessarily represents the memory pressure.
> 
It sure will be nice to not have THP enabled=always,
defrag=defer result in smaller network buffers, and
reduced network performance!

Reviewed-by: Rik van Riel <riel@surriel.com>


-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim
  2026-04-02 23:25 [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim JP Kobryn (Meta)
  2026-04-03  0:45 ` Johannes Weiner
  2026-04-03  1:03 ` Shakeel Butt
@ 2026-04-03 20:49 ` Jakub Kicinski
  2 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2026-04-03 20:49 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, willy, hannes, akpm, david, ljs, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, kasong, qi.zheng, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, riel, edumazet, netdev,
	linux-kernel, kernel-team

On Thu,  2 Apr 2026 16:25:11 -0700 JP Kobryn (Meta) wrote:
> When kswapd reclaims at high order due to fragmentation, vmpressure() can
> report poor reclaim efficiency even though the system has plenty of free
> memory. This is because kswapd scans many pages but finds little to reclaim
> - the pages are actively in use and don't need to be freed. The resulting
> scan:reclaim ratio triggers socket pressure, throttling TCP throughput
> unnecessarily.

Acked-by: Jakub Kicinski <kuba@kernel.org>

FWIW

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim
  2026-04-03  1:03 ` Shakeel Butt
  2026-04-03  1:11   ` Rik van Riel
@ 2026-04-06 17:34   ` JP Kobryn (Meta)
  1 sibling, 0 replies; 6+ messages in thread
From: JP Kobryn (Meta) @ 2026-04-06 17:34 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, willy, hannes, akpm, david, ljs, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, kasong, qi.zheng, baohua, axelrasmussen,
	yuanchu, weixugc, riel, kuba, edumazet, netdev, linux-kernel,
	kernel-team

On 4/2/26 6:03 PM, Shakeel Butt wrote:
> On Thu, Apr 02, 2026 at 04:25:11PM -0700, JP Kobryn (Meta) wrote:
>> When kswapd reclaims at high order due to fragmentation,
> 
> * kswapd is woken up for the higher order reclaim request
> 
> But this can be direct reclaim as well.

Good call.

> 
>> vmpressure() can
>> report poor reclaim efficiency even though the system has plenty of free
>> memory. This is because kswapd scans many pages but finds little to reclaim
>> - the pages are actively in use and don't need to be freed. The resulting
>> scan:reclaim ratio triggers socket pressure, throttling TCP throughput
>> unnecessarily.
>>
>> Net allocations do not exceed order 3 (PAGE_ALLOC_COSTLY_ORDER),
> 
> Net not doing costly order allocations is irrelevant here. IIUC you want all
> costly order allocations (like THPs) to not raise vmpressure as those don't
> necessarily represents the memory pressure.

The supporting context I included was based on the investigation that
led to the patch. But as you and Rik both noted, the patch has
greater implications.

> 
>> so high
>> order reclaim difficulty should not trigger socket pressure. The kernel
>> already treats this order as the boundary where reclaim is no longer
>> expected to succeed and compaction may take over.
>>
>> Make vmpressure() order-aware through an additional parameter sourced from
>> scan_control at existing call sites. Socket pressure is now only asserted
>> when order <= PAGE_ALLOC_COSTLY_ORDER.
>>
>> Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
>> uses order 0, which passes the filter unconditionally. Similarly,
>> vmpressure_prio() now passes order 0 internally when calling vmpressure(),
>> ensuring critical pressure from low reclaim priority is not suppressed by
>> the order filter.
>>
>> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
> 
> The patch looks good. I think we can ask Andrew to just adjust the commit
> message and then you don't need to resend.

It's no problem for me. I'll send a v3 with an updated commit message.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-06 17:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 23:25 [PATCH v2] mm/vmpressure: skip socket pressure for costly order reclaim JP Kobryn (Meta)
2026-04-03  0:45 ` Johannes Weiner
2026-04-03  1:03 ` Shakeel Butt
2026-04-03  1:11   ` Rik van Riel
2026-04-06 17:34   ` JP Kobryn (Meta)
2026-04-03 20:49 ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox