Re: [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Brendan Jackman <jackmanb@google.com>, Zi Yan <ziy@nvidia.com>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <liam@infradead.org>,
	Mike Rapoport <rppt@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator
Date: Thu, 2 Jul 2026 12:20:48 +0200	[thread overview]
Message-ID: <e63acfb5-c9d6-4e2a-b44c-417ad3a15b8b@kernel.org> (raw)
In-Reply-To: <akV_VwtcvUN7Mq16@cmpxchg.org>

On 7/1/26 22:57, Johannes Weiner wrote:
> On Wed, Jul 01, 2026 at 08:02:55PM +0200, Vlastimil Babka (SUSE) wrote:
>> On 6/26/26 20:21, Johannes Weiner wrote:
>> > The compaction capturing code assumes the allocation request order
>> > and compaction target order are the same. That won't be true once
>> > defrag_mode promotes sub-block allocations to pageblock-order
>> > compaction: compaction targets the larger order, capture should
>> > remain at the original allocation order.
>> 
>> Well I guess you could also try to capture the whole-pageblock page and then
>> deal with it as with whole pageblock stealing? But this works too and
>> perhaps it's simpler.
> 
> You mean capture the whole block in *page and then break off the chunk
> we need? That's pretty far upstack. We don't hold the zone->lock at
> that point to expand()... I think this is indeed a bit simpler.

Ah right.

> 
>> > @@ -2808,35 +2808,22 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
>> >  		.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
>> >  		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
>> >  	};
>> > -	struct capture_control capc = {
>> > -		.cc = &cc,
>> > -		.page = NULL,
>> > -	};
>> >  
>> > -	/*
>> > -	 * Make sure the structs are really initialized before we expose the
>> > -	 * capture control, in case we are interrupted and the interrupt handler
>> > -	 * frees a page.
>> > -	 */
>> > +	/* See the comment in __alloc_pages_direct_compact() */
>> >  	barrier();
>> > -	WRITE_ONCE(current->capture_control, &capc);
>> > +	WRITE_ONCE(capc->cc, &cc);
>> >  
>> > -	ret = compact_zone(&cc, &capc);
>> > +	ret = compact_zone(&cc, capc);
>> > +
>> > +	WRITE_ONCE(capc->cc, NULL);
>> 
>> I wonder if it makes sense to continue having capc->cc and this whole dance
>> in two functions. AFAICS (after patch 4/4) we access only capc->cc->zone and
>> capc->cc->migratetype. migratetype is stable in the whole
>> try_to_compact_pages(), could be part of capc. Order can be added by this
>> patch (with no semantic change to it) and not the next one. Zone varies, but
>> could be also in capc and set by try_to_compact_pages() before every call to
>> compact_zone_order(). Then compact_zone_order() doesn't have to set up any
>> capc fields anymore?
> 
> You mean something like this?

Almost, see below.

> 
> That could work as well, but let me know if that's what you thought.
> 
> The only thing I don't like so much is that it leaks that
> READ_ONCE(capc.page) out of the function where we have those nice IRQ
> preemption comments and the current->capture_control barriering.

My variant doesn't have it, but it does change capc->zone on an "armed"
capc. But I think it's quite safe; an IRQ can only see a zone that's
compatible.
 
>> > @@ -718,7 +718,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
>> >  {
>> >  	struct capture_control *capc = current->capture_control;
>> >  
>> > -	return unlikely(capc) &&
>> > +	return unlikely(capc && capc->cc) &&
>> >  		!(current->flags & PF_KTHREAD) &&
>> >  		!capc->page &&
>> >  		capc->cc->zone == zone ? capc : NULL;
>> > @@ -4146,23 +4146,42 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>> >  	struct page *page = NULL;
>> >  	unsigned long pflags;
>> >  	unsigned int noreclaim_flag;
>> > +	struct capture_control capc = {
>> > +		.page = NULL,
>> 
>> You didn't set .cc to NULL explicitly...
>> 
>> > +	};
>> >  
>> >  	if (!order)
>> >  		return NULL;
>> >  
>> > +	/*
>> > +	 * Make sure the structs are really initialized before we expose the
>> > +	 * capture control, in case we are interrupted and the interrupt handler
>> > +	 * frees a page.
>> > +	 */
>> > +	barrier();
>> 
>> So either an implicit { } NULL / zero initialization + barrier() is enough
>> (I hope so) and we don't need to set NULL / zero in every field explicitly.
>> Or not and then we should set every field and not just page.
> 
> That partial capc = { .page = NULL, } will 0 the other fields. I don't
> see a correctness issue. Or were you talking about readability?

I mean if we trust we don't need to be explicit about what we set to 0/NULL,
then we don't have to set any of the fields that are 0 or NULL. So in the
quoted above there could be just = { }, if we trust that the implicit
initialization does also happen before the barrier().
Otherwise there would have to be { .page = NULL, .cc = NULL, }

Now my version (only compile tested):

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index f29ef0653546..66a2f70e9e01 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -58,6 +58,7 @@ enum compact_result {
 };
 
 struct alloc_context; /* in mm/internal.h */
+struct capture_control; /* in mm/internal.h */
 
 /*
  * Number of free order-0 pages that should be available above given watermark
@@ -92,7 +93,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 		unsigned int order, unsigned int alloc_flags,
 		const struct alloc_context *ac, enum compact_priority prio,
-		struct page **page);
+		struct capture_control *capc);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern bool compaction_suitable(struct zone *zone, int order,
 				unsigned long watermark, int highest_zoneidx);
diff --git a/mm/compaction.c b/mm/compaction.c
index b7878e4a2c4b..902573797f99 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2792,7 +2792,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 static enum compact_result compact_zone_order(struct zone *zone, int order,
 		gfp_t gfp_mask, enum compact_priority prio,
 		unsigned int alloc_flags, int highest_zoneidx,
-		struct page **capture)
+		struct capture_control *capc)
 {
 	enum compact_result ret;
 	struct compact_control cc = {
@@ -2809,36 +2809,8 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 		.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
 		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
 	};
-	struct capture_control capc = {
-		.cc = &cc,
-		.page = NULL,
-	};
-
-	/*
-	 * Make sure the structs are really initialized before we expose the
-	 * capture control, in case we are interrupted and the interrupt handler
-	 * frees a page.
-	 */
-	barrier();
-	WRITE_ONCE(current->capture_control, &capc);
 
-	ret = compact_zone(&cc, &capc);
-
-	/*
-	 * Make sure we hide capture control first before we read the captured
-	 * page pointer, otherwise an interrupt could free and capture a page
-	 * and we would leak it.
-	 */
-	WRITE_ONCE(current->capture_control, NULL);
-	*capture = READ_ONCE(capc.page);
-	/*
-	 * Technically, it is also possible that compaction is skipped but
-	 * the page is still captured out of luck(IRQ came and freed the page).
-	 * Returning COMPACT_SUCCESS in such cases helps in properly accounting
-	 * the COMPACT[STALL|FAIL] when compaction is skipped.
-	 */
-	if (*capture)
-		ret = COMPACT_SUCCESS;
+	ret = compact_zone(&cc, capc);
 
 	return ret;
 }
@@ -2850,13 +2822,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
  * @alloc_flags: The allocation flags of the current allocation
  * @ac: The context of current allocation
  * @prio: Determines how hard direct compaction should try to succeed
- * @capture: Pointer to free page created by compaction will be stored here
+ * @capc: Free page capture bypassing the freelist
  *
  * This is the main entry point for direct page compaction.
  */
 enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum compact_priority prio, struct page **capture)
+		enum compact_priority prio, struct capture_control *capc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2883,8 +2855,17 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			continue;
 		}
 
+		WRITE_ONCE(capc->zone, zone);
+
 		status = compact_zone_order(zone, order, gfp_mask, prio,
-				alloc_flags, ac->highest_zoneidx, capture);
+				alloc_flags, ac->highest_zoneidx, capc);
+
+		WRITE_ONCE(capc->zone, NULL);
+
+		/* Stop if a page has been captured */
+		if (READ_ONCE(capc->page))
+			status = COMPACT_SUCCESS;
+
 		rc = max(status, rc);
 
 		/* The allocation should succeed, stop compacting */
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..6d3402001b93 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1059,7 +1059,9 @@ struct compact_control {
  * immediately when one is created during the free path.
  */
 struct capture_control {
-	struct compact_control *cc;
+	struct zone *zone;
+	int migratetype;
+	int order;
 	struct page *page;
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb422505c6ef..f7f5245e7ca1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -721,14 +721,14 @@ static inline struct capture_control *task_capc(struct zone *zone)
 	return unlikely(capc) &&
 		!(current->flags & PF_KTHREAD) &&
 		!capc->page &&
-		capc->cc->zone == zone ? capc : NULL;
+		capc->zone == zone ? capc : NULL;
 }
 
 static inline bool
 compaction_capture(struct capture_control *capc, struct page *page,
 		   int order, int migratetype)
 {
-	if (!capc || order != capc->cc->order)
+	if (!capc || order != capc->order)
 		return false;
 
 	/* Do not accidentally pollute CMA or isolated regions*/
@@ -744,12 +744,12 @@ compaction_capture(struct capture_control *capc, struct page *page,
 	 * have trouble finding a high-order free page.
 	 */
 	if (order < pageblock_order && migratetype == MIGRATE_MOVABLE &&
-	    capc->cc->migratetype != MIGRATE_MOVABLE)
+	    capc->migratetype != MIGRATE_MOVABLE)
 		return false;
 
-	if (migratetype != capc->cc->migratetype)
-		trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
-					    capc->cc->migratetype, migratetype);
+	if (migratetype != capc->migratetype)
+		trace_mm_page_alloc_extfrag(page, capc->order, order,
+					    capc->migratetype, migratetype);
 
 	capc->page = page;
 	return true;
@@ -4146,6 +4146,12 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned long pflags;
 	unsigned int noreclaim_flag;
+	struct capture_control capc = {
+		.zone = NULL,
+		.migratetype = ac->migratetype,
+		.order = order,
+		.page = NULL,
+	};
 
 	if (!order)
 		return NULL;
@@ -4155,8 +4161,33 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	fs_reclaim_acquire(gfp_mask);
 	noreclaim_flag = memalloc_noreclaim_save();
 
+	/*
+	 * Make sure the structs are really initialized before we expose the
+	 * capture control, in case we are interrupted and the interrupt handler
+	 * frees a page.
+	 */
+	barrier();
+	WRITE_ONCE(current->capture_control, capc);
+
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
-								prio, &page);
+							       prio, &capc);
+
+	/*
+	 * Make sure we hide capture control first before we read the captured
+	 * page pointer, otherwise an interrupt could free and capture a page
+	 * and we would leak it.
+	 */
+	WRITE_ONCE(current->capture_control, NULL);
+	page = READ_ONCE(capc.page);
+
+	/*
+	 * Technically, it is also possible that compaction is skipped but
+	 * the page is still captured out of luck(IRQ came and freed the page).
+	 * Returning COMPACT_SUCCESS in such cases helps in properly accounting
+	 * the COMPACT[STALL|FAIL] when compaction is skipped.
+	 */
+	if (page)
+		*compact_result = COMPACT_SUCCESS;
 
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(gfp_mask);

next prev parent reply	other threads:[~2026-07-02 10:20 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
2026-07-01 13:45   ` Vlastimil Babka (SUSE)
2026-07-02  0:39   ` Shakeel Butt
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
2026-07-01 14:19   ` Vlastimil Babka (SUSE)
2026-07-01 15:28     ` Johannes Weiner
2026-07-01 18:14       ` Vlastimil Babka (SUSE)
2026-07-01 21:11         ` Johannes Weiner
2026-07-02  9:28           ` Vlastimil Babka (SUSE)
2026-07-02  1:47   ` Zi Yan
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
2026-07-01 18:02   ` Vlastimil Babka (SUSE)
2026-07-01 20:57     ` Johannes Weiner
2026-07-02 10:20       ` Vlastimil Babka (SUSE) [this message]
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
2026-06-26 18:29   ` Zi Yan
2026-06-26 18:43     ` Johannes Weiner
2026-07-01 18:06   ` Vlastimil Babka (SUSE)
2026-07-01 21:02     ` Johannes Weiner
2026-07-02  0:31 ` [PATCH 0/4] mm: fix reclaim storms " Shakeel Butt

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:f29ef065354 dfblob:66a2f70e9e0 dfblob:b7878e4a2c4
dfblob:902573797f9 dfblob:181e79f1d6a dfblob:6d3402001b9
dfblob:cb422505c6e dfblob:f7f5245e7ca )
 OR (
bs:"Re: [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e63acfb5-c9d6-4e2a-b44c-417ad3a15b8b@kernel.org \
    --to=vbabka@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox