From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Brendan Jackman <jackmanb@google.com>, Zi Yan <ziy@nvidia.com>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <liam@infradead.org>,
Mike Rapoport <rppt@kernel.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator
Date: Thu, 2 Jul 2026 12:20:48 +0200 [thread overview]
Message-ID: <e63acfb5-c9d6-4e2a-b44c-417ad3a15b8b@kernel.org> (raw)
In-Reply-To: <akV_VwtcvUN7Mq16@cmpxchg.org>
On 7/1/26 22:57, Johannes Weiner wrote:
> On Wed, Jul 01, 2026 at 08:02:55PM +0200, Vlastimil Babka (SUSE) wrote:
>> On 6/26/26 20:21, Johannes Weiner wrote:
>> > The compaction capturing code assumes the allocation request order
>> > and compaction target order are the same. That won't be true once
>> > defrag_mode promotes sub-block allocations to pageblock-order
>> > compaction: compaction targets the larger order, capture should
>> > remain at the original allocation order.
>>
>> Well I guess you could also try to capture the whole-pageblock page and then
>> deal with it as with whole pageblock stealing? But this works too and
>> perhaps it's simpler.
>
> You mean capture the whole block in *page and then break off the chunk
> we need? That's pretty far upstack. We don't hold the zone->lock at
> that point to expand()... I think this is indeed a bit simpler.
Ah right.
>
>> > @@ -2808,35 +2808,22 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
>> > .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
>> > .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
>> > };
>> > - struct capture_control capc = {
>> > - .cc = &cc,
>> > - .page = NULL,
>> > - };
>> >
>> > - /*
>> > - * Make sure the structs are really initialized before we expose the
>> > - * capture control, in case we are interrupted and the interrupt handler
>> > - * frees a page.
>> > - */
>> > + /* See the comment in __alloc_pages_direct_compact() */
>> > barrier();
>> > - WRITE_ONCE(current->capture_control, &capc);
>> > + WRITE_ONCE(capc->cc, &cc);
>> >
>> > - ret = compact_zone(&cc, &capc);
>> > + ret = compact_zone(&cc, capc);
>> > +
>> > + WRITE_ONCE(capc->cc, NULL);
>>
>> I wonder if it makes sense to continue having capc->cc and this whole dance
>> in two functions. AFAICS (after patch 4/4) we access only capc->cc->zone and
>> capc->cc->migratetype. migratetype is stable in the whole
>> try_to_compact_pages(), could be part of capc. Order can be added by this
>> patch (with no semantic change to it) and not the next one. Zone varies, but
>> could be also in capc and set by try_to_compact_pages() before every call to
>> compact_zone_order(). Then compact_zone_order() doesn't have to set up any
>> capc fields anymore?
>
> You mean something like this?
Almost, see below.
>
> That could work as well, but let me know if that's what you thought.
>
> The only thing I don't like so much is that it leaks that
> READ_ONCE(capc.page) out of the function where we have those nice IRQ
> preemption comments and the current->capture_control barriering.
My variant doesn't have it, but it does change capc->zone on an "armed"
capc. But I think it's quite safe; an IRQ can only see a zone that's
compatible.
>> > @@ -718,7 +718,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
>> > {
>> > struct capture_control *capc = current->capture_control;
>> >
>> > - return unlikely(capc) &&
>> > + return unlikely(capc && capc->cc) &&
>> > !(current->flags & PF_KTHREAD) &&
>> > !capc->page &&
>> > capc->cc->zone == zone ? capc : NULL;
>> > @@ -4146,23 +4146,42 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>> > struct page *page = NULL;
>> > unsigned long pflags;
>> > unsigned int noreclaim_flag;
>> > + struct capture_control capc = {
>> > + .page = NULL,
>>
>> You didn't set .cc to NULL explicitly...
>>
>> > + };
>> >
>> > if (!order)
>> > return NULL;
>> >
>> > + /*
>> > + * Make sure the structs are really initialized before we expose the
>> > + * capture control, in case we are interrupted and the interrupt handler
>> > + * frees a page.
>> > + */
>> > + barrier();
>>
>> So either an implicit { } NULL / zero initialization + barrier() is enough
>> (I hope so) and we don't need to set NULL / zero in every field explicitly.
>> Or not and then we should set every field and not just page.
>
> That partial capc = { .page = NULL, } will 0 the other fields. I don't
> see a correctness issue. Or were you talking about readability?
I mean if we trust we don't need to be explicit about what we set to 0/NULL,
then we don't have to set any of the fields that are 0 or NULL. So in the
quoted above there could be just = { }, if we trust that the implicit
initialization does also happen before the barrier().
Otherwise there would have to be { .page = NULL, .cc = NULL, }
Now my version (only compile tested):
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index f29ef0653546..66a2f70e9e01 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -58,6 +58,7 @@ enum compact_result {
};
struct alloc_context; /* in mm/internal.h */
+struct capture_control; /* in mm/internal.h */
/*
* Number of free order-0 pages that should be available above given watermark
@@ -92,7 +93,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
const struct alloc_context *ac, enum compact_priority prio,
- struct page **page);
+ struct capture_control *capc);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
unsigned long watermark, int highest_zoneidx);
diff --git a/mm/compaction.c b/mm/compaction.c
index b7878e4a2c4b..902573797f99 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2792,7 +2792,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int highest_zoneidx,
- struct page **capture)
+ struct capture_control *capc)
{
enum compact_result ret;
struct compact_control cc = {
@@ -2809,36 +2809,8 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
};
- struct capture_control capc = {
- .cc = &cc,
- .page = NULL,
- };
-
- /*
- * Make sure the structs are really initialized before we expose the
- * capture control, in case we are interrupted and the interrupt handler
- * frees a page.
- */
- barrier();
- WRITE_ONCE(current->capture_control, &capc);
- ret = compact_zone(&cc, &capc);
-
- /*
- * Make sure we hide capture control first before we read the captured
- * page pointer, otherwise an interrupt could free and capture a page
- * and we would leak it.
- */
- WRITE_ONCE(current->capture_control, NULL);
- *capture = READ_ONCE(capc.page);
- /*
- * Technically, it is also possible that compaction is skipped but
- * the page is still captured out of luck(IRQ came and freed the page).
- * Returning COMPACT_SUCCESS in such cases helps in properly accounting
- * the COMPACT[STALL|FAIL] when compaction is skipped.
- */
- if (*capture)
- ret = COMPACT_SUCCESS;
+ ret = compact_zone(&cc, capc);
return ret;
}
@@ -2850,13 +2822,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
* @alloc_flags: The allocation flags of the current allocation
* @ac: The context of current allocation
* @prio: Determines how hard direct compaction should try to succeed
- * @capture: Pointer to free page created by compaction will be stored here
+ * @capc: Free page capture bypassing the freelist
*
* This is the main entry point for direct page compaction.
*/
enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
- enum compact_priority prio, struct page **capture)
+ enum compact_priority prio, struct capture_control *capc)
{
struct zoneref *z;
struct zone *zone;
@@ -2883,8 +2855,17 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
continue;
}
+ WRITE_ONCE(capc->zone, zone);
+
status = compact_zone_order(zone, order, gfp_mask, prio,
- alloc_flags, ac->highest_zoneidx, capture);
+ alloc_flags, ac->highest_zoneidx, capc);
+
+ WRITE_ONCE(capc->zone, NULL);
+
+ /* Stop if a page has been captured */
+ if (READ_ONCE(capc->page))
+ status = COMPACT_SUCCESS;
+
rc = max(status, rc);
/* The allocation should succeed, stop compacting */
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..6d3402001b93 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1059,7 +1059,9 @@ struct compact_control {
* immediately when one is created during the free path.
*/
struct capture_control {
- struct compact_control *cc;
+ struct zone *zone;
+ int migratetype;
+ int order;
struct page *page;
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb422505c6ef..f7f5245e7ca1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -721,14 +721,14 @@ static inline struct capture_control *task_capc(struct zone *zone)
return unlikely(capc) &&
!(current->flags & PF_KTHREAD) &&
!capc->page &&
- capc->cc->zone == zone ? capc : NULL;
+ capc->zone == zone ? capc : NULL;
}
static inline bool
compaction_capture(struct capture_control *capc, struct page *page,
int order, int migratetype)
{
- if (!capc || order != capc->cc->order)
+ if (!capc || order != capc->order)
return false;
/* Do not accidentally pollute CMA or isolated regions*/
@@ -744,12 +744,12 @@ compaction_capture(struct capture_control *capc, struct page *page,
* have trouble finding a high-order free page.
*/
if (order < pageblock_order && migratetype == MIGRATE_MOVABLE &&
- capc->cc->migratetype != MIGRATE_MOVABLE)
+ capc->migratetype != MIGRATE_MOVABLE)
return false;
- if (migratetype != capc->cc->migratetype)
- trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
- capc->cc->migratetype, migratetype);
+ if (migratetype != capc->migratetype)
+ trace_mm_page_alloc_extfrag(page, capc->order, order,
+ capc->migratetype, migratetype);
capc->page = page;
return true;
@@ -4146,6 +4146,12 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
unsigned int noreclaim_flag;
+ struct capture_control capc = {
+ .zone = NULL,
+ .migratetype = ac->migratetype,
+ .order = order,
+ .page = NULL,
+ };
if (!order)
return NULL;
@@ -4155,8 +4161,33 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();
+ /*
+ * Make sure the structs are really initialized before we expose the
+ * capture control, in case we are interrupted and the interrupt handler
+ * frees a page.
+ */
+ barrier();
+ WRITE_ONCE(current->capture_control, capc);
+
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &page);
+ prio, &capc);
+
+ /*
+ * Make sure we hide capture control first before we read the captured
+ * page pointer, otherwise an interrupt could free and capture a page
+ * and we would leak it.
+ */
+ WRITE_ONCE(current->capture_control, NULL);
+ page = READ_ONCE(capc.page);
+
+ /*
+ * Technically, it is also possible that compaction is skipped but
+ * the page is still captured out of luck(IRQ came and freed the page).
+ * Returning COMPACT_SUCCESS in such cases helps in properly accounting
+ * the COMPACT[STALL|FAIL] when compaction is skipped.
+ */
+ if (page)
+ *compact_result = COMPACT_SUCCESS;
memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(gfp_mask);
next prev parent reply other threads:[~2026-07-02 10:20 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
2026-07-01 13:45 ` Vlastimil Babka (SUSE)
2026-07-02 0:39 ` Shakeel Butt
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
2026-07-01 14:19 ` Vlastimil Babka (SUSE)
2026-07-01 15:28 ` Johannes Weiner
2026-07-01 18:14 ` Vlastimil Babka (SUSE)
2026-07-01 21:11 ` Johannes Weiner
2026-07-02 9:28 ` Vlastimil Babka (SUSE)
2026-07-02 1:47 ` Zi Yan
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
2026-07-01 18:02 ` Vlastimil Babka (SUSE)
2026-07-01 20:57 ` Johannes Weiner
2026-07-02 10:20 ` Vlastimil Babka (SUSE) [this message]
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
2026-06-26 18:29 ` Zi Yan
2026-06-26 18:43 ` Johannes Weiner
2026-07-01 18:06 ` Vlastimil Babka (SUSE)
2026-07-01 21:02 ` Johannes Weiner
2026-07-02 0:31 ` [PATCH 0/4] mm: fix reclaim storms " Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e63acfb5-c9d6-4e2a-b44c-417ad3a15b8b@kernel.org \
--to=vbabka@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox