From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2EC6F387566 for ; Thu, 2 Jul 2026 10:20:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782987654; cv=none; b=LWW66dB/PeXznf0PNY4w3w/lr5JdC90kQxbJOhnT06AinX9Im+tIuAozk09s1OpAEiBezJB6YqqTtmym4/+p71pMQm7goZEiQ5KYs3eCiUJzjE9tpbPvgVXIxthfZO02Moe9wE7UVbjlDDvyjDiLUMpQVzpkXPu0f2TDVCNa5yc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782987654; c=relaxed/simple; bh=5P3HTRjxlNcmWnhu0MxzTsS+PDMvkhI4hT55rDSLPTE=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=TLI30y2/xsrOejRc21VuewQt4HYHPn9pL9V1trGckljKo4de6nSJOmM80hghdvhQ/vg3SlqLy67ZbjX2AFcol9lxu0lV9JxZ4mxCnubvLb1itqo7qklPsamGAUo/kGfBVMYJyJekWyIkmErL19ITPs2UFR7jRJPolxt06WNqwOI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=b3KtrptK; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="b3KtrptK" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 842DF1F000E9; Thu, 2 Jul 2026 10:20:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1782987652; bh=B4oqZrw7KNeqKYrC3Tgr54YUlOEFUPsjbJgAUkFdOXA=; h=Date:Subject:To:Cc:References:From:In-Reply-To; b=b3KtrptKdOg4qNS2It/zQeaTcZCiNNofw2uWfyDohvOoCuBKKdU9f3tO0FBWR+MSq ZsDnRsxdp55VByYgauLbV6d+li6/17T86j0P0vRHncwekxj6UBZtiCCVcQHcX+wOjJ m/GdHP8Itp46442FsXs7FI2p/xyVhwvm4jKS5g28c+7hURVWajsHZDwY+K7RfkqYYx +bdfNVS1z6JljUMHznP9yQYjNmxeyapFU5X9zY9LSSxi+H+JiyFeQFYSAB0TsRqes5 tyIS/5I8qJfthaAoWvtiaUcnEwqMO1ZeX97O0Sgxy42dsoeWzA59Gaj7dANI/GjzDK OHiSDLq5IQ4Gg== Message-ID: Date: Thu, 2 Jul 2026 12:20:48 +0200 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Content-Language: en-US To: Johannes Weiner Cc: Andrew Morton , Suren Baghdasaryan , Michal Hocko , Brendan Jackman , Zi Yan , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Mike Rapoport , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20260626182215.1107966-1-hannes@cmpxchg.org> <20260626182215.1107966-4-hannes@cmpxchg.org> From: "Vlastimil Babka (SUSE)" Autocrypt: addr=vbabka@kernel.org; keydata= xsFNBFZdmxYBEADsw/SiUSjB0dM+vSh95UkgcHjzEVBlby/Fg+g42O7LAEkCYXi/vvq31JTB KxRWDHX0R2tgpFDXHnzZcQywawu8eSq0LxzxFNYMvtB7sV1pxYwej2qx9B75qW2plBs+7+YB 87tMFA+u+L4Z5xAzIimfLD5EKC56kJ1CsXlM8S/LHcmdD9Ctkn3trYDNnat0eoAcfPIP2OZ+ 9oe9IF/R28zmh0ifLXyJQQz5ofdj4bPf8ecEW0rhcqHfTD8k4yK0xxt3xW+6Exqp9n9bydiy tcSAw/TahjW6yrA+6JhSBv1v2tIm+itQc073zjSX8OFL51qQVzRFr7H2UQG33lw2QrvHRXqD Ot7ViKam7v0Ho9wEWiQOOZlHItOOXFphWb2yq3nzrKe45oWoSgkxKb97MVsQ+q2SYjJRBBH4 8qKhphADYxkIP6yut/eaj9ImvRUZZRi0DTc8xfnvHGTjKbJzC2xpFcY0DQbZzuwsIZ8OPJCc LM4S7mT25NE5kUTG/TKQCk922vRdGVMoLA7dIQrgXnRXtyT61sg8PG4wcfOnuWf8577aXP1x 6mzw3/jh3F+oSBHb/GcLC7mvWreJifUL2gEdssGfXhGWBo6zLS3qhgtwjay0Jl+kza1lo+Cv BB2T79D4WGdDuVa4eOrQ02TxqGN7G0Biz5ZLRSFzQSQwLn8fbwARAQABzSNWbGFzdGltaWwg QmFia2EgPHZiYWJrYUBrZXJuZWwub3JnPsLBsAQTAQoAWhYhBKlA1DSZLC6OmRA9UCJPp+fM gqZkBQJqFFy6GxSAAAAAAAQADm1hbnUyLDIuNSsxLjEyLDIsMgIbAwUJGtCBUAULCQgHAwUV CgkICwUWAgMBAAIeBQIXgAAKCRAiT6fnzIKmZJIUEADFx/tREzUImHrEwVHeSvDFmA7tJysI UVrlvrM09E7GIuzphzv7jYmo8n3ANpCczLEVr4G0syYQdTigaZgv3+FQDIIzhKih1IHhu1Ei XHlywNWKnQxxQEUNi5Mwx43wQz5XVw9F1A7gtKBKNtfogO511hAbrzagrYajyQacEJ/+sfhZ 9Da8ltHIXD8pcYaHUfQgEusCgmEd9+KrUwrTbckFKmYq5chuE6yJ4J0EmWknL096jIE6CnzF FRslQ3B1UKDjxVsm1ZHfir5NeWszLkTvGFsddFaWTgh8UycESG6VQzKXjjewXu2pG7YQYRpj QKm1W5X2TkwWkXRBZTmfmbhxIUMh3+zf5wQ463rSmDN/8v81tdqBtAW6rH/kzg1GvkaTHXn0 507yEHFzBksk2viAuIxxr7km8+/KARYLIdGtx30EG8cKzAUZOK6WqxtNCsXUJNrVE8CWrCaD icoNu7Fs1c5hmPHdSTnU48ce67449DdnO4neLSNhRiGlMHJgfJUmgrxu/hcYeOZ3haWmEQ2w uW1Mh01OHi8QZHCEyAbABrPs9GUgccc/4eYXX9hIgxfSkYzn8f+8NuIFPWl/0uTvjgqU29FQ SbzOLxHq9439Ox40G5mS5eZXRGxITYR+6TXvRGI6P/264jvflnr/pDGUttaikU+0W+1uxgKH cmYbEc7ATQRbGTU1AQgAn0H6UrFiWcovkh6EXVcl+SeqyO6JHOPm+e9Wu0Vw+VIUvXZVUVVQ La1PQDUi6j00ChlcR66g9/V0sPIcSutacPKfdKYOBvzd4rlhL8rfrdEsQw5ApZxrA8kYZVMh FmBRKAa6wos25moTlMKpCWzTH84+WO5+ziCTsTUZASAToz3RdunTD+vQcHj0GqNTPAHK63sf bAB2I0BslZkXkY1RLb/YhuA6E7JyEd2pilZOrIuBGl/5q2qSakgnAVFWFBR/DO27JuAksYnq +aH8vI0xGvwn75KqSk4UzAkDzWSmO4ZHuahKtQgZNsMYV+PGayRBX9b9zbldzopoLBdqHc4n jQARAQABwsF8BBgBCgAmAhsMFiEEqUDUNJksLo6ZED1QIk+n58yCpmQFAmfIHFQFCRYU6J8A CgkQIk+n58yCpmS2PA//bqN1LfcotmArgElsa+0EGZSQlYgK48pm8WAeTXTngudP9IJ4SuKY HR5RNjHcBeqN+Me0zxRqYzRb8nGanHEkDyf4Im8DQM8d6vbyU+FcPmG4skud4kgS1zMHnlVd SXfSIwKC/hKgdHG8aBV7545Lz9X6Iohea+94wneD0aw/hqF+QWewGZhWJriWAZtvEkzNjQOi 4U9F/trLten/x7bpphDSnDMKJtITbtzATT1Dq7o7VpIUK1nCTQALMuMjKCdi8OdU/+V+R3O4 0PXWvX8qrvqYapVbZ+9KqT74FsuB0Ya9uXwgBF2Q6cRuETZk5vqaqKxzqoQZCO8AOz/58j6O 2RHNy/mZEN+7tJ5Tsq42zVJ4jxsT8b9YplavCMsnBgDeRWhcbYhCyttoL7nYISyWg4kQYZ/P wIV3OuNv2f8iKYsxNsRuClOAF82+gvqOy1/1pprFjy8uo2pkoOrb63aOP3vO5VHnRKgra6dq NcaZ+c6J4H+nEJGi2SkHAUJz5oBzuThvPudLvPA/SK8sKoM01IRxSihev/S/5WLazXB1PGem OCbvzC1IjWJJraxiDJ5IygokapUa2RP7+WBR22skQ3SSl6G107QgWKSyTOGWEaRmV53vxQLV jXuCmzSSasTL60zq5yGrT4/DYQVSNEUiUbG4pYekxJujNeEDkUlky0Y= In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 7/1/26 22:57, Johannes Weiner wrote: > On Wed, Jul 01, 2026 at 08:02:55PM +0200, Vlastimil Babka (SUSE) wrote: >> On 6/26/26 20:21, Johannes Weiner wrote: >> > The compaction capturing code assumes the allocation request order >> > and compaction target order are the same. That won't be true once >> > defrag_mode promotes sub-block allocations to pageblock-order >> > compaction: compaction targets the larger order, capture should >> > remain at the original allocation order. >> >> Well I guess you could also try to capture the whole-pageblock page and then >> deal with it as with whole pageblock stealing? But this works too and >> perhaps it's simpler. > > You mean capture the whole block in *page and then break off the chunk > we need? That's pretty far upstack. We don't hold the zone->lock at > that point to expand()... I think this is indeed a bit simpler. Ah right. > >> > @@ -2808,35 +2808,22 @@ static enum compact_result compact_zone_order(struct zone *zone, int order, >> > .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY), >> > .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY) >> > }; >> > - struct capture_control capc = { >> > - .cc = &cc, >> > - .page = NULL, >> > - }; >> > >> > - /* >> > - * Make sure the structs are really initialized before we expose the >> > - * capture control, in case we are interrupted and the interrupt handler >> > - * frees a page. >> > - */ >> > + /* See the comment in __alloc_pages_direct_compact() */ >> > barrier(); >> > - WRITE_ONCE(current->capture_control, &capc); >> > + WRITE_ONCE(capc->cc, &cc); >> > >> > - ret = compact_zone(&cc, &capc); >> > + ret = compact_zone(&cc, capc); >> > + >> > + WRITE_ONCE(capc->cc, NULL); >> >> I wonder if it makes sense to continue having capc->cc and this whole dance >> in two functions. AFAICS (after patch 4/4) we access only capc->cc->zone and >> capc->cc->migratetype. migratetype is stable in the whole >> try_to_compact_pages(), could be part of capc. Order can be added by this >> patch (with no semantic change to it) and not the next one. Zone varies, but >> could be also in capc and set by try_to_compact_pages() before every call to >> compact_zone_order(). Then compact_zone_order() doesn't have to set up any >> capc fields anymore? > > You mean something like this? Almost, see below. > > That could work as well, but let me know if that's what you thought. > > The only thing I don't like so much is that it leaks that > READ_ONCE(capc.page) out of the function where we have those nice IRQ > preemption comments and the current->capture_control barriering. My variant doesn't have it, but it does change capc->zone on an "armed" capc. But I think it's quite safe; an IRQ can only see a zone that's compatible. >> > @@ -718,7 +718,7 @@ static inline struct capture_control *task_capc(struct zone *zone) >> > { >> > struct capture_control *capc = current->capture_control; >> > >> > - return unlikely(capc) && >> > + return unlikely(capc && capc->cc) && >> > !(current->flags & PF_KTHREAD) && >> > !capc->page && >> > capc->cc->zone == zone ? capc : NULL; >> > @@ -4146,23 +4146,42 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, >> > struct page *page = NULL; >> > unsigned long pflags; >> > unsigned int noreclaim_flag; >> > + struct capture_control capc = { >> > + .page = NULL, >> >> You didn't set .cc to NULL explicitly... >> >> > + }; >> > >> > if (!order) >> > return NULL; >> > >> > + /* >> > + * Make sure the structs are really initialized before we expose the >> > + * capture control, in case we are interrupted and the interrupt handler >> > + * frees a page. >> > + */ >> > + barrier(); >> >> So either an implicit { } NULL / zero initialization + barrier() is enough >> (I hope so) and we don't need to set NULL / zero in every field explicitly. >> Or not and then we should set every field and not just page. > > That partial capc = { .page = NULL, } will 0 the other fields. I don't > see a correctness issue. Or were you talking about readability? I mean if we trust we don't need to be explicit about what we set to 0/NULL, then we don't have to set any of the fields that are 0 or NULL. So in the quoted above there could be just = { }, if we trust that the implicit initialization does also happen before the barrier(). Otherwise there would have to be { .page = NULL, .cc = NULL, } Now my version (only compile tested): diff --git a/include/linux/compaction.h b/include/linux/compaction.h index f29ef0653546..66a2f70e9e01 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -58,6 +58,7 @@ enum compact_result { }; struct alloc_context; /* in mm/internal.h */ +struct capture_control; /* in mm/internal.h */ /* * Number of free order-0 pages that should be available above given watermark @@ -92,7 +93,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum compact_priority prio, - struct page **page); + struct capture_control *capc); extern void reset_isolation_suitable(pg_data_t *pgdat); extern bool compaction_suitable(struct zone *zone, int order, unsigned long watermark, int highest_zoneidx); diff --git a/mm/compaction.c b/mm/compaction.c index b7878e4a2c4b..902573797f99 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2792,7 +2792,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) static enum compact_result compact_zone_order(struct zone *zone, int order, gfp_t gfp_mask, enum compact_priority prio, unsigned int alloc_flags, int highest_zoneidx, - struct page **capture) + struct capture_control *capc) { enum compact_result ret; struct compact_control cc = { @@ -2809,36 +2809,8 @@ static enum compact_result compact_zone_order(struct zone *zone, int order, .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY), .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY) }; - struct capture_control capc = { - .cc = &cc, - .page = NULL, - }; - - /* - * Make sure the structs are really initialized before we expose the - * capture control, in case we are interrupted and the interrupt handler - * frees a page. - */ - barrier(); - WRITE_ONCE(current->capture_control, &capc); - ret = compact_zone(&cc, &capc); - - /* - * Make sure we hide capture control first before we read the captured - * page pointer, otherwise an interrupt could free and capture a page - * and we would leak it. - */ - WRITE_ONCE(current->capture_control, NULL); - *capture = READ_ONCE(capc.page); - /* - * Technically, it is also possible that compaction is skipped but - * the page is still captured out of luck(IRQ came and freed the page). - * Returning COMPACT_SUCCESS in such cases helps in properly accounting - * the COMPACT[STALL|FAIL] when compaction is skipped. - */ - if (*capture) - ret = COMPACT_SUCCESS; + ret = compact_zone(&cc, capc); return ret; } @@ -2850,13 +2822,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order, * @alloc_flags: The allocation flags of the current allocation * @ac: The context of current allocation * @prio: Determines how hard direct compaction should try to succeed - * @capture: Pointer to free page created by compaction will be stored here + * @capc: Free page capture bypassing the freelist * * This is the main entry point for direct page compaction. */ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, - enum compact_priority prio, struct page **capture) + enum compact_priority prio, struct capture_control *capc) { struct zoneref *z; struct zone *zone; @@ -2883,8 +2855,17 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, continue; } + WRITE_ONCE(capc->zone, zone); + status = compact_zone_order(zone, order, gfp_mask, prio, - alloc_flags, ac->highest_zoneidx, capture); + alloc_flags, ac->highest_zoneidx, capc); + + WRITE_ONCE(capc->zone, NULL); + + /* Stop if a page has been captured */ + if (READ_ONCE(capc->page)) + status = COMPACT_SUCCESS; + rc = max(status, rc); /* The allocation should succeed, stop compacting */ diff --git a/mm/internal.h b/mm/internal.h index 181e79f1d6a2..6d3402001b93 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1059,7 +1059,9 @@ struct compact_control { * immediately when one is created during the free path. */ struct capture_control { - struct compact_control *cc; + struct zone *zone; + int migratetype; + int order; struct page *page; }; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cb422505c6ef..f7f5245e7ca1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -721,14 +721,14 @@ static inline struct capture_control *task_capc(struct zone *zone) return unlikely(capc) && !(current->flags & PF_KTHREAD) && !capc->page && - capc->cc->zone == zone ? capc : NULL; + capc->zone == zone ? capc : NULL; } static inline bool compaction_capture(struct capture_control *capc, struct page *page, int order, int migratetype) { - if (!capc || order != capc->cc->order) + if (!capc || order != capc->order) return false; /* Do not accidentally pollute CMA or isolated regions*/ @@ -744,12 +744,12 @@ compaction_capture(struct capture_control *capc, struct page *page, * have trouble finding a high-order free page. */ if (order < pageblock_order && migratetype == MIGRATE_MOVABLE && - capc->cc->migratetype != MIGRATE_MOVABLE) + capc->migratetype != MIGRATE_MOVABLE) return false; - if (migratetype != capc->cc->migratetype) - trace_mm_page_alloc_extfrag(page, capc->cc->order, order, - capc->cc->migratetype, migratetype); + if (migratetype != capc->migratetype) + trace_mm_page_alloc_extfrag(page, capc->order, order, + capc->migratetype, migratetype); capc->page = page; return true; @@ -4146,6 +4146,12 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, struct page *page = NULL; unsigned long pflags; unsigned int noreclaim_flag; + struct capture_control capc = { + .zone = NULL, + .migratetype = ac->migratetype, + .order = order, + .page = NULL, + }; if (!order) return NULL; @@ -4155,8 +4161,33 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, fs_reclaim_acquire(gfp_mask); noreclaim_flag = memalloc_noreclaim_save(); + /* + * Make sure the structs are really initialized before we expose the + * capture control, in case we are interrupted and the interrupt handler + * frees a page. + */ + barrier(); + WRITE_ONCE(current->capture_control, capc); + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, - prio, &page); + prio, &capc); + + /* + * Make sure we hide capture control first before we read the captured + * page pointer, otherwise an interrupt could free and capture a page + * and we would leak it. + */ + WRITE_ONCE(current->capture_control, NULL); + page = READ_ONCE(capc.page); + + /* + * Technically, it is also possible that compaction is skipped but + * the page is still captured out of luck(IRQ came and freed the page). + * Returning COMPACT_SUCCESS in such cases helps in properly accounting + * the COMPACT[STALL|FAIL] when compaction is skipped. + */ + if (page) + *compact_result = COMPACT_SUCCESS; memalloc_noreclaim_restore(noreclaim_flag); fs_reclaim_release(gfp_mask);