From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Brendan Jackman <jackmanb@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
Muchun Song <muchun.song@linux.dev>,
Oscar Salvador <osalvador@suse.de>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <liam@infradead.org>,
Mike Rapoport <rppt@kernel.org>,
Matthew Brost <matthew.brost@intel.com>,
Joshua Hahn <joshua.hahnjy@gmail.com>,
Rakie Kim <rakie.kim@sk.com>, Byungchul Park <byungchul@sk.com>,
Ying Huang <ying.huang@linux.alibaba.com>,
Alistair Popple <apopple@nvidia.com>, Hao Li <hao.li@linux.dev>,
Christoph Lameter <cl@gentwo.org>,
David Rientjes <rientjes@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Clark Williams <clrkwllms@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>
Cc: "Harry Yoo (Oracle)" <harry@kernel.org>,
Gregory Price <gourry@gourry.net>,
Alexei Starovoitov <ast@kernel.org>,
Matthew Wilcox <willy@infradead.org>, Hao Ge <hao.ge@linux.dev>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-rt-devel@lists.linux.dev
Subject: Re: [PATCH v3 05/16] mm/page_alloc: unify __alloc_frozen_pages[_nolock]_noprof()
Date: Tue, 30 Jun 2026 18:16:20 +0200 [thread overview]
Message-ID: <611bd3dc-95d4-45e0-ae5a-158c6cf1472f@kernel.org> (raw)
In-Reply-To: <20260629-alloc-trylock-v3-5-57bef0eadbc2@google.com>
On 6/29/26 15:11, Brendan Jackman wrote:
> Currently the core allocator code is controlled by ALLOC_NOLOCK, but the
> main entry point function is significantly different from the normal
Let's mention it explicitly, alloc_frozen_pages_nolock_noprof().
> __alloc_frozen_pages_nolock(), this is tiring when reading the code.
You mean __alloc_frozen_pages_noprof()?
>
> Plumb the ALLOC_NOLOCK control one layer up in the call stack: create
> an alloc_flags argument to __alloc_frozen_pages_nolock() (which is only
Again __alloc_frozen_pages_noprof()
> exposed to mm/) and then turn the nolock variant into a thin wrapper
> that just sets that flag (as well as handling NUMA_NO_NODE, similar to
> how some of the wrappers in gfp.h do).
>
> Rationale that this doesn't change anything:
>
> 1. Simple bits: A bunch of the nolock-specific handling is just moved to
> the new alloc_order_allowed(), alloc_trylock_allowed() and
> gfp_trylock.
Should be alloc_nolock_allowed() and gfp_nolock
> 2. __alloc_frozen_pages_noprof() has some extra logic that wasn't
> previously in the nolock variant:
>
> a. Application of gfp_allowed_mask; this only affects early boot, and
> only flags that affect the slowpath get changed here.
As discussed in reply to Harry, I'd mention the flags excluded by
GFP_BOOT_MASK are not usable by _nolock() anyway.
> b. Application of current_gfp_context() - also only affects the
> slowpath
>
> 3. The slowpath itself: this is now just explicitly skipped under
> !ALLOC_TRYLOCK.
ALLOC_NOLOCK.
>
> Ulterior motive: adding an alloc_flags arg to the allocator's
> mm-internal entrypoint can later be used to do more allocation
> customisation without needing to create new GFP flags.
>
> While adding this flag to a bunch of places, create ALLOC_DEFAULT to
> avoid a mysterious literal 0 in most places.
> alloc_frozen_pages_noprof()
> is defined above the alloc flags so just leave that as a slightly messy
> exception instead of trying to fully reorder mm/internal.h for that one
> case.
This no longer applies in v3?
> No functional change intended.
>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
> mm/hugetlb.c | 3 +-
> mm/mempolicy.c | 10 ++--
> mm/page_alloc.c | 178 +++++++++++++++++++++++++++++---------------------------
> mm/page_alloc.h | 6 +-
> mm/slub.c | 6 +-
> 5 files changed, 108 insertions(+), 95 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f7925624c4d2e..dfcfcfa4715bf 100644
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a3ba63c7f9199..8d409d075e3e9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5222,7 +5222,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> }
> nr_account++;
>
> - prep_new_page(page, 0, gfp, 0);
> + prep_new_page(page, 0, gfp, ALLOC_DEFAULT);
> set_page_refcounted(page);
> page_array[nr_populated++] = page;
> }
> @@ -5271,24 +5271,98 @@ void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
> }
> }
>
> -/*
> - * This is the 'heart' of the zoned buddy allocator.
> - */
> -struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> - int preferred_nid, nodemask_t *nodemask)
> +static inline bool alloc_order_allowed(gfp_t gfp, unsigned int order,
> + unsigned int alloc_flags)
> {
> - struct page *page;
> - unsigned int fastpath_alloc_flags = ALLOC_WMARK_LOW;
> - gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
> - struct alloc_context ac = { };
> + if (alloc_flags & ALLOC_NOLOCK)
> + return pcp_allowed_order(order);
>
> /*
> * There are several places where we assume that the order value is sane
> * so bail out early if the request is out of bound.
> */
> - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> + return !(WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp));
> +}
> +
> +static inline bool alloc_trylock_allowed(void)
alloc_nolock_allowed()
> +{
> + /*
> + * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
> + * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
> + * task may be waiting for one rt_spin_lock, but rt_spin_trylock() will
> + * mark the task as the owner of another rt_spin_lock which will
> + * confuse PI logic, so return immediately if called from hard IRQ or
> + * NMI.
> + *
> + * Note, irqs_disabled() case is ok. This function can be called
> + * from raw_spin_lock_irqsave region.
> + */
> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> + return false;
> +
> + /* On UP, spin_trylock() always succeeds even when it is locked */
> + if (!IS_ENABLED(CONFIG_SMP) && in_nmi())
> + return false;
> +
> + /* Bailout, since _deferred_grow_zone() needs to take a lock */
> + if (deferred_pages_enabled())
> + return false;
> +
> + return true;
> +}
> +
> +/*
> + * GFP flags to set for ALLOC_NOLOCK i.e. alloc_pages_nolock().
> + *
> + * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> + * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> + * is not safe in arbitrary context.
> + *
> + * These two are the conditions for gfpflags_allow_spinning() being true.
> + *
> + * Specify __GFP_NOWARN since failing alloc_pages_nolock() is not a reason
> + * to warn. Also warn would trigger printk() which is unsafe from
> + * various contexts. We cannot use printk_deferred_enter() to mitigate,
> + * since the running context is unknown.
> + *
> + * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> + * is safe in any context. Also zeroing the page is mandatory for
> + * BPF use cases.
> + *
> + * Though __GFP_NOMEMALLOC is not checked in the code path below,
> + * specify it here to highlight that alloc_pages_nolock()
> + * doesn't want to deplete reserves.
> + */
> +static const gfp_t gfp_nolock = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC |
> + __GFP_COMP;
> +
> +/*
> + * This is the 'heart' of the zoned buddy allocator.
> + */
> +struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> + int preferred_nid, nodemask_t *nodemask, unsigned int alloc_flags)
> +{
> + struct page *page;
> + gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
> + struct alloc_context ac = { };
> + unsigned int fastpath_alloc_flags = alloc_flags;
> +
> + /* Other flags could be supported later if needed. */
> + if (WARN_ON(alloc_flags & ~ALLOC_NOLOCK))
> return NULL;
>
> + if (!alloc_order_allowed(gfp, order, alloc_flags))
> + return NULL;
> +
> + if (alloc_flags & ALLOC_NOLOCK) {
> + VM_WARN_ON_ONCE(gfp & ~__GFP_ACCOUNT);
> + if (!alloc_trylock_allowed())
> + return NULL;
> + gfp |= gfp_nolock;
I think we could do a
fastpath_alloc_flags |= ALLOC_WMARK_MIN;
to make it explicit, even though it's a no-op (the value is 0) and
alloc_frozen_pages_nolock_noprof() didn't do it.
> + } else {
> + fastpath_alloc_flags |= ALLOC_WMARK_LOW;
> + }
> +
> gfp &= gfp_allowed_mask;
> /*
> * Apply scoped allocation constraints. This is mainly about GFP_NOFS
> @@ -5310,9 +5384,9 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> fastpath_alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
> fastpath_alloc_flags |= alloc_flags_nonblocking(gfp, order) & ALLOC_HIGHATOMIC;
>
> - /* First allocation attempt */
> + /* First allocation attempt (or, for nolock, only attempt) */
> page = get_page_from_freelist(alloc_gfp, order, fastpath_alloc_flags, &ac);
> - if (likely(page))
> + if (likely(page) || (alloc_flags & ALLOC_NOLOCK))
> goto out;
>
> alloc_gfp = gfp;
> @@ -5329,7 +5403,8 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> out:
> if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT) && page &&
> unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
> - free_frozen_pages(page, order);
> + __free_frozen_pages(page, order,
> + alloc_flags & ALLOC_NOLOCK ? FPI_TRYLOCK : 0);
> page = NULL;
> }
>
> @@ -5345,7 +5420,8 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> {
> struct page *page;
>
> - page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
> + page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask,
> + ALLOC_DEFAULT);
> if (page)
> set_page_refcounted(page);
> return page;
> @@ -7875,80 +7951,10 @@ static bool __free_unaccepted(struct page *page)
>
> struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
> {
> - /*
> - * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> - * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> - * is not safe in arbitrary context.
> - *
> - * These two are the conditions for gfpflags_allow_spinning() being true.
> - *
> - * Specify __GFP_NOWARN since failing alloc_pages_nolock() is not a reason
> - * to warn. Also warn would trigger printk() which is unsafe from
> - * various contexts. We cannot use printk_deferred_enter() to mitigate,
> - * since the running context is unknown.
> - *
> - * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> - * is safe in any context. Also zeroing the page is mandatory for
> - * BPF use cases.
> - *
> - * Though __GFP_NOMEMALLOC is not checked in the code path below,
> - * specify it here to highlight that alloc_pages_nolock()
> - * doesn't want to deplete reserves.
> - */
> - gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
> - | gfp_flags;
> - unsigned int alloc_flags = ALLOC_NOLOCK;
> - struct alloc_context ac = { };
> - struct page *page;
> -
> - VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
> - /*
> - * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
> - * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
> - * task may be waiting for one rt_spin_lock, but rt_spin_trylock() will
> - * mark the task as the owner of another rt_spin_lock which will
> - * confuse PI logic, so return immediately if called from hard IRQ or
> - * NMI.
> - *
> - * Note, irqs_disabled() case is ok. This function can be called
> - * from raw_spin_lock_irqsave region.
> - */
> - if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> - return NULL;
> -
> - /* On UP, spin_trylock() always succeeds even when it is locked */
> - if (!IS_ENABLED(CONFIG_SMP) && in_nmi())
> - return NULL;
> -
> - if (!pcp_allowed_order(order))
> - return NULL;
> -
> - /* Bailout, since _deferred_grow_zone() needs to take a lock */
> - if (deferred_pages_enabled())
> - return NULL;
> -
> if (nid == NUMA_NO_NODE)
> nid = numa_node_id();
>
> - prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> - &alloc_gfp, &alloc_flags);
> -
> - /*
> - * Best effort allocation from percpu free list.
> - * If it's empty attempt to spin_trylock zone->lock.
> - */
> - page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> -
> - /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> -
> - if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
> - unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
> - __free_frozen_pages(page, order, FPI_TRYLOCK);
> - page = NULL;
> - }
> - trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> - kmsan_alloc_page(page, order, alloc_gfp);
> - return page;
> + return __alloc_frozen_pages_noprof(gfp_flags, order, nid, NULL, ALLOC_NOLOCK);
> }
> /**
> * alloc_pages_nolock - opportunistic reentrant allocation from any context
> diff --git a/mm/page_alloc.h b/mm/page_alloc.h
> index 3250d44f96457..e16f905f859a7 100644
> --- a/mm/page_alloc.h
> +++ b/mm/page_alloc.h
> @@ -11,6 +11,7 @@
> #include <linux/nodemask.h>
> #include <linux/types.h>
>
> +#define ALLOC_DEFAULT 0
> /* The ALLOC_WMARK bits are used as an index to zone->watermark */
> #define ALLOC_WMARK_MIN WMARK_MIN
> #define ALLOC_WMARK_LOW WMARK_LOW
> @@ -219,7 +220,7 @@ extern bool free_pages_prepare(struct page *page, unsigned int order);
> extern int user_min_free_kbytes;
>
> struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order, int nid,
> - nodemask_t *nodemask);
> + nodemask_t *nodemask, unsigned int alloc_flags);
> #define __alloc_frozen_pages(...) \
> alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
> void free_frozen_pages(struct page *page, unsigned int order);
> @@ -230,7 +231,8 @@ struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
> #else
> static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order)
> {
> - return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL);
> + return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL,
> + 0 /* ALLOC_DEFAULT */);
Can use ALLOC_DEFAULT now.
> }
> #endif
>
next prev parent reply other threads:[~2026-06-30 16:16 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-29 13:11 [PATCH v3 00/16] mm: Some cleanups for page allocator APIs Brendan Jackman
2026-06-29 13:11 ` [PATCH v3 01/16] mm/page_alloc: rename ALLOC_TRYLOCK -> ALLOC_NOLOCK Brendan Jackman
2026-06-30 12:27 ` Vlastimil Babka (SUSE)
2026-06-29 13:11 ` [PATCH v3 02/16] mm/page_alloc: some renames to clarify alloc_flags scopes Brendan Jackman
2026-06-30 12:38 ` Vlastimil Babka (SUSE)
2026-06-30 17:25 ` Brendan Jackman
2026-06-29 13:11 ` [PATCH v3 03/16] mm: name some args in a function declaration Brendan Jackman
2026-06-30 12:43 ` Vlastimil Babka (SUSE)
2026-06-29 13:11 ` [PATCH v3 04/16] mm: Split out internal page_alloc.h Brendan Jackman
2026-06-30 13:54 ` Vlastimil Babka (SUSE)
2026-06-29 13:11 ` [PATCH v3 05/16] mm/page_alloc: unify __alloc_frozen_pages[_nolock]_noprof() Brendan Jackman
2026-06-30 13:36 ` Harry Yoo
2026-06-30 15:34 ` Vlastimil Babka (SUSE)
2026-06-30 16:56 ` Brendan Jackman
2026-06-30 17:04 ` Brendan Jackman
2026-06-30 16:16 ` Vlastimil Babka (SUSE) [this message]
2026-06-30 18:47 ` Brendan Jackman
2026-06-29 13:11 ` [PATCH v3 06/16] mm/page_alloc: relax GFP WARN in nolock allocs Brendan Jackman
2026-06-30 13:52 ` Harry Yoo
2026-06-30 16:42 ` Vlastimil Babka (SUSE)
2026-06-29 13:11 ` [PATCH v3 07/16] mm: move some stuff to mm/page_alloc.h Brendan Jackman
2026-06-30 16:42 ` Vlastimil Babka (SUSE)
2026-06-29 13:11 ` [PATCH v3 08/16] perf/x86/intel: Use higher-level allocator API Brendan Jackman
2026-06-29 13:11 ` [PATCH v3 09/16] KVM: VMX: " Brendan Jackman
2026-06-29 15:31 ` -EXT-[PATCH " Soderlund, David
2026-06-29 13:11 ` [PATCH v3 10/16] x86/virt: " Brendan Jackman
2026-06-29 13:12 ` [PATCH v3 11/16] sgi-xp: " Brendan Jackman
2026-06-29 18:47 ` Steve Wahl
2026-06-29 13:12 ` [PATCH v3 12/16] net/funeth: Switch to " Brendan Jackman
2026-06-29 13:12 ` [PATCH v3 13/16] mm: Remove __alloc_pages_node() Brendan Jackman
2026-06-29 13:12 ` [PATCH v3 14/16] mm: Move __alloc_pages() to mm/page_alloc.h Brendan Jackman
2026-06-29 13:12 ` [PATCH v3 15/16] mm: replace __GFP_NO_CODETAG with ALLOC_NO_CODETAG Brendan Jackman
2026-06-30 1:55 ` Hao Ge
2026-06-30 10:10 ` Brendan Jackman
2026-06-30 12:01 ` Brendan Jackman
2026-06-29 13:12 ` [PATCH v3 16/16] mm: remove the __GFP_NO_OBJ_EXT flag Brendan Jackman
2026-06-29 14:00 ` [PATCH v3 00/16] mm: Some cleanups for page allocator APIs Mike Rapoport
2026-06-29 14:30 ` Brendan Jackman
2026-06-29 15:05 ` Brendan Jackman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=611bd3dc-95d4-45e0-ae5a-158c6cf1472f@kernel.org \
--to=vbabka@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=ast@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=byungchul@sk.com \
--cc=cl@gentwo.org \
--cc=clrkwllms@kernel.org \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=hao.ge@linux.dev \
--cc=hao.li@linux.dev \
--cc=harry@kernel.org \
--cc=jackmanb@google.com \
--cc=joshua.hahnjy@gmail.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=ljs@kernel.org \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
--cc=rakie.kim@sk.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox