Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>, Zi Yan <ziy@nvidia.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Joshua Hahn <joshua.hahnjy@gmail.com>,
	Rakie Kim <rakie.kim@sk.com>, Byungchul Park <byungchul@sk.com>,
	Alistair Popple <apopple@nvidia.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Neha Gholkar <nehagholkar@gmail.com>
Subject: Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
Date: Wed, 1 Jul 2026 11:49:16 -0400	[thread overview]
Message-ID: <akU2_Mt_JDCxmnc7@cmpxchg.org> (raw)
In-Reply-To: <akUzUqQG185XHqgL@gourry-fedora-PF4VCD3F>

On Wed, Jul 01, 2026 at 11:33:38AM -0400, Gregory Price wrote:
> Something i found while seeing if i could make ZONE_NORMAL nodes more
> reliably hotpluggable:
> 
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index dd2717ff94bf..9ceeb56574ef 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -682,7 +682,15 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>  	 * we won't be able to do that under the lock.
>  	 */
>  	if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
> -		page = alloc_pages(gfp_nested_mask(alloc_flags),
> +		/*
> +		 * The stack depot pool is a global, never-freed allocation.
> +		 * Use alloc_pages_node() on the CPU-local node instead of
> +		 * alloc_pages() so the pool does not inherit a transient task's
> +		 * NUMA mempolicy (e.g. MPOL_BIND to a CPU-less/bound node), which
> +		 * would strand this long-lived page on that node forever.
> +		 */
> +		page = alloc_pages_node(numa_node_id(),
> +				   gfp_nested_mask(alloc_flags),
>  				   DEPOT_POOL_ORDER);
>  		if (page)
>  			prealloc = page_address(page);
> 
> This is a global, permanently allocated, resource that inherits a task
> mempolicy's placement because that task *happened* to be the first one
> to touch it.
> 
> There are many alloc_pages() calls (155 instances kernel-wide) that
> inherit a task mempolicy when that's probably not what we want.
> 
> alloc_pages() is called in: net/, lib/, kexec_core/, drivers/, arch/
> 
> you can imagine a task setting `set_mempolicy(INTERLEAVE, ALL)` and the
> result is a bunch of random driver memory gets spread all over the place
> along with the task's heap.  Is that really what the caller wanted, or
> did they just want userland data spread about?
> 
> But at this point it's a 20 year old interface, not much we can do about
> it without making *someone* sad :[
> 
> I considered proposing MPOL_F_MOVABLE_ONLY to mean (roughly) "userland
> memory only" - and then slowly trying to migrate numactl to make this
> the default.

Hm. Kernel allocations that are totally incidental like the stackdepot
example above should not follow task policy. But there are kernel
allocations (kernel stack, inodes, pipes) that are directly allocated
on behalf of a process, and so probably SHOULD follow task policy.

That's an annotation problem that I think we have solved already,
because cgroups need the same distinction for what allocations to
charge to the current task's cgroup context.

We could rename __GFP_ACCOUNT / SLAB_ACCOUNT to __GFP_TASKPOLICY /
SLAB_TASKPOLICY or something, and have mempolicy follow it too.

There is still the whole "changing 20 year old behavior" aspect, but I
think the polarity works in our favor: big important allocations have
already been following the policy correctly. The behavior changes
primarily for smaller, random allocations.


  reply	other threads:[~2026-07-01 15:49 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
2026-06-29 17:59 ` Gregory Price
2026-06-29 18:22   ` Johannes Weiner
2026-06-30 11:20   ` Huang, Ying
2026-06-30 15:29     ` Gregory Price
2026-07-01 11:03       ` Huang, Ying
2026-07-01 15:33         ` Gregory Price
2026-07-01 15:49           ` Johannes Weiner [this message]
2026-07-01 16:22             ` Gregory Price
2026-06-29 18:33 ` David Hildenbrand (Arm)
2026-06-29 18:47   ` Johannes Weiner
2026-06-30 11:26     ` David Hildenbrand (Arm)
2026-06-30 23:40 ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=akU2_Mt_JDCxmnc7@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.brost@intel.com \
    --cc=nehagholkar@gmail.com \
    --cc=rakie.kim@sk.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox