Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 3/3] memcg: bail out proactive reclaim when memcg is dying
From: Jiayuan Chen @ 2026-06-24 14:41 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel
In-Reply-To: <20260624135839.2596358-1-usama.arif@linux.dev>


On 6/24/26 9:58 PM, Usama Arif wrote:
> On Tue, 23 Jun 2026 14:27:56 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
>> From: Jiayuan Chen <jiayuan.chen@shopee.com>
>>
>> Proactive reclaim via memory.reclaim can run for a long time - swap I/O
>> or thrashing again dominating the latency - and delays cgroup removal in
>> the same way.
>>
>> Mitigate this by stopping the reclaim once memcg_is_dying().
>>
>> Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
>> Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
>> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
>> ---
>>   mm/vmscan.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 8190c4abec84..1162b7f76655 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -7922,6 +7922,9 @@ int user_proactive_reclaim(char *buf,
>>   		if (memcg) {
>>   			unsigned int reclaim_options;
>>   
>> +			if (memcg_is_dying(memcg))
>> +				break;
>> +
> This exits the reclaim loop with nr_reclaimed < nr_to_reclaim, but the
> function then returns 0 and memory_reclaim() reports a successful write.
> I think you want to return -EAGAIN here?


You are right that an error should be returned instead of 0.


But since memcg is being deleted, I'm reconsidering the appropriateerror 
code.

-EAGAIN, -ENOENT, -EINTR are possible candidates




^ permalink raw reply

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Usama Arif @ 2026-06-24 14:43 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Usama Arif, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260623180124.868655-5-joshua.hahnjy@gmail.com>

On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Now with all of the memcg_stock handling logic replicated in
> page_counter_stock, switch memcg to use the page_counter_stock for the
> memory (and for cgroup v1 users, memsw) page_counters.
> 
> There are a few details that have changed:
> 
> First, the old special-casing for the !allow_spinning check to avoid
> refilling and flushing of the old stock is removed. This special casing
> was important previously, because refilling the stock could do a lot of
> extra work by evicting one of 7 random victim memcgs in the percpu
> memcg_stock slots. In the new per-counter design, refilling stock just
> adds pages to the counter's own local cache without affecting other memcgs,
> so the original reason for the special case no longer applies.
> 
> Also, we can now fail during page_counter_alloc_stock(), if there is
> not enough memory to allocate a percpu page_counter_stock. This failure
> is rare and nonfatal; the system can continue to operate, with the page
> counter working without stock and falling back to walking the hierarchy.
> 
> drain_all_stock and memcg_hotplug_cpu_dead also now use the page_counter
> stock drain variant, which uses remote atomic_xchg to retrieve stock
> across CPUs, instead of scheduling asynchronous work.
> 
> Finally, as a side-effect of separating the per-memcg stock to per-
> page_counter, the memsw and memory page_counters have independent stock.
> This means that the reported memsw may transiently be lower than memory
> usage if the stock for memory and memsw page_counters go out of sync.
> 
> Note that obj_stock is untouched by this change.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>  mm/memcontrol.c | 87 +++++++++++++++++++++++--------------------------
>  1 file changed, 41 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 306658fd55512..846800917af49 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2269,39 +2269,36 @@ static void schedule_drain_work(int cpu, struct work_struct *work)
>  		queue_work_on(cpu, memcg_wq, work);
>  }
>  
> +static void memcg_drain_stock(struct mem_cgroup *memcg, int cpu)
> +{
> +	page_counter_drain_stock(&memcg->memory, cpu);
> +	if (do_memsw_account())
> +		page_counter_drain_stock(&memcg->memsw, cpu);
> +}
> +
>  /*
>   * Drains all per-CPU charge caches for given root_memcg resp. subtree
>   * of the hierarchy under it.
>   */
>  void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> +	struct mem_cgroup *memcg;
>  	int cpu, curcpu;
>  
>  	/* If someone's already draining, avoid adding running more workers. */
>  	if (!mutex_trylock(&percpu_charge_mutex))
>  		return;
> -	/*
> -	 * Notify other cpus that system-wide "drain" is running
> -	 * We do not care about races with the cpu hotplug because cpu down
> -	 * as well as workers from this path always operate on the local
> -	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
> -	 */
> +
> +	for_each_mem_cgroup_tree(memcg, root_memcg) {
> +		for_each_online_cpu(cpu)
> +			memcg_drain_stock(memcg, cpu);
> +	}
> +
>  	migrate_disable();
>  	curcpu = smp_processor_id();
>  	for_each_online_cpu(cpu) {
> -		struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu);
>  		struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu);
>  
> -		if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) &&
> -		    is_memcg_drain_needed(memcg_st, root_memcg) &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
> -				      &memcg_st->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_memcg_stock(&memcg_st->work);
> -			else
> -				schedule_drain_work(cpu, &memcg_st->work);
> -		}
> -
>  		if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) &&
>  		    obj_stock_flush_required(obj_st, root_memcg) &&
>  		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
> @@ -2318,9 +2315,13 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
>  
>  static int memcg_hotplug_cpu_dead(unsigned int cpu)
>  {
> +	struct mem_cgroup *memcg;
> +
>  	/* no need for the local lock */
>  	drain_obj_stock(&per_cpu(obj_stock, cpu));
> -	drain_stock_fully(&per_cpu(memcg_stock, cpu));
> +
> +	for_each_mem_cgroup(memcg)
> +		memcg_drain_stock(memcg, cpu);
>  
>  	return 0;
>  }
> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  			    unsigned int nr_pages)
>  {
> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>  	int nr_retries = MAX_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
>  	struct page_counter *counter;
> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	bool raised_max_event = false;
>  	unsigned long pflags;
>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> +	unsigned long nr_charged = 0;
>  
>  retry:
> -	if (consume_stock(memcg, nr_pages))
> -		return 0;
> -
> -	if (!allow_spinning)
> -		/* Avoid the refill and flush of the older stock */
> -		batch = nr_pages;
> -
>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
>  	if (do_memsw_account() &&
> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> +					   &counter, NULL)) {
>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
>  		goto reclaim;
>  	}
>  
> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> -		goto done_restock;
> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> +					  &counter, &nr_charged)) {
> +		if (!nr_charged)
> +			return 0;
> +		goto handle_high;
> +	}
>  
>  	if (do_memsw_account())
> -		page_counter_uncharge(&memcg->memsw, batch);
> +		page_counter_uncharge(&memcg->memsw, nr_pages);

This needs a transactional rollback. page_counter_try_charge_stock() can
succeed by consuming memsw stock and charging 0 new pages, but the
memory-failure path unconditionally uncharges nr_pages from memsw.
That turns a failed allocation into a real memsw usage decrement.


>  	mem_over_limit = mem_cgroup_from_counter(counter, memory);
>  
>  reclaim:
> -	if (batch > nr_pages) {
> -		batch = nr_pages;
> -		goto retry;
> -	}
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2731,10 +2725,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	return 0;
>  
> -done_restock:
> -	if (batch > nr_pages)
> -		refill_stock(memcg, batch - nr_pages);
> -
> +handle_high:
>  	/*
>  	 * If the hierarchy is above the normal consumption range, schedule
>  	 * reclaim on returning to userland.  We can perform reclaim here
> @@ -2771,7 +2762,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  			 * and distribute reclaim work and delay penalties
>  			 * based on how much each task is actually allocating.
>  			 */
> -			current->memcg_nr_pages_over_high += batch;
> +			current->memcg_nr_pages_over_high += nr_charged;
>  			set_notify_resume(current);
>  			break;
>  		}
> @@ -3076,7 +3067,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
>  	account_kmem_nmi_safe(memcg, -nr_pages);
>  	memcg1_account_kmem(memcg, -nr_pages);
>  	if (!mem_cgroup_is_root(memcg))
> -		refill_stock(memcg, nr_pages);
> +		memcg_uncharge(memcg, nr_pages);
>  
>  	css_put(&memcg->css);
>  }
> @@ -4080,6 +4071,8 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
> +	page_counter_free_stock(&memcg->memory);
> +	page_counter_free_stock(&memcg->memsw);
>  	lru_gen_exit_memcg(memcg);
>  	memcg_wb_domain_exit(memcg);
>  	__mem_cgroup_free(memcg);
> @@ -4247,6 +4240,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	refcount_set(&memcg->id.ref, 1);
>  	css_get(css);
>  
> +	/* failure is nonfatal, charges fall back to direct hierarchy */
> +	page_counter_alloc_stock(&memcg->memory, MEMCG_CHARGE_BATCH);
> +	if (do_memsw_account())
> +		page_counter_alloc_stock(&memcg->memsw, MEMCG_CHARGE_BATCH);
> +
>  	/*
>  	 * Ensure mem_cgroup_from_private_id() works once we're fully online.
>  	 *
> @@ -5502,7 +5500,7 @@ void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)
>  
>  	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
>  
> -	refill_stock(memcg, nr_pages);
> +	page_counter_uncharge(&memcg->memory, nr_pages);
>  }
>  
>  void mem_cgroup_flush_workqueue(void)
> @@ -5555,12 +5553,9 @@ int __init mem_cgroup_init(void)
>  	memcg_wq = alloc_workqueue("memcg", WQ_PERCPU, 0);
>  	WARN_ON(!memcg_wq);
>  
> -	for_each_possible_cpu(cpu) {
> -		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
> -			  drain_local_memcg_stock);
> +	for_each_possible_cpu(cpu)
>  		INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work,
>  			  drain_local_obj_stock);
> -	}
>  
>  	memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids);
>  	memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0,
> -- 
> 2.53.0-Meta
> 
> 


^ permalink raw reply

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
From: Rik van Riel @ 2026-06-24 14:44 UTC (permalink / raw)
  To: Pratyush Yadav, Breno Leitao
  Cc: nao.horiguchi, linmiaohe, david, lance.yang, akpm, baoquan.he,
	rppt, kexec, linux-mm, rneu, caggio, kas
In-Reply-To: <2vxzse6ckqfg.fsf@kernel.org>

On Wed, 2026-06-24 at 15:40 +0200, Pratyush Yadav wrote:
> 
> Also, what happens on cold reboot? If the HW does not remember bad
> pages, won't the kernel be in the same position? How does it know the
> bad pages on a cold boot?

Some modern server hardware will simply unmap known
bad pages from the physical page map, so they will
not be exposed to the OS after a cold reboot.

The hardware keeps a log of uncorrectable memory
errors somewhere in memory, for example in the SEL.

> 
> 
> > 
> > This PoC
> > ========
> > 
> >   * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
> >     HandOver) to carry the poison list between kernels.
> > 
> >   * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
> >     chokepoint for every poison/unpoison event - and records each
> >     poisoned PFN into a vmalloc array that KHO preserves across the
> >     kexec, described by a small versioned "hwpoison" subtree.
> 
> More of an implementation detail, but with vmalloc array, what if you
> have too many poisoned pages?
> > 

If a very large amount of memory is broken, you
should probably just repair the hardware.

Page poisoning is good for localized memory
failures, but not for failures that extend across
much of a memory chip.

> 
> 
> 
-- 
All Rights Reversed.


^ permalink raw reply

* Re: [PATCH v2] Respect mempolicy when calculating surplus huge pages.
From: Usama Arif @ 2026-06-24 14:45 UTC (permalink / raw)
  To: Charles Haithcock
  Cc: Usama Arif, muchun.song, osalvador, akpm, david, linux-mm,
	linux-kernel, arozansk
In-Reply-To: <20260623184548.1245488-1-chaithco@redhat.com>

On Tue, 23 Jun 2026 12:45:42 -0600 Charles Haithcock <chaithco@redhat.com> wrote:

> Presently, when calculating how many huge pages are needed when
> reserving surplus huge pages, the global count of free huge pages
> are used. When reserving with a mempolicy, the global count of free huge
> pages is used even if some/all of those free huge pages are on numa
> nodes outside of the mempolicy.
> 
> Reserving surplus huge pages is ultimately best effort even without a
> mempolicy. Restrictions from cpusets and mempolicies further complicate
> calculating correct numbers of surplus huge pages to reserve and
> maintaining which nodes those reservations belong to (see the comment in
> `hugetlb_acct_memory`).
> 
> However, we can do a little better when reserving surplus huge pages
> with a mempolicy. This patch changes how to calculate the necessary
> amount of surplus huge pages to reserve by considering the max of either
> the amount of free huge pages on nodes in the mempolicy or the global
> amount of free huge pages. We may still attempt to reserve huge pages
> outside the mempolicy, however, we end up being more likely to reserve
> from nodes in the mempolicy.
> 
> Signed-off-by: Charles Haithcock <chaithco@redhat.com>
> ---
> 
> - v1: Modified `needed` calculation to use `allowed_mems_nr(h)` in order
>   to consider free hugetlb pages in our mempolicy.
> - v2: Folded in Joshua Hahn's recommendation [1] to further modify 
>   `needed` calculation to take the max of either the available hugetlb 
>   pages in the mempolicy or the globally available hugetlb pages. Allows
>   allocations to prioritize nodes in the mempolicy but can still fall
>   back to offnode allocations. Also added selftests to check only for
>   the edgecase which caused this to initially be reported and sanity
>   checks.
> 
> [1] https://lore.kernel.org/all/20260602152022.2673803-1-joshua.hahnjy@gmail.com/
> 
>  mm/hugetlb.c                                  |  42 +-
>  tools/testing/selftests/mm/Makefile           |   3 +
>  .../selftests/mm/hugetlb_surplus_mempolicy.c  | 472 ++++++++++++++++++
>  tools/testing/selftests/mm/run_vmtests.sh     |   1 +
>  4 files changed, 498 insertions(+), 20 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/hugetlb_surplus_mempolicy.c
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24bf49be0..bd97f0f434 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2255,6 +2255,23 @@ static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
>  	return NULL;
>  }
>  
> +static unsigned int allowed_mems_nr(struct hstate *h)
> +{
> +	int node;
> +	unsigned int nr = 0;
> +	nodemask_t *mbind_nodemask;
> +	unsigned int *array = h->free_huge_pages_node;
> +	gfp_t gfp_mask = htlb_alloc_mask(h);
> +
> +	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
> +	for_each_node_mask(node, cpuset_current_mems_allowed) {
> +		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
> +			nr += array[node];
> +	}
> +
> +	return nr;
> +}
> +
>  /*
>   * Increase the hugetlb pool such that it can accommodate a reservation
>   * of size 'delta'.
> @@ -2277,7 +2294,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  		alloc_nodemask = cpuset_current_mems_allowed;
>  
>  	lockdep_assert_held(&hugetlb_lock);
> -	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
> +	needed = max((long) (delta - allowed_mems_nr(h)),
> +		(long) ((h->resv_huge_pages + delta) - h->free_huge_pages));
>  	if (needed <= 0) {
>  		h->resv_huge_pages += delta;
>  		return 0;
> @@ -2311,8 +2329,9 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  	 * because either resv_huge_pages or free_huge_pages may have changed.
>  	 */
>  	spin_lock_irq(&hugetlb_lock);
> -	needed = (h->resv_huge_pages + delta) -
> -			(h->free_huge_pages + allocated);
> +	needed = max((long) ((delta - allowed_mems_nr(h)) - allocated),
> +			(long) ((h->resv_huge_pages + delta) -
> +				(h->free_huge_pages + allocated)));
>  	if (needed > 0) {
>  		if (alloc_ok)
>  			goto retry;
> @@ -4513,23 +4532,6 @@ static int __init hugepage_alloc_threads_setup(char *s)
>  }
>  __setup("hugepage_alloc_threads=", hugepage_alloc_threads_setup);
>  
> -static unsigned int allowed_mems_nr(struct hstate *h)
> -{
> -	int node;
> -	unsigned int nr = 0;
> -	nodemask_t *mbind_nodemask;
> -	unsigned int *array = h->free_huge_pages_node;
> -	gfp_t gfp_mask = htlb_alloc_mask(h);
> -
> -	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
> -	for_each_node_mask(node, cpuset_current_mems_allowed) {
> -		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
> -			nr += array[node];
> -	}
> -
> -	return nr;
> -}
> -
>  void hugetlb_report_meminfo(struct seq_file *m)
>  {
>  	struct hstate *h;
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index cd24596cdd..40de0938f3 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
>  TEST_GEN_FILES += merge
>  TEST_GEN_FILES += rmap
>  TEST_GEN_FILES += folio_split_race_test
> +TEST_GEN_FILES += hugetlb_surplus_mempolicy
>  
>  ifneq ($(ARCH),arm64)
>  TEST_GEN_FILES += soft-dirty
> @@ -260,6 +261,8 @@ $(OUTPUT)/migration: LDLIBS += -lnuma
>  
>  $(OUTPUT)/rmap: LDLIBS += -lnuma
>  
> +$(OUTPUT)/hugetlb_surplus_mempolicy: LDLIBS += -lnuma
> +
>  local_config.mk local_config.h: check_config.sh
>  	CC="$(CC)" CFLAGS="$(CFLAGS)" ./check_config.sh
>  
> diff --git a/tools/testing/selftests/mm/hugetlb_surplus_mempolicy.c b/tools/testing/selftests/mm/hugetlb_surplus_mempolicy.c
> new file mode 100644
> index 0000000000..0a77b01693
> --- /dev/null
> +++ b/tools/testing/selftests/mm/hugetlb_surplus_mempolicy.c
> @@ -0,0 +1,472 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * hugetlb_surplus_mempolicy
> + *
> + * Reserving surplus hugepages within mempolicies is quite tricky due to
> + * the transient nature of cpusets and mempolicies. As such, these tests
> + * do not cover all edge cases, but rather focus on what the kernel can
> + * currently do to reserve surplus hugepages in the presence of cpusets
> + * and mempolicies to help check for regressions in this behavior.
> + */
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <numa.h>
> +#include <pthread.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#include "vm_util.h"
> +#include "kselftest.h"
> +
> +#define HPSIZE_BYTES default_huge_page_size()
> +#define HPSIZE_KB default_huge_page_size() >> 10
> +#define GLOBAL_SYS_HP_PATH "/sys/kernel/mm/hugepages/hugepages-%lukB/%s"
> +#define NODE_SYS_HP_PATH "/sys/devices/system/node/node%u/hugepages/hugepages-%lukB/%s"
> +
> +struct bitmask **nodemasks;
> +int *nodeids;
> +
> +pthread_t *threads;
> +struct thread_args {
> +	struct bitmask *my_nodemask;
> +	int to_reserve;
> +};
> +struct thread_args* per_thread_args;
> +pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
> +pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
> +int wake_cond = 0;
> +
> +char *nr_overcommit_hugepages_path;
> +char *g_free_hugepages_path;
> +char *g_nr_hugepages_path;
> +char *g_resv_hugepages_path;
> +char *g_surplus_hugepages_path;
> +char *n0_free_hugepages_path;
> +char *n0_nr_hugepages_path;
> +char *n0_surplus_hugepages_path;
> +char *n1_free_hugepages_path;
> +char *n1_nr_hugepages_path;
> +char *n1_surplus_hugepages_path;
> +
> +unsigned long g_free_hugepages, g_nr_hugepages;
> +unsigned long g_resv_hugepages, g_surplus_hugepages;
> +unsigned long n0_free_hugepages, n0_nr_hugepages, n0_surplus_hugepages;
> +unsigned long int n1_free_hugepages, n1_nr_hugepages, n1_surplus_hugepages;
> +unsigned long int orig_n0_nr_hugepages, orig_n1_nr_hugepages;
> +unsigned long int orig_nr_overcommit_hugepages;
> +
> +
> +/* setup_paths
> + *
> + * Helper function to create strings for the various hugetlb page sysfs
> + * paths. The strings are used to read from and write to the sysfs files.
> + */
> +static void setup_paths(void) {
> +	asprintf(&nr_overcommit_hugepages_path,
> +			"/proc/sys/vm/nr_overcommit_hugepages");
> +	asprintf(&g_free_hugepages_path, GLOBAL_SYS_HP_PATH,
> +			HPSIZE_KB, "free_hugepages");
> +	asprintf(&g_nr_hugepages_path, GLOBAL_SYS_HP_PATH,
> +			HPSIZE_KB, "nr_hugepages");
> +	asprintf(&g_resv_hugepages_path, GLOBAL_SYS_HP_PATH,
> +			HPSIZE_KB, "resv_hugepages");
> +	asprintf(&g_surplus_hugepages_path, GLOBAL_SYS_HP_PATH,
> +			HPSIZE_KB, "surplus_hugepages");
> +	asprintf(&n0_free_hugepages_path, NODE_SYS_HP_PATH, nodeids[0],
> +			HPSIZE_KB, "free_hugepages");
> +	asprintf(&n0_nr_hugepages_path, NODE_SYS_HP_PATH, nodeids[0],
> +			HPSIZE_KB, "nr_hugepages");
> +	asprintf(&n0_surplus_hugepages_path, NODE_SYS_HP_PATH, nodeids[0],
> +			HPSIZE_KB, "surplus_hugepages");
> +	asprintf(&n1_free_hugepages_path, NODE_SYS_HP_PATH, nodeids[1],
> +			HPSIZE_KB, "free_hugepages");
> +	asprintf(&n1_nr_hugepages_path, NODE_SYS_HP_PATH, nodeids[1],
> +			HPSIZE_KB, "nr_hugepages");
> +	asprintf(&n1_surplus_hugepages_path, NODE_SYS_HP_PATH, nodeids[1],
> +			HPSIZE_KB, "surplus_hugepages");
> +}
> +
> +/* get_hugepage_stats
> + *
> + * Helper function to simply grab a bunch of the hugetlb page metrics in sysfs
> + */
> +static void get_hugepage_stats(void) {
> +	read_sysfs(g_free_hugepages_path, &g_free_hugepages);
> +	read_sysfs(g_nr_hugepages_path, &g_nr_hugepages);
> +	read_sysfs(g_resv_hugepages_path, &g_resv_hugepages);
> +	read_sysfs(g_surplus_hugepages_path, &g_surplus_hugepages);
> +	read_sysfs(n0_free_hugepages_path, &n0_free_hugepages);
> +	read_sysfs(n0_nr_hugepages_path, &n0_nr_hugepages);
> +	read_sysfs(n0_surplus_hugepages_path, &n0_surplus_hugepages);
> +	read_sysfs(n1_free_hugepages_path, &n1_free_hugepages);
> +	read_sysfs(n1_nr_hugepages_path, &n1_nr_hugepages);
> +	read_sysfs(n1_surplus_hugepages_path, &n1_surplus_hugepages);
> +}
> +
> +/* save_hugepage_configs
> + *
> + * Helper function to save the current state of the hugepage configs so this
> + * test suite doesn't clobber configs needed for other tests.
> + */
> +static void save_hugepage_configs(void) {
> +	read_sysfs(n0_nr_hugepages_path, &orig_n0_nr_hugepages);
> +	read_sysfs(n1_nr_hugepages_path, &orig_n1_nr_hugepages);
> +	read_sysfs(nr_overcommit_hugepages_path, &orig_nr_overcommit_hugepages);
> +}
> +
> +/* restore_hugepage_configs
> + *
> + * Helper function to restore the state of hugepage configs before this test
> + * was ran.
> + */
> +static void restore_hugepage_configs(void) {
> +	write_sysfs(n0_nr_hugepages_path, orig_n0_nr_hugepages);
> +	write_sysfs(n1_nr_hugepages_path, orig_n1_nr_hugepages);
> +	write_sysfs(nr_overcommit_hugepages_path, orig_nr_overcommit_hugepages);
> +}
> +
> +/* reset_hugepages
> + *
> + * Helper function to reset static hugetlb page reservations to 0.
> + * Used to get back to a clear state between tests.
> + */
> +static void reset_hugepages(void) {
> +	write_sysfs(nr_overcommit_hugepages_path, 0);
> +	write_sysfs(g_nr_hugepages_path, 0);
> +	write_sysfs(n0_nr_hugepages_path, 0);
> +	write_sysfs(n1_nr_hugepages_path, 0);
> +}
> +
> +/* can_run
> + *
> + * Does sanity checking first to make sure the tests can even run.
> + */
> +static void check_requirements(void) {
> +        if (geteuid() != 0)
> +                ksft_exit_skip("Please run the test as root.\n");
> +
> +	if (numa_available() == -1)
> +		ksft_exit_skip("Numa is unavailable.\n");
> +
> +	if (numa_num_configured_nodes() < 2)
> +		ksft_exit_skip("Not enough nodes to test.\n");
> +
> +	if (numa_num_task_nodes() < 2)
> +		ksft_exit_skip("Current mempolicy is too restrictive.\n");
> +}
> +
> +static void cleanup(char* err_msg) {
> +	free(per_thread_args);
> +	free(threads);
> +	free(nodeids);
> +	free(nodemasks);
> +	free(nr_overcommit_hugepages_path);
> +	free(g_free_hugepages_path);
> +	free(g_nr_hugepages_path);
> +	free(g_resv_hugepages_path);
> +	free(g_surplus_hugepages_path);
> +	free(n0_free_hugepages_path);
> +	free(n0_nr_hugepages_path);
> +	free(n0_surplus_hugepages_path);
> +	free(n1_free_hugepages_path);
> +	free(n1_nr_hugepages_path);
> +	free(n1_surplus_hugepages_path);
> +	if (err_msg)
> +		ksft_exit_fail_msg(err_msg);
> +}
> +
> +/* setup_node_info
> + *
> + * Creates the bitmasks used to isolate test runners and their hugetlb page
> + * reservations.
> + */
> +static void setup_node_info(void) {
> +	int i;
> +	int ith_nodemask = 0;
> +
> +	nodeids = calloc(2, sizeof(int));
> +	nodemasks = calloc(2, sizeof(struct bitmask *));
> +
> +	if (!nodemasks || !nodeids)
> +		cleanup("setup_node_info: calloc.");
> +
> +	/* Walk the nodes available to us. Create two bitmasks, one of the
> +	 * index of the first node available to us, and the second of the next
> +	 * node available to us. */
> +	for (i = 0; i < numa_num_task_nodes(); i++) {
> +		if (numa_bitmask_isbitset(numa_get_mems_allowed(), i)) {
> +			nodeids[ith_nodemask] = i;
> +			nodemasks[ith_nodemask++] = numa_bitmask_setbit(
> +					numa_allocate_nodemask(), i);
> +		}
> +	}
> +	if (ith_nodemask != 2 || !nodemasks[0] || !nodemasks[1])
> +		cleanup("Failed to create nodemasks.");
> +}
> +
> +/* setup_threads
> + *
> + * Helper function to setup space for threads.
> + */
> +static void setup_threads(void) {
> +	per_thread_args = calloc(2, sizeof(per_thread_args));

Should we do calloc(2, sizeof(*per_thread_args)) here?

> +	if (!per_thread_args)
> +		cleanup("calloc thread args.");
> +
> +	threads = calloc(2, sizeof(pthread_t));
> +	if (!threads) {
> +		cleanup("calloc threads.");
> +	}
> +}
> +
> +/* reserve_hugepage
> + *
> + * Helper function to reserve a hugetlb page
> + */
> +static unsigned long* reserve_hugepage(void) {
> +	return (unsigned long *) mmap(NULL, HPSIZE_BYTES, PROT_READ | PROT_WRITE,
> +		MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
> +}
> +
> +/* thread_work
> + *
> + * Test runners. Performs the work of reserving and freeing hugetlb pages.
> + */
> +static void *thread_work(void *arg) {
> +	struct thread_args* t_args = (struct thread_args*) arg;
> +	unsigned long **hugepages;
> +	int i;
> +
> +	hugepages = (unsigned long **) calloc(t_args->to_reserve,
> +						sizeof(unsigned long **));
> +
> +	/* Reserve hugetlb pages on my node */
> +	if (t_args->my_nodemask)
> +		numa_bind(t_args->my_nodemask);
> +	for (i = 0; i < t_args->to_reserve; i++) {
> +		hugepages[i] = reserve_hugepage();
> +		/* Tests may purposefully try to overallocate, so just
> +		 * fall through rather than error out*/
> +		if (hugepages[i] == MAP_FAILED) {
> +			t_args->to_reserve = i;
> +			break;
> +		}
> +	}
> +
> +	/* Go to sleep until main thread wakes us up */
> +	pthread_mutex_lock(&mutex);
> +	while(!wake_cond) {
> +		pthread_cond_wait(&cond, &mutex);
> +	}
> +	pthread_mutex_unlock(&mutex);
> +
> +	/* Try to free those hugetlb pages */
> +	for (i = 0; i < t_args->to_reserve; i++) {
> +		if (munmap(hugepages[i], HPSIZE_BYTES) < 0)
> +			ksft_perror("munmap() failed! Check for leaked hugetlb pages!\n");
> +	}
> +	free(hugepages);
> +	return NULL;
> +}
> +
> +/* wake_children
> + *
> + * Helper function to wake children threads.
> + */
> +static void wake_children(void) {
> +	pthread_mutex_lock(&mutex);
> +	wake_cond = 1;
> +	pthread_cond_broadcast(&cond);
> +	pthread_mutex_unlock(&mutex);
> +}
> +
> +/* test1
> + *
> + * Sanity checking, attempt to reserve a surplus hugetlb page anywhere.
> + */
> +static void test1(void) {
> +	reset_hugepages();
> +
> +	write_sysfs(nr_overcommit_hugepages_path, 1);
> +	per_thread_args[0].my_nodemask = NULL;
> +	per_thread_args[0].to_reserve = 1;
> +
> +	pthread_create(&threads[0], NULL, thread_work, &per_thread_args[0]);
> +
> +	usleep(500000);
> +
> +	get_hugepage_stats();
> +	ksft_test_result((g_free_hugepages == 1 && g_nr_hugepages == 1 &&
> +			 g_resv_hugepages == 1 && g_surplus_hugepages == 1) &&
> +			 ((n0_free_hugepages == 1 && n0_nr_hugepages == 1 &&
> +			 n0_surplus_hugepages == 1 && n1_free_hugepages == 0 &&
> +			 n1_nr_hugepages == 0 && n1_surplus_hugepages == 0) ||
> +			 (n0_free_hugepages == 0 && n0_nr_hugepages == 0 &&
> +			 n0_surplus_hugepages == 0 && n1_free_hugepages == 1 &&
> +			 n1_nr_hugepages == 1 && n1_surplus_hugepages == 1)),
> +			 "Reserve 1 surplus hugepage anywhere\n");
> +
> +	wake_children();
> +	pthread_join(threads[0], NULL);
> +	wake_cond = 0;
> +	reset_hugepages();
> +}
> +
> +/* test2
> + *
> + * Sanity checking, attempt to reserve a surplus hugetlb page with
> + * a mempolicy.
> + */
> +static void test2(void) {
> +	reset_hugepages();
> +
> +	write_sysfs(nr_overcommit_hugepages_path, 1);
> +	per_thread_args[0].my_nodemask = nodemasks[0];
> +	per_thread_args[0].to_reserve = 1;
> +
> +	pthread_create(&threads[0], NULL, thread_work, &per_thread_args[0]);
> +
> +	usleep(500000);
> +
> +	get_hugepage_stats();
> +	ksft_test_result(g_free_hugepages == 1 && g_nr_hugepages == 1 &&
> +			 g_resv_hugepages == 1 && g_surplus_hugepages == 1 &&
> +			 n0_free_hugepages == 1 && n0_nr_hugepages == 1 &&
> +			 n0_surplus_hugepages == 1 && n1_free_hugepages == 0 &&
> +			 n1_nr_hugepages == 0 && n1_surplus_hugepages == 0,
> +			 "Reserve 1 surplus hugepage on node0\n");
> +
> +	wake_children();
> +	pthread_join(threads[0], NULL);
> +	wake_cond = 0;
> +	reset_hugepages();
> +}
> +
> +/* test3
> + *
> + * Set a static hugepage and reserve off node
> + */
> +static void test3(void) {
> +	reset_hugepages();
> +
> +	write_sysfs(nr_overcommit_hugepages_path, 1);
> +	write_sysfs(n0_nr_hugepages_path, 1);
> +
> +	per_thread_args[0].my_nodemask = nodemasks[0];
> +	per_thread_args[0].to_reserve = 0;
> +	per_thread_args[1].my_nodemask = nodemasks[1];
> +	per_thread_args[1].to_reserve = 1;
> +
> +	pthread_create(&threads[0], NULL, thread_work, &per_thread_args[0]);
> +	pthread_create(&threads[1], NULL, thread_work, &per_thread_args[1]);
> +
> +	usleep(500000);
> +
> +	get_hugepage_stats();
> +	ksft_test_result(g_free_hugepages == 2 && g_nr_hugepages == 2 &&
> +			 g_resv_hugepages == 1 && g_surplus_hugepages == 1 &&
> +			 n0_free_hugepages == 1 && n0_nr_hugepages == 1 &&
> +			 n0_surplus_hugepages == 0 && n1_free_hugepages == 1 &&
> +			 n1_nr_hugepages == 1 && n1_surplus_hugepages == 1,
> +			 "Set 1 static hugepage on node0, reserve surplus hugepage on node 1\n");
> +
> +	wake_children();
> +	pthread_join(threads[0], NULL);
> +	pthread_join(threads[1], NULL);
> +	wake_cond = 0;
> +	reset_hugepages();
> +}
> +
> +/* test4
> + *
> + * Reserve static hugepage on node0, reserve surplus hugepage on node1
> + */
> +static void test4(void) {
> +	reset_hugepages();
> +
> +	write_sysfs(nr_overcommit_hugepages_path, 1);
> +	write_sysfs(n0_nr_hugepages_path, 1);
> +
> +	per_thread_args[0].my_nodemask = nodemasks[0];
> +	per_thread_args[0].to_reserve = 1;
> +	per_thread_args[1].my_nodemask = nodemasks[1];
> +	per_thread_args[1].to_reserve = 1;
> +
> +	pthread_create(&threads[0], NULL, thread_work, &per_thread_args[0]);
> +	pthread_create(&threads[1], NULL, thread_work, &per_thread_args[1]);
> +
> +	usleep(500000);
> +
> +	get_hugepage_stats();
> +	ksft_test_result(g_free_hugepages == 2 && g_nr_hugepages == 2 &&
> +			 g_resv_hugepages == 2 && g_surplus_hugepages == 1 &&
> +			 n0_free_hugepages == 1 && n0_nr_hugepages == 1 &&
> +			 n0_surplus_hugepages == 0 && n1_free_hugepages == 1 &&
> +			 n1_nr_hugepages == 1 && n1_surplus_hugepages == 1,
> +			 "Reserve 1 static hugepage on node0, reserve surplus hugepage on node 1\n");
> +
> +	wake_children();
> +	pthread_join(threads[0], NULL);
> +	pthread_join(threads[1], NULL);
> +	wake_cond = 0;
> +	reset_hugepages();
> +}
> +
> +/* test5
> + *
> + * Reserve static hugepage on node0, reserve surplus hugepage on node1 and
> + * fail to over allocate another.
> + */
> +static void test5(void) {
> +	reset_hugepages();
> +
> +	write_sysfs(nr_overcommit_hugepages_path, 1);
> +	write_sysfs(n0_nr_hugepages_path, 1);
> +
> +	per_thread_args[0].my_nodemask = nodemasks[0];
> +	per_thread_args[0].to_reserve = 1;
> +	per_thread_args[1].my_nodemask = nodemasks[1];
> +	per_thread_args[1].to_reserve = 2;
> +
> +	pthread_create(&threads[0], NULL, thread_work, &per_thread_args[0]);
> +	pthread_create(&threads[1], NULL, thread_work, &per_thread_args[1]);
> +
> +	usleep(500000);
> +
> +	get_hugepage_stats();
> +	ksft_test_result(g_free_hugepages == 2 && g_nr_hugepages == 2 &&
> +			 g_resv_hugepages == 2 && g_surplus_hugepages == 1 &&
> +			 n0_free_hugepages == 1 && n0_nr_hugepages == 1 &&
> +			 n0_surplus_hugepages == 0 && n1_free_hugepages == 1 &&
> +			 n1_nr_hugepages == 1 && n1_surplus_hugepages == 1,
> +			 "Intentionally overallocate and fail due to nr_overcommit_hugepages limit.\n");
> +
> +	wake_children();
> +	pthread_join(threads[0], NULL);
> +	pthread_join(threads[1], NULL);
> +	wake_cond = 0;
> +	reset_hugepages();
> +
> +}
> +
> +int main(void) {
> +	ksft_print_header();
> +	ksft_set_plan(5);
> +
> +	check_requirements();
> +	setup_threads();
> +	setup_node_info();
> +	setup_paths();
> +	save_hugepage_configs();
> +
> +	test1();
> +	test2();
> +	test3();
> +	test4();
> +	test5();
> +
> +	restore_hugepage_configs();
> +	ksft_finished();
> +}
> diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> index c17b133a81..cd368ce590 100755
> --- a/tools/testing/selftests/mm/run_vmtests.sh
> +++ b/tools/testing/selftests/mm/run_vmtests.sh
> @@ -297,6 +297,7 @@ CATEGORY="hugetlb" run_test ./hugepage-mremap
>  CATEGORY="hugetlb" run_test ./hugepage-vmemmap
>  CATEGORY="hugetlb" run_test ./hugetlb-madvise
>  CATEGORY="hugetlb" run_test ./hugetlb_dio
> +CATEGORY="hugetlb" run_test ./hugetlb_surplus_mempolicy
>  
>  if [ "${HAVE_HUGEPAGES}" = "1" ]; then
>  	nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
> -- 
> 2.54.0
> 
> 


^ permalink raw reply

* Re: [RFC PATCH v2 1/3] mm/huge_memory: make persistent huge zero folio read-only
From: Xueyuan Chen @ 2026-06-24 14:49 UTC (permalink / raw)
  To: david
  Cc: xueyuan.chen21, dave.hansen, akpm, linux-mm, linux-kernel,
	linux-arm-kernel, x86, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, luto, peterz, hpa, ljs, liam, vbabka, rppt, surenb,
	mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, yang, jannh
In-Reply-To: <f679fc68-5ff6-4715-9a07-9eb3074b2d1e@kernel.org>


On Fri, Jun 19, 2026 at 01:09:09PM +0200, David Hildenbrand (Arm) wrote:

Hi, David

[...]

>Having a new direct-map specific function with clear semantics might indeed
>avoid even messing with that.
>
>So, yeah, given that we have
>
>	set_direct_map_invalid_noflush
>	set_direct_map_default_noflush
>	set_direct_map_valid_noflush
>
>Let's add a
>
>	set_direct_map_ro()
>
>Or (my preference)
>
>	change_direct_map_ro()
>
>But given the existing naming scheme ... maybe just set_direct_map_ro() and
>we'll clean this up another day.
>
>
>Now, should there also be a "_noflush" in there, or who is supposed to flush the
>TLB (or don't we flush at all, because it's used early during boot so far)?
>
>In any case, for this function we should add excessive documentation and define
>clear semantics.
>

After talking with Lance, we're leaning toward a direct-map helper:

  set_direct_map_ro_noflush(const void *addr, unsigned long nr_pages)

Contract should be pretty explicit: direct/linear map only, no alias
handling, no TLB flush. If a caller needs stale entries gone, it has
to flush itself.

Huge zero folio only uses this early during boot so far, so noflush
should be OK. Will document that.

Thanks,
Xueyuan


^ permalink raw reply

* [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple

The dax kmem driver onlines memory during probe using the system
default policy, with no atomic control for the state of an entire
region at runtime - only by toggling individual memory blocks.

Offlining and removing a whole region therefore races with other
userland controllers that interfere between the offline and remove
steps. This was discussed in the LPC2025 device memory sessions [1].

This series adds a sysfs "state" attribute for atomic whole-device
hotplug control, plus the mm and dax plumbing to support it.

Transitions are atomic across every range of the device. The state
names mirror the per-block memoryX/state ABI with one modification:

  - "unplugged":      memory blocks are not present
  - "online":         online as system RAM, zone chosen by the kernel
  - "online_kernel":  online in ZONE_NORMAL
  - "online_movable": online in ZONE_MOVABLE

"offline" (blocks present but offline) is reportable for backward
compatibility but is not writable because it entices the race condition
we are trying to solve (offlining all the memory blocks in one atomic
and unplugging them in another atomic).

mm preparation:
  1. mm/memory: add memory_block_aligned_range() helper.
  2. mm/memory_hotplug: pass online_type to online_memory_block().
  3. mm/memory_hotplug: export mhp_get_default_online_type().
  4. mm/memory_hotplug: add __add_memory_driver_managed() so a driver can
     select the online policy.  The override is restricted to in-tree
     modules via EXPORT_SYMBOL_FOR_MODULES().
  5. mm/memory_hotplug: add offline_and_remove_memory_ranges() for atomic,
     all-or-nothing offline+remove of several ranges under a single
     lock_device_hotplug().

dax/kmem feature:
  6. Plumb online_type through the dax device creation path.
  7. Extract hotplug/hotremove into helper functions.
  8. Add the "hotplug" sysfs attribute.
  9. selftests/dax: regression test for the attribute.

DAX Kmem probe still creates the memory blocks by default, even when
the default policy is "offline" to preserve backwards compatibility.

Unplug (atomic offline+remove of the whole device) is the new
capability provided by the attribute.

I downgraded a BUG() to a WARN() when unbind is called while the device
is not unplugged.  The old per-block toggling pattern is still used by
userland tools and disconnects the 'hotplug' value from the real region
state; until per-block control is deprecated or restricted in some way,
WARN() flags that tools should move to the new atomic pattern.

Changes since v4:
  - renamed 'dax/hotplug' -> 'dax/state'
  - refactored the work into a shared offline_and_remove_memory_ranges
  - reworked MMOP_ helpers to re-use code
  - fixed cached system default online_type regression
  - nits

Gregory Price (9):
  mm/memory: add memory_block_aligned_range() helper
  mm/memory_hotplug: pass online_type to online_memory_block() via arg
  mm/memory_hotplug: export mhp_get_default_online_type
  mm/memory_hotplug: add __add_memory_driver_managed() with online_type
    arg
  mm/memory_hotplug: add offline_and_remove_memory_ranges()
  dax: plumb hotplug online_type through dax
  dax/kmem: extract hotplug/hotremove helper functions
  dax/kmem: add sysfs interface for atomic whole-device hotplug
  selftests/dax: add dax/kmem hotplug sysfs regression test

 Documentation/ABI/testing/sysfs-bus-dax       |  26 +
 drivers/base/memory.c                         |   9 +
 drivers/dax/bus.c                             |   3 +
 drivers/dax/bus.h                             |   9 +
 drivers/dax/cxl.c                             |   1 +
 drivers/dax/dax-private.h                     |   4 +
 drivers/dax/hmem/hmem.c                       |   1 +
 drivers/dax/kmem.c                            | 475 ++++++++++++++----
 drivers/dax/pmem.c                            |   1 +
 include/linux/memory.h                        |  22 +
 include/linux/memory_hotplug.h                |  13 +
 mm/memory_hotplug.c                           | 162 ++++--
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/dax/Makefile          |   6 +
 tools/testing/selftests/dax/config            |   4 +
 .../testing/selftests/dax/dax-kmem-hotplug.sh | 207 ++++++++
 tools/testing/selftests/dax/settings          |   1 +
 17 files changed, 806 insertions(+), 139 deletions(-)
 create mode 100644 tools/testing/selftests/dax/Makefile
 create mode 100644 tools/testing/selftests/dax/config
 create mode 100755 tools/testing/selftests/dax/dax-kmem-hotplug.sh
 create mode 100644 tools/testing/selftests/dax/settings

-- 
2.54.0

^ permalink raw reply

* [PATCH v5 1/9] mm/memory: add memory_block_aligned_range() helper
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

Memory hotplug operations require ranges aligned to memory block
boundaries.  This is a generic operation for hotplug.

Add memory_block_aligned_range() as a common helper in <linux/memory.h>
that aligns the start address up and end address down to memory block
boundaries.

Update dax/kmem to use this helper.

Signed-off-by: Gregory Price <gourry@gourry.net>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 drivers/dax/kmem.c     |  4 +---
 include/linux/memory.h | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a18e2b968e4d..592171ec10f4 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -33,9 +33,7 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
 	struct dev_dax_range *dax_range = &dev_dax->ranges[i];
 	struct range *range = &dax_range->range;
 
-	/* memory-block align the hotplug range */
-	r->start = ALIGN(range->start, memory_block_size_bytes());
-	r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
+	*r = memory_block_aligned_range(range);
 	if (r->start >= r->end) {
 		r->start = range->start;
 		r->end = range->end;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 463dc02f6cff..9f5ef0309f77 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -20,6 +20,7 @@
 #include <linux/compiler.h>
 #include <linux/mutex.h>
 #include <linux/memory_hotplug.h>
+#include <linux/range.h>
 
 #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
 
@@ -100,6 +101,27 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
 unsigned long memory_block_size_bytes(void);
 int set_memory_block_size_order(unsigned int order);
 
+/**
+ * memory_block_aligned_range - align a physical address range to memory blocks
+ * @range: the input range to align
+ *
+ * Aligns the start address up and the end address down to memory block
+ * boundaries. This is required for memory hotplug operations which must
+ * operate on memory-block aligned ranges.
+ *
+ * Returns the aligned range. Callers should check that the returned
+ * range is valid (aligned.start < aligned.end) before using it.
+ */
+static inline struct range memory_block_aligned_range(const struct range *range)
+{
+	struct range aligned;
+
+	aligned.start = ALIGN(range->start, memory_block_size_bytes());
+	aligned.end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
+
+	return aligned;
+}
+
 struct memory_notify {
 	unsigned long start_pfn;
 	unsigned long nr_pages;
-- 
2.54.0



^ permalink raw reply related

* [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

Modify online_memory_block() to accept the online type through its arg
parameter rather than calling mhp_get_default_online_type() internally.

This prepares for allowing callers to specify explicit online types.

Update the caller in add_memory_resource() to pass the default online
type via a local variable.

No functional change.

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/memory_hotplug.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7ac19fab2263..6833208cc17c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1337,7 +1337,9 @@ static int check_hotplug_memory_range(u64 start, u64 size)
 
 static int online_memory_block(struct memory_block *mem, void *arg)
 {
-	mem->online_type = mhp_get_default_online_type();
+	enum mmop *online_type = arg;
+
+	mem->online_type = *online_type;
 	return device_online(&mem->dev);
 }
 
@@ -1494,6 +1496,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
 int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+	enum mmop online_type = mhp_get_default_online_type();
 	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
 	struct memory_group *group = NULL;
 	u64 start, size;
@@ -1582,7 +1585,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 
 	/* online pages if requested */
 	if (mhp_get_default_online_type() != MMOP_OFFLINE)
-		walk_memory_blocks(start, size, NULL, online_memory_block);
+		walk_memory_blocks(start, size, &online_type,
+				   online_memory_block);
 
 	return ret;
 error:
-- 
2.54.0



^ permalink raw reply related

* [PATCH v5 3/9] mm/memory_hotplug: export mhp_get_default_online_type
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

Drivers which may pass hotplug policy down to DAX need MMOP_ symbols
and the mhp_get_default_online_type function for hotplug use cases.

Some drivers (cxl) co-mingle their hotplug and devdax use-cases into
the same driver code, and chose the dax_kmem path as the default driver
path - making it difficult to require hotplug as a predicate to building
the overall driver (it may break other non-hotplug use-cases).

Export mhp_get_default_online_type function to allow these drivers to
build when hotplug is disabled and still use the DAX use case.

In the built-out case we simply return MMOP_OFFLINE as it's
non-destructive.  The internal function can never return -1 either,
so we choose this to allow for defining the function with 'enum mmop'.

Signed-off-by: Gregory Price <gourry@gourry.net>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 include/linux/memory_hotplug.h | 2 ++
 mm/memory_hotplug.c            | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7c9d66729c60..f059025f8f8b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -316,6 +316,8 @@ extern struct zone *zone_for_pfn_range(enum mmop online_type,
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
+#else
+static inline enum mmop mhp_get_default_online_type(void) { return MMOP_OFFLINE; }
 #endif /* CONFIG_MEMORY_HOTPLUG */

 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6833208cc17c..494257054095 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -239,6 +239,7 @@ enum mmop mhp_get_default_online_type(void)

 	return mhp_default_online_type;
 }
+EXPORT_SYMBOL_GPL(mhp_get_default_online_type);

 void mhp_set_default_online_type(enum mmop online_type)
 {
-- 
2.54.0

^ permalink raw reply related

* [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

Existing callers of add_memory_driver_managed cannot select the
preferred online type (ZONE_NORMAL vs ZONE_MOVABLE), requiring it to
hot-add memory as offline blocks, and then follow up by onlining each
memory block individually.

Most drivers prefer the system default, but the CXL driver wants to
plumb a preferred policy through the dax kmem driver.

Refactor APIs to add a new interface which allows the dax kmem module
to select a preferred policy.

Overriding the configured auto-online policy is only safe for known
in-tree modules, where we know the override reflects a different,
user-requested policy.  We do not want arbitrary out-of-tree drivers
silently overriding the system-wide onlining policy, so restrict the
new interface to the kmem module using EXPORT_SYMBOL_FOR_MODULES()
rather than a plain EXPORT_SYMBOL_GPL().  Other in-tree modules (e.g.
cxl_core) can be added to the allowed list as the need arises.

Refactor add_memory_driver_managed, extract __add_memory_driver_managed
- Add proper kernel-doc for add_memory_driver_managed while refactoring
- New helper accepts an explicit online_type.
- New helper validates online_type is between OFFLINE and ONLINE_MOVABLE

Refactor: add_memory_resource, extract __add_memory_resource
- new helper accepts an explicit online_type

Original APIs now explicitly pass the system-default to new helpers.

No functional change for existing users.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory_hotplug.h |  3 ++
 mm/memory_hotplug.c            | 61 +++++++++++++++++++++++++++++-----
 2 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f059025f8f8b..d3edeb80aadb 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -294,6 +294,9 @@ extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
 extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
 extern int add_memory_resource(int nid, struct resource *resource,
 			       mhp_t mhp_flags);
+int __add_memory_driver_managed(int nid, u64 start, u64 size,
+				const char *resource_name, mhp_t mhp_flags,
+				enum mmop online_type);
 extern int add_memory_driver_managed(int nid, u64 start, u64 size,
 				     const char *resource_name,
 				     mhp_t mhp_flags);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 494257054095..a66346def504 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1494,10 +1494,10 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
+static int __add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags,
+				 enum mmop online_type)
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
-	enum mmop online_type = mhp_get_default_online_type();
 	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
 	struct memory_group *group = NULL;
 	u64 start, size;
@@ -1585,7 +1585,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		merge_system_ram_resource(res);
 
 	/* online pages if requested */
-	if (mhp_get_default_online_type() != MMOP_OFFLINE)
+	if (online_type != MMOP_OFFLINE)
 		walk_memory_blocks(start, size, &online_type,
 				   online_memory_block);
 
@@ -1603,7 +1603,13 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	return ret;
 }
 
-/* requires device_hotplug_lock, see add_memory_resource() */
+int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
+{
+	return __add_memory_resource(nid, res, mhp_flags,
+				     mhp_get_default_online_type());
+}
+
+/* requires device_hotplug_lock, see __add_memory_resource() */
 int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
 {
 	struct resource *res;
@@ -1631,7 +1637,15 @@ int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
 }
 EXPORT_SYMBOL_GPL(add_memory);
 
-/*
+/**
+ * __add_memory_driver_managed - add driver-managed memory with explicit online_type
+ * @nid: NUMA node ID where the memory will be added
+ * @start: Start physical address of the memory range
+ * @size: Size of the memory range in bytes
+ * @resource_name: Resource name in format "System RAM ($DRIVER)"
+ * @mhp_flags: Memory hotplug flags
+ * @online_type: Auto-Online behavior (offline, online, kernel, movable)
+ *
  * Add special, driver-managed memory to the system as system RAM. Such
  * memory is not exposed via the raw firmware-provided memmap as system
  * RAM, instead, it is detected and added by a driver - during cold boot,
@@ -1639,6 +1653,7 @@ EXPORT_SYMBOL_GPL(add_memory);
  *
  * Reasons why this memory should not be used for the initial memmap of a
  * kexec kernel or for placing kexec images:
+ *
  * - The booting kernel is in charge of determining how this memory will be
  *   used (e.g., use persistent memory as system RAM)
  * - Coordination with a hypervisor is required before this memory
@@ -1651,9 +1666,12 @@ EXPORT_SYMBOL_GPL(add_memory);
  *
  * The resource_name (visible via /proc/iomem) has to have the format
  * "System RAM ($DRIVER)".
+ *
+ * Return: 0 on success, negative error code on failure.
  */
-int add_memory_driver_managed(int nid, u64 start, u64 size,
-			      const char *resource_name, mhp_t mhp_flags)
+int __add_memory_driver_managed(int nid, u64 start, u64 size,
+		const char *resource_name, mhp_t mhp_flags,
+		enum mmop online_type)
 {
 	struct resource *res;
 	int rc;
@@ -1663,6 +1681,9 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
 	    resource_name[strlen(resource_name) - 1] != ')')
 		return -EINVAL;
 
+	if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
+		return -EINVAL;
+
 	lock_device_hotplug();
 
 	res = register_memory_resource(start, size, resource_name);
@@ -1671,7 +1692,7 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
 		goto out_unlock;
 	}
 
-	rc = add_memory_resource(nid, res, mhp_flags);
+	rc = __add_memory_resource(nid, res, mhp_flags, online_type);
 	if (rc < 0)
 		release_memory_resource(res);
 
@@ -1679,6 +1700,30 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
 	unlock_device_hotplug();
 	return rc;
 }
+EXPORT_SYMBOL_FOR_MODULES(__add_memory_driver_managed, "kmem");
+
+/**
+ * add_memory_driver_managed - add driver-managed memory
+ * @nid: NUMA node ID where the memory will be added
+ * @start: Start physical address of the memory range
+ * @size: Size of the memory range in bytes
+ * @resource_name: Resource name in format "System RAM ($DRIVER)"
+ * @mhp_flags: Memory hotplug flags
+ *
+ * Add driver-managed memory with the system default online type set by
+ * build config or kernel boot parameter.
+ *
+ * See __add_memory_driver_managed for more details.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int add_memory_driver_managed(int nid, u64 start, u64 size,
+			      const char *resource_name, mhp_t mhp_flags)
+{
+	return __add_memory_driver_managed(nid, start, size, resource_name,
+			mhp_flags,
+			mhp_get_default_online_type());
+}
 EXPORT_SYMBOL_GPL(add_memory_driver_managed);
 
 /*
-- 
2.54.0



^ permalink raw reply related

* [PATCH v5 5/9] mm/memory_hotplug: offline_and_remove_memory_ranges()
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

offline_and_remove_memory() handles a single contiguous range.

Callers that manage a device composed of several ranges (dax/kmem)
currently have to call it in a loop, which gives up atomicity.

In addition to pushing rollback logic into the driver, the lack
of atomicity creates a race condition between system daemons trying
to manage the same resource:

   - Manager 1:  Offlines memory blocks.    Removes device.
                                        ^^^^
   - Manager 2:  Detects offline memory blocks, re-onlines them.

Add offline_and_remove_memory_ranges(), which takes an array of ranges
and processes them as one operation under a single lock_device_hotplug():

  - Phase 1 offlines every block of every range.
  - Phase 2 removes the ranges only if all ranges are offline.
  - If any offline fails, the whole operation is reverted.

This gives callers all-or-nothing semantics for the offline step, so a
failed or interrupted unplug leaves the device in a consistent state.

This also resolves the battling managers race - the second manager's
operation simply fails when the block is destroyed / cannot be onlined.

offline_and_remove_memory() becomes a thin wrapper that passes its single
range to the new helper, so the offline/rollback logic lives in one place.

Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory_hotplug.h |  7 +++
 mm/memory_hotplug.c            | 94 ++++++++++++++++++++++++----------
 2 files changed, 74 insertions(+), 27 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d3edeb80aadb..7f1da7c428dc 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -267,6 +267,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 extern int remove_memory(u64 start, u64 size);
 extern void __remove_memory(u64 start, u64 size);
 extern int offline_and_remove_memory(u64 start, u64 size);
+int offline_and_remove_memory_ranges(const struct range *ranges, int nr_ranges);
 
 #else
 static inline void try_offline_node(int nid) {}
@@ -283,6 +284,12 @@ static inline int remove_memory(u64 start, u64 size)
 }
 
 static inline void __remove_memory(u64 start, u64 size) {}
+
+static inline int offline_and_remove_memory_ranges(const struct range *ranges,
+						   int nr_ranges)
+{
+	return -EBUSY;
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a66346def504..7d56e0c6ede0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2429,58 +2429,98 @@ static int try_reonline_memory_block(struct memory_block *mem, void *arg)
  */
 int offline_and_remove_memory(u64 start, u64 size)
 {
-	const unsigned long mb_count = size / memory_block_size_bytes();
+	struct range range = { .start = start, .end = start + size - 1 };
+
+	return offline_and_remove_memory_ranges(&range, 1);
+}
+EXPORT_SYMBOL_GPL(offline_and_remove_memory);
+
+/**
+ * offline_and_remove_memory_ranges - offline and remove multiple memory ranges
+ * @ranges: array of physical address ranges to offline and remove
+ * @nr_ranges: number of entries in @ranges
+ *
+ * Offline and remove several memory ranges as one operation, serialized
+ * against other hotplug operations by a single lock_device_hotplug().
+ *
+ * This offlines all ranges before removing any of them.  If offlining any
+ * range fails, the entire process is reverted and nothing is removed.
+ * This provides a fully atomic semantic for unplugging an entire device.
+ *
+ * Each range must be memory-block aligned in start and size.
+ *
+ * Return: 0 on success, negative errno otherwise.  On failure no range has
+ * been removed.
+ */
+int offline_and_remove_memory_ranges(const struct range *ranges, int nr_ranges)
+{
+	unsigned long mb_total = 0;
 	uint8_t *online_types, *tmp;
-	int rc;
+	int i, rc = 0;
 
-	if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
-	    !IS_ALIGNED(size, memory_block_size_bytes()) || !size)
+	if (!ranges || nr_ranges <= 0)
 		return -EINVAL;
 
+	for (i = 0; i < nr_ranges; i++) {
+		u64 start = ranges[i].start;
+		u64 size = range_len(&ranges[i]);
+
+		if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
+		    !IS_ALIGNED(size, memory_block_size_bytes()) || !size)
+			return -EINVAL;
+		mb_total += size / memory_block_size_bytes();
+	}
+
 	/*
-	 * We'll remember the old online type of each memory block, so we can
-	 * try to revert whatever we did when offlining one memory block fails
-	 * after offlining some others succeeded.
+	 * Remember the old online type of every memory block across all ranges,
+	 * so we can revert if offlining a later block fails.  All entries start
+	 * as MMOP_OFFLINE so blocks we never touched are skipped on rollback.
 	 */
-	online_types = kmalloc_array(mb_count, sizeof(*online_types),
+	online_types = kmalloc_array(mb_total, sizeof(*online_types),
 				     GFP_KERNEL);
 	if (!online_types)
 		return -ENOMEM;
-	/*
-	 * Initialize all states to MMOP_OFFLINE, so when we abort processing in
-	 * try_offline_memory_block(), we'll skip all unprocessed blocks in
-	 * try_reonline_memory_block().
-	 */
-	memset(online_types, MMOP_OFFLINE, mb_count);
+	memset(online_types, MMOP_OFFLINE, mb_total);
 
 	lock_device_hotplug();
 
+	/* Phase 1: offline every block in every range. */
 	tmp = online_types;
-	rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block);
+	for (i = 0; i < nr_ranges; i++) {
+		rc = walk_memory_blocks(ranges[i].start, range_len(&ranges[i]),
+					&tmp, try_offline_memory_block);
+		if (rc)
+			break;
+	}
 
 	/*
-	 * In case we succeeded to offline all memory, remove it.
-	 * This cannot fail as it cannot get onlined in the meantime.
+	 * Phase 2: Remove each range. This essentially cannot fail as we hold
+	 * the hotplug lock . WARN if that assumption is ever broken.
 	 */
 	if (!rc) {
-		rc = try_remove_memory(start, size);
-		if (rc)
-			pr_err("%s: Failed to remove memory: %d", __func__, rc);
+		for (i = 0; i < nr_ranges; i++) {
+			rc = try_remove_memory(ranges[i].start,
+					       range_len(&ranges[i]));
+			if (WARN_ON_ONCE(rc)) {
+				pr_err("%s: Failed to remove memory: %d",
+				       __func__, rc);
+				break;
+			}
+		}
 	}
 
-	/*
-	 * Rollback what we did. While memory onlining might theoretically fail
-	 * (nacked by a notifier), it barely ever happens.
-	 */
+	/* On fail: roll back. Blocks that were already offline are skipped */
 	if (rc) {
 		tmp = online_types;
-		walk_memory_blocks(start, size, &tmp,
-				   try_reonline_memory_block);
+		for (i = 0; i < nr_ranges; i++)
+			walk_memory_blocks(ranges[i].start,
+					   range_len(&ranges[i]), &tmp,
+					   try_reonline_memory_block);
 	}
 	unlock_device_hotplug();
 
 	kfree(online_types);
 	return rc;
 }
-EXPORT_SYMBOL_GPL(offline_and_remove_memory);
+EXPORT_SYMBOL_GPL(offline_and_remove_memory_ranges);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.54.0



^ permalink raw reply related

* [PATCH v5 6/9] dax: plumb hotplug online_type through dax
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

There is no way for drivers leveraging dax_kmem to plumb through a
preferred auto-online policy - the system default policy is forced.

Add 'enum mmop' field to DAX device creation path to allow drivers
to specify an auto-online policy when using the kmem driver.

Capturing the system default would otherwise break the ABI, because
the system default can change - but we would be statically assigning
the value at device creation time.

To resolve this we add DAX_ONLINE_DEFAULT, which defaults devices to
the current behavior, while providing a clean way to override it.

No behavioural change for existing callers (still the system default).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/dax/bus.c         |  3 +++
 drivers/dax/bus.h         |  9 +++++++++
 drivers/dax/cxl.c         |  1 +
 drivers/dax/dax-private.h |  4 ++++
 drivers/dax/hmem/hmem.c   |  1 +
 drivers/dax/kmem.c        | 11 +++++++++--
 drivers/dax/pmem.c        |  1 +
 7 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 492573b47f66..4a03b323b003 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */
 #include <linux/memremap.h>
+#include <linux/memory_hotplug.h>
 #include <linux/device.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
@@ -394,6 +395,7 @@ static ssize_t create_store(struct device *dev, struct device_attribute *attr,
 			.size = 0,
 			.id = -1,
 			.memmap_on_memory = false,
+			.online_type = DAX_ONLINE_DEFAULT,
 		};
 		struct dev_dax *dev_dax = __devm_create_dev_dax(&data);
 
@@ -1527,6 +1529,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	ida_init(&dev_dax->ida);
 
 	dev_dax->memmap_on_memory = data->memmap_on_memory;
+	dev_dax->online_type = data->online_type;
 
 	inode = dax_inode(dax_dev);
 	dev->devt = inode->i_rdev;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 5909171a4428..f3c9dae5de6b 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -3,6 +3,7 @@
 #ifndef __DAX_BUS_H__
 #define __DAX_BUS_H__
 #include <linux/device.h>
+#include <linux/memory_hotplug.h>
 #include <linux/platform_device.h>
 #include <linux/range.h>
 #include <linux/workqueue.h>
@@ -16,6 +17,13 @@ struct dax_region;
 #define IORESOURCE_DAX_STATIC BIT(0)
 #define IORESOURCE_DAX_KMEM BIT(1)
 
+/*
+ * online_type sentinel: the device was created without an explicit online
+ * policy, so the system default is resolved when the kmem driver binds,
+ * (not at device-creation time, which would freeze a stale policy).
+ */
+#define DAX_ONLINE_DEFAULT	(-1)
+
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
 		unsigned long flags);
@@ -26,6 +34,7 @@ struct dev_dax_data {
 	resource_size_t size;
 	int id;
 	bool memmap_on_memory;
+	enum mmop online_type;
 };
 
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 3ab39b77843d..1a7ec6212213 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -27,6 +27,7 @@ static int cxl_dax_region_probe(struct device *dev)
 		.id = -1,
 		.size = range_len(&cxlr_dax->hpa_range),
 		.memmap_on_memory = true,
+		.online_type = DAX_ONLINE_DEFAULT,
 	};
 
 	return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c..ccd77965fe3e 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -8,6 +8,7 @@
 #include <linux/device.h>
 #include <linux/cdev.h>
 #include <linux/idr.h>
+#include <linux/memory_hotplug.h>
 
 /* private routines between core files */
 struct dax_device;
@@ -79,6 +80,8 @@ struct dev_dax_range {
  * @dev: device core
  * @pgmap: pgmap for memmap setup / lifetime (driver owned)
  * @memmap_on_memory: allow kmem to put the memmap in the memory
+ * @online_type: MMOP_* online type for memory hotplug, or DAX_ONLINE_DEFAULT
+ *		 to resolve the system default policy when kmem binds
  * @nr_range: size of @ranges
  * @ranges: range tuples of memory used
  */
@@ -95,6 +98,7 @@ struct dev_dax {
 	struct device dev;
 	struct dev_pagemap *pgmap;
 	bool memmap_on_memory;
+	enum mmop online_type;
 	int nr_range;
 	struct dev_dax_range *ranges;
 };
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index af21f66bf872..2de3bc925172 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -37,6 +37,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
 		.id = -1,
 		.size = region_idle ? 0 : range_len(&mri->range),
 		.memmap_on_memory = false,
+		.online_type = DAX_ONLINE_DEFAULT,
 	};
 
 	return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 592171ec10f4..0a184c0878dd 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -72,6 +72,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	int i, rc, mapped = 0;
 	mhp_t mhp_flags;
 	int numa_node;
+	int online_type;
 	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
 
 	/*
@@ -132,6 +133,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		goto err_reg_mgid;
 	data->mgid = rc;
 
+	/* Resolve system default at bind time in case it changed */
+	online_type = dev_dax->online_type;
+	if (online_type == DAX_ONLINE_DEFAULT)
+		online_type = mhp_get_default_online_type();
+
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct resource *res;
 		struct range range;
@@ -172,8 +178,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		 * Ensure that future kexec'd kernels will not treat
 		 * this as RAM automatically.
 		 */
-		rc = add_memory_driver_managed(data->mgid, range.start,
-				range_len(&range), kmem_name, mhp_flags);
+		rc = __add_memory_driver_managed(data->mgid, range.start,
+				range_len(&range), kmem_name, mhp_flags,
+				online_type);
 
 		if (rc) {
 			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index bee93066a849..e7adace69195 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -63,6 +63,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 		.pgmap = &pgmap,
 		.size = range_len(&range),
 		.memmap_on_memory = false,
+		.online_type = DAX_ONLINE_DEFAULT,
 	};
 
 	return devm_create_dev_dax(&data);
-- 
2.54.0



^ permalink raw reply related

* [PATCH v5 7/9] dax/kmem: extract hotplug/hotremove helper functions
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

Refactor kmem _probe() _remove() by extracting init, cleanup, hotplug,
and hot-remove logic into separate helper functions:

  - dax_kmem_init_resources: inits IO_RESOURCE w/ request_mem_region
  - dax_kmem_cleanup_resources: cleans up initialized IO_RESOURCE
  - dax_kmem_do_hotplug: handles memory region reservation and adding
  - dax_kmem_do_hotremove: handles memory removal and resource cleanup

This is a pure refactoring with no functional change. The helpers will
enable future extensions to support more granular control over memory
hotplug operations.

We need to split hotplug/hotunplug and init/cleanup in order to have the
resources available for hot-add.  Otherwise, when probe occurs, the dax
devices are never added to sysfs because the resources are never
registered.

Detatching hotunplug/cleanup allows us to re-use the hotunplug code
without destroying the underlying resources.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/dax/kmem.c | 316 ++++++++++++++++++++++++++++++---------------
 1 file changed, 214 insertions(+), 102 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 0a184c0878dd..a45e50def537 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -63,14 +63,195 @@ static void kmem_put_memory_types(void)
 	mt_put_memory_types(&kmem_memory_types);
 }
 
+/**
+ * dax_kmem_do_hotplug - hotplug memory for dax kmem device
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Hotplugs all ranges in the dev_dax region as system memory.
+ *
+ * Returns the number of successfully mapped ranges, or negative error.
+ */
+static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
+			       struct dax_kmem_data *data,
+			       int online_type)
+{
+	struct device *dev = &dev_dax->dev;
+	int i, rc, onlined = 0;
+	mhp_t mhp_flags;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct range range;
+
+		rc = dax_kmem_range(dev_dax, i, &range);
+		if (rc)
+			continue;
+
+		mhp_flags = MHP_NID_IS_MGID;
+		if (dev_dax->memmap_on_memory)
+			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+
+		/*
+		 * Ensure that future kexec'd kernels will not treat
+		 * this as RAM automatically.
+		 */
+		rc = __add_memory_driver_managed(data->mgid, range.start,
+				range_len(&range), kmem_name, mhp_flags,
+				online_type);
+
+		if (rc) {
+			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
+				 i, range.start, range.end);
+			/*
+			 * Release the reservation for the range that failed to
+			 * add so a later hotremove does not try to remove memory
+			 * that was never added.
+			 */
+			if (data->res[i]) {
+				remove_resource(data->res[i]);
+				kfree(data->res[i]);
+				data->res[i] = NULL;
+			}
+			if (onlined)
+				continue;
+			return rc;
+		}
+		onlined++;
+	}
+
+	return onlined;
+}
+
+/**
+ * dax_kmem_init_resources - create memory regions for dax kmem
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Initializes all the resources for the DAX
+ *
+ * Returns the number of successfully mapped ranges, or negative error.
+ */
+static int dax_kmem_init_resources(struct dev_dax *dev_dax,
+				   struct dax_kmem_data *data)
+{
+	struct device *dev = &dev_dax->dev;
+	int i, rc, mapped = 0;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct resource *res;
+		struct range range;
+
+		rc = dax_kmem_range(dev_dax, i, &range);
+		if (rc)
+			continue;
+
+		/* Skip ranges already added */
+		if (data->res[i])
+			continue;
+
+		/* Region is permanently reserved if hotremove fails. */
+		res = request_mem_region(range.start, range_len(&range),
+					 data->res_name);
+		if (!res) {
+			dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
+				 i, range.start, range.end);
+			/*
+			 * Once some memory has been onlined we can't
+			 * assume that it can be un-onlined safely.
+			 */
+			if (mapped)
+				continue;
+			return -EBUSY;
+		}
+		data->res[i] = res;
+		/*
+		 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
+		 * so that add_memory() can add a child resource.  Do not
+		 * inherit flags from the parent since it may set new flags
+		 * unknown to us that will break add_memory() below.
+		 */
+		res->flags = IORESOURCE_SYSTEM_RAM;
+		mapped++;
+	}
+	return mapped;
+}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/**
+ * dax_kmem_do_hotremove - hot-remove memory for dax kmem device
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Removes all ranges in the dev_dax region.
+ *
+ * Returns the number of successfully removed ranges.
+ */
+static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
+				 struct dax_kmem_data *data)
+{
+	struct device *dev = &dev_dax->dev;
+	int i, success = 0;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct range range;
+		int rc;
+
+		rc = dax_kmem_range(dev_dax, i, &range);
+		if (rc)
+			continue;
+
+		/* range was never added during probe, count as removed */
+		if (!data->res[i]) {
+			success++;
+			continue;
+		}
+
+		rc = remove_memory(range.start, range_len(&range));
+		if (rc == 0) {
+			/* Release the resource for the successfully removed range */
+			remove_resource(data->res[i]);
+			kfree(data->res[i]);
+			data->res[i] = NULL;
+			success++;
+			continue;
+		}
+		any_hotremove_failed = true;
+		dev_err(dev, "mapping%d: %#llx-%#llx hotremove failed\n",
+			i, range.start, range.end);
+	}
+
+	return success;
+}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
+/**
+ * dax_kmem_cleanup_resources - remove the dax memory resources
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Removes all resources in the dev_dax region.
+ */
+static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
+				       struct dax_kmem_data *data)
+{
+	int i;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		if (!data->res[i])
+			continue;
+		remove_resource(data->res[i]);
+		kfree(data->res[i]);
+		data->res[i] = NULL;
+	}
+}
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
 	unsigned long total_len = 0, orig_len = 0;
 	struct dax_kmem_data *data;
 	struct memory_dev_type *mtype;
-	int i, rc, mapped = 0;
-	mhp_t mhp_flags;
+	int i, rc;
 	int numa_node;
 	int online_type;
 	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
@@ -133,73 +314,27 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		goto err_reg_mgid;
 	data->mgid = rc;
 
+	dev_set_drvdata(dev, data);
+
+	rc = dax_kmem_init_resources(dev_dax, data);
+	if (rc < 0)
+		goto err_resources;
+
 	/* Resolve system default at bind time in case it changed */
 	online_type = dev_dax->online_type;
 	if (online_type == DAX_ONLINE_DEFAULT)
 		online_type = mhp_get_default_online_type();
 
-	for (i = 0; i < dev_dax->nr_range; i++) {
-		struct resource *res;
-		struct range range;
-
-		rc = dax_kmem_range(dev_dax, i, &range);
-		if (rc)
-			continue;
-
-		/* Region is permanently reserved if hotremove fails. */
-		res = request_mem_region(range.start, range_len(&range), data->res_name);
-		if (!res) {
-			dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
-					i, range.start, range.end);
-			/*
-			 * Once some memory has been onlined we can't
-			 * assume that it can be un-onlined safely.
-			 */
-			if (mapped)
-				continue;
-			rc = -EBUSY;
-			goto err_request_mem;
-		}
-		data->res[i] = res;
-
-		/*
-		 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
-		 * so that add_memory() can add a child resource.  Do not
-		 * inherit flags from the parent since it may set new flags
-		 * unknown to us that will break add_memory() below.
-		 */
-		res->flags = IORESOURCE_SYSTEM_RAM;
-
-		mhp_flags = MHP_NID_IS_MGID;
-		if (dev_dax->memmap_on_memory)
-			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
-
-		/*
-		 * Ensure that future kexec'd kernels will not treat
-		 * this as RAM automatically.
-		 */
-		rc = __add_memory_driver_managed(data->mgid, range.start,
-				range_len(&range), kmem_name, mhp_flags,
-				online_type);
-
-		if (rc) {
-			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
-					i, range.start, range.end);
-			remove_resource(res);
-			kfree(res);
-			data->res[i] = NULL;
-			if (mapped)
-				continue;
-			goto err_request_mem;
-		}
-		mapped++;
-	}
-
-	dev_set_drvdata(dev, data);
+	rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
+	if (rc < 0)
+		goto err_hotplug;
 
 	return 0;
 
-err_request_mem:
+err_hotplug:
+	dax_kmem_cleanup_resources(dev_dax, data);
+err_resources:
+	dev_set_drvdata(dev, NULL);
 	memory_group_unregister(data->mgid);
 err_reg_mgid:
 	kfree(data->res_name);
@@ -213,7 +348,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
-	int i, success = 0;
+	int success;
 	int node = dev_dax->target_node;
 	struct device *dev = &dev_dax->dev;
 	struct dax_kmem_data *data = dev_get_drvdata(dev);
@@ -224,48 +359,25 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 	 * there is no way to hotremove this memory until reboot because device
 	 * unbind will succeed even if we return failure.
 	 */
-	for (i = 0; i < dev_dax->nr_range; i++) {
-		struct range range;
-		int rc;
-
-		rc = dax_kmem_range(dev_dax, i, &range);
-		if (rc)
-			continue;
-
-		/* range was never added during probe */
-		if (!data->res[i]) {
-			success++;
-			continue;
-		}
-
-		rc = remove_memory(range.start, range_len(&range));
-		if (rc == 0) {
-			remove_resource(data->res[i]);
-			kfree(data->res[i]);
-			data->res[i] = NULL;
-			success++;
-			continue;
-		}
-		any_hotremove_failed = true;
-		dev_err(dev,
-			"mapping%d: %#llx-%#llx cannot be hotremoved until the next reboot\n",
-				i, range.start, range.end);
+	success = dax_kmem_do_hotremove(dev_dax, data);
+	if (success < dev_dax->nr_range) {
+		dev_err(dev, "Hotplug regions stuck online until reboot\n");
+		return;
 	}
 
-	if (success >= dev_dax->nr_range) {
-		memory_group_unregister(data->mgid);
-		kfree(data->res_name);
-		kfree(data);
-		dev_set_drvdata(dev, NULL);
-		/*
-		 * Clear the memtype association on successful unplug.
-		 * If not, we have memory blocks left which can be
-		 * offlined/onlined later. We need to keep memory_dev_type
-		 * for that. This implies this reference will be around
-		 * till next reboot.
-		 */
-		clear_node_memory_type(node, NULL);
-	}
+	dax_kmem_cleanup_resources(dev_dax, data);
+	memory_group_unregister(data->mgid);
+	kfree(data->res_name);
+	kfree(data);
+	dev_set_drvdata(dev, NULL);
+	/*
+	 * Clear the memtype association on successful unplug.
+	 * If not, we have memory blocks left which can be
+	 * offlined/onlined later. We need to keep memory_dev_type
+	 * for that. This implies this reference will be around
+	 * till next reboot.
+	 */
+	clear_node_memory_type(node, NULL);
 }
 #else
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
-- 
2.54.0



^ permalink raw reply related

* [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple,
	Hannes Reinecke
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

There is no atomic mechanism to offline and remove an entire
multi-block DAX kmem device.  This is presently done in two steps:
    1. offline all
    2. remove all).

This creates a race condition where another entity operates directly
on the memory blocks and can cause hot-unplug to fail / unbind to
deadlock.

Add a new 'state' sysfs attribute that enables an atomic whole-device
hotplug operation across its entire memory region.

daxX.Y/state mirrors the per-block memoryX/state ABI:
  - [offline, online, online_kernel, online_movable]
  - "unplugged" - is added specifically for dax0.0/state

The valid writable states include:
  - "unplugged":      memory blocks are not present
  - "online":         memory is online, zone chosen by the kernel
  - "online_kernel":  memory is online in ZONE_NORMAL
  - "online_movable": memory is online in ZONE_MOVABLE

Valid transitions:
  - unplugged                -> online[_kernel|_movable]
  - online[_kernel|_movable] -> unplugged
  - offline                  -> unplugged

A device can only be onlined from "unplugged", so it must be returned
there before being onlined into a different state.

For backwards compatibility the memory blocks are always created at
probe - existing tools expect them to be present after kmem binds.

"offline" is therefore a reportable state but is not writable: it only
arises from the legacy auto_online_blocks=offline policy.  Onlining
such a device through this attribute requires unplugging it first in
an effort to get drivers creating DAX devices to set a default.

Unplug is atomic across the whole device: dax_kmem_do_hotremove()
collects every added range and offlines/removes them in one operation.
Either the operation succeeds or is entirely rolled back.

Unbind Note:
  We used to call remove_memory() during unbind, which would fire a
  BUG() if any of the memory blocks were online at that time.  We lift
  this into a WARN in the cleanup routine and don't attempt hotremove
  if ->state is not DAX_KMEM_UNPLUGGED or MMOP_OFFLINE.

  An offline dax device memory is removed on unbind as before.

  If online at unbind, the resources are leaked (as before), but now
  we prevent deadlock if a memory region is impossible to hotremove.

Suggested-by: Hannes Reinecke <hare@suse.de>
Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 Documentation/ABI/testing/sysfs-bus-dax |  26 +++
 drivers/base/memory.c                   |   9 +
 drivers/dax/kmem.c                      | 224 ++++++++++++++++++++----
 include/linux/memory_hotplug.h          |   1 +
 4 files changed, 224 insertions(+), 36 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
index b34266bfae49..2dcad1e9dad0 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -151,3 +151,29 @@ Description:
 		memmap_on_memory parameter for memory_hotplug. This is
 		typically set on the kernel command line -
 		memory_hotplug.memmap_on_memory set to 'true' or 'force'."
+
+What:		/sys/bus/dax/devices/daxX.Y/state
+Date:		June, 2026
+KernelVersion:	v6.21
+Contact:	nvdimm@lists.linux.dev
+Description:
+		(RW) Controls the state of the memory region.
+		Applies to all memory blocks associated with the device.
+		Only applies to dax_kmem devices.
+
+		Reading returns the current state; the writable states mirror
+		the per-block /sys/devices/system/memory/memoryX/state ABI::
+
+		  "unplugged": memory blocks are not present
+		  "online": memory is online, zone chosen by the kernel
+		  "online_kernel": memory is online in ZONE_NORMAL
+		  "online_movable": memory is online in ZONE_MOVABLE
+
+		"offline" (memory blocks are present but offline) may also be
+		reported - this happens when the device is bound while the
+		auto_online_blocks policy is "offline".  It cannot be written,
+		as it's not useful and creates device destruction races.
+
+		A device can only be onlined from the "unplugged" state, so a
+		device must be returned to "unplugged" before it can be onlined
+		into a different state.
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b318344426fa..3a2f69d3af7b 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -46,6 +46,15 @@ int mhp_online_type_from_str(const char *str)
 	}
 	return -EINVAL;
 }
+EXPORT_SYMBOL_GPL(mhp_online_type_from_str);
+
+const char *mhp_online_type_to_str(int online_type)
+{
+	if (online_type < 0 || online_type >= (int)ARRAY_SIZE(online_type_to_str))
+		return NULL;
+	return online_type_to_str[online_type];
+}
+EXPORT_SYMBOL_GPL(mhp_online_type_to_str);
 
 #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
 
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a45e50def537..340486586d82 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -42,9 +42,15 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
 	return 0;
 }
 
+#define DAX_KMEM_UNPLUGGED	(-1)
+
 struct dax_kmem_data {
 	const char *res_name;
 	int mgid;
+	int numa_node;
+	struct dev_dax *dev_dax;
+	int state;
+	struct mutex lock; /* protects hotplug state transitions */
 	struct resource *res[];
 };
 
@@ -63,12 +69,22 @@ static void kmem_put_memory_types(void)
 	mt_put_memory_types(&kmem_memory_types);
 }
 
+/* True for the online states a kmem dax device can hold. */
+static bool dax_kmem_state_is_online(int state)
+{
+	return state == MMOP_ONLINE ||
+	       state == MMOP_ONLINE_KERNEL ||
+	       state == MMOP_ONLINE_MOVABLE;
+}
+
 /**
  * dax_kmem_do_hotplug - hotplug memory for dax kmem device
  * @dev_dax: the dev_dax instance
  * @data: the dax_kmem_data structure with resource tracking
+ * @online_type: the online policy to use for the memory blocks
  *
- * Hotplugs all ranges in the dev_dax region as system memory.
+ * Hotplugs all ranges in the dev_dax region as system memory with the
+ * provided online policy (offline, online, online_movable, online_kernel).
  *
  * Returns the number of successfully mapped ranges, or negative error.
  */
@@ -77,9 +93,15 @@ static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
 			       int online_type)
 {
 	struct device *dev = &dev_dax->dev;
-	int i, rc, onlined = 0;
+	int i, rc, added = 0;
 	mhp_t mhp_flags;
 
+	if (dax_kmem_state_is_online(data->state))
+		return -EINVAL;
+
+	if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
+		return -EINVAL;
+
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range range;
 
@@ -112,14 +134,14 @@ static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
 				kfree(data->res[i]);
 				data->res[i] = NULL;
 			}
-			if (onlined)
+			if (added)
 				continue;
 			return rc;
 		}
-		onlined++;
+		added++;
 	}
 
-	return onlined;
+	return added;
 }
 
 /**
@@ -182,45 +204,64 @@ static int dax_kmem_init_resources(struct dev_dax *dev_dax,
  * @dev_dax: the dev_dax instance
  * @data: the dax_kmem_data structure with resource tracking
  *
- * Removes all ranges in the dev_dax region.
+ * Offlines and removes every currently-added range in the dev_dax region
+ * atomically: either all ranges are offlined and removed, or none are and
+ * the device is returned to its prior state.
  *
- * Returns the number of successfully removed ranges.
+ * Returns 0 on success, or a negative errno on failure.
  */
 static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
 				 struct dax_kmem_data *data)
 {
 	struct device *dev = &dev_dax->dev;
-	int i, success = 0;
+	struct range *ranges;
+	int i, nr_ranges = 0, rc;
+
+	ranges = kmalloc_array(dev_dax->nr_range, sizeof(*ranges), GFP_KERNEL);
+	if (!ranges)
+		return -ENOMEM;
 
+	/* Collect the ranges that were actually added during probe. */
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range range;
-		int rc;
 
-		rc = dax_kmem_range(dev_dax, i, &range);
-		if (rc)
+		if (!data->res[i])
 			continue;
-
-		/* range was never added during probe, count as removed */
-		if (!data->res[i]) {
-			success++;
+		if (dax_kmem_range(dev_dax, i, &range))
 			continue;
-		}
+		ranges[nr_ranges++] = range;
+	}
 
-		rc = remove_memory(range.start, range_len(&range));
-		if (rc == 0) {
-			/* Release the resource for the successfully removed range */
-			remove_resource(data->res[i]);
-			kfree(data->res[i]);
-			data->res[i] = NULL;
-			success++;
-			continue;
-		}
+	/* Nothing added means nothing to remove. */
+	if (!nr_ranges) {
+		kfree(ranges);
+		return 0;
+	}
+
+	rc = offline_and_remove_memory_ranges(ranges, nr_ranges);
+	kfree(ranges);
+	if (rc) {
 		any_hotremove_failed = true;
-		dev_err(dev, "mapping%d: %#llx-%#llx hotremove failed\n",
-			i, range.start, range.end);
+		dev_err(dev, "hotremove failed, device left online: %d\n", rc);
+		return rc;
 	}
 
-	return success;
+	/* All ranges removed; release the reserved resources. */
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		if (!data->res[i])
+			continue;
+		remove_resource(data->res[i]);
+		kfree(data->res[i]);
+		data->res[i] = NULL;
+	}
+
+	return 0;
+}
+#else
+static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
+				 struct dax_kmem_data *data)
+{
+	return -EBUSY;
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
@@ -236,6 +277,18 @@ static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
 {
 	int i;
 
+	/*
+	 * If the device unbind occurs before memory is hotremoved, we can never
+	 * remove the memory (requires reboot).  Attempting an offline operation
+	 * here may cause deadlock and a failure to finish the unbind.
+	 *
+	 * Note: This leaks the resources.
+	 */
+	if (WARN(((data->state != DAX_KMEM_UNPLUGGED) &&
+		  (data->state != MMOP_OFFLINE)),
+		 "Hotplug memory regions stuck online until reboot"))
+		return;
+
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		if (!data->res[i])
 			continue;
@@ -245,6 +298,85 @@ static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
 	}
 }
 
+static int dax_kmem_parse_state(const char *buf)
+{
+	int online_type;
+
+	/* "unplugged" is kmem-specific - the rest map to MMOP_ */
+	if (sysfs_streq(buf, "unplugged"))
+		return DAX_KMEM_UNPLUGGED;
+
+	online_type = mhp_online_type_from_str(buf);
+	/* Disallow "offline": it's not useful and creates race conditions */
+	if (online_type == MMOP_OFFLINE)
+		return -EINVAL;
+	return online_type;
+}
+
+static ssize_t state_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct dax_kmem_data *data = dev_get_drvdata(dev);
+	const char *state_str;
+
+	if (!data)
+		return -ENXIO;
+
+	if (data->state == DAX_KMEM_UNPLUGGED)
+		state_str = "unplugged";
+	else
+		state_str = mhp_online_type_to_str(data->state);
+
+	return sysfs_emit(buf, "%s\n", state_str ?: "unknown");
+}
+
+static ssize_t state_store(struct device *dev, struct device_attribute *attr,
+			     const char *buf, size_t len)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_kmem_data *data = dev_get_drvdata(dev);
+	int online_type;
+	int rc;
+
+	if (!data)
+		return -ENXIO;
+
+	online_type = dax_kmem_parse_state(buf);
+	if (online_type < DAX_KMEM_UNPLUGGED)
+		return online_type;
+
+	guard(mutex)(&data->lock);
+
+	/* Already in requested state */
+	if (data->state == online_type)
+		return len;
+
+	if (online_type == DAX_KMEM_UNPLUGGED) {
+		rc = dax_kmem_do_hotremove(dev_dax, data);
+		if (rc)
+			return rc;
+		data->state = DAX_KMEM_UNPLUGGED;
+		return len;
+	}
+
+	/* Onlining is only allowed from the unplugged state. */
+	if (data->state != DAX_KMEM_UNPLUGGED)
+		return -EBUSY;
+
+	/* Re-acquire resources if previously unplugged, otherwise no-op */
+	rc = dax_kmem_init_resources(dev_dax, data);
+	if (rc < 0)
+		return rc;
+
+	rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
+	if (rc < 0)
+		return rc;
+
+	data->state = online_type;
+	return len;
+}
+static DEVICE_ATTR_RW(state);
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
@@ -313,6 +445,10 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	if (rc < 0)
 		goto err_reg_mgid;
 	data->mgid = rc;
+	data->numa_node = numa_node;
+	data->dev_dax = dev_dax;
+	data->state = DAX_KMEM_UNPLUGGED;
+	mutex_init(&data->lock);
 
 	dev_set_drvdata(dev, data);
 
@@ -325,9 +461,15 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	if (online_type == DAX_ONLINE_DEFAULT)
 		online_type = mhp_get_default_online_type();
 
+	/* Always create blocks for backward compatibility, even if offline */
 	rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
 	if (rc < 0)
 		goto err_hotplug;
+	data->state = online_type;
+
+	rc = device_create_file(dev, &dev_attr_state);
+	if (rc)
+		dev_warn(dev, "failed to create state sysfs entry\n");
 
 	return 0;
 
@@ -348,20 +490,26 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
-	int success;
 	int node = dev_dax->target_node;
 	struct device *dev = &dev_dax->dev;
 	struct dax_kmem_data *data = dev_get_drvdata(dev);
 
+	device_remove_file(dev, &dev_attr_state);
 	/*
-	 * We have one shot for removing memory, if some memory blocks were not
-	 * offline prior to calling this function remove_memory() will fail, and
-	 * there is no way to hotremove this memory until reboot because device
-	 * unbind will succeed even if we return failure.
+	 * Online memory cannot safely be removed (offlining during unbind can
+	 * deadlock a task as unbind cannot be interrupted).  Unfortunately we
+	 * have to leak all of [resources, memory group, @data, memtype], until
+	 * the next reboot - and the memory will stay online until then.
+	 *
+	 * offline blocks are removed on unbind, but may leak on failure.
 	 */
-	success = dax_kmem_do_hotremove(dev_dax, data);
-	if (success < dev_dax->nr_range) {
-		dev_err(dev, "Hotplug regions stuck online until reboot\n");
+	if (dax_kmem_state_is_online(data->state)) {
+		dev_warn(dev, "Hotplug regions stuck online until reboot\n");
+		any_hotremove_failed = true;
+		return;
+	} else if (data->state == MMOP_OFFLINE &&
+	    dax_kmem_do_hotremove(dev_dax, data)) {
+		dev_warn(dev, "Unplug failed, resources leaked until reboot\n");
 		return;
 	}
 
@@ -382,6 +530,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 #else
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
+	struct device *dev = &dev_dax->dev;
+
+	device_remove_file(dev, &dev_attr_state);
+
 	/*
 	 * Without hotremove purposely leak the request_mem_region() for the
 	 * device-dax range and return '0' to ->remove() attempts. The removal
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7f1da7c428dc..46c796570692 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -127,6 +127,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
 extern u64 max_mem_size;
 
 extern int mhp_online_type_from_str(const char *str);
+const char *mhp_online_type_to_str(int online_type);
 
 /* If movable_node boot option specified */
 extern bool movable_node_enabled;
-- 
2.54.0



^ permalink raw reply related

* [PATCH v5 9/9] selftests/dax: add dax/kmem hotplug sysfs regression test
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, gourry, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

Add a kselftest for the dax/kmem whole-device "state" sysfs attribute
(/sys/bus/dax/devices/daxX.Y/state), which transitions a kmem-backed
dax device between "unplugged", "online" and "online_movable".

The kselftest also includes a test to demonstrate the force-unbind
does not deadlock - but this is a destructive test.  The dax device
can never be rebound after doing this.

Provisioning a devdax device and binding it to kmem needs daxctl/ndctl
out of scope for an in-tree selftest, so the test discovers an already
kmem-bound dax device and SKIPs when none are present or the memory
cannot be freed to reach a known baseline.

When a device is available it validates the interface contract:
  - online / online_movable actually add memory (MemTotal grows),
  - online is idempotent,
  - switching between online types without unplug is rejected,
  - unplug removes memory and the reported state is "unplugged"
  - invalid input is rejected.

One specific regression test:
    online -> unplug -> online_movable -> unplug

Re-online must re-reserve per-range resources so subsequent unplug
actually offlines and removes instead of silently reporting success
while the memory stays online.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/dax/Makefile          |   6 +
 tools/testing/selftests/dax/config            |   4 +
 .../testing/selftests/dax/dax-kmem-hotplug.sh | 207 ++++++++++++++++++
 tools/testing/selftests/dax/settings          |   1 +
 5 files changed, 219 insertions(+)
 create mode 100644 tools/testing/selftests/dax/Makefile
 create mode 100644 tools/testing/selftests/dax/config
 create mode 100755 tools/testing/selftests/dax/dax-kmem-hotplug.sh
 create mode 100644 tools/testing/selftests/dax/settings

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e41..8c2b4f97619c 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -14,6 +14,7 @@ TARGETS += core
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
 TARGETS += damon
+TARGETS += dax
 TARGETS += devices/error_logs
 TARGETS += devices/probe
 TARGETS += dmabuf-heaps
diff --git a/tools/testing/selftests/dax/Makefile b/tools/testing/selftests/dax/Makefile
new file mode 100644
index 000000000000..25a4f3d73a5b
--- /dev/null
+++ b/tools/testing/selftests/dax/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+TEST_PROGS := dax-kmem-hotplug.sh
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dax/config b/tools/testing/selftests/dax/config
new file mode 100644
index 000000000000..4c9aaeb6ceb4
--- /dev/null
+++ b/tools/testing/selftests/dax/config
@@ -0,0 +1,4 @@
+CONFIG_DEV_DAX=m
+CONFIG_DEV_DAX_KMEM=m
+CONFIG_MEMORY_HOTPLUG=y
+CONFIG_MEMORY_HOTREMOVE=y
diff --git a/tools/testing/selftests/dax/dax-kmem-hotplug.sh b/tools/testing/selftests/dax/dax-kmem-hotplug.sh
new file mode 100755
index 000000000000..803bbd5a6409
--- /dev/null
+++ b/tools/testing/selftests/dax/dax-kmem-hotplug.sh
@@ -0,0 +1,207 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Exercise the dax/kmem "state" sysfs attribute:
+#   /sys/bus/dax/devices/daxX.Y/state  ->  unplugged | online | online_movable
+#
+# The test needs a dax device already bound to the kmem driver.
+# If no suitable device is found the tests SKIP.
+#
+# A dax device can be provisioned with the memmap= boot param, e.g.:
+#   memmap=2G!4G
+#
+# then, in the booted system:
+#
+#   ndctl create-namespace -m devdax -e namespace0.0 -f
+#   daxctl reconfigure-device -N -m system-ram dax0.0   # bind kmem
+#   ./dax-kmem-hotplug.sh
+
+# shellcheck disable=SC1091
+DIR="$(dirname "$(readlink -f "$0")")"
+. "$DIR"/../kselftest/ktap_helpers.sh
+
+DAX_BASE=/sys/bus/dax/devices
+
+memtotal_kb() { awk '/^MemTotal:/ {print $2}' /proc/meminfo; }
+get_state() { cat "$HP" 2>/dev/null; }
+# set_state STATE -- write a state to the state attribute; returns the
+# write's exit status (0 = accepted by the kernel)
+set_state() { echo "$1" > "$HP" 2>/dev/null; }
+
+find_kmem_dax() {
+	local d drv
+	for d in "$DAX_BASE"/dax*; do
+		[ -e "$d/state" ] || continue
+		drv=$(readlink "$d/driver" 2>/dev/null)
+		[ "$(basename "${drv:-}")" = kmem ] || continue
+		basename "$d"
+		return 0
+	done
+	return 1
+}
+
+ktap_print_header
+
+if [ "$UID" != 0 ]; then
+	ktap_skip_all "must be run as root"
+	exit "$KSFT_SKIP"
+fi
+
+DAX=$(find_kmem_dax)
+if [ -z "$DAX" ]; then
+	ktap_skip_all "no kmem-bound dax device with a state attribute"
+	exit "$KSFT_SKIP"
+fi
+HP=$DAX_BASE/$DAX/state
+ORIG=$(get_state)
+
+# A failure to reach the baseline is environmental (memory in use), not an
+# interface failure, so skip rather than fail.
+set_state unplugged; rc=$?
+if [ "$rc" != 0 ] || [ "$(get_state)" != unplugged ]; then
+	ktap_skip_all "$DAX: cannot reach 'unplugged' baseline (memory in use?)"
+	[ -n "$ORIG" ] && set_state "$ORIG"
+	exit "$KSFT_SKIP"
+fi
+mt_unplugged=$(memtotal_kb)
+
+DRV=/sys/bus/dax/drivers/kmem
+AOB=/sys/devices/system/memory/auto_online_blocks
+
+ktap_print_msg "using $DAX (initial state was: $ORIG)"
+ktap_set_plan 11
+
+set_state online; rc=$?
+mt_online=$(memtotal_kb)
+if [ "$rc" = 0 ] && [ "$(get_state)" = online ] && [ "$mt_online" -gt "$mt_unplugged" ]; then
+	ktap_test_pass "online: state=online, MemTotal $mt_unplugged -> $mt_online kB"
+else
+	ktap_test_fail "online: rc=$rc state=$(get_state) MemTotal $mt_unplugged -> $mt_online"
+fi
+
+set_state online; rc=$?
+if [ "$rc" = 0 ] && [ "$(get_state)" = online ]; then
+	ktap_test_pass "online idempotent"
+else
+	ktap_test_fail "online idempotent: rc=$rc state=$(get_state)"
+fi
+
+set_state online_movable; rc=$?
+if [ "$rc" != 0 ] && [ "$(get_state)" = online ]; then
+	ktap_test_pass "reject online_movable without intervening unplug"
+else
+	ktap_test_fail "online->online_movable not rejected: rc=$rc state=$(get_state)"
+fi
+
+set_state unplugged; rc=$?
+mt=$(memtotal_kb)
+if [ "$rc" = 0 ] && [ "$(get_state)" = unplugged ] && [ "$mt" -lt "$mt_online" ]; then
+	ktap_test_pass "unplug from online: MemTotal $mt_online -> $mt kB"
+else
+	ktap_test_fail "unplug from online: rc=$rc state=$(get_state) MemTotal $mt_online -> $mt"
+fi
+
+set_state online_movable; rc=$?
+mt_movable=$(memtotal_kb)
+if [ "$rc" = 0 ] && [ "$(get_state)" = online_movable ] && [ "$mt_movable" -gt "$mt_unplugged" ]; then
+	ktap_test_pass "online_movable after unplug: MemTotal $mt_unplugged -> $mt_movable kB"
+else
+	ktap_test_fail "online_movable after unplug: rc=$rc state=$(get_state) MemTotal=$mt_movable"
+fi
+
+# The online -> unplug -> online_movable -> unplug cycle once regressed:
+# a re-online failed to re-reserve the per-range resources, so the final unplug
+# reported success while leaving the memory online.  Assert it is really freed.
+set_state unplugged; rc=$?
+mt=$(memtotal_kb)
+if [ "$rc" != 0 ]; then
+	ktap_test_skip "unplug from movable not accepted (memory in use?) rc=$rc"
+elif [ "$(get_state)" = unplugged ] && [ "$mt" -lt "$mt_movable" ]; then
+	ktap_test_pass "unplug from online_movable removed memory: $mt_movable -> $mt kB"
+else
+	ktap_test_fail "unplug from movable reported success but memory remained: state=$(get_state) MemTotal $mt_movable -> $mt"
+fi
+
+set_state online_kernel; rc=$?
+mt=$(memtotal_kb)
+if [ "$rc" = 0 ] && [ "$(get_state)" = online_kernel ] && [ "$mt" -gt "$mt_unplugged" ]; then
+	ktap_test_pass "online_kernel: MemTotal $mt_unplugged -> $mt kB"
+else
+	ktap_test_fail "online_kernel: rc=$rc state=$(get_state) MemTotal=$mt"
+fi
+set_state unplugged
+
+before=$(get_state)
+set_state bogus_state; rc=$?
+if [ "$rc" != 0 ] && [ "$(get_state)" = "$before" ]; then
+	ktap_test_pass "reject invalid state string"
+else
+	ktap_test_fail "invalid state not rejected: rc=$rc state=$(get_state)"
+fi
+
+# Run several online/unplug cycles and require that each one adds/removes memory
+set_state unplugged
+cycle_ok=1; fail_i=0
+for i in 1 2 3; do
+	if ! set_state online; then cycle_ok=0; fail_i=$i; break; fi
+	on=$(memtotal_kb)
+	if ! set_state unplugged; then cycle_ok=0; fail_i=$i; break; fi
+	off=$(memtotal_kb)
+	if [ "$on" -le "$mt_unplugged" ] || [ "$off" -ge "$on" ]; then
+		cycle_ok=0; fail_i=$i; break
+	fi
+done
+if [ "$cycle_ok" = 1 ]; then
+	ktap_test_pass "online/unplug cycle re-acquires resources (3x: memory added and freed each time)"
+else
+	ktap_test_fail "online/unplug cycle regressed at iteration $fail_i (on=$on off=$off baseline=$mt_unplugged)"
+fi
+
+# change system default online policy while the device is unbound, and show
+# the new system default policy is utilized across bindings.
+set_state unplugged
+if [ -w "$AOB" ] && [ -w "$DRV/unbind" ] && [ -w "$DRV/bind" ]; then
+	orig_aob=$(cat "$AOB")
+	echo "$DAX" > "$DRV/unbind" 2>/dev/null
+	echo offline > "$AOB" 2>/dev/null
+	echo "$DAX" > "$DRV/bind" 2>/dev/null
+	sleep 1
+	st=$(get_state)
+	echo "$orig_aob" > "$AOB" 2>/dev/null		# restore system policy
+	if [ "$st" = offline ]; then
+		ktap_test_pass "online policy resolved at bind: auto_online_blocks=offline -> state=offline"
+	else
+		ktap_test_fail "bind-time policy not honored: state=$st (expected offline)"
+	fi
+	set_state unplugged 2>/dev/null
+else
+	ktap_test_skip "auto_online_blocks or driver bind/unbind not writable"
+fi
+
+[ -n "$ORIG" ] && set_state "$ORIG"
+
+# DESTRUCTIVE: unbinding the driver while memory is online causes the resources
+# to leak - but the unbind should not deadlock.  Instead the driver leaks it
+# with a single "suck online" warning. This leaves the memory online and the
+# device unbound until reboot, so it runs last.
+set_state unplugged; set_state online
+if [ "$(get_state)" = online ] && [ -w "$DRV/unbind" ]; then
+	mt_on=$(memtotal_kb)
+	dmesg -C 2>/dev/null
+	echo "$DAX" > "$DRV/unbind" 2>/dev/null
+	mt_after=$(memtotal_kb)
+	# The leaked "System RAM (kmem)" regions stay in the iomem tree; reading
+	# their names dereferences res_name, which a buggy unbind already freed.
+	# Walk /proc/iomem to provoke that use-after-free (caught by KASAN).
+	cat /proc/iomem > /dev/null 2>&1
+	splat=$(dmesg 2>/dev/null | grep -ciE "KASAN|BUG:|use-after-free|general protection|Oops|refcount_t")
+	if [ "$splat" = 0 ] && [ "$mt_after" -ge "$mt_on" ]; then
+		ktap_test_pass "unbind while online: memory left online, no UAF/oops (MemTotal $mt_on -> $mt_after kB)"
+	else
+		ktap_test_fail "unbind while online regressed: splat=$splat MemTotal $mt_on -> $mt_after kB"
+	fi
+else
+	ktap_test_skip "could not online device for unbind-while-online test"
+fi
+
+ktap_finished
diff --git a/tools/testing/selftests/dax/settings b/tools/testing/selftests/dax/settings
new file mode 100644
index 000000000000..ba4d85f74cd6
--- /dev/null
+++ b/tools/testing/selftests/dax/settings
@@ -0,0 +1 @@
+timeout=90
-- 
2.54.0



^ permalink raw reply related

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
From: David Hildenbrand (Arm) @ 2026-06-24 14:58 UTC (permalink / raw)
  To: Alexandre Ghiti, akpm, hannes, yosry, nphamcs
  Cc: chengming.zhou, ljs, liam, vbabka, rppt, surenb, mhocko, kasong,
	chrisl, baohua, usama.arif, linux-mm, linux-kernel
In-Reply-To: <20260624075700.751467-1-alex@ghiti.fr>

On 6/24/26 09:55, Alexandre Ghiti wrote:
> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
> 
> zswap is the same kind of in-memory, synchronous backend as zram, not a
> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
> swapin_readahead().
> 
> Here are the results from bypassing readahead for zswap too: it was
> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
> off, on Sapphire Rapids and 3 iterations.
> 
>   768M memcg (sustained swap thrash):
>     metric                 mm-new    + bypass    delta
>     build time (s)          405.0       341.7    -15.6%
>     zswap-in (GB)            79.5        53.0     -33%
>     zswap-out (GB)          144.8       115.6     -20%
>     swap readahead (pages)  6.79M       0.45M     -93%
>     swap_ra hit (%)          72.1        89.9     +18pp
> 
>   1G memcg (light pressure, build not memory-bound):
>     metric                 mm-new    + bypass    delta
>     build time (s)          177.7       176.0    ~same (no regression)
>     zswap-in (GB)            10.2         7.5     -26%
>     zswap-out (GB)           27.7        25.1      -9%
>     swap readahead (pages)  1.07M       0.08M     -93%
>     swap_ra hit (%)          68.6        87.2     +19pp
> 
> The gain is from no longer prefetching pages that are pointless for an
> in-memory backend: readahead inflates anon residency and thrashes the
> page cache (file pages get evicted and re-read), lengthens each fault by
> synchronously (de)compressing a cluster of neighbours, and adds
> compression traffic when those extra pages are reclaimed.
> 
> Bypassing swap readahead for zswap therefore makes sense.
> 
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---

[...]

>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..5aa1ea9eb48a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (folio)
>  		swap_update_readahead(folio, vma, vmf->address);
>  	if (!folio) {
> -		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> -		if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
> +		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
> +		    zswap_present_test(entry))

This should really be abstracted into a reasonably-named helper that can live in
swap code.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
From: David Hildenbrand (Arm) @ 2026-06-24 15:02 UTC (permalink / raw)
  To: Barry Song (Xiaomi), akpm, linux-mm
  Cc: baoquan.he, chrisl, jp.kobryn, kasong, liam, linux-kernel, ljs,
	mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park
In-Reply-To: <20260623231635.43086-2-baohua@kernel.org>

On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> We always unconditionally drain the LRU before retrying anon folio
> reuse in wp_can_reuse_anon_folio(). Instead, assume !LRU anon folios
> are in lru_cache, and use the refcount to avoid many unnecessary LRU
> drains.
> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..f6848f4234a6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4193,12 +4193,18 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
>  	 */
>  	if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
>  		return false;
> -	if (!folio_test_lru(folio))
> +	if (!folio_test_lru(folio)) {
> +		/*
> +		 * Assume folio is on lru_cache and holds a cache reference.
> +		 */
> +		if (folio_ref_count(folio) > 2 + folio_test_swapcache(folio))
> +			return false;

I'm not keen on making this function even uglier, so no, not like that.

We have the earlier "folio_ref_count(folio) > 3" check.

In which scenarios can you trigger this such that we would care?

If the answer is "I don't know" there is no reason for a change.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2 02/13] mm/page_alloc: some renames to clarify alloc_flags scopes
From: Suren Baghdasaryan @ 2026-06-24 15:03 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <20260622-alloc-trylock-v2-2-31f31367d420@google.com>

On Mon, Jun 22, 2026 at 3:01 AM Brendan Jackman <jackmanb@google.com> wrote:
>
> It's pretty confusing that:
>
> - The slowpath and fastpath have a totally distinct set of alloc_flags.
>
> - gfp_to_alloc_flags() sounds generic but it only influences the
>   slowpath.
>
> - prepare_alloc_pages() is generic in that it sets up the
>   alloc_context, but the alloc_flags it generates are only used for the
>   fastpath.

I understand you want to clarify the usage but this particular point
seems to be an implementation detail. IOW, if tomorrow
__alloc_frozen_pages_noprof() is changed to use alloc_flags when
calling __alloc_pages_slowpath(), would we be renaming it back? So, I
would suggest keeping alloc_flags as is in prepare_alloc_pages() and
its callers. The rest LGTM.

>
> Rename some variables to highlight which alloc_flags are
> fastpath-specific. Rename gfp_to_alloc_flags() to highlight that it's
> slowpath-specific.
>
> gfp_to_alloc_flags_cma()'s current name is actually fine, but rename it
> anyway, just for consistency.
>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>  mm/page_alloc.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6c4eb6908bd95..bc05d75a41627 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3771,8 +3771,8 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
>  }
>
>  /* Must be called after current_gfp_context() which can change gfp_mask */
> -static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
> -                                                 unsigned int alloc_flags)
> +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> +                                          unsigned int alloc_flags)
>  {
>  #ifdef CONFIG_CMA
>         if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> @@ -4471,7 +4471,7 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
>  }
>
>  static inline unsigned int
> -gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
> +slowpath_alloc_flags(gfp_t gfp_mask, unsigned int order)
>  {
>         unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>
> @@ -4508,7 +4508,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
>         } else if (unlikely(rt_or_dl_task(current)) && in_task())
>                 alloc_flags |= ALLOC_MIN_RESERVE;
>
> -       alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
> +       alloc_flags = cma_alloc_flags(gfp_mask, alloc_flags);
>
>         if (defrag_mode)
>                 alloc_flags |= ALLOC_NOFRAGMENT;
> @@ -4774,7 +4774,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>          * kswapd needs to be woken up, and to avoid the cost of setting up
>          * alloc_flags precisely. So we do that now.
>          */
> -       alloc_flags = gfp_to_alloc_flags(gfp_mask, order);
> +       alloc_flags = slowpath_alloc_flags(gfp_mask, order);
>
>         /*
>          * We need to recalculate the starting point for the zonelist iterator
> @@ -4815,7 +4815,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>
>         reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
>         if (reserve_flags)
> -               alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags) |
> +               alloc_flags = cma_alloc_flags(gfp_mask, reserve_flags) |
>                                           (alloc_flags & ALLOC_KSWAPD);
>
>         /*
> @@ -5017,7 +5017,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask,
>                 struct alloc_context *ac, gfp_t *alloc_gfp,
> -               unsigned int *alloc_flags)
> +               unsigned int *fastpath_alloc_flags)
>  {
>         ac->highest_zoneidx = gfp_zone(gfp_mask);
>         ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> @@ -5033,7 +5033,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>                 if (in_task() && !ac->nodemask)
>                         ac->nodemask = &cpuset_current_mems_allowed;
>                 else
> -                       *alloc_flags |= ALLOC_CPUSET;
> +                       *fastpath_alloc_flags |= ALLOC_CPUSET;
>         }
>
>         might_alloc(gfp_mask);
> @@ -5042,11 +5042,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>          * Don't invoke should_fail logic, since it may call
>          * get_random_u32() and printk() which need to spin_lock.
>          */
> -       if (!(*alloc_flags & ALLOC_NOLOCK) &&
> +       if (!(*fastpath_alloc_flags & ALLOC_NOLOCK) &&
>             should_fail_alloc_page(gfp_mask, order))
>                 return false;
>
> -       *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
> +       *fastpath_alloc_flags = cma_alloc_flags(gfp_mask, *fastpath_alloc_flags);
>
>         /* Dirty zone balancing only done in the fast path */
>         ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
> @@ -5260,7 +5260,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask)
>  {
>         struct page *page;
> -       unsigned int alloc_flags = ALLOC_WMARK_LOW;
> +       unsigned int fastpath_alloc_flags = ALLOC_WMARK_LOW;
>         gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
>         struct alloc_context ac = { };
>
> @@ -5282,17 +5282,17 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
>         gfp = current_gfp_context(gfp);
>         alloc_gfp = gfp;
>         if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
> -                       &alloc_gfp, &alloc_flags))
> +                       &alloc_gfp, &fastpath_alloc_flags))
>                 return NULL;
>
>         /*
>          * Forbid the first pass from falling back to types that fragment
>          * memory until all local zones are considered.
>          */
> -       alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
> +       fastpath_alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
>
>         /* First allocation attempt */
> -       page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> +       page = get_page_from_freelist(alloc_gfp, order, fastpath_alloc_flags, &ac);
>         if (likely(page))
>                 goto out;
>
>
> --
> 2.54.0
>


^ permalink raw reply

* [PATCH] tools/mm: add thp_swap_allocator_test binary to .gitignore
From: Zenghui Yu @ 2026-06-24 15:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, Zenghui Yu

Tell git to ignore the generated binary for thp_swap_allocator_test.c.

Signed-off-by: Zenghui Yu <zenghui.yu@linux.dev>
---
 tools/mm/.gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/mm/.gitignore b/tools/mm/.gitignore
index 922879f93fc8..1446a659e540 100644
--- a/tools/mm/.gitignore
+++ b/tools/mm/.gitignore
@@ -2,3 +2,4 @@
 slabinfo
 page-types
 page_owner_sort
+thp_swap_allocator_test
-- 
2.53.0



^ permalink raw reply related

* Re: [PATCH v2 2/4] mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic
From: David Hildenbrand (Arm) @ 2026-06-24 15:07 UTC (permalink / raw)
  To: Barry Song (Xiaomi), akpm, linux-mm
  Cc: baoquan.he, chrisl, jp.kobryn, kasong, liam, linux-kernel, ljs,
	mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park
In-Reply-To: <20260623231635.43086-3-baohua@kernel.org>

On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> The "we just allocated them without exposing them to the swapcache"
> case no longer exists, as Kairui has routed synchronous I/O through
> the swapcache as well in his series "unify swapin use swap cache and
> cleanup flags"[1]. As a result, folio_ref_count() should never be 1
> in this path, since at least two references are held (base ref plus
> swapcache). Remove the folio_ref_count()==1 check and update the
> comment accordingly.
> 
> [1] https://lore.kernel.org/all/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com/
> 
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Reviewed-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f6848f4234a6..abd0adcf65f0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5049,12 +5049,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  
>  	/*
>  	 * Same logic as in do_wp_page(); however, optimize for pages that are

s/Same/Similar/ ?

> -	 * certainly not shared either because we just allocated them without
> -	 * exposing them to the swapcache or because the swap entry indicates
> -	 * exclusivity.
> +	 * certainly not because the swap entry indicates exclusivity.
>  	 */
> -	if (!folio_test_ksm(folio) &&
> -	    (exclusive || folio_ref_count(folio) == 1)) {
> +	if (!folio_test_ksm(folio) && exclusive) {

Hmm, but KSM folios should never have "exclusive" set. So I think you can drop
that as well (was only relevant with folio_ref_count==1 check IIRC).

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page
From: David Hildenbrand (Arm) @ 2026-06-24 15:10 UTC (permalink / raw)
  To: Barry Song (Xiaomi), akpm, linux-mm
  Cc: baoquan.he, chrisl, jp.kobryn, kasong, liam, linux-kernel, ljs,
	mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park
In-Reply-To: <20260623231635.43086-4-baohua@kernel.org>

On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> We are doing a lot of redundant lru_add_drain() calls in
> do_swap_page(), especially for synchronous I/O devices. For
> example, the test program below currently ends up draining
> lru_cache 100% of the time:
> 
> int main(int argc, char *argv[])
> {
>         int i;
>  #define SIZE 100*1024*1024
> 	while(1) {
> 		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
>                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> 		for (int i = 0; i < SIZE/sizeof(int); i++)
> 			p[i] =  i%64;
> 		madvise((void *)p, SIZE, MADV_PAGEOUT);
> 		for (int i = 0; i < SIZE/sizeof(int); i++)
> 			p[i] =  i%64;
> 		munmap(p, SIZE);
> 	}
> 	return 0;
> }
> 
> Folio reuse now relies primarily on the exclusive hint, making
> lru_cache draining to drop the refcount in lru_cache largely
> irrelevant.

Makes sense, we'll fallback to do_wp_page() where we handle the non-exclusive
either way.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2 11/13] alloc_tag: Move to mm/
From: Lorenzo Stoakes @ 2026-06-24 15:11 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Brendan Jackman, Andrew Morton, Vlastimil Babka, Michal Hocko,
	Johannes Weiner, Zi Yan, Muchun Song, Oscar Salvador,
	David Hildenbrand, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <CAJuCfpHqvLVnycuufO6Kf3S_RSFjyDRUnTGDyu3VW0kQ90tiHQ@mail.gmail.com>

On Tue, Jun 23, 2026 at 04:48:04PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jun 23, 2026 at 10:29 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Mon, Jun 22, 2026 at 10:01:38AM +0000, Brendan Jackman wrote:
> > > This is logically mm code. Moving to mm/ allows access to mm/internal.h
> > >
> > > Signed-off-by: Brendan Jackman <jackmanb@google.com>
> >
> > Sorry to be a pain, but I feel that this change should be dealt with separately
> > perhaps as a pre-requisite to this series.
>
> I know you have an idea for some cleanup. Let's wait for your patch
> and then rebase this series over it. In the meantime I'll start
> reviewing the rest.

Thanks!

Cheers, Lorenzo


^ permalink raw reply

* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Sean Christopherson @ 2026-06-24 15:12 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Binbin Wu, aik, andrew.jones, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgGF+O7r-YHqcLp-ZgoXTCbqjuUhpOdD5eE5w2wu3YYYpw@mail.gmail.com>

On Tue, Jun 23, 2026, Ackerley Tng wrote:
> Binbin Wu <binbin.wu@linux.intel.com> writes:
> 
> > On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> >> From: Sean Christopherson <seanjc@google.com>
> >>
> >> When memory attributes become trackable in guest_memfd, the concept of
> >> having private memory is no longer dependent on
> >> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
> >>
> >> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
> >> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
> >> in.
> >>
> >> Signed-off-by: Sean Christopherson <seanjc@google.com>
> >> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >
> > Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> >
> > One nit below.
> >
> >> ---
> >>  arch/x86/include/asm/kvm_host.h | 4 +++-
> >>  include/linux/kvm_host.h        | 2 +-
> >>  2 files changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> >>  		       int tdp_max_root_level, int tdp_huge_page_level);
> >>
> >>
> >> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> >> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) ||	\
> >> +	defined(CONFIG_KVM_INTEL_TDX) ||	\
> >> +	defined(CONFIG_KVM_AMD_SEV)
> >
> > Nit:
> > Vertically align the defined(XXX) statements for better readability?
> >
> 
> Sean had this aligned with spaces, and checkpatch complained about

checkpatch is a tool, it is neither omniscient nor authoritative.  And for things
like this, the *entire* purpose for rules/guildlines like "no tabs after spaces"
is to help ensure the code is easier to read, e.g. doesn't end up with wonky
formatting when viewed in certain editors or whatever.  So, ignore checkpatch if
it complains about formatting that is visually superior to what makes checkpatch
happy.

> having no spaces before tabs, so I switched it to tabs instead since I
> don't think alignment like that is officially documented either way.

This exact case may not be "officially" documented, but the general gist is in
Documentation/process/maintainer-tip.rst:

  When splitting function declarations or function calls, then please align
  the first argument in the second line with the first argument in the first
  line::

And there is lots and lots of prior art on-list (from me and others) that is more
or less as good as official documentation.

> Either way is fine :)

Please restore the alignment.


^ permalink raw reply

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
From: Pratyush Yadav @ 2026-06-24 15:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pratyush Yadav, Breno Leitao, nao.horiguchi, linmiaohe, david,
	lance.yang, akpm, baoquan.he, rppt, kexec, linux-mm, rneu, caggio,
	kas
In-Reply-To: <da460b26bef42152bf23d395f845503ad8eedc4d.camel@surriel.com>

On Wed, Jun 24 2026, Rik van Riel wrote:

> On Wed, 2026-06-24 at 15:40 +0200, Pratyush Yadav wrote:
>> 
>> Also, what happens on cold reboot? If the HW does not remember bad
>> pages, won't the kernel be in the same position? How does it know the
>> bad pages on a cold boot?
>
> Some modern server hardware will simply unmap known
> bad pages from the physical page map, so they will
> not be exposed to the OS after a cold reboot.
>
> The hardware keeps a log of uncorrectable memory
> errors somewhere in memory, for example in the SEL.
>
>> 
>> 
>> > 
>> > This PoC
>> > ========
>> > 
>> >   * Makes hardware-poisoned pages survive a kexec, using KHO (Kexec
>> >     HandOver) to carry the poison list between kernels.
>> > 
>> >   * Producer: hooks num_poisoned_pages_inc()/_sub() - the single
>> >     chokepoint for every poison/unpoison event - and records each
>> >     poisoned PFN into a vmalloc array that KHO preserves across the
>> >     kexec, described by a small versioned "hwpoison" subtree.
>> 
>> More of an implementation detail, but with vmalloc array, what if you
>> have too many poisoned pages?
>> > 
>
> If a very large amount of memory is broken, you
> should probably just repair the hardware.

"large" is relative. On a 2 TiB system, if you have 0.5% of pages
poisoned (I have no idea if that number is realistic), you have 10 GiB
of memory poisoned, or around 2.6 million pages. To store all their
PFNs, you need around 20 MiB of memory.

While not too large, it isn't trivial either.

I think static data structures like vmalloc are likely not the way to go
here especially when we have better things like KHO block or the KHO
radix tree.

Between those two, what is more efficient largely depends on how many
pages you'd typically see poisoned and what their locations tend to be.
That I think we can dive deeper into when we take a closer look at the
patches.

>
> Page poisoning is good for localized memory
> failures, but not for failures that extend across
> much of a memory chip.


-- 
Regards,
Pratyush Yadav


^ permalink raw reply

* Re: [PATCH] fs/super: skip non-memcg-aware nr_cached_objects in memcg slab shrink
From: Usama Arif @ 2026-06-24 15:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: brauner, jack, linux-fsdevel, linux-kernel, Al Viro, linux-mm,
	hughd, boris, clm, dsterba, linux-btrfs, cem, linux-xfs, hannes,
	riel, kernel-team
In-Reply-To: <ajrHa_JGxQvCMf8t@linux.dev>

On 23/06/2026 19:08, Shakeel Butt wrote:
> On Tue, Jun 09, 2026 at 05:30:47AM -0700, Usama Arif wrote:
>> The super_block shrinker is registered with SHRINKER_MEMCG_AWARE because its
>> dentry and inode LRUs are memcg-aware (via list_lru). But the optional
>> ->nr_cached_objects() hooks that the shrinker also drives are not memcg-aware:
>> btrfs extent maps and xfs inode reclaim operate on filesystem-global
>> state, and shmem's unused-huge shrinker walks a per-superblock shrinklist.
>> None of them filter by sc->memcg.
> 
> I see the underlying objects whose count is returned by ->nr_cached_objects()
> hook is memcg charged for shmem and xfs but not for btrfs. Do you envision
> there might be a rare scenario where we have a lot of memory charged to a memcg
> consumed by objects which ->nr_cached_objects() tracks and that memory becomes
> unreclaimable due to this patch?

Hello!

Thanks for the review.

For XFS, xfs_inode is SLAB_ACCOUNT, so a memcg can have memory charged in
XFS inodes sitting in XFS' internal reclaim state. But the current callback
does not target that memcg: xfs_fs_free_cached_objects() calls
xfs_reclaim_inodes_nr(), which walks the mount and reclaims
XFS_ICI_RECLAIM_TAG inodes without checking sc->memcg. So non-root memcg
reclaim was only getting an opportunistic mount-wide reclaim pass here; it
could reclaim the target memcg's inodes by chance, but it could just as well
reclaim inodes charged to other memcgs. This patch removes that cross-memcg
side effect, not a correct memcg-targeted reclaim path.

Those XFS inodes are still reclaimed by XFS' own reclaim worker, which is
queued when reclaimable inodes are tagged and requeues while reclaimable
inodes remain. With the default xfssyncd_centisecs value, that is about every
5 seconds. The root/global superblock shrinker path also continues to call
the XFS callbacks. This will keep the memory reclaimable.

For shmem, shrinklist_len counts inodes whose tail large folio could be                                                                                                                                                                                                                                                                                                                                               
split — splitting itself doesn't free anything; it just lets normal page                                                                                                                                                                                                                                                                                                                                              
LRU reclaim the truncated tail. The folios are on the memcg-aware LRU and                                                                                                                                                                                                                                                                                
will be aged/reclaimed there independently, so skipping the split from                                                                                                                                                                                                                                                                                   
memcg context just delays the split?

or btrfs extent maps, as you note, these objects are not memcg-charged.
> 
>>
>> The mismatch shows up under memcg-heavy slab reclaim. shrink_slab_memcg()
>> calls do_shrink_slab() once per (memcg, NUMA node) pair for every memcg
>> whose bit is set in the per-superblock shrinker bitmap, which on a busy
>> host means hundreds of calls per reclaim pass. Each scan queues the same
>> global shrinker work item that's already kicked from the root path.
>>
>> Because btrfs/xfs global count is typically non-zero on any in-use filesystem,
>> the returned total stays positive even if a memcg's own dentry/inode LRUs
>> are empty. shrink_slab_memcg() therefore never clears the SB shrinker bit
>> in the memcg bitmap, so subsequent reclaim passes from the same memcg
>> re-enter super_cache_count() and pay for the global counter walk again.
> 
> What is the main concern? Is it the amount of CPU wasted or are we over
> reclaiming or reclaiming from unrelated memcgs?
> 

The primary concern is wasted CPU. On the busy hosts where I saw this,
hundreds of memcgs all repeating that walk per reclaim pass is what made
it visible in profiles.

Another concern is misattribution. Reclaim from memcg X ends up
reclaiming another memcg Y, which could have an affect on Y and
won't provide proper isolation.

>>
>> Restrict ->nr_cached_objects() to the global shrink path (sc->memcg NULL
>> or root). The memcg-aware dentry/inode LRUs keep being counted and
>> scanned per memcg as before; only the global fs-specific hooks are skipped.
>> The root/global shrink path still drives those hooks; only their
>> invocation from non-root memcg slab reclaim is removed.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> 
> I am fine with the stopgap but it would be nice to have proper memcg awareness
> in xfs and shmem callbacks. For btrfs, I am not sure if it makes sense to memcg
> charge btrfs_extent_map objects but at least to decision to skip memcg reclaim
> will be inside the fs callbacks i.e. nr_cached_objects.
> 

Agreed that this is the right long term approach. If the preference is to move this down to
the fs own callbacks instead of fs/super.c, I can do that in the revision as well.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox