Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v4] mm: assert exclusive nid/zonenum bits at the page/folio access sites
From: David Hildenbrand (Arm) @ 2026-06-25  6:46 UTC (permalink / raw)
  To: Hui Zhu, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Kairui Song, Qi Zheng, Shakeel Butt, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-mm, linux-kernel
  Cc: Hui Zhu
In-Reply-To: <20260625053958.918738-1-hui.zhu@linux.dev>

On 6/25/26 07:39, Hui Zhu wrote:
> From: Hui Zhu <zhuhui@kylinos.cn>
> 
> KCSAN reports a data race between page_to_nid()/folio_pgdat() reading
> page->flags and folio_trylock()/folio_lock() concurrently doing
> test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:
> 
>   BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp
> 
> The node id and zone id occupy fixed bit-ranges of page->flags that
> are set once at page init and never modified afterwards, so they can
> never overlap with the low PG_locked/PG_waiters bits touched by the
> folio lock path.
> 
> ASSERT_EXCLUSIVE_BITS(mdf.f, ...) inside memdesc_nid()/memdesc_zonenum()
> checks a by-value copy of the flags word, not the actual shared
> page->flags/folio->flags being modified concurrently, so it doesn't
> reliably assert anything about the real race.

Is that the case? I thought the existing ASSERT_EXCLUSIVE_BITS() reliably worked
before?

Maybe the compiler optimizing out a local copy sorted that for us.

> Move the assertion to
> page_to_nid(), folio_nid(), page_zonenum() and folio_zonenum(), where
> flags is dereferenced directly from the page/folio.
> 
> On CONFIG_NUMA=n, NODES_MASK is 0 and the old memdesc_nid() body
> folded to a constant, so page->flags/folio->flags was never actually
> read. ASSERT_EXCLUSIVE_BITS() is a real runtime check that can't be
> folded away, so doing it unconditionally would add a pointless read
> of page->flags/folio->flags and a check that can never fire. Keep
> page_to_nid()/folio_nid() as plain "return 0" static inline stubs
> under CONFIG_NUMA=n instead.
> 
> Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
> ---
> Changelog:
> v4:
> According to the comments of Andrew and Sashiko, set
> page_to_nid()/folio_nid() as static inline stubs returning 0
> under CONFIG_NUMA=n.
> v3:
> According to the comments of Andrew and Sashiko, move
> ASSERT_EXCLUSIVE_BITS out of memdesc_nid()/memdesc_zonenum()
> into the page/folio call sites.
> v2:
> According to the comments of David, remove useless comments and use
> ASSERT_EXCLUSIVE_BITS() in memdesc_nid() instead of data_race() in
> page_to_nid().
> 
>  include/linux/mm.h     | 9 +++++++++
>  include/linux/mmzone.h | 3 ++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 485df9c2dbdd..56b39194605a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2294,15 +2294,24 @@ static inline int memdesc_nid(memdesc_flags_t mdf)
>  }
>  #endif
>  
> +#ifdef CONFIG_NUMA
>  static inline int page_to_nid(const struct page *page)
>  {
> +	ASSERT_EXCLUSIVE_BITS(PF_POISONED_CHECK(page)->flags,
> +			      NODES_MASK << NODES_PGSHIFT);

Performing the PF_POISONED_CHECK() twice is a bit odd. One time is sufficient,
maybe simply before both statements separately?

>  	return memdesc_nid(PF_POISONED_CHECK(page)->flags);
>  }
>  
>  static inline int folio_nid(const struct folio *folio)
>  {
> +	ASSERT_EXCLUSIVE_BITS(folio->flags,
> +			      NODES_MASK << NODES_PGSHIFT);
>  	return memdesc_nid(folio->flags);
>  }
> +#else
> +#define page_to_nid(page) (0)
> +#define folio_nid(folio) (0)
> +#endif
>  

LGTM

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug
From: Gregory Price @ 2026-06-25  6:43 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-mm, nvdimm, linux-kernel, linux-cxl, driver-core,
	linux-kselftest, kernel-team, david, osalvador, gregkh, rafael,
	dakr, djbw, vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka,
	rppt, surenb, mhocko, shuah, alison.schofield,
	Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <8e42587a-d614-4259-ae6b-5bca1479b425@suse.de>

>
> Why do we need to treat the 'unbind' call as a given thing?
> If we know that we cannot handle online memory during unbind,
> can't we just disallow unbind in that case?

No.  Unbind is a violent operation - unbinds cannot fail, and a
straight, uncoordinated unbind is essentially a `--force` flag:
the admin accepts the risks.

To your point, the admin either does the nice thing are they
muck up the system.

But we should still try to do something sane to defend the kernel,
in this case we should try to prevent that task from becoming
deadlocked.  The only way to do that is to leak the resources.

I'm making a small modification to this code to reinstate the
legacy behavior when "state!=UNPLUGGED".

~Gregory


^ permalink raw reply

* Re: [PATCH] mm/page_vma_mapped: guard check_pmd() with CONFIG_TRANSPARENT_HUGEPAGE
From: Wei Yang @ 2026-06-25  6:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wei Yang, david, ljs, riel, liam, vbabka, harry, jannh, willy,
	linux-mm, linux-kernel, lance.yang, balbirs
In-Reply-To: <20260624215943.332f1bdcc99a3b85b7df0822@linux-foundation.org>

+cc Balbir

On Wed, Jun 24, 2026 at 09:59:43PM -0700, Andrew Morton wrote:
>On Thu, 25 Jun 2026 03:46:29 +0000 Wei Yang <richard.weiyang@gmail.com> wrote:
>
>> >Sashiko had an off-topic complaint about the surrounding code:
>> >	https://lore.kernel.org/oe-kbuild-all/202606240042.ffPsEXVc-lkp@intel.com/
>> 
>> I see this robot reply, but not see the Sashiko comment.
>> 
>> How can I view Sashiko's commnet?
>
>oop sorry.
>
>You can go to https://sashiko.dev/ and search for the email subject. 
>
>Or append your Message-ID to "https://sashiko.dev/#/patchset":
>
>	https://sashiko.dev/#/patchset/20260624082359.2869-1-richard.weiyang@gmail.com
>

Got it, thansk

This one mentioned two things:

  a. page_vma_mapped_walk() return without check
  b. whether __split_huge_pmd_locked() would split device-private pmd

For a., it is being fixing at [1].

For b., to be honest I am not 100% for sure. If a device-private pmd could be
file backed, then this looks like a bug.

Balbir,

Would you mind taking a look at the second comment raised by Sashiko?

[1]: https://lore.kernel.org/linux-mm/20260624065353.1622-1-richard.weiyang@gmail.com/

-- 
Wei Yang
Help you, Help me


^ permalink raw reply

* Re: [PATCH] fs/proc: fix KPF_KSM reported for all anonymous pages
From: David Hildenbrand (Arm) @ 2026-06-25  6:39 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, ziy, luizcap, willy, linmiaohe, svetly.todorov,
	xu.xin16, chengming.zhou, linux-fsdevel, linux-mm
  Cc: wangkefeng.wang, sunnanyong
In-Reply-To: <601fb5dd-18e1-4a6c-bc99-dc2a655240e2@huawei.com>

On 6/23/26 03:37, Jinjiang Tu wrote:
> 
> 在 2026/6/22 19:45, David Hildenbrand (Arm) 写道:
>> On 6/22/26 11:15, Jinjiang Tu wrote:
>>> Reading /proc/kpageflags for any anonymous page returns KPF_KSM set, even
>>> when KSM is not in use. As a result, tools misclassify all anonymous pages
>>> as KSM merged.
>>>
>>> In stable_page_flags(), if the page is anonymous, then use (mapping &
>>> FOLIO_MAPPING_KSM) check to identify if the anonymous page is KSM page.
>>> However, FOLIO_MAPPING_KSM is FOLIO_MAPPING_ANON | FOLIO_MAPPING_ANON_KSM,
>>> (mapping & FOLIO_MAPPING_KSM) check returns true for all nonymous pages.
>>>
>>> To fix it, use FOLIO_MAPPING_ANON_KSM instead.
>>>
>>> Fixes: dee3d0bef2b0 ("proc: rewrite stable_page_flags()")
>> Right,
>>
>> #define PAGE_MAPPING_KSM       (PAGE_MAPPING_ANON | PAGE_MAPPING_ANON_KSM)
>>
>> Which we later renamed to FOLIO_MAPPING_KSM.
>>
>>
>> Before switching to manual flag checks, PageKsm() translated to folio_test_ksm()
>> that checked whether the values actually matched:
>>
>> ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_KSM;
>>
>>
>> This only affects /proc/kpageflags (well, and hwpoison inject on a weird testing
>> interface), so it's not really that relevant for real workloads (debugging and
>> testing).
>>
>> So not sure whether we should CC:stable. Likely not.
> 
> /proc/kpageflags is generally used only for analysis and is unlikely to be
> used in production environments. I found this issue due to I was analyzing
> pfns allocated by which stacks are KSM-merged. So I think it's unnecessary
> to CC:stable.
> 
>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>> ---
>>>   fs/proc/page.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/proc/page.c b/fs/proc/page.c
>>> index f9b2c2c906cd..cef8ded97610 100644
>>> --- a/fs/proc/page.c
>>> +++ b/fs/proc/page.c
>>> @@ -173,7 +173,7 @@ u64 stable_page_flags(const struct page *page)
>>>           u |= 1 << KPF_MMAP;
>>>       if (is_anon) {
>>>           u |= 1 << KPF_ANON;
>>> -        if (mapping & FOLIO_MAPPING_KSM)
>>> +        if (mapping & FOLIO_MAPPING_ANON_KSM)
>>
>> Wonder whether we should just do
>>
>>         if ((mapping & FOLIO_MAPPING_FLAGS) == FOLIO_MAPPING_KSM)
>>
>> To match what we have in folio_test_ksm.
>>
>> (although I doubt we would reuse this flag for other purposes, likely
>> it's more future proof to check it like that)
> 
> Both are ok. The following check has checked FOLIO_MAPPING_ANON,
> 
>     if (is_anon) {
>         if (mapping & FOLIO_MAPPING_ANON_KSM)
>     }
> 
> So it's equivalent to do
>     if ((mapping & FOLIO_MAPPING_FLAGS) == FOLIO_MAPPING_KSM)
> or
>     if (mapping & FOLIO_MAPPING_ANON_KSM)

As I said, matching precisely what we have in folio_test_ksm() is clearer. We
don't have any users of FOLIO_MAPPING_ANON_KSM outside of page-flags.h for a
reason :)

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH] fs/proc: fix KPF_KSM reported for all anonymous pages
From: David Hildenbrand (Arm) @ 2026-06-25  6:38 UTC (permalink / raw)
  To: Andrew Morton, Jinjiang Tu
  Cc: ziy, luizcap, willy, linmiaohe, svetly.todorov, xu.xin16,
	chengming.zhou, linux-fsdevel, linux-mm, wangkefeng.wang,
	sunnanyong
In-Reply-To: <20260624180051.134da6553b4f0c4c2785f730@linux-foundation.org>

On 6/25/26 03:00, Andrew Morton wrote:
> On Tue, 23 Jun 2026 09:37:57 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:
> 
>>> This only affects /proc/kpageflags (well, and hwpoison inject on a weird testing
>>> interface), so it's not really that relevant for real workloads (debugging and
>>> testing).
>>>
>>> So not sure whether we should CC:stable. Likely not.
>>
>> /proc/kpageflags is generally used only for analysis and is unlikely to be
>> used in production environments. I found this issue due to I was analyzing
>> pfns allocated by which stacks are KSM-merged. So I think it's unnecessary
>> to CC:stable.
> 
> Well, it's a bug.  The fix is super-simple so I think it's reasonable
> to feed it back to users of earlier kernels.

Definitely doesn't hurt.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
From: Dev Jain @ 2026-06-25  6:37 UTC (permalink / raw)
  To: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
	akpm, urezki
  Cc: baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang,
	Ard Biesheuvel
In-Reply-To: <20260618084726.1070022-1-jiangwen6@xiaomi.com>



On 18/06/26 2:17 pm, Wen Jiang wrote:
> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous. Two techniques are used:
> 
> 1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
>    segments
> 2. Use batched mappings wherever possible in both vmalloc and ARM64
>    layers
> 
> Besides accelerating the mapping path, this also enables large
> mappings (PMD and cont-PTE) for vmap, which are currently not
> supported.
> 
> Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> CONT-PTE regions instead of just one.
> 
> Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
> mapping logic between the ioremap and vmalloc/vmap paths, handling both
> CONT_PTE and regular PTE mappings. This prepares for the next patch.
> 
> Patch 4 extends the page table walk path to support page shifts other
> than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
> mappings. The function is renamed from vmap_small_pages_range_noflush()
> to vmap_pages_range_noflush_walk().
> 
> Patches 5-6 add huge vmap support for contiguous pages, including
> support for non-compound pages with pfn alignment verification.
> 
> On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> the performance CPUfreq policy enabled, benchmark results:
> 
> * ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
> * vmalloc(1 MB) mapping time (excluding allocation) with
>   VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
> * vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)
> 
> Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.
> 

I am still a little nervous about doing vmap-huge by default.

We can play set_memory_* games on a vmap huge mapping partially, thus
forcing a pgtable split, and not all arches can handle a kernel pgtable
split.

For arm64, we can handle that with BBML2_NOABORT, but interestingly, in
change_memory_common, arch/arm64/mm/pageattr.c:

	area = find_vm_area((void *)addr);
	if (!area ||
	    ((unsigned long)kasan_reset_tag((void *)end) >
	     (unsigned long)kasan_reset_tag(area->addr) + area->size) ||
	    ((area->flags & (VM_ALLOC | VM_ALLOW_HUGE_VMAP)) != VM_ALLOC))
		return -EINVAL;

Even before my change fcf8dda8cc48, we were bailing out on

!(area->flags & VM_ALLOC))

So on arm64 we haven't been supporting set_memory_* for vmap memory at all, because
it has VM_MAP set and not VM_ALLOC. Although we have a contradictory comment above
this code so not sure if this was intentional:

"Let's restrict ourselves to mappings created by vmalloc (or vmap)."


So either there is no user in the kernel doing vmap + set_memory_* (looks like it
by doing an LLM scan), or it is not fatal for set_memory_* to fail.

But even if no one does it now, technically the API allows it.

> 



^ permalink raw reply

* Re: [RFC Patch 0/3] mm/mm_init: fix and cleanup mirrored_kernelcore
From: David Hildenbrand (Arm) @ 2026-06-25  6:36 UTC (permalink / raw)
  To: Wei Yang, rppt, akpm, izumi.taku; +Cc: linux-mm, linux-kernel, yuan1.liu
In-Reply-To: <20260623092351.13031-1-richard.weiyang@gmail.com>

On 6/23/26 11:23, Wei Yang wrote:
> When reviewing patch set "mm/memory_hotplug: optimize zone contiguous check
> when changing pfn range" [1], we notice mirrored_kernelcore introduced some
> special case to handle.
> 
> While during the discussion and test, it shows current mirrored_kernelcore
> implementation breaks other memmap_init() behavior:
> 
>   * disturbs defer_init()
>   * some Movable zone range would be initialized twice
> 
> The reason is Zone Movable and Zone Normal overlaps and current logic doesn't
> handle it well.
> 
> As Yuan shows in [2], physical overlapped zone seems not exist in practice. If
> remove the possibility of overlapped zone, problem could be solved. And also
> could benefit zone contiguous optimization, IIUC.
> 
> Patch [1]: remove the overlapped zone, which fix the above two problem
> Patch [2]: remove overlap_memmap_init() as there is no overlapped zone, but
>            record current double initialize problem for reference
> Patch [3]: remove absent pages calculation for mirrored_kernelcore
> 
> [1]: https://lore.kernel.org/all/20260520093457.3719960-1-yuan1.liu@intel.com/T/#u
> [2]: https://lore.kernel.org/all/20260520093457.3719960-1-yuan1.liu@intel.com/T/#mb0bd07ffd562a5b029f0f07751e580ca339c5b51
> 

IIUC, Mike will send an alternative.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock
From: David Hildenbrand (Arm) @ 2026-06-25  6:32 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Dmitry Ilvokhin,
	Borislav Petkov, Dave Hansen, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>

On 6/25/26 03:50, Rik van Riel wrote:
> Sometimes processes can get stuck with the mmap_lock held for
> a long time. This slows down, and can even prevent system monitoring
> tools from assessing and logging the situation, because they themselves
> end up getting stuck on the mmap_lock.
> 
> However, with the introduction of per-VMA locks, we can improve the
> reliability of system monitoring, and generally speed up __access_remote_vm
> under mmap_loc contention, by adding a fast path that does not require
> the process-wide mmap_lock.
> 
> This fast path is only compiled in and used when it is safe to do so,
> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> is not hugetlbfs, iomap, pfnmap, etc...
> 
> v2:
>  - simplify the code, which should be ok because these copies are < PAGE_SIZE
>  - clean up the code
>  - fix locking wrt tlb_remove_table_sync_one()
>  - hopefully address all the other comments

You mean, ignoring my comments about not reiplementing GUP entirely?

NAK

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
From: Harry Yoo @ 2026-06-25  6:32 UTC (permalink / raw)
  To: Qi Zheng, akpm, david, kasong, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, hannes, muchun.song, peiyang_he,
	mhocko, roman.gushchin, ljs
  Cc: linux-mm, linux-kernel, Qi Zheng, stable
In-Reply-To: <f18bf1b1-ccf7-4d77-9389-07311d2d1613@linux.dev>


[-- Attachment #1.1: Type: text/plain, Size: 2045 bytes --]



On 6/25/26 3:11 PM, Qi Zheng wrote:
> On 6/25/26 12:16 PM, Harry Yoo wrote:
>>
> [...]
> 
>>
>>> So lock_batch_lruvec() can be implemented like this:
>>>
>>> #ifdef CONFIG_MEMCG
>>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>> {
>>>      struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>      struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>
>>>      rcu_read_lock();
>>>
>>>      /*
>>>       * The memcg can be NULL when the memory controller is disabled.
>>>       * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>       */
>>>      if (!memcg || !css_is_dying(&memcg->css))
>>>          goto lock;
>>>
>>>      do {
>>>          memcg = parent_mem_cgroup(memcg);
>>>      } while (memcg && css_is_dying(&memcg->css));
>>>      lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>
>>> lock:
>>>      spin_lock_irq(&lruvec->lru_lock);
>>>
>>>      return lruvec;
>>> }
>>> #else
>>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>> {
>>>      lruvec_lock_irq(lruvec);
>>>
>>>      return lruvec;
>>> }
>>> #endif
>>>
>>> Does this make sense?
>>
>> Yes, looks good to me!
> 
> OK, this sync method makes more sense as it doesn't require adding a
> new lrugen->reparente. I'll go with this method and update v3.

Thanks!

Just one thing to clarify...

So, when we check something that's updated _before_ grace period
(CSS_DYING), RCU is sufficient.

But in folio_lruvec_lock*(), that is not the case because reparenting
is performed in the RCU work, under the lruvec lock. So the check needs
to be done under RCU and the lruvec lock.

This is quite subtle :D

> Hi Barry and Baolin, what do you think? Since the sync method has been
> changed, I will temporarily drop your previous Reviewed-by tags in v3. ;)

And hopefully Peiyang would kindly double check v3 still not reproduced
on the machine :)

-- 
Cheers,
Harry / Hyeonggon


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug
From: Hannes Reinecke @ 2026-06-25  6:17 UTC (permalink / raw)
  To: Gregory Price, linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
	ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-9-gourry@gourry.net>

On 6/24/26 4:57 PM, Gregory Price wrote:
> There is no atomic mechanism to offline and remove an entire
> multi-block DAX kmem device.  This is presently done in two steps:
>      1. offline all
>      2. remove all).
> 
> This creates a race condition where another entity operates directly
> on the memory blocks and can cause hot-unplug to fail / unbind to
> deadlock.
> 
> Add a new 'state' sysfs attribute that enables an atomic whole-device
> hotplug operation across its entire memory region.
> 
> daxX.Y/state mirrors the per-block memoryX/state ABI:
>    - [offline, online, online_kernel, online_movable]
>    - "unplugged" - is added specifically for dax0.0/state
> 
> The valid writable states include:
>    - "unplugged":      memory blocks are not present
>    - "online":         memory is online, zone chosen by the kernel
>    - "online_kernel":  memory is online in ZONE_NORMAL
>    - "online_movable": memory is online in ZONE_MOVABLE
> 
> Valid transitions:
>    - unplugged                -> online[_kernel|_movable]
>    - online[_kernel|_movable] -> unplugged
>    - offline                  -> unplugged
> 
> A device can only be onlined from "unplugged", so it must be returned
> there before being onlined into a different state.
> 
> For backwards compatibility the memory blocks are always created at
> probe - existing tools expect them to be present after kmem binds.
> 
> "offline" is therefore a reportable state but is not writable: it only
> arises from the legacy auto_online_blocks=offline policy.  Onlining
> such a device through this attribute requires unplugging it first in
> an effort to get drivers creating DAX devices to set a default.
> 
> Unplug is atomic across the whole device: dax_kmem_do_hotremove()
> collects every added range and offlines/removes them in one operation.
> Either the operation succeeds or is entirely rolled back.
> 
> Unbind Note:
>    We used to call remove_memory() during unbind, which would fire a
>    BUG() if any of the memory blocks were online at that time.  We lift
>    this into a WARN in the cleanup routine and don't attempt hotremove
>    if ->state is not DAX_KMEM_UNPLUGGED or MMOP_OFFLINE.
> 
>    An offline dax device memory is removed on unbind as before.
> 
>    If online at unbind, the resources are leaked (as before), but now
>    we prevent deadlock if a memory region is impossible to hotremove.
> 
> Suggested-by: Hannes Reinecke <hare@suse.de>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>   Documentation/ABI/testing/sysfs-bus-dax |  26 +++
>   drivers/base/memory.c                   |   9 +
>   drivers/dax/kmem.c                      | 224 ++++++++++++++++++++----
>   include/linux/memory_hotplug.h          |   1 +
>   4 files changed, 224 insertions(+), 36 deletions(-)
> 
That looks good, but question remains:

Why do we need to treat the 'unbind' call as a given thing?
If we know that we cannot handle online memory during unbind,
can't we just disallow unbind in that case?
I don't think it's too much to ask from an admin to offline
the memory first, _especially_ as now we have a simple knob
to do that ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply

* Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
From: Qi Zheng @ 2026-06-25  6:11 UTC (permalink / raw)
  To: Harry Yoo, akpm, david, kasong, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, hannes, muchun.song, peiyang_he,
	mhocko, roman.gushchin, ljs
  Cc: linux-mm, linux-kernel, Qi Zheng, stable
In-Reply-To: <b5c85cea-5daa-4690-ac41-a6f5aebd1555@kernel.org>



On 6/25/26 12:16 PM, Harry Yoo wrote:
> 

[...]

> 
>> So lock_batch_lruvec() can be implemented like this:
>>
>> #ifdef CONFIG_MEMCG
>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>> {
>>      struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>      struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>
>>      rcu_read_lock();
>>
>>      /*
>>       * The memcg can be NULL when the memory controller is disabled.
>>       * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>       */
>>      if (!memcg || !css_is_dying(&memcg->css))
>>          goto lock;
>>
>>      do {
>>          memcg = parent_mem_cgroup(memcg);
>>      } while (memcg && css_is_dying(&memcg->css));
>>      lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>> lock:
>>      spin_lock_irq(&lruvec->lru_lock);
>>
>>      return lruvec;
>> }
>> #else
>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>> {
>>      lruvec_lock_irq(lruvec);
>>
>>      return lruvec;
>> }
>> #endif
>>
>> Does this make sense?
> 
> Yes, looks good to me!

OK, this sync method makes more sense as it doesn't require adding a
new lrugen->reparente. I'll go with this method and update v3.

Hi Barry and Baolin, what do you think? Since the sync method has been
changed, I will temporarily drop your previous Reviewed-by tags in v3. ;)

Thanks,
Qi

> 



^ permalink raw reply

* [PATCH v1 1/2] eventfd: luo: luo support for preserving eventfd
From: Chenghao Duan @ 2026-06-25  5:49 UTC (permalink / raw)
  To: viro, brauner, jack, linux-fsdevel, pasha.tatashin, linux-kernel,
	rppt, pratyush, kexec, linux-mm
  Cc: jianghaoran, duanchenghao
In-Reply-To: <20260625054946.73445-1-duanchenghao@kylinos.cn>

This patch adds support for preserving eventfd file descriptors across
kexec live updates using the Live Update Orchestrator (LUO) framework.
Userspace applications using eventfd for event notification can now
maintain their state across kernel updates.

Preserved State:
The following properties of the eventfd are preserved across kexec:
- Counter Value: The current 64-bit counter value, including any pending
  events that have been signaled but not yet consumed by readers.
- File Flags: The creation flags (EFD_SEMAPHORE, EFD_CLOEXEC, EFD_NONBLOCK)
  are preserved.

Non-Preserved State:
- File Descriptor Number: The eventfd will be assigned a new fd number
  in the target process after restore.
- Wait Queue State: Any processes blocked on read() operations will be
  woken up and need to re-establish their blocking state.
- All other internal state is reset to default.

Changes:
- fs/eventfd.c: Add eventfd_luo_get_state() to safely read eventfd state
  (count and flags), and eventfd_create() helper function.
- fs/eventfd_luo.c: New file implementing LUO file operations:
  preserve, freeze, unpreserve, retrieve, and finish callbacks.
- include/linux/eventfd.h: Export new functions.
- include/linux/kho/abi/eventfd.h: Define the ABI contract with
  eventfd_luo_ser structure for serialization.

Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
---
 fs/Makefile                     |   1 +
 fs/eventfd.c                    |  40 +++++
 fs/eventfd_luo.c                | 250 ++++++++++++++++++++++++++++++++
 include/linux/eventfd.h         |   2 +
 include/linux/kho/abi/eventfd.h |  39 +++++
 kernel/liveupdate/Kconfig       |  16 ++
 6 files changed, 348 insertions(+)
 create mode 100644 fs/eventfd_luo.c
 create mode 100644 include/linux/kho/abi/eventfd.h

diff --git a/fs/Makefile b/fs/Makefile
index 89a8a9d207d1..36d568e6cfc7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-y				+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
+obj-$(CONFIG_LIVEUPDATE_EVENTFD)+= eventfd_luo.o
 obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 9d33a02757d5..9b76cf06135a 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -376,6 +376,40 @@ struct eventfd_ctx *eventfd_ctx_fileget(struct file *file)
 }
 EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
 
+/**
+ * eventfd_luo_get_state - Get eventfd state (count and flags) for LUO
+ * @file: Eventfd file
+ * @count: Output parameter for count value
+ * @flags: Output parameter for flags value
+ *
+ * This function is exported for use by LUO to safely read eventfd state.
+ * Since struct eventfd_ctx is defined in this file, we can access its
+ * members directly here. The function uses the wait queue lock to ensure
+ * atomic access to count.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+int eventfd_luo_get_state(struct file *file, __u64 *count, unsigned int *flags)
+{
+	struct eventfd_ctx *ctx;
+	unsigned long irq_flags;
+
+	ctx = eventfd_ctx_fileget(file);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/* Read count with lock (flags don't need lock) */
+	spin_lock_irqsave(&ctx->wqh.lock, irq_flags);
+	*count = ctx->count;
+	spin_unlock_irqrestore(&ctx->wqh.lock, irq_flags);
+
+	*flags = ctx->flags;
+
+	eventfd_ctx_put(ctx);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_luo_get_state);
+
 static int do_eventfd(unsigned int count, int flags)
 {
 	struct eventfd_ctx *ctx __free(kfree) = NULL;
@@ -411,6 +445,12 @@ static int do_eventfd(unsigned int count, int flags)
 	return fd_publish(fdf);
 }
 
+int eventfd_create(__u64 count, unsigned int flags)
+{
+	return do_eventfd(count, flags);
+}
+EXPORT_SYMBOL_GPL(eventfd_create);
+
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
 	return do_eventfd(count, flags);
diff --git a/fs/eventfd_luo.c b/fs/eventfd_luo.c
new file mode 100644
index 000000000000..781d90635c52
--- /dev/null
+++ b/fs/eventfd_luo.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2026 KylinSoft Corporation.
+ * Author: Chenghao Duan <duanchenghao@kylinos.cn>
+ */
+
+/**
+ * DOC: Eventfd Preservation via LUO
+ *
+ * Overview
+ * ========
+ *
+ * Event file descriptors (eventfd) can be preserved over a kexec using the Live
+ * Update Orchestrator (LUO) file preservation. This allows userspace applications
+ * that use eventfd for event notification to maintain their state across kernel
+ * updates.
+ *
+ * Eventfd is a simple notification mechanism that uses a 64-bit counter for
+ * signaling events between userspace processes or between userspace and kernel.
+ * The preservation ensures that pending events and configuration are not lost
+ * during kexec.
+ *
+ * The preservation is not intended to be transparent. Only select properties of
+ * the eventfd are preserved. All others are reset to default. The preserved
+ * properties are described below.
+ *
+ * Preserved Properties
+ * ====================
+ *
+ * The following properties of the eventfd are preserved across kexec:
+ *
+ * Counter Value
+ *   The current 64-bit counter value is preserved. This includes any pending
+ *   events that have been signaled but not yet consumed by readers.
+ *
+ * File Flags
+ *   The creation flags (EFD_SEMAPHORE, EFD_CLOEXEC, EFD_NONBLOCK) are preserved.
+ *   These control the behavior of read/write operations and file descriptor
+ *   inheritance.
+ *
+ * Non-Preserved Properties
+ * ========================
+ *
+ * All properties which are not preserved must be assumed to be reset to
+ * default. This section describes some of those properties which may be more of
+ * note.
+ *
+ * File Descriptor Number
+ *   The file descriptor number itself is not preserved. After restore, the
+ *   eventfd will be assigned a new file descriptor number in the target process.
+ *
+ * Wait Queue State
+ *   Any processes currently blocked on read() operations will be woken up and
+ *   need to re-establish their blocking state if desired.
+ *
+ * File Position
+ *   Eventfd files don't have a traditional file position, but any internal
+ *   state related to the file descriptor is reset.
+ */
+
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/io.h>
+#include <linux/kexec_handover.h>
+#include <linux/kho/abi/eventfd.h>
+#include <linux/liveupdate.h>
+#include <linux/module.h>
+#include <linux/eventfd.h>
+#include <linux/anon_inodes.h>
+#include <linux/idr.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/kref.h>
+#include <linux/fdtable.h>
+
+static int eventfd_luo_preserve(struct liveupdate_file_op_args *args)
+{
+	struct eventfd_luo_ser *ser;
+	u64 count;
+	unsigned int flags;
+	int err = 0;
+
+	/* Get eventfd state safely */
+	err = eventfd_luo_get_state(args->file, &count, &flags);
+	if (err) {
+		pr_err("Failed to get eventfd state: %d\n", err);
+		return err;
+	}
+
+	ser = kho_alloc_preserve(sizeof(*ser));
+	if (IS_ERR(ser)) {
+		err = PTR_ERR(ser);
+		pr_err("Failed to allocate preserve memory: %d\n", err);
+		return err;
+	}
+
+	/* Save eventfd state */
+	ser->count = count;
+	ser->flags = flags;
+
+	pr_debug("Preserved eventfd: count=%llu, flags=0x%x\n",
+		ser->count, ser->flags);
+
+	/* Return physical address of serialization structure */
+	args->serialized_data = virt_to_phys(ser);
+
+	return 0;
+}
+
+static int eventfd_luo_freeze(struct liveupdate_file_op_args *args)
+{
+	struct eventfd_luo_ser *ser;
+	u64 count;
+	unsigned int flags;
+	int err;
+
+	if (WARN_ON_ONCE(!args->serialized_data))
+		return -EINVAL;
+
+	ser = phys_to_virt(args->serialized_data);
+
+	/* Get current state and update if changed */
+	err = eventfd_luo_get_state(args->file, &count, &flags);
+	if (err)
+		return err;
+
+	if (ser->count != count) {
+		pr_debug("WARNING: Count changed during preserve->freeze! old=%llu, new=%llu\n",
+			 ser->count, count);
+	}
+
+	ser->count = count;
+
+	return 0;
+}
+
+static void eventfd_luo_unpreserve(struct liveupdate_file_op_args *args)
+{
+	struct eventfd_luo_ser *ser;
+
+	if (WARN_ON_ONCE(!args->serialized_data))
+		return;
+
+	ser = phys_to_virt(args->serialized_data);
+	kho_unpreserve_free(ser);
+}
+
+static int eventfd_luo_retrieve(struct liveupdate_file_op_args *args)
+{
+	struct eventfd_luo_ser *ser;
+	struct eventfd_ctx *ctx;
+	struct file *file = NULL;
+	int eventfd;
+
+	ser = phys_to_virt(args->serialized_data);
+	if (!ser)
+		return -EINVAL;
+
+	/* Create a new eventfd with the preserved count and flags */
+	eventfd = eventfd_create(ser->count, ser->flags);
+	if (eventfd < 0) {
+		pr_err("Failed to create eventfd: %d\n", eventfd);
+		return eventfd;
+	}
+
+	file = fget(eventfd);
+	if (!file) {
+		pr_err("Failed to get file from fd\n");
+		close_fd(eventfd);
+		return -EBADF;
+	}
+
+	close_fd(eventfd);
+
+	/* Verify the created file has correct internal state */
+	ctx = eventfd_ctx_fileget(file);
+	if (IS_ERR(ctx)) {
+		pr_err("Failed to get context from file\n");
+		fput(file);
+		return PTR_ERR(ctx);
+	}
+
+	eventfd_ctx_put(ctx);
+
+	args->file = file;
+	return 0;
+}
+
+static void eventfd_luo_finish(struct liveupdate_file_op_args *args)
+{
+	struct eventfd_luo_ser *ser;
+
+	if (args->retrieve_status)
+		return;
+
+	if (!args->serialized_data)
+		return;
+
+	ser = phys_to_virt(args->serialized_data);
+	if (!ser)
+		return;
+
+	kho_restore_free(ser);
+}
+
+static bool eventfd_luo_can_preserve(struct liveupdate_file_handler *handler,
+				     struct file *file)
+{
+	struct eventfd_ctx *ctx;
+
+	if (!file->f_op)
+		return false;
+
+	/* Try to get eventfd context - this will fail if not an eventfd */
+	ctx = eventfd_ctx_fileget(file);
+	if (IS_ERR(ctx))
+		return false;
+
+	eventfd_ctx_put(ctx);
+	return true;
+}
+
+static const struct liveupdate_file_ops eventfd_luo_file_ops = {
+	.preserve = eventfd_luo_preserve,
+	.unpreserve = eventfd_luo_unpreserve,
+	.freeze = eventfd_luo_freeze,
+	.retrieve = eventfd_luo_retrieve,
+	.finish = eventfd_luo_finish,
+	.can_preserve = eventfd_luo_can_preserve,
+	.owner = THIS_MODULE,
+};
+
+static struct liveupdate_file_handler eventfd_luo_handler = {
+	.ops = &eventfd_luo_file_ops,
+	.compatible = EVENTFD_LUO_FH_COMPATIBLE,
+};
+
+static int __init eventfd_luo_init(void)
+{
+	int err = liveupdate_register_file_handler(&eventfd_luo_handler);
+
+	if (err && err != -EOPNOTSUPP) {
+		pr_err("Could not register eventfd LUO handler: %pe\n",
+		       ERR_PTR(err));
+		return err;
+	}
+
+	return 0;
+}
+late_initcall(eventfd_luo_init);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index e32bee4345fb..703e1a126c4d 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -35,6 +35,8 @@ void eventfd_ctx_put(struct eventfd_ctx *ctx);
 struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
+int eventfd_luo_get_state(struct file *file, __u64 *count, unsigned int *flags);
+int eventfd_create(__u64 count, unsigned int flags);
 void eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait,
 				  __u64 *cnt);
diff --git a/include/linux/kho/abi/eventfd.h b/include/linux/kho/abi/eventfd.h
new file mode 100644
index 000000000000..148beac6bcc7
--- /dev/null
+++ b/include/linux/kho/abi/eventfd.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 KylinSoft Corporation.
+ * Author: Chenghao Duan <duanchenghao@kylinos.cn>
+ */
+
+#ifndef _LINUX_KHO_ABI_EVENTFD_H
+#define _LINUX_KHO_ABI_EVENTFD_H
+
+#include <linux/types.h>
+
+/*
+ * Eventfd Live Update ABI
+ *
+ * This header defines the ABI for preserving eventfd state across kexec.
+ *
+ * The state is serialized into a packed structure `struct eventfd_luo_ser`
+ * which is handed over to the next kernel via the KHO mechanism.
+ *
+ */
+
+/**
+ * struct eventfd_luo_ser - Serialized state of an eventfd
+ * @count: The current counter value
+ * @flags: File flags (EFD_SEMAPHORE, EFD_CLOEXEC, EFD_NONBLOCK)
+ *
+ * This structure contains the minimal state needed to restore an eventfd
+ * after kexec. The count represents the current value of the event counter,
+ * and flags represent the file creation flags.
+ */
+struct eventfd_luo_ser {
+	__u64 count;
+	unsigned int  flags;
+} __packed;
+
+/* The compatibility string for eventfd file handler */
+#define EVENTFD_LUO_FH_COMPATIBLE	"eventfd-v1"
+
+#endif /* _LINUX_KHO_ABI_EVENTFD_H */
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index c13af38ba23a..1361b5733f41 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -86,4 +86,20 @@ config LIVEUPDATE_MEMFD
 
 	  If unsure, say N.
 
+config LIVEUPDATE_EVENTFD
+	bool "Eventfd Live Update Orchestrator support"
+	depends on EVENTFD
+	depends on LIVEUPDATE
+	help
+	  Enable Live Update Orchestrator support for eventfd file descriptors.
+	  This allows eventfd files to be preserved and restored across kexec
+	  operations, maintaining their counter values and flags.
+
+	  Eventfd files are commonly used for event notification between
+	  userspace processes or between userspace and kernel. With this
+	  option enabled, eventfd state can be handed over to a new kernel
+	  during live update operations.
+
+	  If unsure, say N.
+
 endmenu
-- 
2.25.1



^ permalink raw reply related

* [PATCH v1 2/2] selftests: liveupdate: Add selftest for eventfd LUO
From: Chenghao Duan @ 2026-06-25  5:49 UTC (permalink / raw)
  To: viro, brauner, jack, linux-fsdevel, pasha.tatashin, linux-kernel,
	rppt, pratyush, kexec, linux-mm
  Cc: jianghaoran, duanchenghao
In-Reply-To: <20260625054946.73445-1-duanchenghao@kylinos.cn>

This test verifies the Live Update Orchestrator (LUO) support for
preserving eventfd file descriptors across kexec. It creates multiple
LUO sessions, each preserving different eventfd types, and verifies
their state after kexec.

The test covers six different eventfd configurations:
1. **Empty eventfd** - Zero count, default flags
   - Verifies flag preservation without behavioral testing
2. **Default eventfd** - Initial count + write, verifies count preservation
   - Tests basic counter value retention across kexec
3. **Semaphore eventfd** - EFD_SEMAPHORE flag, multiple reads
   - Verifies semaphore behavior (returns 1 per read)
4. **Non-blocking eventfd** - EFD_NONBLOCK flag
   - Tests O_NONBLOCK flag preservation
5. **Large-count eventfd** - UINT_MAX count value
   - Tests handling of maximum counter values
6. **Modified-after-preserve** - Count changed during handover
   - Verifies freeze callback captures final state

The test validates the following sequence:

Stage 1 (pre-kexec):
- Creates state file for inter-stage communication
- Creates multiple LUO sessions
- Preserves eventfds with different configurations
- Trigger kexec reboot

Stage 2 (post-kexec):
- Retrieves preserved eventfd sessions
- Verifies flags and counter values for each type
- Tests semaphore read behavior
- Finalizes all sessions

Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
---
 tools/testing/selftests/liveupdate/Makefile   |   1 +
 tools/testing/selftests/liveupdate/config     |   2 +
 .../selftests/liveupdate/luo_test_eventfd.c   | 376 ++++++++++++++++++
 3 files changed, 379 insertions(+)
 create mode 100644 tools/testing/selftests/liveupdate/luo_test_eventfd.c

diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile
index 30689d22cb02..e092e38fa6f6 100644
--- a/tools/testing/selftests/liveupdate/Makefile
+++ b/tools/testing/selftests/liveupdate/Makefile
@@ -8,6 +8,7 @@ TEST_GEN_PROGS_EXTENDED += luo_kexec_simple
 TEST_GEN_PROGS_EXTENDED += luo_multi_session
 TEST_GEN_PROGS_EXTENDED += luo_stress_sessions
 TEST_GEN_PROGS_EXTENDED += luo_stress_files
+TEST_GEN_PROGS_EXTENDED += luo_test_eventfd
 
 TEST_FILES += do_kexec.sh
 
diff --git a/tools/testing/selftests/liveupdate/config b/tools/testing/selftests/liveupdate/config
index 91d03f9a6a39..d388bd755245 100644
--- a/tools/testing/selftests/liveupdate/config
+++ b/tools/testing/selftests/liveupdate/config
@@ -9,3 +9,5 @@ CONFIG_LIVEUPDATE_TEST=y
 CONFIG_MEMFD_CREATE=y
 CONFIG_TMPFS=y
 CONFIG_SHMEM=y
+CONFIG_EVENTFD=y
+CONFIG_LIVEUPDATE_EVENTFD=y
diff --git a/tools/testing/selftests/liveupdate/luo_test_eventfd.c b/tools/testing/selftests/liveupdate/luo_test_eventfd.c
new file mode 100644
index 000000000000..94ef3bc66ad9
--- /dev/null
+++ b/tools/testing/selftests/liveupdate/luo_test_eventfd.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2026 KylinSoft Corporation.
+ * Author: Chenghao Duan <duanchenghao@kylinos.cn>
+ */
+
+/*
+ * Multi-session kexec selftest for eventfd LUO support.
+ *
+ * Modeled after luo_multi_session.c.
+ * It creates multiple LUO sessions, each preserving different eventfd types,
+ * and verifies them after kexec.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+
+#include "luo_test_utils.h"
+
+/* Session names */
+#define SESSION_EMPTY     "eventfd-empty"
+#define SESSION_DEFAULT   "eventfd-default"
+#define SESSION_SEM       "eventfd-sem"
+#define SESSION_NONBLOCK  "eventfd-nonblock"
+#define SESSION_LARGE     "eventfd-large"
+#define SESSION_MODIFIED  "eventfd-modified-after-preserve"
+
+/* Tokens */
+#define TOKEN_DEFAULT       0xCAFEBABE
+#define TOKEN_SEM           0xDEADBEEF
+#define TOKEN_NONBLOCK      0xFEEDBEEF
+#define TOKEN_LARGE         0xBEEFCAFE
+#define TOKEN_MODIFIED      0xABCD1234
+
+/* Counts */
+#define COUNT_INITIAL   42
+#define COUNT_WRITE     10
+#define COUNT_EXPECTED  (COUNT_INITIAL + COUNT_WRITE) /* 52 */
+#define COUNT_SEM       5
+#define COUNT_NONBLOCK  100
+#define COUNT_LARGE     ((unsigned int)-1) /* UINT_MAX */
+#define COUNT_MODIFIED  25
+#define COUNT_MODIFY_DELTA  99
+
+/* State tracking */
+#define STATE_SESSION_NAME "eventfd_multi_state"
+#define STATE_TOKEN        997
+
+/* Eventfd verification modes */
+enum verify_mode {
+	VERIFY_FLAGS,        /* Only verify flags, no behavior testing */
+	VERIFY_ONCE,         /* Read once and verify count */
+	VERIFY_SEMAPHORE,    /* Read multiple times, verify returns 1 each time */
+	VERIFY_NONBLOCK,     /* Read once in nonblock mode */
+	VERIFY_LARGE         /* Read once, verify large count */
+};
+
+/* Test session configuration */
+struct test_session_config {
+	const char *session_name;
+	unsigned long token;
+	unsigned int count;
+	unsigned int flags;
+	enum verify_mode verify_mode;
+	const char *desc;
+};
+
+/* Test session configurations */
+static const struct test_session_config test_configs[] = {
+	{SESSION_EMPTY, 0, 0, 0, VERIFY_FLAGS, "Empty"},
+	{SESSION_DEFAULT, TOKEN_DEFAULT, COUNT_EXPECTED, 0, VERIFY_ONCE, "Default"},
+	{SESSION_SEM, TOKEN_SEM, COUNT_SEM, EFD_SEMAPHORE, VERIFY_SEMAPHORE, "Semaphore"},
+	{SESSION_NONBLOCK, TOKEN_NONBLOCK, COUNT_NONBLOCK, EFD_NONBLOCK, VERIFY_NONBLOCK, "Nonblock"},
+	{SESSION_LARGE, TOKEN_LARGE, COUNT_LARGE, 0, VERIFY_LARGE, "Large-count"},
+	{SESSION_MODIFIED, TOKEN_MODIFIED, COUNT_MODIFIED + COUNT_MODIFY_DELTA, 0, VERIFY_ONCE, "Modified after preserve (kexec handover)"},
+};
+#define NUM_TEST_SESSIONS ARRAY_SIZE(test_configs)
+
+static int verify_eventfd_flags(int efd, unsigned int expected_flags, const char *desc)
+{
+	int actual_flags = fcntl(efd, F_GETFL);
+
+	if (actual_flags < 0)
+		return -errno;
+
+	int expected_fd_flags = expected_flags & (EFD_NONBLOCK | EFD_CLOEXEC);
+	int actual_fd_flags = actual_flags & (O_NONBLOCK | O_CLOEXEC);
+
+	if (actual_fd_flags != expected_fd_flags) {
+		ksft_print_msg("%s: flag mismatch - expected 0x%x, got 0x%x\n",
+			       desc, expected_fd_flags, actual_fd_flags);
+		return -EINVAL;
+	}
+
+	ksft_print_msg("  %s eventfd flags OK (0x%x)\n", desc, expected_fd_flags);
+	return actual_flags; /* Return actual flags for further use */
+}
+
+static int ensure_nonblock(int efd, int current_flags)
+{
+	return fcntl(efd, F_SETFL, current_flags | O_NONBLOCK);
+}
+
+static int verify_count_once(int efd, unsigned int expected_count, const char *desc)
+{
+	uint64_t val;
+
+	if (read(efd, &val, sizeof(val)) != sizeof(val))
+		return -errno;
+
+	if (val != (uint64_t)expected_count) {
+		ksft_print_msg("%s: expected %u got %llu\n",
+			       desc, expected_count, (unsigned long long)val);
+		return -EINVAL;
+	}
+
+	ksft_print_msg("  %s eventfd OK: %u\n", desc, expected_count);
+	return 0;
+}
+
+static int verify_semaphore_behavior(int efd, unsigned int expected_count, const char *desc)
+{
+	uint64_t val;
+
+	/* Read expected_count times, each should return 1 */
+	for (unsigned int i = 0; i < expected_count; i++) {
+		if (read(efd, &val, sizeof(val)) != sizeof(val))
+			return -errno;
+
+		if (val != 1) {
+			ksft_print_msg("%s: expected 1, got %llu at read %u\n",
+				       desc, (unsigned long long)val, i + 1);
+			return -EINVAL;
+		}
+		ksft_print_msg("  %s eventfd OK: %u at read %u\n", desc, (unsigned int)val, i + 1);
+	}
+
+	/* Next read should return EAGAIN (no more events) */
+	if (read(efd, &val, sizeof(val)) >= 0 || errno != EAGAIN) {
+		ksft_print_msg("%s: expected EAGAIN after %u reads\n",
+			       desc, expected_count);
+		return -EINVAL;
+	}
+
+	ksft_print_msg("  %s eventfd OK (%u reads)\n", desc, expected_count);
+	return 0;
+}
+
+static int restore_and_verify_eventfd_generic(int session_fd,
+					      unsigned long token,
+					      unsigned int expected_count,
+					      unsigned int expected_flags,
+					      enum verify_mode mode,
+					      const char *desc)
+{
+	struct liveupdate_session_retrieve_fd arg = { .size = sizeof(arg) };
+	int efd, ret = 0, actual_flags;
+
+	arg.token = token;
+	if (ioctl(session_fd, LIVEUPDATE_SESSION_RETRIEVE_FD, &arg) < 0)
+		return -errno;
+	efd = arg.fd;
+
+	switch (mode) {
+	case VERIFY_FLAGS:
+		/* Only verify flags, no behavior testing */
+		ret = verify_eventfd_flags(efd, expected_flags, desc);
+		if (ret < 0)
+			close(efd);
+		return ret < 0 ? ret : 0;
+
+	case VERIFY_SEMAPHORE:
+		/* Verify flags + semaphore behavior */
+		actual_flags = verify_eventfd_flags(efd, expected_flags, desc);
+		if (actual_flags < 0) {
+			ret = actual_flags;
+			goto out;
+		}
+
+		if (ensure_nonblock(efd, actual_flags) < 0) {
+			ret = -errno;
+			goto out;
+		}
+
+		ret = verify_semaphore_behavior(efd, expected_count, desc);
+		break;
+
+	case VERIFY_ONCE:
+	case VERIFY_NONBLOCK:
+	case VERIFY_LARGE:
+		/* Verify flags + count behavior */
+		actual_flags = verify_eventfd_flags(efd, expected_flags, desc);
+		if (actual_flags < 0) {
+			ret = actual_flags;
+			goto out;
+		}
+
+		ret = verify_count_once(efd, expected_count, desc);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	close(efd);
+	return ret;
+}
+
+static int create_and_preserve_eventfd_keep_fd(int session_fd,
+					       unsigned long token,
+					       unsigned int count,
+					       unsigned int flags,
+					       const char *desc,
+					       int *efd_out)
+{
+	struct liveupdate_session_preserve_fd arg = { .size = sizeof(arg) };
+	int efd = eventfd(count, flags);
+
+	if (efd < 0)
+		return -errno;
+
+	arg.fd = efd;
+	arg.token = token;
+	if (ioctl(session_fd, LIVEUPDATE_SESSION_PRESERVE_FD, &arg) < 0) {
+		int ret = -errno;
+
+		close(efd);
+		return ret;
+	}
+
+	if (efd_out)
+		*efd_out = efd;
+	else
+		close(efd);
+	return 0;
+}
+
+static int create_and_preserve_eventfd(int session_fd,
+				       unsigned long token,
+				       unsigned int count,
+				       unsigned int flags,
+				       const char *desc)
+{
+	return create_and_preserve_eventfd_keep_fd(session_fd, token, count,
+						   flags, desc, NULL);
+}
+
+static int create_session_checked(int luo_fd, const char *session_name)
+{
+	int session_fd = luo_create_session(luo_fd, session_name);
+
+	if (session_fd < 0)
+		fail_exit("luo_create_session for '%s'", session_name);
+	return session_fd;
+}
+
+static int retrieve_session_checked(int luo_fd, const char *session_name)
+{
+	int session_fd = luo_retrieve_session(luo_fd, session_name);
+
+	if (session_fd < 0)
+		fail_exit("luo_retrieve_session for '%s'", session_name);
+	return session_fd;
+}
+
+static void finish_session_checked(int session_fd, const char *session_name)
+{
+	if (luo_session_finish(session_fd) < 0)
+		fail_exit("luo_session_finish for '%s'", session_name);
+	close(session_fd);
+}
+
+static int verify_eventfd_config(int session_fd, const struct test_session_config *config)
+{
+	return restore_and_verify_eventfd_generic(session_fd, config->token,
+						 config->count, config->flags,
+						 config->verify_mode, config->desc);
+}
+
+
+static void run_stage_1(int luo_fd)
+{
+	ksft_print_msg("[STAGE 1] Starting pre-kexec setup for multi-eventfd test...\n");
+	ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n");
+	create_state_file(luo_fd, STATE_SESSION_NAME, STATE_TOKEN, 2);
+
+	/* Create all test sessions */
+	for (size_t i = 0; i < NUM_TEST_SESSIONS; i++) {
+		const struct test_session_config *config = &test_configs[i];
+
+		ksft_print_msg("[STAGE 1] Creating session '%s'...\n", config->session_name);
+		int session_fd = create_session_checked(luo_fd, config->session_name);
+
+		/* Special handling for modified session (preserve then modify) */
+		if (config->token == TOKEN_MODIFIED) {
+			int preserved_efd;
+
+			if (create_and_preserve_eventfd_keep_fd(session_fd,
+								config->token,
+								COUNT_MODIFIED,
+								config->flags,
+								"modified-after-preserve",
+								&preserved_efd) < 0)
+				fail_exit("create_and_preserve_eventfd_keep_fd modified");
+
+			/* Now modify the preserved eventfd's count */
+			uint64_t modify_value = COUNT_MODIFY_DELTA;
+
+			if (write(preserved_efd, &modify_value,
+			    sizeof(modify_value)) != sizeof(modify_value))
+				fail_exit("write to preserved eventfd after preserve");
+
+			close(preserved_efd);
+		} else {
+			/* Standard session creation */
+			if (create_and_preserve_eventfd(session_fd, config->token,
+							config->count, config->flags,
+							config->desc) < 0)
+				fail_exit("create_and_preserve_eventfd %s", config->desc);
+		}
+	}
+
+	close(luo_fd);
+	daemonize_and_wait();
+}
+
+static void run_stage_2(int luo_fd, int state_session_fd)
+{
+	int session_fds[NUM_TEST_SESSIONS];
+	int stage;
+
+	ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n");
+
+	restore_and_read_stage(state_session_fd, STATE_TOKEN, &stage);
+	if (stage != 2)
+		fail_exit("Expected stage 2, but state file contains %d", stage);
+
+	ksft_print_msg("[STAGE 2] Retrieving all sessions...\n");
+	for (size_t i = 0; i < NUM_TEST_SESSIONS; i++)
+		session_fds[i] = retrieve_session_checked(luo_fd, test_configs[i].session_name);
+
+	ksft_print_msg("[STAGE 2] Verifying eventfds...\n");
+	for (size_t i = 0; i < NUM_TEST_SESSIONS; i++) {
+		if (verify_eventfd_config(session_fds[i], &test_configs[i]) < 0)
+			fail_exit("verify %s eventfd", test_configs[i].desc);
+	}
+
+	ksft_print_msg("[STAGE 2] All eventfd sessions verified successfully.\n");
+
+	ksft_print_msg("[STAGE 2] Finalizing all sessions...\n");
+	for (size_t i = 0; i < NUM_TEST_SESSIONS; i++)
+		finish_session_checked(session_fds[i], test_configs[i].session_name);
+
+	ksft_print_msg("[STAGE 2] Finalizing state session...\n");
+	if (luo_session_finish(state_session_fd) < 0)
+		fail_exit("luo_session_finish for state session");
+	close(state_session_fd);
+
+	ksft_print_msg("\n--- EVENTFD_LUO TEST PASSED ---\n");
+}
+
+int main(int argc, char *argv[])
+{
+	return luo_test(argc, argv, STATE_SESSION_NAME,
+			run_stage_1, run_stage_2);
+}
-- 
2.25.1



^ permalink raw reply related

* [PATCH v1 0/2] luo support for preserving eventfd
From: Chenghao Duan @ 2026-06-25  5:49 UTC (permalink / raw)
  To: viro, brauner, jack, linux-fsdevel, pasha.tatashin, linux-kernel,
	rppt, pratyush, kexec, linux-mm
  Cc: jianghaoran, duanchenghao

It is my great honor to participate in the development of LiveUpdate.
The current patch implements logic to preserve and retrieve eventfd
states, which I developed by referencing memfd_luo while learning the
LiveUpdate framework.

eventfd serves as a critical notification mechanism between Guest and
Host. During host kernel upgrades, we can preserve the corresponding
eventfd states and restore them after the kernel update completes.

Patch 0001 implements eventfd_luo, while Patch 0002 contains selftest code.
Test procedures:
    1. ./luo_test_eventfd --stage 1
    2. kexec reboot
    3. ./luo_test_eventfd --stage 2

Chenghao Duan (2):
  eventfd: luo: luo support for preserving eventfd
  selftests: liveupdate: Add selftest for eventfd LUO

 fs/Makefile                                   |   1 +
 fs/eventfd.c                                  |  40 ++
 fs/eventfd_luo.c                              | 250 ++++++++++++
 include/linux/eventfd.h                       |   2 +
 include/linux/kho/abi/eventfd.h               |  39 ++
 kernel/liveupdate/Kconfig                     |  16 +
 tools/testing/selftests/liveupdate/Makefile   |   1 +
 tools/testing/selftests/liveupdate/config     |   2 +
 .../selftests/liveupdate/luo_test_eventfd.c   | 376 ++++++++++++++++++
 9 files changed, 727 insertions(+)
 create mode 100644 fs/eventfd_luo.c
 create mode 100644 include/linux/kho/abi/eventfd.h
 create mode 100644 tools/testing/selftests/liveupdate/luo_test_eventfd.c

-- 
2.25.1



^ permalink raw reply

* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: kernel test robot @ 2026-06-25  5:45 UTC (permalink / raw)
  To: Dev Jain, akpm, david, ljs
  Cc: llvm, oe-kbuild-all, Dev Jain, riel, liam, vbabka, harry, jannh,
	kas, linux-mm, linux-kernel, ryan.roberts, anshuman.khandual,
	stable
In-Reply-To: <20260625042853.2752898-1-dev.jain@arm.com>

Hi Dev,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-use-huge_ptep_get-in-try_to_unmap_one/20260625-123050
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260625042853.2752898-1-dev.jain%40arm.com
patch subject: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
config: hexagon-allnoconfig (https://download.01.org/0day-ci/archive/20260625/202606251341.jfIr1D7m-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 6cc609bb250b21b47fc7d394b4019101e9983597)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260625/202606251341.jfIr1D7m-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606251341.jfIr1D7m-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/rmap.c:2100:13: error: call to undeclared function 'huge_ptep_get'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2100 |                         pteval = huge_ptep_get(mm, address, pvmw.pte);
         |                                  ^
>> mm/rmap.c:2100:11: error: assigning to 'pte_t' from incompatible type 'int'
    2100 |                         pteval = huge_ptep_get(mm, address, pvmw.pte);
         |                                ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   2 errors generated.


vim +/huge_ptep_get +2100 mm/rmap.c

  1980	
  1981	/*
  1982	 * @arg: enum ttu_flags will be passed to this argument
  1983	 */
  1984	static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  1985			     unsigned long address, void *arg)
  1986	{
  1987		struct mm_struct *mm = vma->vm_mm;
  1988		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
  1989		bool anon_exclusive, ret = true;
  1990		pte_t pteval;
  1991		struct page *subpage;
  1992		struct mmu_notifier_range range;
  1993		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  1994		unsigned long nr_pages = 1, end_addr;
  1995		unsigned long pfn;
  1996		unsigned long hsz = 0;
  1997		int ptes = 0;
  1998	
  1999		/*
  2000		 * When racing against e.g. zap_pte_range() on another cpu,
  2001		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  2002		 * try_to_unmap() may return before folio_mapped() has become false,
  2003		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  2004		 */
  2005		if (flags & TTU_SYNC)
  2006			pvmw.flags = PVMW_SYNC;
  2007	
  2008		/*
  2009		 * For THP, we have to assume the worse case ie pmd for invalidation.
  2010		 * For hugetlb, it could be much worse if we need to do pud
  2011		 * invalidation in the case of pmd sharing.
  2012		 *
  2013		 * Note that the folio can not be freed in this function as call of
  2014		 * try_to_unmap() must hold a reference on the folio.
  2015		 */
  2016		range.end = vma_address_end(&pvmw);
  2017		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  2018					address, range.end);
  2019		if (folio_test_hugetlb(folio)) {
  2020			/*
  2021			 * If sharing is possible, start and end will be adjusted
  2022			 * accordingly.
  2023			 */
  2024			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  2025							     &range.end);
  2026	
  2027			/* We need the huge page size for set_huge_pte_at() */
  2028			hsz = huge_page_size(hstate_vma(vma));
  2029		}
  2030		mmu_notifier_invalidate_range_start(&range);
  2031	
  2032		while (page_vma_mapped_walk(&pvmw)) {
  2033			nr_pages = 1;
  2034	
  2035			/*
  2036			 * If the folio is in an mlock()d vma, we must not swap it out.
  2037			 */
  2038			if (!(flags & TTU_IGNORE_MLOCK) &&
  2039			    (vma->vm_flags & VM_LOCKED)) {
  2040				ptes++;
  2041	
  2042				/*
  2043				 * Set 'ret' to indicate the page cannot be unmapped.
  2044				 *
  2045				 * Do not jump to walk_abort immediately as additional
  2046				 * iteration might be required to detect fully mapped
  2047				 * folio an mlock it.
  2048				 */
  2049				ret = false;
  2050	
  2051				/* Only mlock fully mapped pages */
  2052				if (pvmw.pte && ptes != pvmw.nr_pages)
  2053					continue;
  2054	
  2055				/*
  2056				 * All PTEs must be protected by page table lock in
  2057				 * order to mlock the page.
  2058				 *
  2059				 * If page table boundary has been cross, current ptl
  2060				 * only protect part of ptes.
  2061				 */
  2062				if (pvmw.flags & PVMW_PGTABLE_CROSSED)
  2063					goto walk_done;
  2064	
  2065				/* Restore the mlock which got missed */
  2066				mlock_vma_folio(folio, vma);
  2067				goto walk_done;
  2068			}
  2069	
  2070			if (!pvmw.pte) {
  2071				if (folio_test_lazyfree(folio)) {
  2072					if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
  2073						goto walk_done;
  2074					/*
  2075					 * unmap_huge_pmd_locked has either already marked
  2076					 * the folio as swap-backed or decided to retain it
  2077					 * due to GUP or speculative references.
  2078					 */
  2079					goto walk_abort;
  2080				}
  2081	
  2082				if (flags & TTU_SPLIT_HUGE_PMD) {
  2083					/*
  2084					 * We temporarily have to drop the PTL and
  2085					 * restart so we can process the PTE-mapped THP.
  2086					 */
  2087					split_huge_pmd_locked(vma, pvmw.address,
  2088							      pvmw.pmd, false);
  2089					flags &= ~TTU_SPLIT_HUGE_PMD;
  2090					page_vma_mapped_walk_restart(&pvmw);
  2091					continue;
  2092				}
  2093			}
  2094	
  2095			/* Unexpected PMD-mapped THP? */
  2096			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2097	
  2098			address = pvmw.address;
  2099			if (folio_test_hugetlb(folio)) {
> 2100				pteval = huge_ptep_get(mm, address, pvmw.pte);
  2101			} else {
  2102				/*
  2103				 * Handle PFN swap PTEs, such as device-exclusive ones,
  2104				 * that actually map pages.
  2105				 */
  2106				pteval = ptep_get(pvmw.pte);
  2107			}
  2108			if (likely(pte_present(pteval))) {
  2109				pfn = pte_pfn(pteval);
  2110			} else {
  2111				const softleaf_t entry = softleaf_from_pte(pteval);
  2112	
  2113				pfn = softleaf_to_pfn(entry);
  2114				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  2115			}
  2116	
  2117			subpage = folio_page(folio, pfn - folio_pfn(folio));
  2118			anon_exclusive = folio_test_anon(folio) &&
  2119					 PageAnonExclusive(subpage);
  2120	
  2121			if (folio_test_hugetlb(folio)) {
  2122				bool anon = folio_test_anon(folio);
  2123	
  2124				/*
  2125				 * The try_to_unmap() is only passed a hugetlb page
  2126				 * in the case where the hugetlb page is poisoned.
  2127				 */
  2128				VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
  2129				/*
  2130				 * huge_pmd_unshare may unmap an entire PMD page.
  2131				 * There is no way of knowing exactly which PMDs may
  2132				 * be cached for this mm, so we must flush them all.
  2133				 * start/end were already adjusted above to cover this
  2134				 * range.
  2135				 */
  2136				flush_cache_range(vma, range.start, range.end);
  2137	
  2138				/*
  2139				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2140				 * held in write mode.  Caller needs to explicitly
  2141				 * do this outside rmap routines.
  2142				 *
  2143				 * We also must hold hugetlb vma_lock in write mode.
  2144				 * Lock order dictates acquiring vma_lock BEFORE
  2145				 * i_mmap_rwsem.  We can only try lock here and fail
  2146				 * if unsuccessful.
  2147				 */
  2148				if (!anon) {
  2149					struct mmu_gather tlb;
  2150	
  2151					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2152					if (!hugetlb_vma_trylock_write(vma))
  2153						goto walk_abort;
  2154	
  2155					tlb_gather_mmu_vma(&tlb, vma);
  2156					if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
  2157						hugetlb_vma_unlock_write(vma);
  2158						huge_pmd_unshare_flush(&tlb, vma);
  2159						tlb_finish_mmu(&tlb);
  2160						/*
  2161						 * The PMD table was unmapped,
  2162						 * consequently unmapping the folio.
  2163						 */
  2164						goto walk_done;
  2165					}
  2166					hugetlb_vma_unlock_write(vma);
  2167					tlb_finish_mmu(&tlb);
  2168				}
  2169				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2170				if (pte_dirty(pteval))
  2171					folio_mark_dirty(folio);
  2172			} else if (likely(pte_present(pteval))) {
  2173				nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
  2174				end_addr = address + nr_pages * PAGE_SIZE;
  2175				flush_cache_range(vma, address, end_addr);
  2176	
  2177				/* Nuke the page table entry. */
  2178				pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
  2179				/*
  2180				 * We clear the PTE but do not flush so potentially
  2181				 * a remote CPU could still be writing to the folio.
  2182				 * If the entry was previously clean then the
  2183				 * architecture must guarantee that a clear->dirty
  2184				 * transition on a cached TLB entry is written through
  2185				 * and traps if the PTE is unmapped.
  2186				 */
  2187				if (should_defer_flush(mm, flags))
  2188					set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
  2189				else
  2190					flush_tlb_range(vma, address, end_addr);
  2191				if (pte_dirty(pteval))
  2192					folio_mark_dirty(folio);
  2193			} else {
  2194				pte_clear(mm, address, pvmw.pte);
  2195			}
  2196	
  2197			/*
  2198			 * Now the pte is cleared. If this pte was uffd-wp armed,
  2199			 * we may want to replace a none pte with a marker pte if
  2200			 * it's file-backed, so we don't lose the tracking info.
  2201			 */
  2202			pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
  2203	
  2204			/* Update high watermark before we lower rss */
  2205			update_hiwater_rss(mm);
  2206	
  2207			if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
  2208				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2209				if (folio_test_hugetlb(folio)) {
  2210					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2211					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2212							hsz);
  2213				} else {
  2214					dec_mm_counter(mm, mm_counter(folio));
  2215					set_pte_at(mm, address, pvmw.pte, pteval);
  2216				}
  2217			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2218				   !userfaultfd_armed(vma)) {
  2219				/*
  2220				 * The guest indicated that the page content is of no
  2221				 * interest anymore. Simply discard the pte, vmscan
  2222				 * will take care of the rest.
  2223				 * A future reference will then fault in a new zero
  2224				 * page. When userfaultfd is active, we must not drop
  2225				 * this page though, as its main user (postcopy
  2226				 * migration) will not expect userfaults on already
  2227				 * copied pages.
  2228				 */
  2229				dec_mm_counter(mm, mm_counter(folio));
  2230			} else if (folio_test_anon(folio)) {
  2231				swp_entry_t entry = page_swap_entry(subpage);
  2232				pte_t swp_pte;
  2233				/*
  2234				 * Store the swap location in the pte.
  2235				 * See handle_pte_fault() ...
  2236				 */
  2237				if (unlikely(folio_test_swapbacked(folio) !=
  2238						folio_test_swapcache(folio))) {
  2239					WARN_ON_ONCE(1);
  2240					goto walk_abort;
  2241				}
  2242	
  2243				/* MADV_FREE page check */
  2244				if (!folio_test_swapbacked(folio)) {
  2245					int ref_count, map_count;
  2246	
  2247					/*
  2248					 * Synchronize with gup_pte_range():
  2249					 * - clear PTE; barrier; read refcount
  2250					 * - inc refcount; barrier; read PTE
  2251					 */
  2252					smp_mb();
  2253	
  2254					ref_count = folio_ref_count(folio);
  2255					map_count = folio_mapcount(folio);
  2256	
  2257					/*
  2258					 * Order reads for page refcount and dirty flag
  2259					 * (see comments in __remove_mapping()).
  2260					 */
  2261					smp_rmb();
  2262	
  2263					if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
  2264						/*
  2265						 * redirtied either using the page table or a previously
  2266						 * obtained GUP reference.
  2267						 */
  2268						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2269						folio_set_swapbacked(folio);
  2270						goto walk_abort;
  2271					} else if (ref_count != 1 + map_count) {
  2272						/*
  2273						 * Additional reference. Could be a GUP reference or any
  2274						 * speculative reference. GUP users must mark the folio
  2275						 * dirty if there was a modification. This folio cannot be
  2276						 * reclaimed right now either way, so act just like nothing
  2277						 * happened.
  2278						 * We'll come back here later and detect if the folio was
  2279						 * dirtied when the additional reference is gone.
  2280						 */
  2281						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2282						goto walk_abort;
  2283					}
  2284					add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
  2285					goto discard;
  2286				}
  2287	
  2288				if (folio_dup_swap(folio, subpage) < 0) {
  2289					set_pte_at(mm, address, pvmw.pte, pteval);
  2290					goto walk_abort;
  2291				}
  2292	
  2293				/*
  2294				 * arch_unmap_one() is expected to be a NOP on
  2295				 * architectures where we could have PFN swap PTEs,
  2296				 * so we'll not check/care.
  2297				 */
  2298				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2299					folio_put_swap(folio, subpage);
  2300					set_pte_at(mm, address, pvmw.pte, pteval);
  2301					goto walk_abort;
  2302				}
  2303	
  2304				/* See folio_try_share_anon_rmap(): clear PTE first. */
  2305				if (anon_exclusive &&
  2306				    folio_try_share_anon_rmap_pte(folio, subpage)) {
  2307					folio_put_swap(folio, subpage);
  2308					set_pte_at(mm, address, pvmw.pte, pteval);
  2309					goto walk_abort;
  2310				}
  2311				if (list_empty(&mm->mmlist)) {
  2312					spin_lock(&mmlist_lock);
  2313					if (list_empty(&mm->mmlist))
  2314						list_add(&mm->mmlist, &init_mm.mmlist);
  2315					spin_unlock(&mmlist_lock);
  2316				}
  2317				dec_mm_counter(mm, MM_ANONPAGES);
  2318				inc_mm_counter(mm, MM_SWAPENTS);
  2319				swp_pte = swp_entry_to_pte(entry);
  2320				if (anon_exclusive)
  2321					swp_pte = pte_swp_mkexclusive(swp_pte);
  2322				if (likely(pte_present(pteval))) {
  2323					if (pte_soft_dirty(pteval))
  2324						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2325					if (pte_uffd_wp(pteval))
  2326						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2327				} else {
  2328					if (pte_swp_soft_dirty(pteval))
  2329						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2330					if (pte_swp_uffd_wp(pteval))
  2331						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2332				}
  2333				set_pte_at(mm, address, pvmw.pte, swp_pte);
  2334			} else {
  2335				/*
  2336				 * This is a locked file-backed folio,
  2337				 * so it cannot be removed from the page
  2338				 * cache and replaced by a new folio before
  2339				 * mmu_notifier_invalidate_range_end, so no
  2340				 * concurrent thread might update its page table
  2341				 * to point at a new folio while a device is
  2342				 * still using this folio.
  2343				 *
  2344				 * See Documentation/mm/mmu_notifier.rst
  2345				 */
  2346				add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
  2347			}
  2348	discard:
  2349			if (unlikely(folio_test_hugetlb(folio))) {
  2350				hugetlb_remove_rmap(folio);
  2351			} else {
  2352				folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
  2353			}
  2354			if (vma->vm_flags & VM_LOCKED)
  2355				mlock_drain_local();
  2356			folio_put_refs(folio, nr_pages);
  2357	
  2358			/*
  2359			 * If we are sure that we batched the entire folio and cleared
  2360			 * all PTEs, we can just optimize and stop right here.
  2361			 */
  2362			if (nr_pages == folio_nr_pages(folio))
  2363				goto walk_done;
  2364			continue;
  2365	walk_abort:
  2366			ret = false;
  2367	walk_done:
  2368			page_vma_mapped_walk_done(&pvmw);
  2369			break;
  2370		}
  2371	
  2372		mmu_notifier_invalidate_range_end(&range);
  2373	
  2374		return ret;
  2375	}
  2376	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: kernel test robot @ 2026-06-25  5:45 UTC (permalink / raw)
  To: Dev Jain, akpm, david, ljs
  Cc: oe-kbuild-all, Dev Jain, riel, liam, vbabka, harry, jannh, kas,
	linux-mm, linux-kernel, ryan.roberts, anshuman.khandual, stable
In-Reply-To: <20260625042853.2752898-1-dev.jain@arm.com>

Hi Dev,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-use-huge_ptep_get-in-try_to_unmap_one/20260625-123050
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260625042853.2752898-1-dev.jain%40arm.com
patch subject: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260625/202606251311.CCKYInqf-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260625/202606251311.CCKYInqf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606251311.CCKYInqf-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:2100:34: error: implicit declaration of function 'huge_ptep_get' [-Werror=implicit-function-declaration]
    2100 |                         pteval = huge_ptep_get(mm, address, pvmw.pte);
         |                                  ^~~~~~~~~~~~~
>> mm/rmap.c:2100:34: error: incompatible types when assigning to type 'pte_t' from type 'int'
   cc1: some warnings being treated as errors


vim +/huge_ptep_get +2100 mm/rmap.c

  1980	
  1981	/*
  1982	 * @arg: enum ttu_flags will be passed to this argument
  1983	 */
  1984	static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  1985			     unsigned long address, void *arg)
  1986	{
  1987		struct mm_struct *mm = vma->vm_mm;
  1988		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
  1989		bool anon_exclusive, ret = true;
  1990		pte_t pteval;
  1991		struct page *subpage;
  1992		struct mmu_notifier_range range;
  1993		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  1994		unsigned long nr_pages = 1, end_addr;
  1995		unsigned long pfn;
  1996		unsigned long hsz = 0;
  1997		int ptes = 0;
  1998	
  1999		/*
  2000		 * When racing against e.g. zap_pte_range() on another cpu,
  2001		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  2002		 * try_to_unmap() may return before folio_mapped() has become false,
  2003		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  2004		 */
  2005		if (flags & TTU_SYNC)
  2006			pvmw.flags = PVMW_SYNC;
  2007	
  2008		/*
  2009		 * For THP, we have to assume the worse case ie pmd for invalidation.
  2010		 * For hugetlb, it could be much worse if we need to do pud
  2011		 * invalidation in the case of pmd sharing.
  2012		 *
  2013		 * Note that the folio can not be freed in this function as call of
  2014		 * try_to_unmap() must hold a reference on the folio.
  2015		 */
  2016		range.end = vma_address_end(&pvmw);
  2017		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  2018					address, range.end);
  2019		if (folio_test_hugetlb(folio)) {
  2020			/*
  2021			 * If sharing is possible, start and end will be adjusted
  2022			 * accordingly.
  2023			 */
  2024			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  2025							     &range.end);
  2026	
  2027			/* We need the huge page size for set_huge_pte_at() */
  2028			hsz = huge_page_size(hstate_vma(vma));
  2029		}
  2030		mmu_notifier_invalidate_range_start(&range);
  2031	
  2032		while (page_vma_mapped_walk(&pvmw)) {
  2033			nr_pages = 1;
  2034	
  2035			/*
  2036			 * If the folio is in an mlock()d vma, we must not swap it out.
  2037			 */
  2038			if (!(flags & TTU_IGNORE_MLOCK) &&
  2039			    (vma->vm_flags & VM_LOCKED)) {
  2040				ptes++;
  2041	
  2042				/*
  2043				 * Set 'ret' to indicate the page cannot be unmapped.
  2044				 *
  2045				 * Do not jump to walk_abort immediately as additional
  2046				 * iteration might be required to detect fully mapped
  2047				 * folio an mlock it.
  2048				 */
  2049				ret = false;
  2050	
  2051				/* Only mlock fully mapped pages */
  2052				if (pvmw.pte && ptes != pvmw.nr_pages)
  2053					continue;
  2054	
  2055				/*
  2056				 * All PTEs must be protected by page table lock in
  2057				 * order to mlock the page.
  2058				 *
  2059				 * If page table boundary has been cross, current ptl
  2060				 * only protect part of ptes.
  2061				 */
  2062				if (pvmw.flags & PVMW_PGTABLE_CROSSED)
  2063					goto walk_done;
  2064	
  2065				/* Restore the mlock which got missed */
  2066				mlock_vma_folio(folio, vma);
  2067				goto walk_done;
  2068			}
  2069	
  2070			if (!pvmw.pte) {
  2071				if (folio_test_lazyfree(folio)) {
  2072					if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
  2073						goto walk_done;
  2074					/*
  2075					 * unmap_huge_pmd_locked has either already marked
  2076					 * the folio as swap-backed or decided to retain it
  2077					 * due to GUP or speculative references.
  2078					 */
  2079					goto walk_abort;
  2080				}
  2081	
  2082				if (flags & TTU_SPLIT_HUGE_PMD) {
  2083					/*
  2084					 * We temporarily have to drop the PTL and
  2085					 * restart so we can process the PTE-mapped THP.
  2086					 */
  2087					split_huge_pmd_locked(vma, pvmw.address,
  2088							      pvmw.pmd, false);
  2089					flags &= ~TTU_SPLIT_HUGE_PMD;
  2090					page_vma_mapped_walk_restart(&pvmw);
  2091					continue;
  2092				}
  2093			}
  2094	
  2095			/* Unexpected PMD-mapped THP? */
  2096			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2097	
  2098			address = pvmw.address;
  2099			if (folio_test_hugetlb(folio)) {
> 2100				pteval = huge_ptep_get(mm, address, pvmw.pte);
  2101			} else {
  2102				/*
  2103				 * Handle PFN swap PTEs, such as device-exclusive ones,
  2104				 * that actually map pages.
  2105				 */
  2106				pteval = ptep_get(pvmw.pte);
  2107			}
  2108			if (likely(pte_present(pteval))) {
  2109				pfn = pte_pfn(pteval);
  2110			} else {
  2111				const softleaf_t entry = softleaf_from_pte(pteval);
  2112	
  2113				pfn = softleaf_to_pfn(entry);
  2114				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  2115			}
  2116	
  2117			subpage = folio_page(folio, pfn - folio_pfn(folio));
  2118			anon_exclusive = folio_test_anon(folio) &&
  2119					 PageAnonExclusive(subpage);
  2120	
  2121			if (folio_test_hugetlb(folio)) {
  2122				bool anon = folio_test_anon(folio);
  2123	
  2124				/*
  2125				 * The try_to_unmap() is only passed a hugetlb page
  2126				 * in the case where the hugetlb page is poisoned.
  2127				 */
  2128				VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
  2129				/*
  2130				 * huge_pmd_unshare may unmap an entire PMD page.
  2131				 * There is no way of knowing exactly which PMDs may
  2132				 * be cached for this mm, so we must flush them all.
  2133				 * start/end were already adjusted above to cover this
  2134				 * range.
  2135				 */
  2136				flush_cache_range(vma, range.start, range.end);
  2137	
  2138				/*
  2139				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2140				 * held in write mode.  Caller needs to explicitly
  2141				 * do this outside rmap routines.
  2142				 *
  2143				 * We also must hold hugetlb vma_lock in write mode.
  2144				 * Lock order dictates acquiring vma_lock BEFORE
  2145				 * i_mmap_rwsem.  We can only try lock here and fail
  2146				 * if unsuccessful.
  2147				 */
  2148				if (!anon) {
  2149					struct mmu_gather tlb;
  2150	
  2151					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2152					if (!hugetlb_vma_trylock_write(vma))
  2153						goto walk_abort;
  2154	
  2155					tlb_gather_mmu_vma(&tlb, vma);
  2156					if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
  2157						hugetlb_vma_unlock_write(vma);
  2158						huge_pmd_unshare_flush(&tlb, vma);
  2159						tlb_finish_mmu(&tlb);
  2160						/*
  2161						 * The PMD table was unmapped,
  2162						 * consequently unmapping the folio.
  2163						 */
  2164						goto walk_done;
  2165					}
  2166					hugetlb_vma_unlock_write(vma);
  2167					tlb_finish_mmu(&tlb);
  2168				}
  2169				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2170				if (pte_dirty(pteval))
  2171					folio_mark_dirty(folio);
  2172			} else if (likely(pte_present(pteval))) {
  2173				nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
  2174				end_addr = address + nr_pages * PAGE_SIZE;
  2175				flush_cache_range(vma, address, end_addr);
  2176	
  2177				/* Nuke the page table entry. */
  2178				pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
  2179				/*
  2180				 * We clear the PTE but do not flush so potentially
  2181				 * a remote CPU could still be writing to the folio.
  2182				 * If the entry was previously clean then the
  2183				 * architecture must guarantee that a clear->dirty
  2184				 * transition on a cached TLB entry is written through
  2185				 * and traps if the PTE is unmapped.
  2186				 */
  2187				if (should_defer_flush(mm, flags))
  2188					set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
  2189				else
  2190					flush_tlb_range(vma, address, end_addr);
  2191				if (pte_dirty(pteval))
  2192					folio_mark_dirty(folio);
  2193			} else {
  2194				pte_clear(mm, address, pvmw.pte);
  2195			}
  2196	
  2197			/*
  2198			 * Now the pte is cleared. If this pte was uffd-wp armed,
  2199			 * we may want to replace a none pte with a marker pte if
  2200			 * it's file-backed, so we don't lose the tracking info.
  2201			 */
  2202			pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
  2203	
  2204			/* Update high watermark before we lower rss */
  2205			update_hiwater_rss(mm);
  2206	
  2207			if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
  2208				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2209				if (folio_test_hugetlb(folio)) {
  2210					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2211					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2212							hsz);
  2213				} else {
  2214					dec_mm_counter(mm, mm_counter(folio));
  2215					set_pte_at(mm, address, pvmw.pte, pteval);
  2216				}
  2217			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2218				   !userfaultfd_armed(vma)) {
  2219				/*
  2220				 * The guest indicated that the page content is of no
  2221				 * interest anymore. Simply discard the pte, vmscan
  2222				 * will take care of the rest.
  2223				 * A future reference will then fault in a new zero
  2224				 * page. When userfaultfd is active, we must not drop
  2225				 * this page though, as its main user (postcopy
  2226				 * migration) will not expect userfaults on already
  2227				 * copied pages.
  2228				 */
  2229				dec_mm_counter(mm, mm_counter(folio));
  2230			} else if (folio_test_anon(folio)) {
  2231				swp_entry_t entry = page_swap_entry(subpage);
  2232				pte_t swp_pte;
  2233				/*
  2234				 * Store the swap location in the pte.
  2235				 * See handle_pte_fault() ...
  2236				 */
  2237				if (unlikely(folio_test_swapbacked(folio) !=
  2238						folio_test_swapcache(folio))) {
  2239					WARN_ON_ONCE(1);
  2240					goto walk_abort;
  2241				}
  2242	
  2243				/* MADV_FREE page check */
  2244				if (!folio_test_swapbacked(folio)) {
  2245					int ref_count, map_count;
  2246	
  2247					/*
  2248					 * Synchronize with gup_pte_range():
  2249					 * - clear PTE; barrier; read refcount
  2250					 * - inc refcount; barrier; read PTE
  2251					 */
  2252					smp_mb();
  2253	
  2254					ref_count = folio_ref_count(folio);
  2255					map_count = folio_mapcount(folio);
  2256	
  2257					/*
  2258					 * Order reads for page refcount and dirty flag
  2259					 * (see comments in __remove_mapping()).
  2260					 */
  2261					smp_rmb();
  2262	
  2263					if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
  2264						/*
  2265						 * redirtied either using the page table or a previously
  2266						 * obtained GUP reference.
  2267						 */
  2268						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2269						folio_set_swapbacked(folio);
  2270						goto walk_abort;
  2271					} else if (ref_count != 1 + map_count) {
  2272						/*
  2273						 * Additional reference. Could be a GUP reference or any
  2274						 * speculative reference. GUP users must mark the folio
  2275						 * dirty if there was a modification. This folio cannot be
  2276						 * reclaimed right now either way, so act just like nothing
  2277						 * happened.
  2278						 * We'll come back here later and detect if the folio was
  2279						 * dirtied when the additional reference is gone.
  2280						 */
  2281						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2282						goto walk_abort;
  2283					}
  2284					add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
  2285					goto discard;
  2286				}
  2287	
  2288				if (folio_dup_swap(folio, subpage) < 0) {
  2289					set_pte_at(mm, address, pvmw.pte, pteval);
  2290					goto walk_abort;
  2291				}
  2292	
  2293				/*
  2294				 * arch_unmap_one() is expected to be a NOP on
  2295				 * architectures where we could have PFN swap PTEs,
  2296				 * so we'll not check/care.
  2297				 */
  2298				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2299					folio_put_swap(folio, subpage);
  2300					set_pte_at(mm, address, pvmw.pte, pteval);
  2301					goto walk_abort;
  2302				}
  2303	
  2304				/* See folio_try_share_anon_rmap(): clear PTE first. */
  2305				if (anon_exclusive &&
  2306				    folio_try_share_anon_rmap_pte(folio, subpage)) {
  2307					folio_put_swap(folio, subpage);
  2308					set_pte_at(mm, address, pvmw.pte, pteval);
  2309					goto walk_abort;
  2310				}
  2311				if (list_empty(&mm->mmlist)) {
  2312					spin_lock(&mmlist_lock);
  2313					if (list_empty(&mm->mmlist))
  2314						list_add(&mm->mmlist, &init_mm.mmlist);
  2315					spin_unlock(&mmlist_lock);
  2316				}
  2317				dec_mm_counter(mm, MM_ANONPAGES);
  2318				inc_mm_counter(mm, MM_SWAPENTS);
  2319				swp_pte = swp_entry_to_pte(entry);
  2320				if (anon_exclusive)
  2321					swp_pte = pte_swp_mkexclusive(swp_pte);
  2322				if (likely(pte_present(pteval))) {
  2323					if (pte_soft_dirty(pteval))
  2324						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2325					if (pte_uffd_wp(pteval))
  2326						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2327				} else {
  2328					if (pte_swp_soft_dirty(pteval))
  2329						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2330					if (pte_swp_uffd_wp(pteval))
  2331						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2332				}
  2333				set_pte_at(mm, address, pvmw.pte, swp_pte);
  2334			} else {
  2335				/*
  2336				 * This is a locked file-backed folio,
  2337				 * so it cannot be removed from the page
  2338				 * cache and replaced by a new folio before
  2339				 * mmu_notifier_invalidate_range_end, so no
  2340				 * concurrent thread might update its page table
  2341				 * to point at a new folio while a device is
  2342				 * still using this folio.
  2343				 *
  2344				 * See Documentation/mm/mmu_notifier.rst
  2345				 */
  2346				add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
  2347			}
  2348	discard:
  2349			if (unlikely(folio_test_hugetlb(folio))) {
  2350				hugetlb_remove_rmap(folio);
  2351			} else {
  2352				folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
  2353			}
  2354			if (vma->vm_flags & VM_LOCKED)
  2355				mlock_drain_local();
  2356			folio_put_refs(folio, nr_pages);
  2357	
  2358			/*
  2359			 * If we are sure that we batched the entire folio and cleared
  2360			 * all PTEs, we can just optimize and stop right here.
  2361			 */
  2362			if (nr_pages == folio_nr_pages(folio))
  2363				goto walk_done;
  2364			continue;
  2365	walk_abort:
  2366			ret = false;
  2367	walk_done:
  2368			page_vma_mapped_walk_done(&pvmw);
  2369			break;
  2370		}
  2371	
  2372		mmu_notifier_invalidate_range_end(&range);
  2373	
  2374		return ret;
  2375	}
  2376	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* Re: [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock()
From: Harry Yoo @ 2026-06-25  5:40 UTC (permalink / raw)
  To: hu.shengming
  Cc: vbabka, hao.li, cl, rientjes, roman.gushchin, linux-mm,
	linux-kernel, zhang.run, cai.qu
In-Reply-To: <20260624172234202jw3y4yP1YfgOYbPCQdVIw@zte.com.cn>


[-- Attachment #1.1: Type: text/plain, Size: 3078 bytes --]



On 6/24/26 6:22 PM, hu.shengming@zte.com.cn wrote:
> Harry wrote:
>> Currently, k[v]free_rcu() cannot be called in unknown context since
>> it could lead to a deadlock when called in the middle of k[v]free_rcu().
>>
>> Make users' lives easier by introducing kfree_rcu_nolock() variant,
>> now that kfree_rcu_sheaf() is available on PREEMPT_RT and
>> __kfree_rcu_sheaf() handles unknown context.
>>
>> Unlike k[v]free_rcu(), kfree_rcu_nolock() does not fall back to
>> the kvfree_rcu batching when the sheaves path fails, and falls back to
>> defer_kfree_rcu() instead. In most cases, the sheaves path is expected
>> to succeed and it's unnecessary to add complexity to the existing
>> kvfree_rcu batching.
>>
>> Since defer_kfree_rcu() can be called on caches without sheaves, move
>> deferred_work_barrier() and rcu_barrier() outside the branch in
>> kvfree_rcu_barrier_on_cache().
>>
>> Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
> 
> Hi Harry,
> 
> Thanks for the series. These patches fill a clear functional gap in the
> existing free APIs by adding an RCU-deferred free interface for contexts
> where kfree_rcu() cannot safely be used.

Thanks for looking into this, Shengming.

>> ---
>>  include/linux/rcupdate.h | 12 ++++++++++++
>>  mm/slab.h                |  1 +
>>  mm/slab_common.c         | 22 ++++++++++++++++++++--
>>  mm/slub.c                | 23 ++++++++++++++++++++++-
>>  4 files changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>> index 807924a94fb0..5a39e6225160 100644
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> @@ -1263,6 +1263,23 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
>>  EXPORT_TRACEPOINT_SYMBOL(kfree);
>>  EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
>>  
>> +void kfree_call_rcu_nolock(struct rcu_head *head, void *ptr)
>> +{
>> +	struct slab *slab;
>> +	struct kmem_cache *s;
>> +
>> +	VM_WARN_ON_ONCE(is_vmalloc_addr(ptr) || !virt_to_slab(ptr));
>> +
>> +	slab = virt_to_slab(ptr);
>> +	s = slab->slab_cache;
>> +
>> +	if (__kfree_rcu_sheaf(s, ptr, /* allow_spin = */ false))
>> +		return;
>> +
> 
> One consistency issue to address here: kfree_rcu_sheaf() only calls
> __kfree_rcu_sheaf() for objects belonging to the local NUMA node. This
> avoids filling a CPU's per-CPU sheaves with objects from remote slabs.
> 
> kfree_call_rcu_nolock() currently skips that check and may therefore
> place remote-node objects into the local CPU's RCU sheaf.

That was intentional, but actually, this is a good point. Thanks.

> Could you add the same local-node check used by kfree_rcu_sheaf()
> before calling __kfree_rcu_sheaf(), and route remote-node objects
> directly to the defer_kfree_rcu() fallback path instead?

Falling back to defer_kfree_rcu() in v3 didn't make much sense
as the object is inserted to a global list which would cause more
troubles than NUMA miss.

But once we make the fallback path percpu, your suggestion would make
more sense.

-- 
Cheers,
Harry / Hyeonggon


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH v4] mm: assert exclusive nid/zonenum bits at the page/folio access sites
From: Hui Zhu @ 2026-06-25  5:39 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	linux-mm, linux-kernel
  Cc: Hui Zhu

From: Hui Zhu <zhuhui@kylinos.cn>

KCSAN reports a data race between page_to_nid()/folio_pgdat() reading
page->flags and folio_trylock()/folio_lock() concurrently doing
test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:

  BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp

The node id and zone id occupy fixed bit-ranges of page->flags that
are set once at page init and never modified afterwards, so they can
never overlap with the low PG_locked/PG_waiters bits touched by the
folio lock path.

ASSERT_EXCLUSIVE_BITS(mdf.f, ...) inside memdesc_nid()/memdesc_zonenum()
checks a by-value copy of the flags word, not the actual shared
page->flags/folio->flags being modified concurrently, so it doesn't
reliably assert anything about the real race. Move the assertion to
page_to_nid(), folio_nid(), page_zonenum() and folio_zonenum(), where
flags is dereferenced directly from the page/folio.

On CONFIG_NUMA=n, NODES_MASK is 0 and the old memdesc_nid() body
folded to a constant, so page->flags/folio->flags was never actually
read. ASSERT_EXCLUSIVE_BITS() is a real runtime check that can't be
folded away, so doing it unconditionally would add a pointless read
of page->flags/folio->flags and a check that can never fire. Keep
page_to_nid()/folio_nid() as plain "return 0" static inline stubs
under CONFIG_NUMA=n instead.

Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
Changelog:
v4:
According to the comments of Andrew and Sashiko, set
page_to_nid()/folio_nid() as static inline stubs returning 0
under CONFIG_NUMA=n.
v3:
According to the comments of Andrew and Sashiko, move
ASSERT_EXCLUSIVE_BITS out of memdesc_nid()/memdesc_zonenum()
into the page/folio call sites.
v2:
According to the comments of David, remove useless comments and use
ASSERT_EXCLUSIVE_BITS() in memdesc_nid() instead of data_race() in
page_to_nid().

 include/linux/mm.h     | 9 +++++++++
 include/linux/mmzone.h | 3 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbdd..56b39194605a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2294,15 +2294,24 @@ static inline int memdesc_nid(memdesc_flags_t mdf)
 }
 #endif
 
+#ifdef CONFIG_NUMA
 static inline int page_to_nid(const struct page *page)
 {
+	ASSERT_EXCLUSIVE_BITS(PF_POISONED_CHECK(page)->flags,
+			      NODES_MASK << NODES_PGSHIFT);
 	return memdesc_nid(PF_POISONED_CHECK(page)->flags);
 }
 
 static inline int folio_nid(const struct folio *folio)
 {
+	ASSERT_EXCLUSIVE_BITS(folio->flags,
+			      NODES_MASK << NODES_PGSHIFT);
 	return memdesc_nid(folio->flags);
 }
+#else
+#define page_to_nid(page) (0)
+#define folio_nid(folio) (0)
+#endif
 
 #ifdef CONFIG_NUMA_BALANCING
 /* page access time bits needs to hold at least 4 seconds */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca2712187147..56dffa966343 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1274,17 +1274,18 @@ static inline bool zone_is_empty(const struct zone *zone)
 
 static inline enum zone_type memdesc_zonenum(memdesc_flags_t flags)
 {
-	ASSERT_EXCLUSIVE_BITS(flags.f, ZONES_MASK << ZONES_PGSHIFT);
 	return (flags.f >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
 static inline enum zone_type page_zonenum(const struct page *page)
 {
+	ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
 	return memdesc_zonenum(page->flags);
 }
 
 static inline enum zone_type folio_zonenum(const struct folio *folio)
 {
+	ASSERT_EXCLUSIVE_BITS(folio->flags, ZONES_MASK << ZONES_PGSHIFT);
 	return memdesc_zonenum(folio->flags);
 }
 
-- 
2.43.0



^ permalink raw reply related

* Re: [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock()
From: Harry Yoo @ 2026-06-25  5:27 UTC (permalink / raw)
  To: XIAO WU, Vlastimil Babka, Andrew Morton, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf
In-Reply-To: <tencent_12A1049445F2CE248F20CB235AC111353A0A@qq.com>


[-- Attachment #1.1: Type: text/plain, Size: 4556 bytes --]



On 6/22/26 11:56 PM, XIAO WU wrote:
> Hi Harry,
> 
> On Mon, Jun 22, 2026 at 02:28:44PM +0900, Harry Yoo wrote:
>> On 6/21/26 9:29 AM, XIAO WU wrote:
>> > I was able to reproduce this in QEMU with KASAN.  The trigger is as
>> > simple as passing a large (>8KB) kmalloc buffer to the new function.
>>
>> Thanks for taking a look, but this was intentional.
>>
>> I should have documented that only kmalloc_nolock() ->
>> kfree_rcu_nolock() is allowed and kmalloc() -> kfree_rcu_nolock()
>> is not allowed (yet).
>>
>> Since kmalloc_nolock() does not support large kmalloc, the warning
>> is not supposed to trigger. That is why I added only debug warnings.
> 
> Thank you very much for taking the time to explain — I really
> appreciate it, especially since I'm still learning my way around the
> mm/ subsystem.  You are absolutely right that kmalloc_nolock() returns
> NULL for sizes above KMALLOC_MAX_CACHE_SIZE, so a proper caller using
> the kmalloc_nolock() → kfree_rcu_nolock() pairing would never hit this.
> 
> I did notice one small thing that I wanted to gently bring up, though.
> Please forgive me if I'm missing something obvious here.
> 
> When I was reading through the surrounding code to understand the
> pattern better, I noticed that kfree_nolock() — which has the same
> "only for kmalloc_nolock()" constraint (documented in the comment at
> mm/slub.c:6828-6835) — actually does check for a NULL slab:
> 
>   void kfree_nolock(const void *object)
>   {
>       ...
>       slab = virt_to_slab(object);
>       if (unlikely(!slab)) {
>           WARN_ONCE(1, "large_kmalloc is not supported by kfree_nolock()");
>           return;
>       }
>       s = slab->slab_cache;
>       ...
> 
> So kfree_nolock() gracefully returns with a warning even though it too
> expects only kmalloc_nolock() callers.  That pattern seemed really
> sensible to me — it costs almost nothing and prevents a panic if
> someone ever passes the wrong pointer (which they shouldn't, but as you
> mentioned, the constraint isn't documented on kfree_call_rcu_nolock()
> yet).
>
> I also wondered about the difference between WARN_ONCE (used in
> kfree_nolock) and VM_WARN_ON_ONCE (used in kfree_call_rcu_nolock). If
> I understand correctly, VM_WARN_ON_ONCE compiles away entirely on
> production kernels without CONFIG_DEBUG_VM, which would make the
> subsequent NULL dereference completely silent — no warning, just a
> panic.

It would crash without debug option anyway, but warnings are there to
make it easier to point what's gone wrong.

And not testing the code path at least once with debug option is a big
problem :)

> And since you mentioned that kmalloc() → kfree_rcu_nolock() support is
> planned for the future (the "yet") — wouldn't this code path need the
> NULL check at that point anyway?
> 
> I was thinking something like this would make the function consistent
> with kfree_nolock() and also make it forward-compatible with the
> planned kmalloc() support:
> 
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1266,10 +1266,16 @@ void kfree_call_rcu_nolock(struct rcu_head
> *head, void *ptr)
>  {
>      struct slab *slab;
>      struct kmem_cache *s;
> 
> -    VM_WARN_ON_ONCE(is_vmalloc_addr(ptr) || !virt_to_slab(ptr));
> -
>      slab = virt_to_slab(ptr);
> +    /*
> +     * kmalloc_nolock() never produces large-kmalloc or vmalloc
> +     * addresses, but be defensive: fall back to defer_kfree_rcu()
> +     * for unsupported pointer types, consistent with kfree_nolock().
> +     */
> +    if (unlikely(!slab))
> +        goto fallback;

Just FYI, virt_to_slab() and virt_to_page()
don't work correctly for vmalloc addresses.

And I don't think silently making it work is good.

> +
>      s = slab->slab_cache;
> 
>      if (__kfree_rcu_sheaf(s, ptr, /* allow_spin = */ false))
>          return;
> 
> +fallback:
>      defer_kfree_rcu(head);
>  }
> 
> Of course, this is just a suggestion — you know this code far better
> than I do.  If you feel the current code is fine as-is with proper
> documentation, I completely understand and won't press the point
> further.
> 
> Either way, thank you again for the explanation, and for working on
> this series — having kfree_rcu_nolock() available for BPF and other
> contexts will be really valuable.
> 
> Thanks,
> XIAO

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [RFC PATCH v1.1 11/11] mm/damon/sysfs: fix typos in probe_{add,rm}_dirs: s/attr/probe/
From: SeongJae Park @ 2026-06-25  5:07 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, linux-kernel, linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>

damon_sysfs_probe_{add,rm}_dirs names a variable for damon_sysf_probe
as 'attr'.  Probably a trivial copy-pasta error, but it makes the code
not pleasant to read.  Fix those.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/sysfs.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index f3bb146b204df..36d71f1675426 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -1068,7 +1068,7 @@ static struct damon_sysfs_probe *damon_sysfs_probe_alloc(void)
 	return kzalloc_obj(struct damon_sysfs_probe);
 }
 
-static int damon_sysfs_probe_add_dirs(struct damon_sysfs_probe *attr)
+static int damon_sysfs_probe_add_dirs(struct damon_sysfs_probe *probe)
 {
 	struct damon_sysfs_filters *filters;
 	int err;
@@ -1076,22 +1076,22 @@ static int damon_sysfs_probe_add_dirs(struct damon_sysfs_probe *attr)
 	filters = damon_sysfs_filters_alloc();
 	if (!filters)
 		return -ENOMEM;
-	attr->filters = filters;
+	probe->filters = filters;
 
 	err = kobject_init_and_add(&filters->kobj, &damon_sysfs_filters_ktype,
-			&attr->kobj, "filters");
+			&probe->kobj, "filters");
 	if (err) {
 		kobject_put(&filters->kobj);
-		attr->filters = NULL;
+		probe->filters = NULL;
 	}
 	return err;
 }
 
-static void damon_sysfs_probe_rm_dirs(struct damon_sysfs_probe *attr)
+static void damon_sysfs_probe_rm_dirs(struct damon_sysfs_probe *probe)
 {
-	if (attr->filters) {
-		damon_sysfs_filters_rm_dirs(attr->filters);
-		kobject_put(&attr->filters->kobj);
+	if (probe->filters) {
+		damon_sysfs_filters_rm_dirs(probe->filters);
+		kobject_put(&probe->filters->kobj);
 	}
 }
 
-- 
2.47.3


^ permalink raw reply related

* [RFC PATCH v1.1 10/11] mm/damon/sysfs: split out filters setup function
From: SeongJae Park @ 2026-06-25  5:07 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, linux-kernel, linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>

damon_sysfs_set_probe() is doing not only probe setup but also filters
setup.  Split out filters setup for readability.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/sysfs.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index 982d824f63c21..f3bb146b204df 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -1899,16 +1899,11 @@ static int damon_sysfs_set_attrs(struct damon_ctx *ctx,
 	return damon_set_attrs(ctx, &attrs);
 }
 
-static int damon_sysfs_set_probe(struct damon_probe *probe,
-		struct damon_sysfs_probe *sys_probe)
+static int damon_sysfs_set_filters(struct damon_probe *probe,
+		struct damon_sysfs_filters *sys_filters)
 {
-	struct damon_sysfs_filters *sys_filters;
 	int i;
 
-	sys_filters = sys_probe->filters;
-	if (!sys_filters)
-		return 0;
-
 	for (i = 0; i < sys_filters->nr; i++) {
 		struct damon_sysfs_filter *sys_filter =
 			sys_filters->filters_arr[i];
@@ -1935,6 +1930,17 @@ static int damon_sysfs_set_probe(struct damon_probe *probe,
 	return 0;
 }
 
+static int damon_sysfs_set_probe(struct damon_probe *probe,
+		struct damon_sysfs_probe *sys_probe)
+{
+	struct damon_sysfs_filters *sys_filters;
+
+	sys_filters = sys_probe->filters;
+	if (!sys_filters)
+		return 0;
+	return damon_sysfs_set_filters(probe, sys_filters);
+}
+
 static int damon_sysfs_set_probes(struct damon_ctx *ctx,
 		struct damon_sysfs_probes *sys_probes)
 {
-- 
2.47.3


^ permalink raw reply related

* [RFC PATCH v1.1 09/11] mm/damon/sysfs: split probe setup function out
From: SeongJae Park @ 2026-06-25  5:07 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, linux-kernel, linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>

damon_sysfs_set_probes() function is relatively long.  It has two nested
loop for setting two nested entities, namely probe and filter.  Split
out the probe level setup for readability.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/sysfs.c | 80 ++++++++++++++++++++++++++++--------------------
 1 file changed, 46 insertions(+), 34 deletions(-)

diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index 2e95e3bac774d..982d824f63c21 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -1899,47 +1899,59 @@ static int damon_sysfs_set_attrs(struct damon_ctx *ctx,
 	return damon_set_attrs(ctx, &attrs);
 }
 
-static int damon_sysfs_set_probes(struct damon_ctx *ctx,
-		struct damon_sysfs_probes *sys_probes)
+static int damon_sysfs_set_probe(struct damon_probe *probe,
+		struct damon_sysfs_probe *sys_probe)
 {
+	struct damon_sysfs_filters *sys_filters;
 	int i;
 
-	for (i = 0; i < sys_probes->nr; i++) {
-		struct damon_sysfs_filters *sys_filters =
-			sys_probes->probes_arr[i]->filters;
-		struct damon_probe *c;
-		int j;
+	sys_filters = sys_probe->filters;
+	if (!sys_filters)
+		return 0;
 
-		if (!sys_filters)
-			continue;
-		c = damon_new_probe();
-		if (!c)
+	for (i = 0; i < sys_filters->nr; i++) {
+		struct damon_sysfs_filter *sys_filter =
+			sys_filters->filters_arr[i];
+		struct damon_filter *filter;
+
+		filter = damon_new_filter(sys_filter->type,
+				sys_filter->matching,
+				sys_filter->allow);
+		if (!filter)
 			return -ENOMEM;
-		damon_add_probe(ctx, c);
-
-		for (j = 0; j < sys_filters->nr; j++) {
-			struct damon_sysfs_filter *sys_filter =
-				sys_filters->filters_arr[j];
-			struct damon_filter *filter;
-
-			filter = damon_new_filter(sys_filter->type,
-					sys_filter->matching,
-					sys_filter->allow);
-			if (!filter)
-				return -ENOMEM;
-			if (filter->type == DAMON_FILTER_TYPE_MEMCG) {
-				int err;
-
-				err = damon_sysfs_memcg_path_to_id(
-						sys_filter->path,
-						&filter->memcg_id);
-				if (err) {
-					damon_destroy_filter(filter);
-					return err;
-				}
+		if (filter->type == DAMON_FILTER_TYPE_MEMCG) {
+			int err;
+
+			err = damon_sysfs_memcg_path_to_id(
+					sys_filter->path,
+					&filter->memcg_id);
+			if (err) {
+				damon_destroy_filter(filter);
+				return err;
 			}
-			damon_add_filter(c, filter);
 		}
+		damon_add_filter(probe, filter);
+	}
+	return 0;
+}
+
+static int damon_sysfs_set_probes(struct damon_ctx *ctx,
+		struct damon_sysfs_probes *sys_probes)
+{
+	int i, err;
+
+	for (i = 0; i < sys_probes->nr; i++) {
+		struct damon_sysfs_probe *sys_probe;
+		struct damon_probe *p;
+
+		p = damon_new_probe();
+		if (!p)
+			return -ENOMEM;
+		damon_add_probe(ctx, p);
+		sys_probe = sys_probes->probes_arr[i];
+		err = damon_sysfs_set_probe(p, sys_probe);
+		if (err)
+			return err;
 	}
 	return 0;
 }
-- 
2.47.3


^ permalink raw reply related

* [RFC PATCH v1.1 08/11] mm/damon/core: reduce range setup in damon_commit_target_regions()
From: SeongJae Park @ 2026-06-25  5:07 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, damon, linux-kernel, linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>

damon_commit_target_regions() calls damon_set_regions() for updating the
destination target's monitoring target region boundaries.  It sets the
boundaries same to source target's monitoring regions, even if they are
adjacent.  Meanwhile, damon_set_region() sets the destination target
regions exactly the same to the source, only when the target regions are
empty.  When there are existing target regions, only a few regions are
expanded or shrunk to fit on only the boundaries for disjoint regions in
the source.  Hence the adjacent source ranges mean nothing in common
cases.  When there are many regions, such adjacent range setup is only a
waste of time and space.  We recently found [1] it is actually causing
memory overhead.  Setup the ranges for only distinct ranges.

[1] https://lore.kernel.org/20260603112306.58490-1-akinobu.mita@gmail.com

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/core.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/damon/core.c b/mm/damon/core.c
index 7e4b9affc5b06..ce5294cb1b4f3 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1349,21 +1349,33 @@ static struct damon_target *damon_nth_target(int n, struct damon_ctx *ctx)
 static int damon_commit_target_regions(struct damon_target *dst,
 		struct damon_target *src, unsigned long src_min_region_sz)
 {
-	struct damon_region *src_region;
+	struct damon_region *src_region, *prev = NULL;
 	struct damon_addr_range *ranges;
 	int i = 0, err;
 
-	damon_for_each_region(src_region, src)
-		i++;
+	damon_for_each_region(src_region, src) {
+		if (!prev || prev->ar.end != src_region->ar.start)
+			i++;
+		prev = src_region;
+	}
 	if (!i)
 		return 0;
 
 	ranges = kmalloc_objs(*ranges, i, GFP_KERNEL | __GFP_NOWARN);
 	if (!ranges)
 		return -ENOMEM;
+	prev = NULL;
 	i = 0;
-	damon_for_each_region(src_region, src)
-		ranges[i++] = src_region->ar;
+	damon_for_each_region(src_region, src) {
+		if (!prev) {
+			ranges[i].start = src_region->ar.start;
+		} else if (prev->ar.end != src_region->ar.start) {
+			ranges[i++].end = prev->ar.end;
+			ranges[i].start = src_region->ar.start;
+		}
+		prev = src_region;
+	}
+	ranges[i++].end = damon_last_region(src)->ar.end;
 	err = damon_set_regions(dst, ranges, i, src_min_region_sz);
 	kfree(ranges);
 	return err;
-- 
2.47.3


^ permalink raw reply related

* [RFC PATCH v1.1 07/11] selftests/damon/sysfs.sh: test all files in quota goal dir
From: SeongJae Park @ 2026-06-25  5:07 UTC (permalink / raw)
  Cc: SeongJae Park, Shuah Khan, damon, linux-kernel, linux-kselftest,
	linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>

DAMON sysfs interface for DAMOS quota has quite extended since its
initial introduction.  The test case for that in DAMON sysfs interface
essential file operations test (sysfs.sh) has not accordingly extended,
though.  Extend the test case to test all existing files.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 tools/testing/selftests/damon/sysfs.sh | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/testing/selftests/damon/sysfs.sh b/tools/testing/selftests/damon/sysfs.sh
index ffa8413b5ab3d..15fb9df928818 100755
--- a/tools/testing/selftests/damon/sysfs.sh
+++ b/tools/testing/selftests/damon/sysfs.sh
@@ -199,6 +199,20 @@ test_goal()
 	ensure_dir "$goal_dir" "exist"
 	ensure_file "$goal_dir/target_value" "exist" "600"
 	ensure_file "$goal_dir/current_value" "exist" "600"
+	ensure_file "$goal_dir/target_metric" "exist" "600"
+	local fpath="$goal_dir/target_metric"
+	ensure_write_succ "$fpath" "user_input" "valid input"
+	ensure_write_succ "$fpath" "some_mem_psi_us" "valid input"
+	ensure_write_succ "$fpath" "node_mem_used_bp" "valid input"
+	ensure_write_succ "$fpath" "node_mem_free_bp" "valid input"
+	ensure_write_succ "$fpath" "node_memcg_used_bp" "valid input"
+	ensure_write_succ "$fpath" "node_memcg_free_bp" "valid input"
+	ensure_write_succ "$fpath" "active_mem_bp" "valid input"
+	ensure_write_succ "$fpath" "inactive_mem_bp" "valid input"
+	ensure_write_succ "$fpath" "node_eligible_mem_bp" "valid input"
+	ensure_write_fail "$fpath" "foo" "invalid input"
+	ensure_file "$goal_dir/nid" "exist" "600"
+	ensure_file "$goal_dir/path" "exist" "600"
 }
 
 test_goals()
-- 
2.47.3


^ permalink raw reply related

* [RFC PATCH v1.1 06/11] selftests/damon/sysfs.sh: test dests dir
From: SeongJae Park @ 2026-06-25  5:07 UTC (permalink / raw)
  Cc: SeongJae Park, Shuah Khan, damon, linux-kernel, linux-kselftest,
	linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>

DAMON selftest interface essential file operations test (sysfs.sh) is
not testing DAMOS dests/ directory.  Add the test.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 tools/testing/selftests/damon/sysfs.sh | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/damon/sysfs.sh b/tools/testing/selftests/damon/sysfs.sh
index 07a33995be852..ffa8413b5ab3d 100755
--- a/tools/testing/selftests/damon/sysfs.sh
+++ b/tools/testing/selftests/damon/sysfs.sh
@@ -99,6 +99,29 @@ test_stats()
 	done
 }
 
+test_dest()
+{
+	dest_dir=$1
+	ensure_file "$dest_dir/id" "exist"
+	ensure_file "$dest_dir/weight" "exist"
+}
+
+test_dests()
+{
+	dests_dir=$1
+	ensure_file "$dests_dir/nr_dests" "exist" "600"
+	ensure_write_succ "$dests_dir/nr_dests" "1" "valid input"
+	test_dest "$dests_dir/0"
+
+	ensure_write_succ "$dests_dir/nr_dests" "2" "valid input"
+	test_dest "$dests_dir/0"
+	test_dest "$dests_dir/1"
+
+	ensure_write_succ "$dests_dir/nr_dests" "0" "valid input"
+	ensure_dir "$dests_dir/0" "not_exist"
+	ensure_dir "$dests_dir/1" "not_exist"
+}
+
 test_filter()
 {
 	filter_dir=$1
-- 
2.47.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox