* Re: [PATCH v4] mm: assert exclusive nid/zonenum bits at the page/folio access sites
From: David Hildenbrand (Arm) @ 2026-06-25 6:46 UTC (permalink / raw)
To: Hui Zhu, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Kairui Song, Qi Zheng, Shakeel Butt, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, linux-mm, linux-kernel
Cc: Hui Zhu
In-Reply-To: <20260625053958.918738-1-hui.zhu@linux.dev>
On 6/25/26 07:39, Hui Zhu wrote:
> From: Hui Zhu <zhuhui@kylinos.cn>
>
> KCSAN reports a data race between page_to_nid()/folio_pgdat() reading
> page->flags and folio_trylock()/folio_lock() concurrently doing
> test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:
>
> BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp
>
> The node id and zone id occupy fixed bit-ranges of page->flags that
> are set once at page init and never modified afterwards, so they can
> never overlap with the low PG_locked/PG_waiters bits touched by the
> folio lock path.
>
> ASSERT_EXCLUSIVE_BITS(mdf.f, ...) inside memdesc_nid()/memdesc_zonenum()
> checks a by-value copy of the flags word, not the actual shared
> page->flags/folio->flags being modified concurrently, so it doesn't
> reliably assert anything about the real race.
Is that the case? I thought the existing ASSERT_EXCLUSIVE_BITS() reliably worked
before?
Maybe the compiler optimizing out a local copy sorted that for us.
> Move the assertion to
> page_to_nid(), folio_nid(), page_zonenum() and folio_zonenum(), where
> flags is dereferenced directly from the page/folio.
>
> On CONFIG_NUMA=n, NODES_MASK is 0 and the old memdesc_nid() body
> folded to a constant, so page->flags/folio->flags was never actually
> read. ASSERT_EXCLUSIVE_BITS() is a real runtime check that can't be
> folded away, so doing it unconditionally would add a pointless read
> of page->flags/folio->flags and a check that can never fire. Keep
> page_to_nid()/folio_nid() as plain "return 0" static inline stubs
> under CONFIG_NUMA=n instead.
>
> Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
> ---
> Changelog:
> v4:
> According to the comments of Andrew and Sashiko, set
> page_to_nid()/folio_nid() as static inline stubs returning 0
> under CONFIG_NUMA=n.
> v3:
> According to the comments of Andrew and Sashiko, move
> ASSERT_EXCLUSIVE_BITS out of memdesc_nid()/memdesc_zonenum()
> into the page/folio call sites.
> v2:
> According to the comments of David, remove useless comments and use
> ASSERT_EXCLUSIVE_BITS() in memdesc_nid() instead of data_race() in
> page_to_nid().
>
> include/linux/mm.h | 9 +++++++++
> include/linux/mmzone.h | 3 ++-
> 2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 485df9c2dbdd..56b39194605a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2294,15 +2294,24 @@ static inline int memdesc_nid(memdesc_flags_t mdf)
> }
> #endif
>
> +#ifdef CONFIG_NUMA
> static inline int page_to_nid(const struct page *page)
> {
> + ASSERT_EXCLUSIVE_BITS(PF_POISONED_CHECK(page)->flags,
> + NODES_MASK << NODES_PGSHIFT);
Performing the PF_POISONED_CHECK() twice is a bit odd. One time is sufficient,
maybe simply before both statements separately?
> return memdesc_nid(PF_POISONED_CHECK(page)->flags);
> }
>
> static inline int folio_nid(const struct folio *folio)
> {
> + ASSERT_EXCLUSIVE_BITS(folio->flags,
> + NODES_MASK << NODES_PGSHIFT);
> return memdesc_nid(folio->flags);
> }
> +#else
> +#define page_to_nid(page) (0)
> +#define folio_nid(folio) (0)
> +#endif
>
LGTM
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug
From: Gregory Price @ 2026-06-25 6:43 UTC (permalink / raw)
To: Hannes Reinecke
Cc: linux-mm, nvdimm, linux-kernel, linux-cxl, driver-core,
linux-kselftest, kernel-team, david, osalvador, gregkh, rafael,
dakr, djbw, vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka,
rppt, surenb, mhocko, shuah, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <8e42587a-d614-4259-ae6b-5bca1479b425@suse.de>
>
> Why do we need to treat the 'unbind' call as a given thing?
> If we know that we cannot handle online memory during unbind,
> can't we just disallow unbind in that case?
No. Unbind is a violent operation - unbinds cannot fail, and a
straight, uncoordinated unbind is essentially a `--force` flag:
the admin accepts the risks.
To your point, the admin either does the nice thing are they
muck up the system.
But we should still try to do something sane to defend the kernel,
in this case we should try to prevent that task from becoming
deadlocked. The only way to do that is to leak the resources.
I'm making a small modification to this code to reinstate the
legacy behavior when "state!=UNPLUGGED".
~Gregory
^ permalink raw reply
* Re: [PATCH] mm/page_vma_mapped: guard check_pmd() with CONFIG_TRANSPARENT_HUGEPAGE
From: Wei Yang @ 2026-06-25 6:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Wei Yang, david, ljs, riel, liam, vbabka, harry, jannh, willy,
linux-mm, linux-kernel, lance.yang, balbirs
In-Reply-To: <20260624215943.332f1bdcc99a3b85b7df0822@linux-foundation.org>
+cc Balbir
On Wed, Jun 24, 2026 at 09:59:43PM -0700, Andrew Morton wrote:
>On Thu, 25 Jun 2026 03:46:29 +0000 Wei Yang <richard.weiyang@gmail.com> wrote:
>
>> >Sashiko had an off-topic complaint about the surrounding code:
>> > https://lore.kernel.org/oe-kbuild-all/202606240042.ffPsEXVc-lkp@intel.com/
>>
>> I see this robot reply, but not see the Sashiko comment.
>>
>> How can I view Sashiko's commnet?
>
>oop sorry.
>
>You can go to https://sashiko.dev/ and search for the email subject.
>
>Or append your Message-ID to "https://sashiko.dev/#/patchset":
>
> https://sashiko.dev/#/patchset/20260624082359.2869-1-richard.weiyang@gmail.com
>
Got it, thansk
This one mentioned two things:
a. page_vma_mapped_walk() return without check
b. whether __split_huge_pmd_locked() would split device-private pmd
For a., it is being fixing at [1].
For b., to be honest I am not 100% for sure. If a device-private pmd could be
file backed, then this looks like a bug.
Balbir,
Would you mind taking a look at the second comment raised by Sashiko?
[1]: https://lore.kernel.org/linux-mm/20260624065353.1622-1-richard.weiyang@gmail.com/
--
Wei Yang
Help you, Help me
^ permalink raw reply
* Re: [PATCH] fs/proc: fix KPF_KSM reported for all anonymous pages
From: David Hildenbrand (Arm) @ 2026-06-25 6:39 UTC (permalink / raw)
To: Jinjiang Tu, akpm, ziy, luizcap, willy, linmiaohe, svetly.todorov,
xu.xin16, chengming.zhou, linux-fsdevel, linux-mm
Cc: wangkefeng.wang, sunnanyong
In-Reply-To: <601fb5dd-18e1-4a6c-bc99-dc2a655240e2@huawei.com>
On 6/23/26 03:37, Jinjiang Tu wrote:
>
> 在 2026/6/22 19:45, David Hildenbrand (Arm) 写道:
>> On 6/22/26 11:15, Jinjiang Tu wrote:
>>> Reading /proc/kpageflags for any anonymous page returns KPF_KSM set, even
>>> when KSM is not in use. As a result, tools misclassify all anonymous pages
>>> as KSM merged.
>>>
>>> In stable_page_flags(), if the page is anonymous, then use (mapping &
>>> FOLIO_MAPPING_KSM) check to identify if the anonymous page is KSM page.
>>> However, FOLIO_MAPPING_KSM is FOLIO_MAPPING_ANON | FOLIO_MAPPING_ANON_KSM,
>>> (mapping & FOLIO_MAPPING_KSM) check returns true for all nonymous pages.
>>>
>>> To fix it, use FOLIO_MAPPING_ANON_KSM instead.
>>>
>>> Fixes: dee3d0bef2b0 ("proc: rewrite stable_page_flags()")
>> Right,
>>
>> #define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_ANON_KSM)
>>
>> Which we later renamed to FOLIO_MAPPING_KSM.
>>
>>
>> Before switching to manual flag checks, PageKsm() translated to folio_test_ksm()
>> that checked whether the values actually matched:
>>
>> ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_KSM;
>>
>>
>> This only affects /proc/kpageflags (well, and hwpoison inject on a weird testing
>> interface), so it's not really that relevant for real workloads (debugging and
>> testing).
>>
>> So not sure whether we should CC:stable. Likely not.
>
> /proc/kpageflags is generally used only for analysis and is unlikely to be
> used in production environments. I found this issue due to I was analyzing
> pfns allocated by which stacks are KSM-merged. So I think it's unnecessary
> to CC:stable.
>
>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>> ---
>>> fs/proc/page.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/proc/page.c b/fs/proc/page.c
>>> index f9b2c2c906cd..cef8ded97610 100644
>>> --- a/fs/proc/page.c
>>> +++ b/fs/proc/page.c
>>> @@ -173,7 +173,7 @@ u64 stable_page_flags(const struct page *page)
>>> u |= 1 << KPF_MMAP;
>>> if (is_anon) {
>>> u |= 1 << KPF_ANON;
>>> - if (mapping & FOLIO_MAPPING_KSM)
>>> + if (mapping & FOLIO_MAPPING_ANON_KSM)
>>
>> Wonder whether we should just do
>>
>> if ((mapping & FOLIO_MAPPING_FLAGS) == FOLIO_MAPPING_KSM)
>>
>> To match what we have in folio_test_ksm.
>>
>> (although I doubt we would reuse this flag for other purposes, likely
>> it's more future proof to check it like that)
>
> Both are ok. The following check has checked FOLIO_MAPPING_ANON,
>
> if (is_anon) {
> if (mapping & FOLIO_MAPPING_ANON_KSM)
> }
>
> So it's equivalent to do
> if ((mapping & FOLIO_MAPPING_FLAGS) == FOLIO_MAPPING_KSM)
> or
> if (mapping & FOLIO_MAPPING_ANON_KSM)
As I said, matching precisely what we have in folio_test_ksm() is clearer. We
don't have any users of FOLIO_MAPPING_ANON_KSM outside of page-flags.h for a
reason :)
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH] fs/proc: fix KPF_KSM reported for all anonymous pages
From: David Hildenbrand (Arm) @ 2026-06-25 6:38 UTC (permalink / raw)
To: Andrew Morton, Jinjiang Tu
Cc: ziy, luizcap, willy, linmiaohe, svetly.todorov, xu.xin16,
chengming.zhou, linux-fsdevel, linux-mm, wangkefeng.wang,
sunnanyong
In-Reply-To: <20260624180051.134da6553b4f0c4c2785f730@linux-foundation.org>
On 6/25/26 03:00, Andrew Morton wrote:
> On Tue, 23 Jun 2026 09:37:57 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:
>
>>> This only affects /proc/kpageflags (well, and hwpoison inject on a weird testing
>>> interface), so it's not really that relevant for real workloads (debugging and
>>> testing).
>>>
>>> So not sure whether we should CC:stable. Likely not.
>>
>> /proc/kpageflags is generally used only for analysis and is unlikely to be
>> used in production environments. I found this issue due to I was analyzing
>> pfns allocated by which stacks are KSM-merged. So I think it's unnecessary
>> to CC:stable.
>
> Well, it's a bug. The fix is super-simple so I think it's reasonable
> to feed it back to users of earlier kernels.
Definitely doesn't hurt.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
From: Dev Jain @ 2026-06-25 6:37 UTC (permalink / raw)
To: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
akpm, urezki
Cc: baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang,
Ard Biesheuvel
In-Reply-To: <20260618084726.1070022-1-jiangwen6@xiaomi.com>
On 18/06/26 2:17 pm, Wen Jiang wrote:
> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous. Two techniques are used:
>
> 1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
> segments
> 2. Use batched mappings wherever possible in both vmalloc and ARM64
> layers
>
> Besides accelerating the mapping path, this also enables large
> mappings (PMD and cont-PTE) for vmap, which are currently not
> supported.
>
> Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> CONT-PTE regions instead of just one.
>
> Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
> mapping logic between the ioremap and vmalloc/vmap paths, handling both
> CONT_PTE and regular PTE mappings. This prepares for the next patch.
>
> Patch 4 extends the page table walk path to support page shifts other
> than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
> mappings. The function is renamed from vmap_small_pages_range_noflush()
> to vmap_pages_range_noflush_walk().
>
> Patches 5-6 add huge vmap support for contiguous pages, including
> support for non-compound pages with pfn alignment verification.
>
> On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> the performance CPUfreq policy enabled, benchmark results:
>
> * ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
> * vmalloc(1 MB) mapping time (excluding allocation) with
> VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
> * vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)
>
> Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.
>
I am still a little nervous about doing vmap-huge by default.
We can play set_memory_* games on a vmap huge mapping partially, thus
forcing a pgtable split, and not all arches can handle a kernel pgtable
split.
For arm64, we can handle that with BBML2_NOABORT, but interestingly, in
change_memory_common, arch/arm64/mm/pageattr.c:
area = find_vm_area((void *)addr);
if (!area ||
((unsigned long)kasan_reset_tag((void *)end) >
(unsigned long)kasan_reset_tag(area->addr) + area->size) ||
((area->flags & (VM_ALLOC | VM_ALLOW_HUGE_VMAP)) != VM_ALLOC))
return -EINVAL;
Even before my change fcf8dda8cc48, we were bailing out on
!(area->flags & VM_ALLOC))
So on arm64 we haven't been supporting set_memory_* for vmap memory at all, because
it has VM_MAP set and not VM_ALLOC. Although we have a contradictory comment above
this code so not sure if this was intentional:
"Let's restrict ourselves to mappings created by vmalloc (or vmap)."
So either there is no user in the kernel doing vmap + set_memory_* (looks like it
by doing an LLM scan), or it is not fatal for set_memory_* to fail.
But even if no one does it now, technically the API allows it.
>
^ permalink raw reply
* Re: [RFC Patch 0/3] mm/mm_init: fix and cleanup mirrored_kernelcore
From: David Hildenbrand (Arm) @ 2026-06-25 6:36 UTC (permalink / raw)
To: Wei Yang, rppt, akpm, izumi.taku; +Cc: linux-mm, linux-kernel, yuan1.liu
In-Reply-To: <20260623092351.13031-1-richard.weiyang@gmail.com>
On 6/23/26 11:23, Wei Yang wrote:
> When reviewing patch set "mm/memory_hotplug: optimize zone contiguous check
> when changing pfn range" [1], we notice mirrored_kernelcore introduced some
> special case to handle.
>
> While during the discussion and test, it shows current mirrored_kernelcore
> implementation breaks other memmap_init() behavior:
>
> * disturbs defer_init()
> * some Movable zone range would be initialized twice
>
> The reason is Zone Movable and Zone Normal overlaps and current logic doesn't
> handle it well.
>
> As Yuan shows in [2], physical overlapped zone seems not exist in practice. If
> remove the possibility of overlapped zone, problem could be solved. And also
> could benefit zone contiguous optimization, IIUC.
>
> Patch [1]: remove the overlapped zone, which fix the above two problem
> Patch [2]: remove overlap_memmap_init() as there is no overlapped zone, but
> record current double initialize problem for reference
> Patch [3]: remove absent pages calculation for mirrored_kernelcore
>
> [1]: https://lore.kernel.org/all/20260520093457.3719960-1-yuan1.liu@intel.com/T/#u
> [2]: https://lore.kernel.org/all/20260520093457.3719960-1-yuan1.liu@intel.com/T/#mb0bd07ffd562a5b029f0f07751e580ca339c5b51
>
IIUC, Mike will send an alternative.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock
From: David Hildenbrand (Arm) @ 2026-06-25 6:32 UTC (permalink / raw)
To: Rik van Riel, linux-kernel
Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Dmitry Ilvokhin,
Borislav Petkov, Dave Hansen, Andrew Morton, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>
On 6/25/26 03:50, Rik van Riel wrote:
> Sometimes processes can get stuck with the mmap_lock held for
> a long time. This slows down, and can even prevent system monitoring
> tools from assessing and logging the situation, because they themselves
> end up getting stuck on the mmap_lock.
>
> However, with the introduction of per-VMA locks, we can improve the
> reliability of system monitoring, and generally speed up __access_remote_vm
> under mmap_loc contention, by adding a fast path that does not require
> the process-wide mmap_lock.
>
> This fast path is only compiled in and used when it is safe to do so,
> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> is not hugetlbfs, iomap, pfnmap, etc...
>
> v2:
> - simplify the code, which should be ok because these copies are < PAGE_SIZE
> - clean up the code
> - fix locking wrt tlb_remove_table_sync_one()
> - hopefully address all the other comments
You mean, ignoring my comments about not reiplementing GUP entirely?
NAK
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
From: Harry Yoo @ 2026-06-25 6:32 UTC (permalink / raw)
To: Qi Zheng, akpm, david, kasong, shakeel.butt, baohua,
axelrasmussen, yuanchu, weixugc, hannes, muchun.song, peiyang_he,
mhocko, roman.gushchin, ljs
Cc: linux-mm, linux-kernel, Qi Zheng, stable
In-Reply-To: <f18bf1b1-ccf7-4d77-9389-07311d2d1613@linux.dev>
[-- Attachment #1.1: Type: text/plain, Size: 2045 bytes --]
On 6/25/26 3:11 PM, Qi Zheng wrote:
> On 6/25/26 12:16 PM, Harry Yoo wrote:
>>
> [...]
>
>>
>>> So lock_batch_lruvec() can be implemented like this:
>>>
>>> #ifdef CONFIG_MEMCG
>>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>> {
>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>
>>> rcu_read_lock();
>>>
>>> /*
>>> * The memcg can be NULL when the memory controller is disabled.
>>> * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>> */
>>> if (!memcg || !css_is_dying(&memcg->css))
>>> goto lock;
>>>
>>> do {
>>> memcg = parent_mem_cgroup(memcg);
>>> } while (memcg && css_is_dying(&memcg->css));
>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>
>>> lock:
>>> spin_lock_irq(&lruvec->lru_lock);
>>>
>>> return lruvec;
>>> }
>>> #else
>>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>> {
>>> lruvec_lock_irq(lruvec);
>>>
>>> return lruvec;
>>> }
>>> #endif
>>>
>>> Does this make sense?
>>
>> Yes, looks good to me!
>
> OK, this sync method makes more sense as it doesn't require adding a
> new lrugen->reparente. I'll go with this method and update v3.
Thanks!
Just one thing to clarify...
So, when we check something that's updated _before_ grace period
(CSS_DYING), RCU is sufficient.
But in folio_lruvec_lock*(), that is not the case because reparenting
is performed in the RCU work, under the lruvec lock. So the check needs
to be done under RCU and the lruvec lock.
This is quite subtle :D
> Hi Barry and Baolin, what do you think? Since the sync method has been
> changed, I will temporarily drop your previous Reviewed-by tags in v3. ;)
And hopefully Peiyang would kindly double check v3 still not reproduced
on the machine :)
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug
From: Hannes Reinecke @ 2026-06-25 6:17 UTC (permalink / raw)
To: Gregory Price, linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-9-gourry@gourry.net>
On 6/24/26 4:57 PM, Gregory Price wrote:
> There is no atomic mechanism to offline and remove an entire
> multi-block DAX kmem device. This is presently done in two steps:
> 1. offline all
> 2. remove all).
>
> This creates a race condition where another entity operates directly
> on the memory blocks and can cause hot-unplug to fail / unbind to
> deadlock.
>
> Add a new 'state' sysfs attribute that enables an atomic whole-device
> hotplug operation across its entire memory region.
>
> daxX.Y/state mirrors the per-block memoryX/state ABI:
> - [offline, online, online_kernel, online_movable]
> - "unplugged" - is added specifically for dax0.0/state
>
> The valid writable states include:
> - "unplugged": memory blocks are not present
> - "online": memory is online, zone chosen by the kernel
> - "online_kernel": memory is online in ZONE_NORMAL
> - "online_movable": memory is online in ZONE_MOVABLE
>
> Valid transitions:
> - unplugged -> online[_kernel|_movable]
> - online[_kernel|_movable] -> unplugged
> - offline -> unplugged
>
> A device can only be onlined from "unplugged", so it must be returned
> there before being onlined into a different state.
>
> For backwards compatibility the memory blocks are always created at
> probe - existing tools expect them to be present after kmem binds.
>
> "offline" is therefore a reportable state but is not writable: it only
> arises from the legacy auto_online_blocks=offline policy. Onlining
> such a device through this attribute requires unplugging it first in
> an effort to get drivers creating DAX devices to set a default.
>
> Unplug is atomic across the whole device: dax_kmem_do_hotremove()
> collects every added range and offlines/removes them in one operation.
> Either the operation succeeds or is entirely rolled back.
>
> Unbind Note:
> We used to call remove_memory() during unbind, which would fire a
> BUG() if any of the memory blocks were online at that time. We lift
> this into a WARN in the cleanup routine and don't attempt hotremove
> if ->state is not DAX_KMEM_UNPLUGGED or MMOP_OFFLINE.
>
> An offline dax device memory is removed on unbind as before.
>
> If online at unbind, the resources are leaked (as before), but now
> we prevent deadlock if a memory region is impossible to hotremove.
>
> Suggested-by: Hannes Reinecke <hare@suse.de>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> Documentation/ABI/testing/sysfs-bus-dax | 26 +++
> drivers/base/memory.c | 9 +
> drivers/dax/kmem.c | 224 ++++++++++++++++++++----
> include/linux/memory_hotplug.h | 1 +
> 4 files changed, 224 insertions(+), 36 deletions(-)
>
That looks good, but question remains:
Why do we need to treat the 'unbind' call as a given thing?
If we know that we cannot handle online memory during unbind,
can't we just disallow unbind in that case?
I don't think it's too much to ask from an admin to offline
the memory first, _especially_ as now we have a simple knob
to do that ...
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply
* Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
From: Qi Zheng @ 2026-06-25 6:11 UTC (permalink / raw)
To: Harry Yoo, akpm, david, kasong, shakeel.butt, baohua,
axelrasmussen, yuanchu, weixugc, hannes, muchun.song, peiyang_he,
mhocko, roman.gushchin, ljs
Cc: linux-mm, linux-kernel, Qi Zheng, stable
In-Reply-To: <b5c85cea-5daa-4690-ac41-a6f5aebd1555@kernel.org>
On 6/25/26 12:16 PM, Harry Yoo wrote:
>
[...]
>
>> So lock_batch_lruvec() can be implemented like this:
>>
>> #ifdef CONFIG_MEMCG
>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>> {
>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>
>> rcu_read_lock();
>>
>> /*
>> * The memcg can be NULL when the memory controller is disabled.
>> * Otherwise, the caller keeps the memcg owning @lruvec alive.
>> */
>> if (!memcg || !css_is_dying(&memcg->css))
>> goto lock;
>>
>> do {
>> memcg = parent_mem_cgroup(memcg);
>> } while (memcg && css_is_dying(&memcg->css));
>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>> lock:
>> spin_lock_irq(&lruvec->lru_lock);
>>
>> return lruvec;
>> }
>> #else
>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>> {
>> lruvec_lock_irq(lruvec);
>>
>> return lruvec;
>> }
>> #endif
>>
>> Does this make sense?
>
> Yes, looks good to me!
OK, this sync method makes more sense as it doesn't require adding a
new lrugen->reparente. I'll go with this method and update v3.
Hi Barry and Baolin, what do you think? Since the sync method has been
changed, I will temporarily drop your previous Reviewed-by tags in v3. ;)
Thanks,
Qi
>
^ permalink raw reply
* [PATCH v1 1/2] eventfd: luo: luo support for preserving eventfd
From: Chenghao Duan @ 2026-06-25 5:49 UTC (permalink / raw)
To: viro, brauner, jack, linux-fsdevel, pasha.tatashin, linux-kernel,
rppt, pratyush, kexec, linux-mm
Cc: jianghaoran, duanchenghao
In-Reply-To: <20260625054946.73445-1-duanchenghao@kylinos.cn>
This patch adds support for preserving eventfd file descriptors across
kexec live updates using the Live Update Orchestrator (LUO) framework.
Userspace applications using eventfd for event notification can now
maintain their state across kernel updates.
Preserved State:
The following properties of the eventfd are preserved across kexec:
- Counter Value: The current 64-bit counter value, including any pending
events that have been signaled but not yet consumed by readers.
- File Flags: The creation flags (EFD_SEMAPHORE, EFD_CLOEXEC, EFD_NONBLOCK)
are preserved.
Non-Preserved State:
- File Descriptor Number: The eventfd will be assigned a new fd number
in the target process after restore.
- Wait Queue State: Any processes blocked on read() operations will be
woken up and need to re-establish their blocking state.
- All other internal state is reset to default.
Changes:
- fs/eventfd.c: Add eventfd_luo_get_state() to safely read eventfd state
(count and flags), and eventfd_create() helper function.
- fs/eventfd_luo.c: New file implementing LUO file operations:
preserve, freeze, unpreserve, retrieve, and finish callbacks.
- include/linux/eventfd.h: Export new functions.
- include/linux/kho/abi/eventfd.h: Define the ABI contract with
eventfd_luo_ser structure for serialization.
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
---
fs/Makefile | 1 +
fs/eventfd.c | 40 +++++
fs/eventfd_luo.c | 250 ++++++++++++++++++++++++++++++++
include/linux/eventfd.h | 2 +
include/linux/kho/abi/eventfd.h | 39 +++++
kernel/liveupdate/Kconfig | 16 ++
6 files changed, 348 insertions(+)
create mode 100644 fs/eventfd_luo.c
create mode 100644 include/linux/kho/abi/eventfd.h
diff --git a/fs/Makefile b/fs/Makefile
index 89a8a9d207d1..36d568e6cfc7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-y += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
+obj-$(CONFIG_LIVEUPDATE_EVENTFD)+= eventfd_luo.o
obj-$(CONFIG_AIO) += aio.o
obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FS_ENCRYPTION) += crypto/
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 9d33a02757d5..9b76cf06135a 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -376,6 +376,40 @@ struct eventfd_ctx *eventfd_ctx_fileget(struct file *file)
}
EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
+/**
+ * eventfd_luo_get_state - Get eventfd state (count and flags) for LUO
+ * @file: Eventfd file
+ * @count: Output parameter for count value
+ * @flags: Output parameter for flags value
+ *
+ * This function is exported for use by LUO to safely read eventfd state.
+ * Since struct eventfd_ctx is defined in this file, we can access its
+ * members directly here. The function uses the wait queue lock to ensure
+ * atomic access to count.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+int eventfd_luo_get_state(struct file *file, __u64 *count, unsigned int *flags)
+{
+ struct eventfd_ctx *ctx;
+ unsigned long irq_flags;
+
+ ctx = eventfd_ctx_fileget(file);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ /* Read count with lock (flags don't need lock) */
+ spin_lock_irqsave(&ctx->wqh.lock, irq_flags);
+ *count = ctx->count;
+ spin_unlock_irqrestore(&ctx->wqh.lock, irq_flags);
+
+ *flags = ctx->flags;
+
+ eventfd_ctx_put(ctx);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_luo_get_state);
+
static int do_eventfd(unsigned int count, int flags)
{
struct eventfd_ctx *ctx __free(kfree) = NULL;
@@ -411,6 +445,12 @@ static int do_eventfd(unsigned int count, int flags)
return fd_publish(fdf);
}
+int eventfd_create(__u64 count, unsigned int flags)
+{
+ return do_eventfd(count, flags);
+}
+EXPORT_SYMBOL_GPL(eventfd_create);
+
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
{
return do_eventfd(count, flags);
diff --git a/fs/eventfd_luo.c b/fs/eventfd_luo.c
new file mode 100644
index 000000000000..781d90635c52
--- /dev/null
+++ b/fs/eventfd_luo.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2026 KylinSoft Corporation.
+ * Author: Chenghao Duan <duanchenghao@kylinos.cn>
+ */
+
+/**
+ * DOC: Eventfd Preservation via LUO
+ *
+ * Overview
+ * ========
+ *
+ * Event file descriptors (eventfd) can be preserved over a kexec using the Live
+ * Update Orchestrator (LUO) file preservation. This allows userspace applications
+ * that use eventfd for event notification to maintain their state across kernel
+ * updates.
+ *
+ * Eventfd is a simple notification mechanism that uses a 64-bit counter for
+ * signaling events between userspace processes or between userspace and kernel.
+ * The preservation ensures that pending events and configuration are not lost
+ * during kexec.
+ *
+ * The preservation is not intended to be transparent. Only select properties of
+ * the eventfd are preserved. All others are reset to default. The preserved
+ * properties are described below.
+ *
+ * Preserved Properties
+ * ====================
+ *
+ * The following properties of the eventfd are preserved across kexec:
+ *
+ * Counter Value
+ * The current 64-bit counter value is preserved. This includes any pending
+ * events that have been signaled but not yet consumed by readers.
+ *
+ * File Flags
+ * The creation flags (EFD_SEMAPHORE, EFD_CLOEXEC, EFD_NONBLOCK) are preserved.
+ * These control the behavior of read/write operations and file descriptor
+ * inheritance.
+ *
+ * Non-Preserved Properties
+ * ========================
+ *
+ * All properties which are not preserved must be assumed to be reset to
+ * default. This section describes some of those properties which may be more of
+ * note.
+ *
+ * File Descriptor Number
+ * The file descriptor number itself is not preserved. After restore, the
+ * eventfd will be assigned a new file descriptor number in the target process.
+ *
+ * Wait Queue State
+ * Any processes currently blocked on read() operations will be woken up and
+ * need to re-establish their blocking state if desired.
+ *
+ * File Position
+ * Eventfd files don't have a traditional file position, but any internal
+ * state related to the file descriptor is reset.
+ */
+
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/io.h>
+#include <linux/kexec_handover.h>
+#include <linux/kho/abi/eventfd.h>
+#include <linux/liveupdate.h>
+#include <linux/module.h>
+#include <linux/eventfd.h>
+#include <linux/anon_inodes.h>
+#include <linux/idr.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/kref.h>
+#include <linux/fdtable.h>
+
+static int eventfd_luo_preserve(struct liveupdate_file_op_args *args)
+{
+ struct eventfd_luo_ser *ser;
+ u64 count;
+ unsigned int flags;
+ int err = 0;
+
+ /* Get eventfd state safely */
+ err = eventfd_luo_get_state(args->file, &count, &flags);
+ if (err) {
+ pr_err("Failed to get eventfd state: %d\n", err);
+ return err;
+ }
+
+ ser = kho_alloc_preserve(sizeof(*ser));
+ if (IS_ERR(ser)) {
+ err = PTR_ERR(ser);
+ pr_err("Failed to allocate preserve memory: %d\n", err);
+ return err;
+ }
+
+ /* Save eventfd state */
+ ser->count = count;
+ ser->flags = flags;
+
+ pr_debug("Preserved eventfd: count=%llu, flags=0x%x\n",
+ ser->count, ser->flags);
+
+ /* Return physical address of serialization structure */
+ args->serialized_data = virt_to_phys(ser);
+
+ return 0;
+}
+
+static int eventfd_luo_freeze(struct liveupdate_file_op_args *args)
+{
+ struct eventfd_luo_ser *ser;
+ u64 count;
+ unsigned int flags;
+ int err;
+
+ if (WARN_ON_ONCE(!args->serialized_data))
+ return -EINVAL;
+
+ ser = phys_to_virt(args->serialized_data);
+
+ /* Get current state and update if changed */
+ err = eventfd_luo_get_state(args->file, &count, &flags);
+ if (err)
+ return err;
+
+ if (ser->count != count) {
+ pr_debug("WARNING: Count changed during preserve->freeze! old=%llu, new=%llu\n",
+ ser->count, count);
+ }
+
+ ser->count = count;
+
+ return 0;
+}
+
+static void eventfd_luo_unpreserve(struct liveupdate_file_op_args *args)
+{
+ struct eventfd_luo_ser *ser;
+
+ if (WARN_ON_ONCE(!args->serialized_data))
+ return;
+
+ ser = phys_to_virt(args->serialized_data);
+ kho_unpreserve_free(ser);
+}
+
+static int eventfd_luo_retrieve(struct liveupdate_file_op_args *args)
+{
+ struct eventfd_luo_ser *ser;
+ struct eventfd_ctx *ctx;
+ struct file *file = NULL;
+ int eventfd;
+
+ ser = phys_to_virt(args->serialized_data);
+ if (!ser)
+ return -EINVAL;
+
+ /* Create a new eventfd with the preserved count and flags */
+ eventfd = eventfd_create(ser->count, ser->flags);
+ if (eventfd < 0) {
+ pr_err("Failed to create eventfd: %d\n", eventfd);
+ return eventfd;
+ }
+
+ file = fget(eventfd);
+ if (!file) {
+ pr_err("Failed to get file from fd\n");
+ close_fd(eventfd);
+ return -EBADF;
+ }
+
+ close_fd(eventfd);
+
+ /* Verify the created file has correct internal state */
+ ctx = eventfd_ctx_fileget(file);
+ if (IS_ERR(ctx)) {
+ pr_err("Failed to get context from file\n");
+ fput(file);
+ return PTR_ERR(ctx);
+ }
+
+ eventfd_ctx_put(ctx);
+
+ args->file = file;
+ return 0;
+}
+
+static void eventfd_luo_finish(struct liveupdate_file_op_args *args)
+{
+ struct eventfd_luo_ser *ser;
+
+ if (args->retrieve_status)
+ return;
+
+ if (!args->serialized_data)
+ return;
+
+ ser = phys_to_virt(args->serialized_data);
+ if (!ser)
+ return;
+
+ kho_restore_free(ser);
+}
+
+static bool eventfd_luo_can_preserve(struct liveupdate_file_handler *handler,
+ struct file *file)
+{
+ struct eventfd_ctx *ctx;
+
+ if (!file->f_op)
+ return false;
+
+ /* Try to get eventfd context - this will fail if not an eventfd */
+ ctx = eventfd_ctx_fileget(file);
+ if (IS_ERR(ctx))
+ return false;
+
+ eventfd_ctx_put(ctx);
+ return true;
+}
+
+static const struct liveupdate_file_ops eventfd_luo_file_ops = {
+ .preserve = eventfd_luo_preserve,
+ .unpreserve = eventfd_luo_unpreserve,
+ .freeze = eventfd_luo_freeze,
+ .retrieve = eventfd_luo_retrieve,
+ .finish = eventfd_luo_finish,
+ .can_preserve = eventfd_luo_can_preserve,
+ .owner = THIS_MODULE,
+};
+
+static struct liveupdate_file_handler eventfd_luo_handler = {
+ .ops = &eventfd_luo_file_ops,
+ .compatible = EVENTFD_LUO_FH_COMPATIBLE,
+};
+
+static int __init eventfd_luo_init(void)
+{
+ int err = liveupdate_register_file_handler(&eventfd_luo_handler);
+
+ if (err && err != -EOPNOTSUPP) {
+ pr_err("Could not register eventfd LUO handler: %pe\n",
+ ERR_PTR(err));
+ return err;
+ }
+
+ return 0;
+}
+late_initcall(eventfd_luo_init);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index e32bee4345fb..703e1a126c4d 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -35,6 +35,8 @@ void eventfd_ctx_put(struct eventfd_ctx *ctx);
struct file *eventfd_fget(int fd);
struct eventfd_ctx *eventfd_ctx_fdget(int fd);
struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
+int eventfd_luo_get_state(struct file *file, __u64 *count, unsigned int *flags);
+int eventfd_create(__u64 count, unsigned int flags);
void eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait,
__u64 *cnt);
diff --git a/include/linux/kho/abi/eventfd.h b/include/linux/kho/abi/eventfd.h
new file mode 100644
index 000000000000..148beac6bcc7
--- /dev/null
+++ b/include/linux/kho/abi/eventfd.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 KylinSoft Corporation.
+ * Author: Chenghao Duan <duanchenghao@kylinos.cn>
+ */
+
+#ifndef _LINUX_KHO_ABI_EVENTFD_H
+#define _LINUX_KHO_ABI_EVENTFD_H
+
+#include <linux/types.h>
+
+/*
+ * Eventfd Live Update ABI
+ *
+ * This header defines the ABI for preserving eventfd state across kexec.
+ *
+ * The state is serialized into a packed structure `struct eventfd_luo_ser`
+ * which is handed over to the next kernel via the KHO mechanism.
+ *
+ */
+
+/**
+ * struct eventfd_luo_ser - Serialized state of an eventfd
+ * @count: The current counter value
+ * @flags: File flags (EFD_SEMAPHORE, EFD_CLOEXEC, EFD_NONBLOCK)
+ *
+ * This structure contains the minimal state needed to restore an eventfd
+ * after kexec. The count represents the current value of the event counter,
+ * and flags represent the file creation flags.
+ */
+struct eventfd_luo_ser {
+ __u64 count;
+ unsigned int flags;
+} __packed;
+
+/* The compatibility string for eventfd file handler */
+#define EVENTFD_LUO_FH_COMPATIBLE "eventfd-v1"
+
+#endif /* _LINUX_KHO_ABI_EVENTFD_H */
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index c13af38ba23a..1361b5733f41 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -86,4 +86,20 @@ config LIVEUPDATE_MEMFD
If unsure, say N.
+config LIVEUPDATE_EVENTFD
+ bool "Eventfd Live Update Orchestrator support"
+ depends on EVENTFD
+ depends on LIVEUPDATE
+ help
+ Enable Live Update Orchestrator support for eventfd file descriptors.
+ This allows eventfd files to be preserved and restored across kexec
+ operations, maintaining their counter values and flags.
+
+ Eventfd files are commonly used for event notification between
+ userspace processes or between userspace and kernel. With this
+ option enabled, eventfd state can be handed over to a new kernel
+ during live update operations.
+
+ If unsure, say N.
+
endmenu
--
2.25.1
^ permalink raw reply related
* [PATCH v1 2/2] selftests: liveupdate: Add selftest for eventfd LUO
From: Chenghao Duan @ 2026-06-25 5:49 UTC (permalink / raw)
To: viro, brauner, jack, linux-fsdevel, pasha.tatashin, linux-kernel,
rppt, pratyush, kexec, linux-mm
Cc: jianghaoran, duanchenghao
In-Reply-To: <20260625054946.73445-1-duanchenghao@kylinos.cn>
This test verifies the Live Update Orchestrator (LUO) support for
preserving eventfd file descriptors across kexec. It creates multiple
LUO sessions, each preserving different eventfd types, and verifies
their state after kexec.
The test covers six different eventfd configurations:
1. **Empty eventfd** - Zero count, default flags
- Verifies flag preservation without behavioral testing
2. **Default eventfd** - Initial count + write, verifies count preservation
- Tests basic counter value retention across kexec
3. **Semaphore eventfd** - EFD_SEMAPHORE flag, multiple reads
- Verifies semaphore behavior (returns 1 per read)
4. **Non-blocking eventfd** - EFD_NONBLOCK flag
- Tests O_NONBLOCK flag preservation
5. **Large-count eventfd** - UINT_MAX count value
- Tests handling of maximum counter values
6. **Modified-after-preserve** - Count changed during handover
- Verifies freeze callback captures final state
The test validates the following sequence:
Stage 1 (pre-kexec):
- Creates state file for inter-stage communication
- Creates multiple LUO sessions
- Preserves eventfds with different configurations
- Trigger kexec reboot
Stage 2 (post-kexec):
- Retrieves preserved eventfd sessions
- Verifies flags and counter values for each type
- Tests semaphore read behavior
- Finalizes all sessions
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
---
tools/testing/selftests/liveupdate/Makefile | 1 +
tools/testing/selftests/liveupdate/config | 2 +
.../selftests/liveupdate/luo_test_eventfd.c | 376 ++++++++++++++++++
3 files changed, 379 insertions(+)
create mode 100644 tools/testing/selftests/liveupdate/luo_test_eventfd.c
diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile
index 30689d22cb02..e092e38fa6f6 100644
--- a/tools/testing/selftests/liveupdate/Makefile
+++ b/tools/testing/selftests/liveupdate/Makefile
@@ -8,6 +8,7 @@ TEST_GEN_PROGS_EXTENDED += luo_kexec_simple
TEST_GEN_PROGS_EXTENDED += luo_multi_session
TEST_GEN_PROGS_EXTENDED += luo_stress_sessions
TEST_GEN_PROGS_EXTENDED += luo_stress_files
+TEST_GEN_PROGS_EXTENDED += luo_test_eventfd
TEST_FILES += do_kexec.sh
diff --git a/tools/testing/selftests/liveupdate/config b/tools/testing/selftests/liveupdate/config
index 91d03f9a6a39..d388bd755245 100644
--- a/tools/testing/selftests/liveupdate/config
+++ b/tools/testing/selftests/liveupdate/config
@@ -9,3 +9,5 @@ CONFIG_LIVEUPDATE_TEST=y
CONFIG_MEMFD_CREATE=y
CONFIG_TMPFS=y
CONFIG_SHMEM=y
+CONFIG_EVENTFD=y
+CONFIG_LIVEUPDATE_EVENTFD=y
diff --git a/tools/testing/selftests/liveupdate/luo_test_eventfd.c b/tools/testing/selftests/liveupdate/luo_test_eventfd.c
new file mode 100644
index 000000000000..94ef3bc66ad9
--- /dev/null
+++ b/tools/testing/selftests/liveupdate/luo_test_eventfd.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2026 KylinSoft Corporation.
+ * Author: Chenghao Duan <duanchenghao@kylinos.cn>
+ */
+
+/*
+ * Multi-session kexec selftest for eventfd LUO support.
+ *
+ * Modeled after luo_multi_session.c.
+ * It creates multiple LUO sessions, each preserving different eventfd types,
+ * and verifies them after kexec.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+
+#include "luo_test_utils.h"
+
+/* Session names */
+#define SESSION_EMPTY "eventfd-empty"
+#define SESSION_DEFAULT "eventfd-default"
+#define SESSION_SEM "eventfd-sem"
+#define SESSION_NONBLOCK "eventfd-nonblock"
+#define SESSION_LARGE "eventfd-large"
+#define SESSION_MODIFIED "eventfd-modified-after-preserve"
+
+/* Tokens */
+#define TOKEN_DEFAULT 0xCAFEBABE
+#define TOKEN_SEM 0xDEADBEEF
+#define TOKEN_NONBLOCK 0xFEEDBEEF
+#define TOKEN_LARGE 0xBEEFCAFE
+#define TOKEN_MODIFIED 0xABCD1234
+
+/* Counts */
+#define COUNT_INITIAL 42
+#define COUNT_WRITE 10
+#define COUNT_EXPECTED (COUNT_INITIAL + COUNT_WRITE) /* 52 */
+#define COUNT_SEM 5
+#define COUNT_NONBLOCK 100
+#define COUNT_LARGE ((unsigned int)-1) /* UINT_MAX */
+#define COUNT_MODIFIED 25
+#define COUNT_MODIFY_DELTA 99
+
+/* State tracking */
+#define STATE_SESSION_NAME "eventfd_multi_state"
+#define STATE_TOKEN 997
+
+/* Eventfd verification modes */
+enum verify_mode {
+ VERIFY_FLAGS, /* Only verify flags, no behavior testing */
+ VERIFY_ONCE, /* Read once and verify count */
+ VERIFY_SEMAPHORE, /* Read multiple times, verify returns 1 each time */
+ VERIFY_NONBLOCK, /* Read once in nonblock mode */
+ VERIFY_LARGE /* Read once, verify large count */
+};
+
+/* Test session configuration */
+struct test_session_config {
+ const char *session_name;
+ unsigned long token;
+ unsigned int count;
+ unsigned int flags;
+ enum verify_mode verify_mode;
+ const char *desc;
+};
+
+/* Test session configurations */
+static const struct test_session_config test_configs[] = {
+ {SESSION_EMPTY, 0, 0, 0, VERIFY_FLAGS, "Empty"},
+ {SESSION_DEFAULT, TOKEN_DEFAULT, COUNT_EXPECTED, 0, VERIFY_ONCE, "Default"},
+ {SESSION_SEM, TOKEN_SEM, COUNT_SEM, EFD_SEMAPHORE, VERIFY_SEMAPHORE, "Semaphore"},
+ {SESSION_NONBLOCK, TOKEN_NONBLOCK, COUNT_NONBLOCK, EFD_NONBLOCK, VERIFY_NONBLOCK, "Nonblock"},
+ {SESSION_LARGE, TOKEN_LARGE, COUNT_LARGE, 0, VERIFY_LARGE, "Large-count"},
+ {SESSION_MODIFIED, TOKEN_MODIFIED, COUNT_MODIFIED + COUNT_MODIFY_DELTA, 0, VERIFY_ONCE, "Modified after preserve (kexec handover)"},
+};
+#define NUM_TEST_SESSIONS ARRAY_SIZE(test_configs)
+
+static int verify_eventfd_flags(int efd, unsigned int expected_flags, const char *desc)
+{
+ int actual_flags = fcntl(efd, F_GETFL);
+
+ if (actual_flags < 0)
+ return -errno;
+
+ int expected_fd_flags = expected_flags & (EFD_NONBLOCK | EFD_CLOEXEC);
+ int actual_fd_flags = actual_flags & (O_NONBLOCK | O_CLOEXEC);
+
+ if (actual_fd_flags != expected_fd_flags) {
+ ksft_print_msg("%s: flag mismatch - expected 0x%x, got 0x%x\n",
+ desc, expected_fd_flags, actual_fd_flags);
+ return -EINVAL;
+ }
+
+ ksft_print_msg(" %s eventfd flags OK (0x%x)\n", desc, expected_fd_flags);
+ return actual_flags; /* Return actual flags for further use */
+}
+
+static int ensure_nonblock(int efd, int current_flags)
+{
+ return fcntl(efd, F_SETFL, current_flags | O_NONBLOCK);
+}
+
+static int verify_count_once(int efd, unsigned int expected_count, const char *desc)
+{
+ uint64_t val;
+
+ if (read(efd, &val, sizeof(val)) != sizeof(val))
+ return -errno;
+
+ if (val != (uint64_t)expected_count) {
+ ksft_print_msg("%s: expected %u got %llu\n",
+ desc, expected_count, (unsigned long long)val);
+ return -EINVAL;
+ }
+
+ ksft_print_msg(" %s eventfd OK: %u\n", desc, expected_count);
+ return 0;
+}
+
+static int verify_semaphore_behavior(int efd, unsigned int expected_count, const char *desc)
+{
+ uint64_t val;
+
+ /* Read expected_count times, each should return 1 */
+ for (unsigned int i = 0; i < expected_count; i++) {
+ if (read(efd, &val, sizeof(val)) != sizeof(val))
+ return -errno;
+
+ if (val != 1) {
+ ksft_print_msg("%s: expected 1, got %llu at read %u\n",
+ desc, (unsigned long long)val, i + 1);
+ return -EINVAL;
+ }
+ ksft_print_msg(" %s eventfd OK: %u at read %u\n", desc, (unsigned int)val, i + 1);
+ }
+
+ /* Next read should return EAGAIN (no more events) */
+ if (read(efd, &val, sizeof(val)) >= 0 || errno != EAGAIN) {
+ ksft_print_msg("%s: expected EAGAIN after %u reads\n",
+ desc, expected_count);
+ return -EINVAL;
+ }
+
+ ksft_print_msg(" %s eventfd OK (%u reads)\n", desc, expected_count);
+ return 0;
+}
+
+static int restore_and_verify_eventfd_generic(int session_fd,
+ unsigned long token,
+ unsigned int expected_count,
+ unsigned int expected_flags,
+ enum verify_mode mode,
+ const char *desc)
+{
+ struct liveupdate_session_retrieve_fd arg = { .size = sizeof(arg) };
+ int efd, ret = 0, actual_flags;
+
+ arg.token = token;
+ if (ioctl(session_fd, LIVEUPDATE_SESSION_RETRIEVE_FD, &arg) < 0)
+ return -errno;
+ efd = arg.fd;
+
+ switch (mode) {
+ case VERIFY_FLAGS:
+ /* Only verify flags, no behavior testing */
+ ret = verify_eventfd_flags(efd, expected_flags, desc);
+ if (ret < 0)
+ close(efd);
+ return ret < 0 ? ret : 0;
+
+ case VERIFY_SEMAPHORE:
+ /* Verify flags + semaphore behavior */
+ actual_flags = verify_eventfd_flags(efd, expected_flags, desc);
+ if (actual_flags < 0) {
+ ret = actual_flags;
+ goto out;
+ }
+
+ if (ensure_nonblock(efd, actual_flags) < 0) {
+ ret = -errno;
+ goto out;
+ }
+
+ ret = verify_semaphore_behavior(efd, expected_count, desc);
+ break;
+
+ case VERIFY_ONCE:
+ case VERIFY_NONBLOCK:
+ case VERIFY_LARGE:
+ /* Verify flags + count behavior */
+ actual_flags = verify_eventfd_flags(efd, expected_flags, desc);
+ if (actual_flags < 0) {
+ ret = actual_flags;
+ goto out;
+ }
+
+ ret = verify_count_once(efd, expected_count, desc);
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+out:
+ close(efd);
+ return ret;
+}
+
+static int create_and_preserve_eventfd_keep_fd(int session_fd,
+ unsigned long token,
+ unsigned int count,
+ unsigned int flags,
+ const char *desc,
+ int *efd_out)
+{
+ struct liveupdate_session_preserve_fd arg = { .size = sizeof(arg) };
+ int efd = eventfd(count, flags);
+
+ if (efd < 0)
+ return -errno;
+
+ arg.fd = efd;
+ arg.token = token;
+ if (ioctl(session_fd, LIVEUPDATE_SESSION_PRESERVE_FD, &arg) < 0) {
+ int ret = -errno;
+
+ close(efd);
+ return ret;
+ }
+
+ if (efd_out)
+ *efd_out = efd;
+ else
+ close(efd);
+ return 0;
+}
+
+static int create_and_preserve_eventfd(int session_fd,
+ unsigned long token,
+ unsigned int count,
+ unsigned int flags,
+ const char *desc)
+{
+ return create_and_preserve_eventfd_keep_fd(session_fd, token, count,
+ flags, desc, NULL);
+}
+
+static int create_session_checked(int luo_fd, const char *session_name)
+{
+ int session_fd = luo_create_session(luo_fd, session_name);
+
+ if (session_fd < 0)
+ fail_exit("luo_create_session for '%s'", session_name);
+ return session_fd;
+}
+
+static int retrieve_session_checked(int luo_fd, const char *session_name)
+{
+ int session_fd = luo_retrieve_session(luo_fd, session_name);
+
+ if (session_fd < 0)
+ fail_exit("luo_retrieve_session for '%s'", session_name);
+ return session_fd;
+}
+
+static void finish_session_checked(int session_fd, const char *session_name)
+{
+ if (luo_session_finish(session_fd) < 0)
+ fail_exit("luo_session_finish for '%s'", session_name);
+ close(session_fd);
+}
+
+static int verify_eventfd_config(int session_fd, const struct test_session_config *config)
+{
+ return restore_and_verify_eventfd_generic(session_fd, config->token,
+ config->count, config->flags,
+ config->verify_mode, config->desc);
+}
+
+
+static void run_stage_1(int luo_fd)
+{
+ ksft_print_msg("[STAGE 1] Starting pre-kexec setup for multi-eventfd test...\n");
+ ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n");
+ create_state_file(luo_fd, STATE_SESSION_NAME, STATE_TOKEN, 2);
+
+ /* Create all test sessions */
+ for (size_t i = 0; i < NUM_TEST_SESSIONS; i++) {
+ const struct test_session_config *config = &test_configs[i];
+
+ ksft_print_msg("[STAGE 1] Creating session '%s'...\n", config->session_name);
+ int session_fd = create_session_checked(luo_fd, config->session_name);
+
+ /* Special handling for modified session (preserve then modify) */
+ if (config->token == TOKEN_MODIFIED) {
+ int preserved_efd;
+
+ if (create_and_preserve_eventfd_keep_fd(session_fd,
+ config->token,
+ COUNT_MODIFIED,
+ config->flags,
+ "modified-after-preserve",
+ &preserved_efd) < 0)
+ fail_exit("create_and_preserve_eventfd_keep_fd modified");
+
+ /* Now modify the preserved eventfd's count */
+ uint64_t modify_value = COUNT_MODIFY_DELTA;
+
+ if (write(preserved_efd, &modify_value,
+ sizeof(modify_value)) != sizeof(modify_value))
+ fail_exit("write to preserved eventfd after preserve");
+
+ close(preserved_efd);
+ } else {
+ /* Standard session creation */
+ if (create_and_preserve_eventfd(session_fd, config->token,
+ config->count, config->flags,
+ config->desc) < 0)
+ fail_exit("create_and_preserve_eventfd %s", config->desc);
+ }
+ }
+
+ close(luo_fd);
+ daemonize_and_wait();
+}
+
+static void run_stage_2(int luo_fd, int state_session_fd)
+{
+ int session_fds[NUM_TEST_SESSIONS];
+ int stage;
+
+ ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n");
+
+ restore_and_read_stage(state_session_fd, STATE_TOKEN, &stage);
+ if (stage != 2)
+ fail_exit("Expected stage 2, but state file contains %d", stage);
+
+ ksft_print_msg("[STAGE 2] Retrieving all sessions...\n");
+ for (size_t i = 0; i < NUM_TEST_SESSIONS; i++)
+ session_fds[i] = retrieve_session_checked(luo_fd, test_configs[i].session_name);
+
+ ksft_print_msg("[STAGE 2] Verifying eventfds...\n");
+ for (size_t i = 0; i < NUM_TEST_SESSIONS; i++) {
+ if (verify_eventfd_config(session_fds[i], &test_configs[i]) < 0)
+ fail_exit("verify %s eventfd", test_configs[i].desc);
+ }
+
+ ksft_print_msg("[STAGE 2] All eventfd sessions verified successfully.\n");
+
+ ksft_print_msg("[STAGE 2] Finalizing all sessions...\n");
+ for (size_t i = 0; i < NUM_TEST_SESSIONS; i++)
+ finish_session_checked(session_fds[i], test_configs[i].session_name);
+
+ ksft_print_msg("[STAGE 2] Finalizing state session...\n");
+ if (luo_session_finish(state_session_fd) < 0)
+ fail_exit("luo_session_finish for state session");
+ close(state_session_fd);
+
+ ksft_print_msg("\n--- EVENTFD_LUO TEST PASSED ---\n");
+}
+
+int main(int argc, char *argv[])
+{
+ return luo_test(argc, argv, STATE_SESSION_NAME,
+ run_stage_1, run_stage_2);
+}
--
2.25.1
^ permalink raw reply related
* [PATCH v1 0/2] luo support for preserving eventfd
From: Chenghao Duan @ 2026-06-25 5:49 UTC (permalink / raw)
To: viro, brauner, jack, linux-fsdevel, pasha.tatashin, linux-kernel,
rppt, pratyush, kexec, linux-mm
Cc: jianghaoran, duanchenghao
It is my great honor to participate in the development of LiveUpdate.
The current patch implements logic to preserve and retrieve eventfd
states, which I developed by referencing memfd_luo while learning the
LiveUpdate framework.
eventfd serves as a critical notification mechanism between Guest and
Host. During host kernel upgrades, we can preserve the corresponding
eventfd states and restore them after the kernel update completes.
Patch 0001 implements eventfd_luo, while Patch 0002 contains selftest code.
Test procedures:
1. ./luo_test_eventfd --stage 1
2. kexec reboot
3. ./luo_test_eventfd --stage 2
Chenghao Duan (2):
eventfd: luo: luo support for preserving eventfd
selftests: liveupdate: Add selftest for eventfd LUO
fs/Makefile | 1 +
fs/eventfd.c | 40 ++
fs/eventfd_luo.c | 250 ++++++++++++
include/linux/eventfd.h | 2 +
include/linux/kho/abi/eventfd.h | 39 ++
kernel/liveupdate/Kconfig | 16 +
tools/testing/selftests/liveupdate/Makefile | 1 +
tools/testing/selftests/liveupdate/config | 2 +
.../selftests/liveupdate/luo_test_eventfd.c | 376 ++++++++++++++++++
9 files changed, 727 insertions(+)
create mode 100644 fs/eventfd_luo.c
create mode 100644 include/linux/kho/abi/eventfd.h
create mode 100644 tools/testing/selftests/liveupdate/luo_test_eventfd.c
--
2.25.1
^ permalink raw reply
* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: kernel test robot @ 2026-06-25 5:45 UTC (permalink / raw)
To: Dev Jain, akpm, david, ljs
Cc: llvm, oe-kbuild-all, Dev Jain, riel, liam, vbabka, harry, jannh,
kas, linux-mm, linux-kernel, ryan.roberts, anshuman.khandual,
stable
In-Reply-To: <20260625042853.2752898-1-dev.jain@arm.com>
Hi Dev,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-use-huge_ptep_get-in-try_to_unmap_one/20260625-123050
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20260625042853.2752898-1-dev.jain%40arm.com
patch subject: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
config: hexagon-allnoconfig (https://download.01.org/0day-ci/archive/20260625/202606251341.jfIr1D7m-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 6cc609bb250b21b47fc7d394b4019101e9983597)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260625/202606251341.jfIr1D7m-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606251341.jfIr1D7m-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/rmap.c:2100:13: error: call to undeclared function 'huge_ptep_get'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
2100 | pteval = huge_ptep_get(mm, address, pvmw.pte);
| ^
>> mm/rmap.c:2100:11: error: assigning to 'pte_t' from incompatible type 'int'
2100 | pteval = huge_ptep_get(mm, address, pvmw.pte);
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.
vim +/huge_ptep_get +2100 mm/rmap.c
1980
1981 /*
1982 * @arg: enum ttu_flags will be passed to this argument
1983 */
1984 static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
1985 unsigned long address, void *arg)
1986 {
1987 struct mm_struct *mm = vma->vm_mm;
1988 DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
1989 bool anon_exclusive, ret = true;
1990 pte_t pteval;
1991 struct page *subpage;
1992 struct mmu_notifier_range range;
1993 enum ttu_flags flags = (enum ttu_flags)(long)arg;
1994 unsigned long nr_pages = 1, end_addr;
1995 unsigned long pfn;
1996 unsigned long hsz = 0;
1997 int ptes = 0;
1998
1999 /*
2000 * When racing against e.g. zap_pte_range() on another cpu,
2001 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
2002 * try_to_unmap() may return before folio_mapped() has become false,
2003 * if page table locking is skipped: use TTU_SYNC to wait for that.
2004 */
2005 if (flags & TTU_SYNC)
2006 pvmw.flags = PVMW_SYNC;
2007
2008 /*
2009 * For THP, we have to assume the worse case ie pmd for invalidation.
2010 * For hugetlb, it could be much worse if we need to do pud
2011 * invalidation in the case of pmd sharing.
2012 *
2013 * Note that the folio can not be freed in this function as call of
2014 * try_to_unmap() must hold a reference on the folio.
2015 */
2016 range.end = vma_address_end(&pvmw);
2017 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
2018 address, range.end);
2019 if (folio_test_hugetlb(folio)) {
2020 /*
2021 * If sharing is possible, start and end will be adjusted
2022 * accordingly.
2023 */
2024 adjust_range_if_pmd_sharing_possible(vma, &range.start,
2025 &range.end);
2026
2027 /* We need the huge page size for set_huge_pte_at() */
2028 hsz = huge_page_size(hstate_vma(vma));
2029 }
2030 mmu_notifier_invalidate_range_start(&range);
2031
2032 while (page_vma_mapped_walk(&pvmw)) {
2033 nr_pages = 1;
2034
2035 /*
2036 * If the folio is in an mlock()d vma, we must not swap it out.
2037 */
2038 if (!(flags & TTU_IGNORE_MLOCK) &&
2039 (vma->vm_flags & VM_LOCKED)) {
2040 ptes++;
2041
2042 /*
2043 * Set 'ret' to indicate the page cannot be unmapped.
2044 *
2045 * Do not jump to walk_abort immediately as additional
2046 * iteration might be required to detect fully mapped
2047 * folio an mlock it.
2048 */
2049 ret = false;
2050
2051 /* Only mlock fully mapped pages */
2052 if (pvmw.pte && ptes != pvmw.nr_pages)
2053 continue;
2054
2055 /*
2056 * All PTEs must be protected by page table lock in
2057 * order to mlock the page.
2058 *
2059 * If page table boundary has been cross, current ptl
2060 * only protect part of ptes.
2061 */
2062 if (pvmw.flags & PVMW_PGTABLE_CROSSED)
2063 goto walk_done;
2064
2065 /* Restore the mlock which got missed */
2066 mlock_vma_folio(folio, vma);
2067 goto walk_done;
2068 }
2069
2070 if (!pvmw.pte) {
2071 if (folio_test_lazyfree(folio)) {
2072 if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
2073 goto walk_done;
2074 /*
2075 * unmap_huge_pmd_locked has either already marked
2076 * the folio as swap-backed or decided to retain it
2077 * due to GUP or speculative references.
2078 */
2079 goto walk_abort;
2080 }
2081
2082 if (flags & TTU_SPLIT_HUGE_PMD) {
2083 /*
2084 * We temporarily have to drop the PTL and
2085 * restart so we can process the PTE-mapped THP.
2086 */
2087 split_huge_pmd_locked(vma, pvmw.address,
2088 pvmw.pmd, false);
2089 flags &= ~TTU_SPLIT_HUGE_PMD;
2090 page_vma_mapped_walk_restart(&pvmw);
2091 continue;
2092 }
2093 }
2094
2095 /* Unexpected PMD-mapped THP? */
2096 VM_BUG_ON_FOLIO(!pvmw.pte, folio);
2097
2098 address = pvmw.address;
2099 if (folio_test_hugetlb(folio)) {
> 2100 pteval = huge_ptep_get(mm, address, pvmw.pte);
2101 } else {
2102 /*
2103 * Handle PFN swap PTEs, such as device-exclusive ones,
2104 * that actually map pages.
2105 */
2106 pteval = ptep_get(pvmw.pte);
2107 }
2108 if (likely(pte_present(pteval))) {
2109 pfn = pte_pfn(pteval);
2110 } else {
2111 const softleaf_t entry = softleaf_from_pte(pteval);
2112
2113 pfn = softleaf_to_pfn(entry);
2114 VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
2115 }
2116
2117 subpage = folio_page(folio, pfn - folio_pfn(folio));
2118 anon_exclusive = folio_test_anon(folio) &&
2119 PageAnonExclusive(subpage);
2120
2121 if (folio_test_hugetlb(folio)) {
2122 bool anon = folio_test_anon(folio);
2123
2124 /*
2125 * The try_to_unmap() is only passed a hugetlb page
2126 * in the case where the hugetlb page is poisoned.
2127 */
2128 VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
2129 /*
2130 * huge_pmd_unshare may unmap an entire PMD page.
2131 * There is no way of knowing exactly which PMDs may
2132 * be cached for this mm, so we must flush them all.
2133 * start/end were already adjusted above to cover this
2134 * range.
2135 */
2136 flush_cache_range(vma, range.start, range.end);
2137
2138 /*
2139 * To call huge_pmd_unshare, i_mmap_rwsem must be
2140 * held in write mode. Caller needs to explicitly
2141 * do this outside rmap routines.
2142 *
2143 * We also must hold hugetlb vma_lock in write mode.
2144 * Lock order dictates acquiring vma_lock BEFORE
2145 * i_mmap_rwsem. We can only try lock here and fail
2146 * if unsuccessful.
2147 */
2148 if (!anon) {
2149 struct mmu_gather tlb;
2150
2151 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
2152 if (!hugetlb_vma_trylock_write(vma))
2153 goto walk_abort;
2154
2155 tlb_gather_mmu_vma(&tlb, vma);
2156 if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
2157 hugetlb_vma_unlock_write(vma);
2158 huge_pmd_unshare_flush(&tlb, vma);
2159 tlb_finish_mmu(&tlb);
2160 /*
2161 * The PMD table was unmapped,
2162 * consequently unmapping the folio.
2163 */
2164 goto walk_done;
2165 }
2166 hugetlb_vma_unlock_write(vma);
2167 tlb_finish_mmu(&tlb);
2168 }
2169 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
2170 if (pte_dirty(pteval))
2171 folio_mark_dirty(folio);
2172 } else if (likely(pte_present(pteval))) {
2173 nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
2174 end_addr = address + nr_pages * PAGE_SIZE;
2175 flush_cache_range(vma, address, end_addr);
2176
2177 /* Nuke the page table entry. */
2178 pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
2179 /*
2180 * We clear the PTE but do not flush so potentially
2181 * a remote CPU could still be writing to the folio.
2182 * If the entry was previously clean then the
2183 * architecture must guarantee that a clear->dirty
2184 * transition on a cached TLB entry is written through
2185 * and traps if the PTE is unmapped.
2186 */
2187 if (should_defer_flush(mm, flags))
2188 set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
2189 else
2190 flush_tlb_range(vma, address, end_addr);
2191 if (pte_dirty(pteval))
2192 folio_mark_dirty(folio);
2193 } else {
2194 pte_clear(mm, address, pvmw.pte);
2195 }
2196
2197 /*
2198 * Now the pte is cleared. If this pte was uffd-wp armed,
2199 * we may want to replace a none pte with a marker pte if
2200 * it's file-backed, so we don't lose the tracking info.
2201 */
2202 pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
2203
2204 /* Update high watermark before we lower rss */
2205 update_hiwater_rss(mm);
2206
2207 if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
2208 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
2209 if (folio_test_hugetlb(folio)) {
2210 hugetlb_count_sub(folio_nr_pages(folio), mm);
2211 set_huge_pte_at(mm, address, pvmw.pte, pteval,
2212 hsz);
2213 } else {
2214 dec_mm_counter(mm, mm_counter(folio));
2215 set_pte_at(mm, address, pvmw.pte, pteval);
2216 }
2217 } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
2218 !userfaultfd_armed(vma)) {
2219 /*
2220 * The guest indicated that the page content is of no
2221 * interest anymore. Simply discard the pte, vmscan
2222 * will take care of the rest.
2223 * A future reference will then fault in a new zero
2224 * page. When userfaultfd is active, we must not drop
2225 * this page though, as its main user (postcopy
2226 * migration) will not expect userfaults on already
2227 * copied pages.
2228 */
2229 dec_mm_counter(mm, mm_counter(folio));
2230 } else if (folio_test_anon(folio)) {
2231 swp_entry_t entry = page_swap_entry(subpage);
2232 pte_t swp_pte;
2233 /*
2234 * Store the swap location in the pte.
2235 * See handle_pte_fault() ...
2236 */
2237 if (unlikely(folio_test_swapbacked(folio) !=
2238 folio_test_swapcache(folio))) {
2239 WARN_ON_ONCE(1);
2240 goto walk_abort;
2241 }
2242
2243 /* MADV_FREE page check */
2244 if (!folio_test_swapbacked(folio)) {
2245 int ref_count, map_count;
2246
2247 /*
2248 * Synchronize with gup_pte_range():
2249 * - clear PTE; barrier; read refcount
2250 * - inc refcount; barrier; read PTE
2251 */
2252 smp_mb();
2253
2254 ref_count = folio_ref_count(folio);
2255 map_count = folio_mapcount(folio);
2256
2257 /*
2258 * Order reads for page refcount and dirty flag
2259 * (see comments in __remove_mapping()).
2260 */
2261 smp_rmb();
2262
2263 if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
2264 /*
2265 * redirtied either using the page table or a previously
2266 * obtained GUP reference.
2267 */
2268 set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
2269 folio_set_swapbacked(folio);
2270 goto walk_abort;
2271 } else if (ref_count != 1 + map_count) {
2272 /*
2273 * Additional reference. Could be a GUP reference or any
2274 * speculative reference. GUP users must mark the folio
2275 * dirty if there was a modification. This folio cannot be
2276 * reclaimed right now either way, so act just like nothing
2277 * happened.
2278 * We'll come back here later and detect if the folio was
2279 * dirtied when the additional reference is gone.
2280 */
2281 set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
2282 goto walk_abort;
2283 }
2284 add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
2285 goto discard;
2286 }
2287
2288 if (folio_dup_swap(folio, subpage) < 0) {
2289 set_pte_at(mm, address, pvmw.pte, pteval);
2290 goto walk_abort;
2291 }
2292
2293 /*
2294 * arch_unmap_one() is expected to be a NOP on
2295 * architectures where we could have PFN swap PTEs,
2296 * so we'll not check/care.
2297 */
2298 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
2299 folio_put_swap(folio, subpage);
2300 set_pte_at(mm, address, pvmw.pte, pteval);
2301 goto walk_abort;
2302 }
2303
2304 /* See folio_try_share_anon_rmap(): clear PTE first. */
2305 if (anon_exclusive &&
2306 folio_try_share_anon_rmap_pte(folio, subpage)) {
2307 folio_put_swap(folio, subpage);
2308 set_pte_at(mm, address, pvmw.pte, pteval);
2309 goto walk_abort;
2310 }
2311 if (list_empty(&mm->mmlist)) {
2312 spin_lock(&mmlist_lock);
2313 if (list_empty(&mm->mmlist))
2314 list_add(&mm->mmlist, &init_mm.mmlist);
2315 spin_unlock(&mmlist_lock);
2316 }
2317 dec_mm_counter(mm, MM_ANONPAGES);
2318 inc_mm_counter(mm, MM_SWAPENTS);
2319 swp_pte = swp_entry_to_pte(entry);
2320 if (anon_exclusive)
2321 swp_pte = pte_swp_mkexclusive(swp_pte);
2322 if (likely(pte_present(pteval))) {
2323 if (pte_soft_dirty(pteval))
2324 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2325 if (pte_uffd_wp(pteval))
2326 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2327 } else {
2328 if (pte_swp_soft_dirty(pteval))
2329 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2330 if (pte_swp_uffd_wp(pteval))
2331 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2332 }
2333 set_pte_at(mm, address, pvmw.pte, swp_pte);
2334 } else {
2335 /*
2336 * This is a locked file-backed folio,
2337 * so it cannot be removed from the page
2338 * cache and replaced by a new folio before
2339 * mmu_notifier_invalidate_range_end, so no
2340 * concurrent thread might update its page table
2341 * to point at a new folio while a device is
2342 * still using this folio.
2343 *
2344 * See Documentation/mm/mmu_notifier.rst
2345 */
2346 add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
2347 }
2348 discard:
2349 if (unlikely(folio_test_hugetlb(folio))) {
2350 hugetlb_remove_rmap(folio);
2351 } else {
2352 folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
2353 }
2354 if (vma->vm_flags & VM_LOCKED)
2355 mlock_drain_local();
2356 folio_put_refs(folio, nr_pages);
2357
2358 /*
2359 * If we are sure that we batched the entire folio and cleared
2360 * all PTEs, we can just optimize and stop right here.
2361 */
2362 if (nr_pages == folio_nr_pages(folio))
2363 goto walk_done;
2364 continue;
2365 walk_abort:
2366 ret = false;
2367 walk_done:
2368 page_vma_mapped_walk_done(&pvmw);
2369 break;
2370 }
2371
2372 mmu_notifier_invalidate_range_end(&range);
2373
2374 return ret;
2375 }
2376
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: kernel test robot @ 2026-06-25 5:45 UTC (permalink / raw)
To: Dev Jain, akpm, david, ljs
Cc: oe-kbuild-all, Dev Jain, riel, liam, vbabka, harry, jannh, kas,
linux-mm, linux-kernel, ryan.roberts, anshuman.khandual, stable
In-Reply-To: <20260625042853.2752898-1-dev.jain@arm.com>
Hi Dev,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-use-huge_ptep_get-in-try_to_unmap_one/20260625-123050
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20260625042853.2752898-1-dev.jain%40arm.com
patch subject: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260625/202606251311.CCKYInqf-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260625/202606251311.CCKYInqf-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606251311.CCKYInqf-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:2100:34: error: implicit declaration of function 'huge_ptep_get' [-Werror=implicit-function-declaration]
2100 | pteval = huge_ptep_get(mm, address, pvmw.pte);
| ^~~~~~~~~~~~~
>> mm/rmap.c:2100:34: error: incompatible types when assigning to type 'pte_t' from type 'int'
cc1: some warnings being treated as errors
vim +/huge_ptep_get +2100 mm/rmap.c
1980
1981 /*
1982 * @arg: enum ttu_flags will be passed to this argument
1983 */
1984 static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
1985 unsigned long address, void *arg)
1986 {
1987 struct mm_struct *mm = vma->vm_mm;
1988 DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
1989 bool anon_exclusive, ret = true;
1990 pte_t pteval;
1991 struct page *subpage;
1992 struct mmu_notifier_range range;
1993 enum ttu_flags flags = (enum ttu_flags)(long)arg;
1994 unsigned long nr_pages = 1, end_addr;
1995 unsigned long pfn;
1996 unsigned long hsz = 0;
1997 int ptes = 0;
1998
1999 /*
2000 * When racing against e.g. zap_pte_range() on another cpu,
2001 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
2002 * try_to_unmap() may return before folio_mapped() has become false,
2003 * if page table locking is skipped: use TTU_SYNC to wait for that.
2004 */
2005 if (flags & TTU_SYNC)
2006 pvmw.flags = PVMW_SYNC;
2007
2008 /*
2009 * For THP, we have to assume the worse case ie pmd for invalidation.
2010 * For hugetlb, it could be much worse if we need to do pud
2011 * invalidation in the case of pmd sharing.
2012 *
2013 * Note that the folio can not be freed in this function as call of
2014 * try_to_unmap() must hold a reference on the folio.
2015 */
2016 range.end = vma_address_end(&pvmw);
2017 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
2018 address, range.end);
2019 if (folio_test_hugetlb(folio)) {
2020 /*
2021 * If sharing is possible, start and end will be adjusted
2022 * accordingly.
2023 */
2024 adjust_range_if_pmd_sharing_possible(vma, &range.start,
2025 &range.end);
2026
2027 /* We need the huge page size for set_huge_pte_at() */
2028 hsz = huge_page_size(hstate_vma(vma));
2029 }
2030 mmu_notifier_invalidate_range_start(&range);
2031
2032 while (page_vma_mapped_walk(&pvmw)) {
2033 nr_pages = 1;
2034
2035 /*
2036 * If the folio is in an mlock()d vma, we must not swap it out.
2037 */
2038 if (!(flags & TTU_IGNORE_MLOCK) &&
2039 (vma->vm_flags & VM_LOCKED)) {
2040 ptes++;
2041
2042 /*
2043 * Set 'ret' to indicate the page cannot be unmapped.
2044 *
2045 * Do not jump to walk_abort immediately as additional
2046 * iteration might be required to detect fully mapped
2047 * folio an mlock it.
2048 */
2049 ret = false;
2050
2051 /* Only mlock fully mapped pages */
2052 if (pvmw.pte && ptes != pvmw.nr_pages)
2053 continue;
2054
2055 /*
2056 * All PTEs must be protected by page table lock in
2057 * order to mlock the page.
2058 *
2059 * If page table boundary has been cross, current ptl
2060 * only protect part of ptes.
2061 */
2062 if (pvmw.flags & PVMW_PGTABLE_CROSSED)
2063 goto walk_done;
2064
2065 /* Restore the mlock which got missed */
2066 mlock_vma_folio(folio, vma);
2067 goto walk_done;
2068 }
2069
2070 if (!pvmw.pte) {
2071 if (folio_test_lazyfree(folio)) {
2072 if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
2073 goto walk_done;
2074 /*
2075 * unmap_huge_pmd_locked has either already marked
2076 * the folio as swap-backed or decided to retain it
2077 * due to GUP or speculative references.
2078 */
2079 goto walk_abort;
2080 }
2081
2082 if (flags & TTU_SPLIT_HUGE_PMD) {
2083 /*
2084 * We temporarily have to drop the PTL and
2085 * restart so we can process the PTE-mapped THP.
2086 */
2087 split_huge_pmd_locked(vma, pvmw.address,
2088 pvmw.pmd, false);
2089 flags &= ~TTU_SPLIT_HUGE_PMD;
2090 page_vma_mapped_walk_restart(&pvmw);
2091 continue;
2092 }
2093 }
2094
2095 /* Unexpected PMD-mapped THP? */
2096 VM_BUG_ON_FOLIO(!pvmw.pte, folio);
2097
2098 address = pvmw.address;
2099 if (folio_test_hugetlb(folio)) {
> 2100 pteval = huge_ptep_get(mm, address, pvmw.pte);
2101 } else {
2102 /*
2103 * Handle PFN swap PTEs, such as device-exclusive ones,
2104 * that actually map pages.
2105 */
2106 pteval = ptep_get(pvmw.pte);
2107 }
2108 if (likely(pte_present(pteval))) {
2109 pfn = pte_pfn(pteval);
2110 } else {
2111 const softleaf_t entry = softleaf_from_pte(pteval);
2112
2113 pfn = softleaf_to_pfn(entry);
2114 VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
2115 }
2116
2117 subpage = folio_page(folio, pfn - folio_pfn(folio));
2118 anon_exclusive = folio_test_anon(folio) &&
2119 PageAnonExclusive(subpage);
2120
2121 if (folio_test_hugetlb(folio)) {
2122 bool anon = folio_test_anon(folio);
2123
2124 /*
2125 * The try_to_unmap() is only passed a hugetlb page
2126 * in the case where the hugetlb page is poisoned.
2127 */
2128 VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
2129 /*
2130 * huge_pmd_unshare may unmap an entire PMD page.
2131 * There is no way of knowing exactly which PMDs may
2132 * be cached for this mm, so we must flush them all.
2133 * start/end were already adjusted above to cover this
2134 * range.
2135 */
2136 flush_cache_range(vma, range.start, range.end);
2137
2138 /*
2139 * To call huge_pmd_unshare, i_mmap_rwsem must be
2140 * held in write mode. Caller needs to explicitly
2141 * do this outside rmap routines.
2142 *
2143 * We also must hold hugetlb vma_lock in write mode.
2144 * Lock order dictates acquiring vma_lock BEFORE
2145 * i_mmap_rwsem. We can only try lock here and fail
2146 * if unsuccessful.
2147 */
2148 if (!anon) {
2149 struct mmu_gather tlb;
2150
2151 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
2152 if (!hugetlb_vma_trylock_write(vma))
2153 goto walk_abort;
2154
2155 tlb_gather_mmu_vma(&tlb, vma);
2156 if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
2157 hugetlb_vma_unlock_write(vma);
2158 huge_pmd_unshare_flush(&tlb, vma);
2159 tlb_finish_mmu(&tlb);
2160 /*
2161 * The PMD table was unmapped,
2162 * consequently unmapping the folio.
2163 */
2164 goto walk_done;
2165 }
2166 hugetlb_vma_unlock_write(vma);
2167 tlb_finish_mmu(&tlb);
2168 }
2169 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
2170 if (pte_dirty(pteval))
2171 folio_mark_dirty(folio);
2172 } else if (likely(pte_present(pteval))) {
2173 nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
2174 end_addr = address + nr_pages * PAGE_SIZE;
2175 flush_cache_range(vma, address, end_addr);
2176
2177 /* Nuke the page table entry. */
2178 pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
2179 /*
2180 * We clear the PTE but do not flush so potentially
2181 * a remote CPU could still be writing to the folio.
2182 * If the entry was previously clean then the
2183 * architecture must guarantee that a clear->dirty
2184 * transition on a cached TLB entry is written through
2185 * and traps if the PTE is unmapped.
2186 */
2187 if (should_defer_flush(mm, flags))
2188 set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
2189 else
2190 flush_tlb_range(vma, address, end_addr);
2191 if (pte_dirty(pteval))
2192 folio_mark_dirty(folio);
2193 } else {
2194 pte_clear(mm, address, pvmw.pte);
2195 }
2196
2197 /*
2198 * Now the pte is cleared. If this pte was uffd-wp armed,
2199 * we may want to replace a none pte with a marker pte if
2200 * it's file-backed, so we don't lose the tracking info.
2201 */
2202 pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
2203
2204 /* Update high watermark before we lower rss */
2205 update_hiwater_rss(mm);
2206
2207 if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
2208 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
2209 if (folio_test_hugetlb(folio)) {
2210 hugetlb_count_sub(folio_nr_pages(folio), mm);
2211 set_huge_pte_at(mm, address, pvmw.pte, pteval,
2212 hsz);
2213 } else {
2214 dec_mm_counter(mm, mm_counter(folio));
2215 set_pte_at(mm, address, pvmw.pte, pteval);
2216 }
2217 } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
2218 !userfaultfd_armed(vma)) {
2219 /*
2220 * The guest indicated that the page content is of no
2221 * interest anymore. Simply discard the pte, vmscan
2222 * will take care of the rest.
2223 * A future reference will then fault in a new zero
2224 * page. When userfaultfd is active, we must not drop
2225 * this page though, as its main user (postcopy
2226 * migration) will not expect userfaults on already
2227 * copied pages.
2228 */
2229 dec_mm_counter(mm, mm_counter(folio));
2230 } else if (folio_test_anon(folio)) {
2231 swp_entry_t entry = page_swap_entry(subpage);
2232 pte_t swp_pte;
2233 /*
2234 * Store the swap location in the pte.
2235 * See handle_pte_fault() ...
2236 */
2237 if (unlikely(folio_test_swapbacked(folio) !=
2238 folio_test_swapcache(folio))) {
2239 WARN_ON_ONCE(1);
2240 goto walk_abort;
2241 }
2242
2243 /* MADV_FREE page check */
2244 if (!folio_test_swapbacked(folio)) {
2245 int ref_count, map_count;
2246
2247 /*
2248 * Synchronize with gup_pte_range():
2249 * - clear PTE; barrier; read refcount
2250 * - inc refcount; barrier; read PTE
2251 */
2252 smp_mb();
2253
2254 ref_count = folio_ref_count(folio);
2255 map_count = folio_mapcount(folio);
2256
2257 /*
2258 * Order reads for page refcount and dirty flag
2259 * (see comments in __remove_mapping()).
2260 */
2261 smp_rmb();
2262
2263 if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
2264 /*
2265 * redirtied either using the page table or a previously
2266 * obtained GUP reference.
2267 */
2268 set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
2269 folio_set_swapbacked(folio);
2270 goto walk_abort;
2271 } else if (ref_count != 1 + map_count) {
2272 /*
2273 * Additional reference. Could be a GUP reference or any
2274 * speculative reference. GUP users must mark the folio
2275 * dirty if there was a modification. This folio cannot be
2276 * reclaimed right now either way, so act just like nothing
2277 * happened.
2278 * We'll come back here later and detect if the folio was
2279 * dirtied when the additional reference is gone.
2280 */
2281 set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
2282 goto walk_abort;
2283 }
2284 add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
2285 goto discard;
2286 }
2287
2288 if (folio_dup_swap(folio, subpage) < 0) {
2289 set_pte_at(mm, address, pvmw.pte, pteval);
2290 goto walk_abort;
2291 }
2292
2293 /*
2294 * arch_unmap_one() is expected to be a NOP on
2295 * architectures where we could have PFN swap PTEs,
2296 * so we'll not check/care.
2297 */
2298 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
2299 folio_put_swap(folio, subpage);
2300 set_pte_at(mm, address, pvmw.pte, pteval);
2301 goto walk_abort;
2302 }
2303
2304 /* See folio_try_share_anon_rmap(): clear PTE first. */
2305 if (anon_exclusive &&
2306 folio_try_share_anon_rmap_pte(folio, subpage)) {
2307 folio_put_swap(folio, subpage);
2308 set_pte_at(mm, address, pvmw.pte, pteval);
2309 goto walk_abort;
2310 }
2311 if (list_empty(&mm->mmlist)) {
2312 spin_lock(&mmlist_lock);
2313 if (list_empty(&mm->mmlist))
2314 list_add(&mm->mmlist, &init_mm.mmlist);
2315 spin_unlock(&mmlist_lock);
2316 }
2317 dec_mm_counter(mm, MM_ANONPAGES);
2318 inc_mm_counter(mm, MM_SWAPENTS);
2319 swp_pte = swp_entry_to_pte(entry);
2320 if (anon_exclusive)
2321 swp_pte = pte_swp_mkexclusive(swp_pte);
2322 if (likely(pte_present(pteval))) {
2323 if (pte_soft_dirty(pteval))
2324 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2325 if (pte_uffd_wp(pteval))
2326 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2327 } else {
2328 if (pte_swp_soft_dirty(pteval))
2329 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2330 if (pte_swp_uffd_wp(pteval))
2331 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2332 }
2333 set_pte_at(mm, address, pvmw.pte, swp_pte);
2334 } else {
2335 /*
2336 * This is a locked file-backed folio,
2337 * so it cannot be removed from the page
2338 * cache and replaced by a new folio before
2339 * mmu_notifier_invalidate_range_end, so no
2340 * concurrent thread might update its page table
2341 * to point at a new folio while a device is
2342 * still using this folio.
2343 *
2344 * See Documentation/mm/mmu_notifier.rst
2345 */
2346 add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
2347 }
2348 discard:
2349 if (unlikely(folio_test_hugetlb(folio))) {
2350 hugetlb_remove_rmap(folio);
2351 } else {
2352 folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
2353 }
2354 if (vma->vm_flags & VM_LOCKED)
2355 mlock_drain_local();
2356 folio_put_refs(folio, nr_pages);
2357
2358 /*
2359 * If we are sure that we batched the entire folio and cleared
2360 * all PTEs, we can just optimize and stop right here.
2361 */
2362 if (nr_pages == folio_nr_pages(folio))
2363 goto walk_done;
2364 continue;
2365 walk_abort:
2366 ret = false;
2367 walk_done:
2368 page_vma_mapped_walk_done(&pvmw);
2369 break;
2370 }
2371
2372 mmu_notifier_invalidate_range_end(&range);
2373
2374 return ret;
2375 }
2376
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock()
From: Harry Yoo @ 2026-06-25 5:40 UTC (permalink / raw)
To: hu.shengming
Cc: vbabka, hao.li, cl, rientjes, roman.gushchin, linux-mm,
linux-kernel, zhang.run, cai.qu
In-Reply-To: <20260624172234202jw3y4yP1YfgOYbPCQdVIw@zte.com.cn>
[-- Attachment #1.1: Type: text/plain, Size: 3078 bytes --]
On 6/24/26 6:22 PM, hu.shengming@zte.com.cn wrote:
> Harry wrote:
>> Currently, k[v]free_rcu() cannot be called in unknown context since
>> it could lead to a deadlock when called in the middle of k[v]free_rcu().
>>
>> Make users' lives easier by introducing kfree_rcu_nolock() variant,
>> now that kfree_rcu_sheaf() is available on PREEMPT_RT and
>> __kfree_rcu_sheaf() handles unknown context.
>>
>> Unlike k[v]free_rcu(), kfree_rcu_nolock() does not fall back to
>> the kvfree_rcu batching when the sheaves path fails, and falls back to
>> defer_kfree_rcu() instead. In most cases, the sheaves path is expected
>> to succeed and it's unnecessary to add complexity to the existing
>> kvfree_rcu batching.
>>
>> Since defer_kfree_rcu() can be called on caches without sheaves, move
>> deferred_work_barrier() and rcu_barrier() outside the branch in
>> kvfree_rcu_barrier_on_cache().
>>
>> Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
>
> Hi Harry,
>
> Thanks for the series. These patches fill a clear functional gap in the
> existing free APIs by adding an RCU-deferred free interface for contexts
> where kfree_rcu() cannot safely be used.
Thanks for looking into this, Shengming.
>> ---
>> include/linux/rcupdate.h | 12 ++++++++++++
>> mm/slab.h | 1 +
>> mm/slab_common.c | 22 ++++++++++++++++++++--
>> mm/slub.c | 23 ++++++++++++++++++++++-
>> 4 files changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>> index 807924a94fb0..5a39e6225160 100644
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> @@ -1263,6 +1263,23 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
>> EXPORT_TRACEPOINT_SYMBOL(kfree);
>> EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
>>
>> +void kfree_call_rcu_nolock(struct rcu_head *head, void *ptr)
>> +{
>> + struct slab *slab;
>> + struct kmem_cache *s;
>> +
>> + VM_WARN_ON_ONCE(is_vmalloc_addr(ptr) || !virt_to_slab(ptr));
>> +
>> + slab = virt_to_slab(ptr);
>> + s = slab->slab_cache;
>> +
>> + if (__kfree_rcu_sheaf(s, ptr, /* allow_spin = */ false))
>> + return;
>> +
>
> One consistency issue to address here: kfree_rcu_sheaf() only calls
> __kfree_rcu_sheaf() for objects belonging to the local NUMA node. This
> avoids filling a CPU's per-CPU sheaves with objects from remote slabs.
>
> kfree_call_rcu_nolock() currently skips that check and may therefore
> place remote-node objects into the local CPU's RCU sheaf.
That was intentional, but actually, this is a good point. Thanks.
> Could you add the same local-node check used by kfree_rcu_sheaf()
> before calling __kfree_rcu_sheaf(), and route remote-node objects
> directly to the defer_kfree_rcu() fallback path instead?
Falling back to defer_kfree_rcu() in v3 didn't make much sense
as the object is inserted to a global list which would cause more
troubles than NUMA miss.
But once we make the fallback path percpu, your suggestion would make
more sense.
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [PATCH v4] mm: assert exclusive nid/zonenum bits at the page/folio access sites
From: Hui Zhu @ 2026-06-25 5:39 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
linux-mm, linux-kernel
Cc: Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
KCSAN reports a data race between page_to_nid()/folio_pgdat() reading
page->flags and folio_trylock()/folio_lock() concurrently doing
test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:
BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp
The node id and zone id occupy fixed bit-ranges of page->flags that
are set once at page init and never modified afterwards, so they can
never overlap with the low PG_locked/PG_waiters bits touched by the
folio lock path.
ASSERT_EXCLUSIVE_BITS(mdf.f, ...) inside memdesc_nid()/memdesc_zonenum()
checks a by-value copy of the flags word, not the actual shared
page->flags/folio->flags being modified concurrently, so it doesn't
reliably assert anything about the real race. Move the assertion to
page_to_nid(), folio_nid(), page_zonenum() and folio_zonenum(), where
flags is dereferenced directly from the page/folio.
On CONFIG_NUMA=n, NODES_MASK is 0 and the old memdesc_nid() body
folded to a constant, so page->flags/folio->flags was never actually
read. ASSERT_EXCLUSIVE_BITS() is a real runtime check that can't be
folded away, so doing it unconditionally would add a pointless read
of page->flags/folio->flags and a check that can never fire. Keep
page_to_nid()/folio_nid() as plain "return 0" static inline stubs
under CONFIG_NUMA=n instead.
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
Changelog:
v4:
According to the comments of Andrew and Sashiko, set
page_to_nid()/folio_nid() as static inline stubs returning 0
under CONFIG_NUMA=n.
v3:
According to the comments of Andrew and Sashiko, move
ASSERT_EXCLUSIVE_BITS out of memdesc_nid()/memdesc_zonenum()
into the page/folio call sites.
v2:
According to the comments of David, remove useless comments and use
ASSERT_EXCLUSIVE_BITS() in memdesc_nid() instead of data_race() in
page_to_nid().
include/linux/mm.h | 9 +++++++++
include/linux/mmzone.h | 3 ++-
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbdd..56b39194605a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2294,15 +2294,24 @@ static inline int memdesc_nid(memdesc_flags_t mdf)
}
#endif
+#ifdef CONFIG_NUMA
static inline int page_to_nid(const struct page *page)
{
+ ASSERT_EXCLUSIVE_BITS(PF_POISONED_CHECK(page)->flags,
+ NODES_MASK << NODES_PGSHIFT);
return memdesc_nid(PF_POISONED_CHECK(page)->flags);
}
static inline int folio_nid(const struct folio *folio)
{
+ ASSERT_EXCLUSIVE_BITS(folio->flags,
+ NODES_MASK << NODES_PGSHIFT);
return memdesc_nid(folio->flags);
}
+#else
+#define page_to_nid(page) (0)
+#define folio_nid(folio) (0)
+#endif
#ifdef CONFIG_NUMA_BALANCING
/* page access time bits needs to hold at least 4 seconds */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca2712187147..56dffa966343 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1274,17 +1274,18 @@ static inline bool zone_is_empty(const struct zone *zone)
static inline enum zone_type memdesc_zonenum(memdesc_flags_t flags)
{
- ASSERT_EXCLUSIVE_BITS(flags.f, ZONES_MASK << ZONES_PGSHIFT);
return (flags.f >> ZONES_PGSHIFT) & ZONES_MASK;
}
static inline enum zone_type page_zonenum(const struct page *page)
{
+ ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
return memdesc_zonenum(page->flags);
}
static inline enum zone_type folio_zonenum(const struct folio *folio)
{
+ ASSERT_EXCLUSIVE_BITS(folio->flags, ZONES_MASK << ZONES_PGSHIFT);
return memdesc_zonenum(folio->flags);
}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock()
From: Harry Yoo @ 2026-06-25 5:27 UTC (permalink / raw)
To: XIAO WU, Vlastimil Babka, Andrew Morton, Hao Li,
Christoph Lameter, David Rientjes, Roman Gushchin,
Alexei Starovoitov, Andrii Nakryiko, Puranjay Mohan, Amery Hung,
Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
Suren Baghdasaryan
Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf
In-Reply-To: <tencent_12A1049445F2CE248F20CB235AC111353A0A@qq.com>
[-- Attachment #1.1: Type: text/plain, Size: 4556 bytes --]
On 6/22/26 11:56 PM, XIAO WU wrote:
> Hi Harry,
>
> On Mon, Jun 22, 2026 at 02:28:44PM +0900, Harry Yoo wrote:
>> On 6/21/26 9:29 AM, XIAO WU wrote:
>> > I was able to reproduce this in QEMU with KASAN. The trigger is as
>> > simple as passing a large (>8KB) kmalloc buffer to the new function.
>>
>> Thanks for taking a look, but this was intentional.
>>
>> I should have documented that only kmalloc_nolock() ->
>> kfree_rcu_nolock() is allowed and kmalloc() -> kfree_rcu_nolock()
>> is not allowed (yet).
>>
>> Since kmalloc_nolock() does not support large kmalloc, the warning
>> is not supposed to trigger. That is why I added only debug warnings.
>
> Thank you very much for taking the time to explain — I really
> appreciate it, especially since I'm still learning my way around the
> mm/ subsystem. You are absolutely right that kmalloc_nolock() returns
> NULL for sizes above KMALLOC_MAX_CACHE_SIZE, so a proper caller using
> the kmalloc_nolock() → kfree_rcu_nolock() pairing would never hit this.
>
> I did notice one small thing that I wanted to gently bring up, though.
> Please forgive me if I'm missing something obvious here.
>
> When I was reading through the surrounding code to understand the
> pattern better, I noticed that kfree_nolock() — which has the same
> "only for kmalloc_nolock()" constraint (documented in the comment at
> mm/slub.c:6828-6835) — actually does check for a NULL slab:
>
> void kfree_nolock(const void *object)
> {
> ...
> slab = virt_to_slab(object);
> if (unlikely(!slab)) {
> WARN_ONCE(1, "large_kmalloc is not supported by kfree_nolock()");
> return;
> }
> s = slab->slab_cache;
> ...
>
> So kfree_nolock() gracefully returns with a warning even though it too
> expects only kmalloc_nolock() callers. That pattern seemed really
> sensible to me — it costs almost nothing and prevents a panic if
> someone ever passes the wrong pointer (which they shouldn't, but as you
> mentioned, the constraint isn't documented on kfree_call_rcu_nolock()
> yet).
>
> I also wondered about the difference between WARN_ONCE (used in
> kfree_nolock) and VM_WARN_ON_ONCE (used in kfree_call_rcu_nolock). If
> I understand correctly, VM_WARN_ON_ONCE compiles away entirely on
> production kernels without CONFIG_DEBUG_VM, which would make the
> subsequent NULL dereference completely silent — no warning, just a
> panic.
It would crash without debug option anyway, but warnings are there to
make it easier to point what's gone wrong.
And not testing the code path at least once with debug option is a big
problem :)
> And since you mentioned that kmalloc() → kfree_rcu_nolock() support is
> planned for the future (the "yet") — wouldn't this code path need the
> NULL check at that point anyway?
>
> I was thinking something like this would make the function consistent
> with kfree_nolock() and also make it forward-compatible with the
> planned kmalloc() support:
>
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1266,10 +1266,16 @@ void kfree_call_rcu_nolock(struct rcu_head
> *head, void *ptr)
> {
> struct slab *slab;
> struct kmem_cache *s;
>
> - VM_WARN_ON_ONCE(is_vmalloc_addr(ptr) || !virt_to_slab(ptr));
> -
> slab = virt_to_slab(ptr);
> + /*
> + * kmalloc_nolock() never produces large-kmalloc or vmalloc
> + * addresses, but be defensive: fall back to defer_kfree_rcu()
> + * for unsupported pointer types, consistent with kfree_nolock().
> + */
> + if (unlikely(!slab))
> + goto fallback;
Just FYI, virt_to_slab() and virt_to_page()
don't work correctly for vmalloc addresses.
And I don't think silently making it work is good.
> +
> s = slab->slab_cache;
>
> if (__kfree_rcu_sheaf(s, ptr, /* allow_spin = */ false))
> return;
>
> +fallback:
> defer_kfree_rcu(head);
> }
>
> Of course, this is just a suggestion — you know this code far better
> than I do. If you feel the current code is fine as-is with proper
> documentation, I completely understand and won't press the point
> further.
>
> Either way, thank you again for the explanation, and for working on
> this series — having kfree_rcu_nolock() available for BPF and other
> contexts will be really valuable.
>
> Thanks,
> XIAO
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [RFC PATCH v1.1 11/11] mm/damon/sysfs: fix typos in probe_{add,rm}_dirs: s/attr/probe/
From: SeongJae Park @ 2026-06-25 5:07 UTC (permalink / raw)
Cc: SeongJae Park, Andrew Morton, damon, linux-kernel, linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>
damon_sysfs_probe_{add,rm}_dirs names a variable for damon_sysf_probe
as 'attr'. Probably a trivial copy-pasta error, but it makes the code
not pleasant to read. Fix those.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/damon/sysfs.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index f3bb146b204df..36d71f1675426 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -1068,7 +1068,7 @@ static struct damon_sysfs_probe *damon_sysfs_probe_alloc(void)
return kzalloc_obj(struct damon_sysfs_probe);
}
-static int damon_sysfs_probe_add_dirs(struct damon_sysfs_probe *attr)
+static int damon_sysfs_probe_add_dirs(struct damon_sysfs_probe *probe)
{
struct damon_sysfs_filters *filters;
int err;
@@ -1076,22 +1076,22 @@ static int damon_sysfs_probe_add_dirs(struct damon_sysfs_probe *attr)
filters = damon_sysfs_filters_alloc();
if (!filters)
return -ENOMEM;
- attr->filters = filters;
+ probe->filters = filters;
err = kobject_init_and_add(&filters->kobj, &damon_sysfs_filters_ktype,
- &attr->kobj, "filters");
+ &probe->kobj, "filters");
if (err) {
kobject_put(&filters->kobj);
- attr->filters = NULL;
+ probe->filters = NULL;
}
return err;
}
-static void damon_sysfs_probe_rm_dirs(struct damon_sysfs_probe *attr)
+static void damon_sysfs_probe_rm_dirs(struct damon_sysfs_probe *probe)
{
- if (attr->filters) {
- damon_sysfs_filters_rm_dirs(attr->filters);
- kobject_put(&attr->filters->kobj);
+ if (probe->filters) {
+ damon_sysfs_filters_rm_dirs(probe->filters);
+ kobject_put(&probe->filters->kobj);
}
}
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.1 10/11] mm/damon/sysfs: split out filters setup function
From: SeongJae Park @ 2026-06-25 5:07 UTC (permalink / raw)
Cc: SeongJae Park, Andrew Morton, damon, linux-kernel, linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>
damon_sysfs_set_probe() is doing not only probe setup but also filters
setup. Split out filters setup for readability.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/damon/sysfs.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index 982d824f63c21..f3bb146b204df 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -1899,16 +1899,11 @@ static int damon_sysfs_set_attrs(struct damon_ctx *ctx,
return damon_set_attrs(ctx, &attrs);
}
-static int damon_sysfs_set_probe(struct damon_probe *probe,
- struct damon_sysfs_probe *sys_probe)
+static int damon_sysfs_set_filters(struct damon_probe *probe,
+ struct damon_sysfs_filters *sys_filters)
{
- struct damon_sysfs_filters *sys_filters;
int i;
- sys_filters = sys_probe->filters;
- if (!sys_filters)
- return 0;
-
for (i = 0; i < sys_filters->nr; i++) {
struct damon_sysfs_filter *sys_filter =
sys_filters->filters_arr[i];
@@ -1935,6 +1930,17 @@ static int damon_sysfs_set_probe(struct damon_probe *probe,
return 0;
}
+static int damon_sysfs_set_probe(struct damon_probe *probe,
+ struct damon_sysfs_probe *sys_probe)
+{
+ struct damon_sysfs_filters *sys_filters;
+
+ sys_filters = sys_probe->filters;
+ if (!sys_filters)
+ return 0;
+ return damon_sysfs_set_filters(probe, sys_filters);
+}
+
static int damon_sysfs_set_probes(struct damon_ctx *ctx,
struct damon_sysfs_probes *sys_probes)
{
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.1 09/11] mm/damon/sysfs: split probe setup function out
From: SeongJae Park @ 2026-06-25 5:07 UTC (permalink / raw)
Cc: SeongJae Park, Andrew Morton, damon, linux-kernel, linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>
damon_sysfs_set_probes() function is relatively long. It has two nested
loop for setting two nested entities, namely probe and filter. Split
out the probe level setup for readability.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/damon/sysfs.c | 80 ++++++++++++++++++++++++++++--------------------
1 file changed, 46 insertions(+), 34 deletions(-)
diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index 2e95e3bac774d..982d824f63c21 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -1899,47 +1899,59 @@ static int damon_sysfs_set_attrs(struct damon_ctx *ctx,
return damon_set_attrs(ctx, &attrs);
}
-static int damon_sysfs_set_probes(struct damon_ctx *ctx,
- struct damon_sysfs_probes *sys_probes)
+static int damon_sysfs_set_probe(struct damon_probe *probe,
+ struct damon_sysfs_probe *sys_probe)
{
+ struct damon_sysfs_filters *sys_filters;
int i;
- for (i = 0; i < sys_probes->nr; i++) {
- struct damon_sysfs_filters *sys_filters =
- sys_probes->probes_arr[i]->filters;
- struct damon_probe *c;
- int j;
+ sys_filters = sys_probe->filters;
+ if (!sys_filters)
+ return 0;
- if (!sys_filters)
- continue;
- c = damon_new_probe();
- if (!c)
+ for (i = 0; i < sys_filters->nr; i++) {
+ struct damon_sysfs_filter *sys_filter =
+ sys_filters->filters_arr[i];
+ struct damon_filter *filter;
+
+ filter = damon_new_filter(sys_filter->type,
+ sys_filter->matching,
+ sys_filter->allow);
+ if (!filter)
return -ENOMEM;
- damon_add_probe(ctx, c);
-
- for (j = 0; j < sys_filters->nr; j++) {
- struct damon_sysfs_filter *sys_filter =
- sys_filters->filters_arr[j];
- struct damon_filter *filter;
-
- filter = damon_new_filter(sys_filter->type,
- sys_filter->matching,
- sys_filter->allow);
- if (!filter)
- return -ENOMEM;
- if (filter->type == DAMON_FILTER_TYPE_MEMCG) {
- int err;
-
- err = damon_sysfs_memcg_path_to_id(
- sys_filter->path,
- &filter->memcg_id);
- if (err) {
- damon_destroy_filter(filter);
- return err;
- }
+ if (filter->type == DAMON_FILTER_TYPE_MEMCG) {
+ int err;
+
+ err = damon_sysfs_memcg_path_to_id(
+ sys_filter->path,
+ &filter->memcg_id);
+ if (err) {
+ damon_destroy_filter(filter);
+ return err;
}
- damon_add_filter(c, filter);
}
+ damon_add_filter(probe, filter);
+ }
+ return 0;
+}
+
+static int damon_sysfs_set_probes(struct damon_ctx *ctx,
+ struct damon_sysfs_probes *sys_probes)
+{
+ int i, err;
+
+ for (i = 0; i < sys_probes->nr; i++) {
+ struct damon_sysfs_probe *sys_probe;
+ struct damon_probe *p;
+
+ p = damon_new_probe();
+ if (!p)
+ return -ENOMEM;
+ damon_add_probe(ctx, p);
+ sys_probe = sys_probes->probes_arr[i];
+ err = damon_sysfs_set_probe(p, sys_probe);
+ if (err)
+ return err;
}
return 0;
}
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.1 08/11] mm/damon/core: reduce range setup in damon_commit_target_regions()
From: SeongJae Park @ 2026-06-25 5:07 UTC (permalink / raw)
Cc: SeongJae Park, Andrew Morton, damon, linux-kernel, linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>
damon_commit_target_regions() calls damon_set_regions() for updating the
destination target's monitoring target region boundaries. It sets the
boundaries same to source target's monitoring regions, even if they are
adjacent. Meanwhile, damon_set_region() sets the destination target
regions exactly the same to the source, only when the target regions are
empty. When there are existing target regions, only a few regions are
expanded or shrunk to fit on only the boundaries for disjoint regions in
the source. Hence the adjacent source ranges mean nothing in common
cases. When there are many regions, such adjacent range setup is only a
waste of time and space. We recently found [1] it is actually causing
memory overhead. Setup the ranges for only distinct ranges.
[1] https://lore.kernel.org/20260603112306.58490-1-akinobu.mita@gmail.com
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/damon/core.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 7e4b9affc5b06..ce5294cb1b4f3 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1349,21 +1349,33 @@ static struct damon_target *damon_nth_target(int n, struct damon_ctx *ctx)
static int damon_commit_target_regions(struct damon_target *dst,
struct damon_target *src, unsigned long src_min_region_sz)
{
- struct damon_region *src_region;
+ struct damon_region *src_region, *prev = NULL;
struct damon_addr_range *ranges;
int i = 0, err;
- damon_for_each_region(src_region, src)
- i++;
+ damon_for_each_region(src_region, src) {
+ if (!prev || prev->ar.end != src_region->ar.start)
+ i++;
+ prev = src_region;
+ }
if (!i)
return 0;
ranges = kmalloc_objs(*ranges, i, GFP_KERNEL | __GFP_NOWARN);
if (!ranges)
return -ENOMEM;
+ prev = NULL;
i = 0;
- damon_for_each_region(src_region, src)
- ranges[i++] = src_region->ar;
+ damon_for_each_region(src_region, src) {
+ if (!prev) {
+ ranges[i].start = src_region->ar.start;
+ } else if (prev->ar.end != src_region->ar.start) {
+ ranges[i++].end = prev->ar.end;
+ ranges[i].start = src_region->ar.start;
+ }
+ prev = src_region;
+ }
+ ranges[i++].end = damon_last_region(src)->ar.end;
err = damon_set_regions(dst, ranges, i, src_min_region_sz);
kfree(ranges);
return err;
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.1 07/11] selftests/damon/sysfs.sh: test all files in quota goal dir
From: SeongJae Park @ 2026-06-25 5:07 UTC (permalink / raw)
Cc: SeongJae Park, Shuah Khan, damon, linux-kernel, linux-kselftest,
linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>
DAMON sysfs interface for DAMOS quota has quite extended since its
initial introduction. The test case for that in DAMON sysfs interface
essential file operations test (sysfs.sh) has not accordingly extended,
though. Extend the test case to test all existing files.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
tools/testing/selftests/damon/sysfs.sh | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/tools/testing/selftests/damon/sysfs.sh b/tools/testing/selftests/damon/sysfs.sh
index ffa8413b5ab3d..15fb9df928818 100755
--- a/tools/testing/selftests/damon/sysfs.sh
+++ b/tools/testing/selftests/damon/sysfs.sh
@@ -199,6 +199,20 @@ test_goal()
ensure_dir "$goal_dir" "exist"
ensure_file "$goal_dir/target_value" "exist" "600"
ensure_file "$goal_dir/current_value" "exist" "600"
+ ensure_file "$goal_dir/target_metric" "exist" "600"
+ local fpath="$goal_dir/target_metric"
+ ensure_write_succ "$fpath" "user_input" "valid input"
+ ensure_write_succ "$fpath" "some_mem_psi_us" "valid input"
+ ensure_write_succ "$fpath" "node_mem_used_bp" "valid input"
+ ensure_write_succ "$fpath" "node_mem_free_bp" "valid input"
+ ensure_write_succ "$fpath" "node_memcg_used_bp" "valid input"
+ ensure_write_succ "$fpath" "node_memcg_free_bp" "valid input"
+ ensure_write_succ "$fpath" "active_mem_bp" "valid input"
+ ensure_write_succ "$fpath" "inactive_mem_bp" "valid input"
+ ensure_write_succ "$fpath" "node_eligible_mem_bp" "valid input"
+ ensure_write_fail "$fpath" "foo" "invalid input"
+ ensure_file "$goal_dir/nid" "exist" "600"
+ ensure_file "$goal_dir/path" "exist" "600"
}
test_goals()
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.1 06/11] selftests/damon/sysfs.sh: test dests dir
From: SeongJae Park @ 2026-06-25 5:07 UTC (permalink / raw)
Cc: SeongJae Park, Shuah Khan, damon, linux-kernel, linux-kselftest,
linux-mm
In-Reply-To: <20260625050756.91115-1-sj@kernel.org>
DAMON selftest interface essential file operations test (sysfs.sh) is
not testing DAMOS dests/ directory. Add the test.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
tools/testing/selftests/damon/sysfs.sh | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/tools/testing/selftests/damon/sysfs.sh b/tools/testing/selftests/damon/sysfs.sh
index 07a33995be852..ffa8413b5ab3d 100755
--- a/tools/testing/selftests/damon/sysfs.sh
+++ b/tools/testing/selftests/damon/sysfs.sh
@@ -99,6 +99,29 @@ test_stats()
done
}
+test_dest()
+{
+ dest_dir=$1
+ ensure_file "$dest_dir/id" "exist"
+ ensure_file "$dest_dir/weight" "exist"
+}
+
+test_dests()
+{
+ dests_dir=$1
+ ensure_file "$dests_dir/nr_dests" "exist" "600"
+ ensure_write_succ "$dests_dir/nr_dests" "1" "valid input"
+ test_dest "$dests_dir/0"
+
+ ensure_write_succ "$dests_dir/nr_dests" "2" "valid input"
+ test_dest "$dests_dir/0"
+ test_dest "$dests_dir/1"
+
+ ensure_write_succ "$dests_dir/nr_dests" "0" "valid input"
+ ensure_dir "$dests_dir/0" "not_exist"
+ ensure_dir "$dests_dir/1" "not_exist"
+}
+
test_filter()
{
filter_dir=$1
--
2.47.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox