Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] mm/mm_init: fix incorrect node_spanned_pages
From: Wei Yang @ 2026-06-25  3:14 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Wei Yang, akpm, izumi.taku, linux-mm, Yuan Liu, David Hildenbrand
In-Reply-To: <ajv4KewVYU5YzVea@kernel.org>

On Wed, Jun 24, 2026 at 06:30:49PM +0300, Mike Rapoport wrote:
>On Tue, Jun 23, 2026 at 09:26:53AM +0000, Wei Yang wrote:
>> On Tue, Jun 23, 2026 at 12:22:15PM +0300, Mike Rapoport wrote:
>> >
>> >After I applied your patch, I did some checks to see why
>> >mirrored_kernelcore causes us troubles. I found that unlike other variants
>> >of kernelcore/movablecore settings, mirrored_kernelcore creates zone
>> >overlap for no apparent reason. I did some git archaeology and I didn't
>> >find a justification for making overlapping pages absent in ZONE_NORMAL.
>> >
>> >So I came up with this cleanup:
>> >
>> >https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kernelcore-mirror
>> >
>> >I'm waiting for the bots to chew on it before positing the patches.
>> > 
>> 
>> Ah, just the same as I do :-).
> 
>I'm going to send my version with your co-developed-by if you don't mind.
> 

My pleasure :-)

>> -- 
>> Wei Yang
>> Help you, Help me
>
>-- 
>Sincerely yours,
>Mike.

-- 
Wei Yang
Help you, Help me


^ permalink raw reply

* Re: [PATCH v3] mm: assert exclusive nid/zonenum bits at the page/folio access sites
From: Andrew Morton @ 2026-06-25  3:09 UTC (permalink / raw)
  To: Hui Zhu
  Cc: David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Kairui Song, Qi Zheng, Shakeel Butt, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-mm, linux-kernel, Hui Zhu
In-Reply-To: <20260625024019.838786-1-hui.zhu@linux.dev>

On Thu, 25 Jun 2026 10:40:19 +0800 Hui Zhu <hui.zhu@linux.dev> wrote:

> KCSAN reports a data race between page_to_nid()/folio_pgdat() reading
> page->flags and folio_trylock()/folio_lock() concurrently doing
> test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:
> 
>   BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp
> 
> The node id and zone id occupy fixed bit-ranges of page->flags that
> are set once at page init and never modified afterwards, so they can
> never overlap with the low PG_locked/PG_waiters bits touched by the
> folio lock path.
> 
> ASSERT_EXCLUSIVE_BITS(mdf.f, ...) inside memdesc_nid()/memdesc_zonenum()
> checks a by-value copy of the flags word, not the actual shared
> page->flags/folio->flags being modified concurrently, so it doesn't
> reliably assert anything about the real race. Move the assertion to
> page_to_nid(), folio_nid(), page_zonenum() and folio_zonenum(), where
> flags is dereferenced directly from the page/folio.

Thanks.  Sashiko is worried about CONFIG_NUMA=n:
	https://sashiko.dev/#/patchset/20260625024019.838786-1-hui.zhu@linux.dev


I'm a bit surprised that these functions even exist on NUMA=n.  That we
don't have simply

#else
static inline int folio_nid(const struct folio *folio)
{
	return 0;
}
#endif


^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-25  3:01 UTC (permalink / raw)
  To: David Laight, Christian König, Jani Nikula,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <20260624152324.3def88ce@pumpkin>

在 2026/6/24 22:23, David Laight 写道:
> On Wed, 24 Jun 2026 15:23:47 +0200
> Christian König <christian.koenig@amd.com> wrote:
>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>  
>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>
>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>> every call site, even though most users only need it for the iterator
>>>>> implementation and never reference it in the loop body.
>>>>>
>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>> a unique internal cursor.  
>>>>
>>>> I'm not really sure 'mutable' means anything either.
>>>> It is possible to make it valid for the loop body (or even other threads)
>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>
>>>> It might be worth doing something that doesn't need the extra variable,
>>>> but there is little point doing all the churn just to rename things.
>>>>  
>>>>>
>>>>> This makes call sites that only mutate the list through the current entry
>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>> compatibility.
>>>>>
>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>> ---
>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>> --- a/include/linux/list.h
>>>>> +++ b/include/linux/list.h
>>>>> @@ -7,6 +7,7 @@
>>>>>  #include <linux/stddef.h>
>>>>>  #include <linux/poison.h>
>>>>>  #include <linux/const.h>
>>>>> +#include <linux/args.h>
>>>>>  
>>>>>  #include <asm/barrier.h>
>>>>>  
>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>  #define list_for_each_prev(pos, head) \
>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>  
>>>>> -/**
>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> +/*
>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>   */
>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>  	     !list_is_head(pos, (head)); \
>>>>>  	     pos = n, n = pos->next)
>>>>>  
>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>
>>>> Use auto
>>>>  
>>>>> +	     !list_is_head(pos, (head));				\
>>>>> +	     pos = tmp, tmp = pos->next)
>>>>> +
>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>> +
>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>> +	list_for_each_safe(pos, next, head)
>>>>> +
>>>>>  /**
>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> + * @...:	either (head) or (next, head)
>>>>> + *
>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>> + *		the caller.
>>>>> + * head:	the head for your list.
>>>>> + */
>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>> +		(pos, __VA_ARGS__)  
>>>>
>>>> The variable argument count logic really just slows down compilation.
>>>> Maybe there aren't enough copies of this code to make that significant.
>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>> I'm also not sure it really adds anything to the readability.
>>>>
>>>> And, it you are going to make the middle argument optional there is
>>>> no need to change the macro name.  
>>>
>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>> implementation approach. If we abandon that method, it means we will
>>> inevitably need to add some new macros. If mutable is not a good name,
>>> suggestions for better alternatives would be welcome; coming up with a
>>> suitable name is indeed rather tricky.  
>>
>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>
>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
> 
> IIRC currently you have a choice of either:
> 	define               Item that can't be deleted
> 	list_for_each()	     The current item.
> 	list_for_each_safe() The next item.
> There is also likely to be code that updates the variables to allow
> for other scenarios.
> 
> Note that if increase a reference count and release a lock then list_for_each()
> is likely safer than list_for_each_safe() :-)
> 
> list.h has 9 variants of the 'safe' loop.
> The bloat of another 9 is getting excessive.
> 
> It has to be said that this is one of my least favourite type of list...

Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
Andy Shevchenko, Alexei Starovoitov

For ease of discussion, I need to summarize the currently possible
approaches and briefly describe their respective pros and cons,
using the list_for_each_entry* interfaces as examples.

1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
would be used specifically for safe deletion scenarios that do not
need to expose the temporary cursor externally. The code can refer to
the v1 version.

Pros: Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: Requires adding a whole set of mutable interfaces, which makes the
      code somewhat redundant.

2. Directly optimize away the temporary cursor in list_for_each_entry_safe
and define it inside the loop instead, changing the interface from four
arguments to three.

Pros: Does not add redundant interfaces.
Cons: (1) Users need to manually update special cases that use the
      traversal variable of list_for_each_entry_safe, the new
      list_for_each_entry_safe would no longer apply there and would
      need to be open-coded.
      (2) Because the macro arguments changes, all list_for_each_entry_safe
      callers would need to be modified and merged together, making it
      difficult to merge such a large amount of code at once.

3. Use a variadic macro approach to optimize list_for_each_entry_safe,
so that it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can
      be merged directly.
Cons: (1) Increases compile time.
      (2) Makes the interface harder for users to use.

4. Optimize list_for_each_entry by defining the temporary cursor internally,
making it compatible with the functionality of list_for_each_entry_safe.
The code can refer to the v2 version.

Pros: (1) Does not add redundant interfaces.
      (2) The number of externally visible arguments of list_for_each_entry
      remains unchanged, still three.
Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.
      (2) Users need to manually update special cases that use the traversal
      variable of list_for_each_entry, the new list_for_each_entry would no
      longer apply there and would need to be open-coded. There are 15 such
      cases in total.

5. Use a variadic macro approach to optimize list_for_each_entry, so that
it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: (1) Increases compile time.
      (2) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.

6. Make no changes, keep the current logic unchanged, and close the current
email discussion.


Which of the six solutions above do people prefer?

-- 
Thanks
Kaitao Cheng



^ permalink raw reply

* Re: Re: [PATCH 1/1] mm/shrinker: add NULL checks after rcu_dereference() in shrinker bit functions
From: 傅清爽 @ 2026-06-25  3:01 UTC (permalink / raw)
  To: fffsqian, Andrew Morton, Dave Chinner, Roman Gushchin,
	Muchun Song, Qi Zheng
  Cc: linux-kernel, linux-mm

[-- Attachment #1: Type: text/html, Size: 6021 bytes --]

^ permalink raw reply

* Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
From: Andrew Morton @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, urezki, baohua,
	Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang
In-Reply-To: <20260618084726.1070022-1-jiangwen6@xiaomi.com>

On Thu, 18 Jun 2026 16:47:20 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:

> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous. Two techniques are used:

Thanks.

> 1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
>    segments
> 2. Use batched mappings wherever possible in both vmalloc and ARM64
>    layers
> 
> Besides accelerating the mapping path, this also enables large
> mappings (PMD and cont-PTE) for vmap, which are currently not
> supported.
> 
> Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> CONT-PTE regions instead of just one.
> 
> Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
> mapping logic between the ioremap and vmalloc/vmap paths, handling both
> CONT_PTE and regular PTE mappings. This prepares for the next patch.
> 
> Patch 4 extends the page table walk path to support page shifts other
> than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
> mappings. The function is renamed from vmap_small_pages_range_noflush()
> to vmap_pages_range_noflush_walk().
> 
> Patches 5-6 add huge vmap support for contiguous pages, including
> support for non-compound pages with pfn alignment verification.
> 
> On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> the performance CPUfreq policy enabled, benchmark results:
> 
> * ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
> * vmalloc(1 MB) mapping time (excluding allocation) with
>   VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
> * vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)

Nice.

> Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.

Indeed.

I see Dev had a good look at v3 - hopefully he (and Ulad) (and more ARM
folks) have time to go through this.

Is there any effect on anything other than arm64?  I'm wondering how
much testing these changes will really get in mm.git and linux-next.

How is our selftests coverage of these changes?  Is there some existing
selftest which will exercise these new features?

You diligently went through the Sashiko report against v3 (thanks). 
Please pass an eye across its v4 report, see if something new popped
up?
	https://sashiko.dev/#/patchset/20260618084726.1070022-1-jiangwen6@xiaomi.com



^ permalink raw reply

* [PATCH v3] mm: assert exclusive nid/zonenum bits at the page/folio access sites
From: Hui Zhu @ 2026-06-25  2:40 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	linux-mm, linux-kernel
  Cc: Hui Zhu

From: Hui Zhu <zhuhui@kylinos.cn>

KCSAN reports a data race between page_to_nid()/folio_pgdat() reading
page->flags and folio_trylock()/folio_lock() concurrently doing
test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:

  BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp

The node id and zone id occupy fixed bit-ranges of page->flags that
are set once at page init and never modified afterwards, so they can
never overlap with the low PG_locked/PG_waiters bits touched by the
folio lock path.

ASSERT_EXCLUSIVE_BITS(mdf.f, ...) inside memdesc_nid()/memdesc_zonenum()
checks a by-value copy of the flags word, not the actual shared
page->flags/folio->flags being modified concurrently, so it doesn't
reliably assert anything about the real race. Move the assertion to
page_to_nid(), folio_nid(), page_zonenum() and folio_zonenum(), where
flags is dereferenced directly from the page/folio.

Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
Changelog:
v3:
According to the comments of Andrew and Sashiko, move
ASSERT_EXCLUSIVE_BITS out of memdesc_nid()/memdesc_zonenum()
into the page/folio call sites.
v2:
According to the comments of David, remove useless comments and use
ASSERT_EXCLUSIVE_BITS() in memdesc_nid() instead of data_race() in
page_to_nid().

 include/linux/mm.h     | 4 ++++
 include/linux/mmzone.h | 3 ++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbdd..734e9de8f4ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2296,11 +2296,15 @@ static inline int memdesc_nid(memdesc_flags_t mdf)
 
 static inline int page_to_nid(const struct page *page)
 {
+	ASSERT_EXCLUSIVE_BITS(PF_POISONED_CHECK(page)->flags,
+			      NODES_MASK << NODES_PGSHIFT);
 	return memdesc_nid(PF_POISONED_CHECK(page)->flags);
 }
 
 static inline int folio_nid(const struct folio *folio)
 {
+	ASSERT_EXCLUSIVE_BITS(folio->flags,
+			      NODES_MASK << NODES_PGSHIFT);
 	return memdesc_nid(folio->flags);
 }
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca2712187147..56dffa966343 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1274,17 +1274,18 @@ static inline bool zone_is_empty(const struct zone *zone)
 
 static inline enum zone_type memdesc_zonenum(memdesc_flags_t flags)
 {
-	ASSERT_EXCLUSIVE_BITS(flags.f, ZONES_MASK << ZONES_PGSHIFT);
 	return (flags.f >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
 static inline enum zone_type page_zonenum(const struct page *page)
 {
+	ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
 	return memdesc_zonenum(page->flags);
 }
 
 static inline enum zone_type folio_zonenum(const struct folio *folio)
 {
+	ASSERT_EXCLUSIVE_BITS(folio->flags, ZONES_MASK << ZONES_PGSHIFT);
 	return memdesc_zonenum(folio->flags);
 }
 
-- 
2.43.0



^ permalink raw reply related

* Re: [PATCH v2 1/4] mm: mincore: use walk_page_range_vma() in do_mincore()
From: Andrew Morton @ 2026-06-25  2:38 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Pedro Falcato, David Hildenbrand (Arm), Zi Yan, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Suren Baghdasaryan, linux-mm
In-Reply-To: <e7244b54-d794-4e46-88e7-69204da2b60d@huawei.com>

On Mon, 22 Jun 2026 11:23:11 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:

> >>>> -	err = walk_page_range(vma->vm_mm, addr, end, &mincore_walk_ops, vec);
> >>>> +
> >>>> +	/*
> >>>> +	 * walk_page_range_vma() does not call walk_page_test(), which
> >>>> +	 * handles VM_PFNMAP VMA by invoking ->pte_hole() to skip the
> >>>> +	 * page table walk. Without this check, PFNMAP PTEs would be
> >>>> +	 * treated as present by mincore_pte_range(), changing the returned
> >>>> +	 * residency status from the historical "not resident" to "resident".
> >>>> +	 * Handle VM_PFNMAP explicitly to preserve the original behavior.
> >>>> +	 */
> > 
> > I would rather we amend this comment to something like:
> > 
> > 	/* mincore (historically) reports PFNMAP mappings as non-resident. */
> > 
> > because we don't need to explain internal differences in walk_page_range
> > functions in a random comment in mincore. And perhaps attempt a separate
> 
> Hope Andrew can fix the comments when pickup patches.

I added this:

--- a/mm/mincore.c~mm-mincore-use-walk_page_range_vma-in-do_mincore-fix
+++ a/mm/mincore.c
@@ -261,12 +261,7 @@ static long do_mincore(unsigned long add
 	}
 
 	/*
-	 * walk_page_range_vma() does not call walk_page_test(), which
-	 * handles VM_PFNMAP VMA by invoking ->pte_hole() to skip the
-	 * page table walk. Without this check, PFNMAP PTEs would be
-	 * treated as present by mincore_pte_range(), changing the returned
-	 * residency status from the historical "not resident" to "resident".
-	 * Handle VM_PFNMAP explicitly to preserve the original behavior.
+	 * mincore historically reports PFNMAP mappings as non-resident.
 	 */
 	if (vma->vm_flags & VM_PFNMAP) {
 		__mincore_unmapped_range(addr, end, vma, vec);
_


Sashiko is OK with your patchset, but it might have found four(!)
pre-existing issues:

	https://sashiko.dev/#/patchset/20260618092845.3905740-1-wangkefeng.wang@huawei.com


^ permalink raw reply

* Re: [PATCH] mm/page_alloc: don't build vm_numa_stat_key if CONFIG_NUMA=n
From: Andrew Morton @ 2026-06-25  2:27 UTC (permalink / raw)
  To: Ben Dooks
  Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260618100614.1321950-1-ben.dooks@codethink.co.uk>

On Thu, 18 Jun 2026 11:06:14 +0100 Ben Dooks <ben.dooks@codethink.co.uk> wrote:

> The vm_numa_stat_key is only exported if CONFIG_NUMA is set,
> so avoid the following warning by guarding it in an #ifdef
> on CONFIG_NUMA:
> 
> mm/page_alloc.c:165:1: warning: symbol 'vm_numa_stat_key' was not declared. Should it be static?
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -162,7 +162,9 @@ DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
>  #endif
>  
> +#ifdef CONFIG_NUMA
>  DEFINE_STATIC_KEY_TRUE(vm_numa_stat_key);
> +#endif
>  
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
>  /*

It might be tidier to move this into mm/vmstat.c, around line 38?



^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-25  2:25 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <CAEvNRgG1nHipzw4=eBgwhvyXi8xYo7FQD_sy9Ax6FDf7YDu3Og@mail.gmail.com>

On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Tue, Jun 23, 2026, Yan Zhao wrote:
> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >> > > > >  	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> >> > > > >  		return -EIO;
> >> > > > >
> >> > > > > -	if (!src_page)
> >> > > > > -		return -EOPNOTSUPP;
> >> > > > > +	if (!src_page) {
> >> > > > > +		if (!gmem_in_place_conversion)
> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
> >> > > > error.
> >> > >
> >> > > Why MMAP?
> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
> >> >
> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> >> > > and written memory.  And when write() lands, MMAP wouldn't be necessary to
> >> > > initialize the memory.
> >> > Do you mean using up-to-date flag as below?
> >
> > Yes?  I didn't actually look at the implementation details.
> >
> >> > if (!src_page) {
> >> > 	src_page = pfn_to_page(pfn);
> >> > 	if (!folio_test_uptodate(page_folio(src_page)))
> >> > 		return -EOPNOTSUPP;
> >> > }
> 
> Yan is right that with the earlier patch "Zero page while getting pfn",
> folio_test_uptodate() here will always return true.
> 
> Actually, this is an alternative fix for the issue Sashiko pointed out
> on v7 where userspace can do a populate() (either TDX or SNP) without
> first allocating the page, with src_address == NULL, and leak
> uninitialized memory into the guest.
> 
> Advantage of using the uptodate check in populate: if the host never
> allocates the page, populate doesn't incur zeroing before writing the
> page anyway in populate().
> 
> Disadvantage: Both TDX and SNP will have to implement this uptodate
> check. guest_memfd can't check centrally because for SNP, for a
> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
> firmware will zero and there's no leakage of uninitialized host memory?
Another disadvantage: the uptodate flag is per-folio. What if the folio
is only partially initialized by the userspace especially after huge page is
supported?


> >> Another concern with this fix is that:
> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
> >> folio uptodate before reaching post_populate().
> >>
> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
> >>
> >> > One concern is that TDX now does not much care about the up-to-date flag since
> >> > TDX doesn't rely on the flag to clear pages on conversions.
> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
> >> > user access.
> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
> >> > to have more effort than introducing a dedicated flag for TDX.
> >> >
> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> >> > > > 1. create guest_memfd with MMAP flag
> >> > > > 2. mmap the guest_memfd.
> >> > > > 3. convert the initial memory range to shared.
> >> > > > 4. copy initial content to the source page.
> >> > > > 5. convert the initial memory range to private
> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> >> > > > 7. do not unmap the source backend.
> >> > > >
> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
> >> > >
> >> > > Why?  It's userspace's responsibility to get the above right.  If userspace fails
> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
> 
> Yan, is your concern that userspace forgot to update the code and
> forgets to provide a src_page, and if we keep the "Zero page while
Yes. Previously, it would be rejected after GUP fails.

> getting pfn" patch, ends up with the guest silently having a zero page?
> I think that would be found quite early in userspace VMM testing...
I actually encountered this during testing this patch.
I update most code path to follow this sequence. However, still some corner ones
for TDVF HOB, which are less obvious and harder to update.
The TD just booted up and hang silently.

> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
> >> > kernel to detect this mistake, similar to how it validates whether source_addr
> >> > is PAGE_ALIGNED.
> >
> > The alignment case is different.  If userspace provides an unaligned value, KVM
> > *can't* do what userspace is asking because hardware and thus KVM only supports
> > converting on page boundaries.
> >
> > For a NULL source, KVM can still do what userspace is asking.  Rejecting userspace's
> > request would then be making assumptions about what userspace wants.
> >
> 
> Also, +1 on this, what if userspace, knowing that pages are zeroed on
> allocation, actually wants to rely on that to get a zero page in the guest?
What if 0 uaddr is a valid address? :)

> >> > Since userspace already needs to perform additional steps to enable in-place
> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
> >> > intentional seems like a reasonable burden.
> >
> > I don't see how it adds any value.  I wouldn't be at all surprised if most VMMs
> > just wen up with code that does:
> >
> > 	if (in-place) {
> > 		src = NULL;
> > 		flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
> > 	}
> 


^ permalink raw reply

* Re: [PATCH 0/2] mm/page_owner: fix TOCTOU races in lockless page state reading
From: Andrew Morton @ 2026-06-25  2:04 UTC (permalink / raw)
  To: Ye Liu
  Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260625014708.87386-1-ye.liu@linux.dev>

On Thu, 25 Jun 2026 09:47:03 +0800 Ye Liu <ye.liu@linux.dev> wrote:

> Fix two TOCTOU races found during review of [1].
> 
> page_owner reads page state locklessly by design. In two places the
> code reads the same metadata twice — once as a guard, then again as
> a use — and the page can be concurrently reallocated between the two:
> 
> Patch 1: buddy_order_unsafe() in skip_buddy_pages() can return garbage
> if the page is allocated between PageBuddy() and the private read,
> causing the PFN to skip past a pfn_valid() boundary.  Clamp the
> advance at MAX_ORDER_NR_PAGES.
> 
> Patch 2: PageMemcgKmem() in print_page_owner_memcg() re-reads
> folio->memcg_data and triggers VM_BUG_ON assertions if the page
> became a tail page or slab page.  Use the snapshot taken at entry.

That was fast.  I haven't pushed out mm-new yet, so Sashiko wasn't able
to apply these.

> [1] https://lore.kernel.org/all/20260623065234.31866-2-ye.liu@linux.dev/
> [2] https://sashiko.dev/#/patchset/20260623065234.31866-2-ye.liu@linux.dev

Nothing cites "[2]".    That's OK.




^ permalink raw reply

* Re: [PATCH v2] mm: avoid KCSAN false positive in memdesc_nid()
From: Andrew Morton @ 2026-06-25  1:58 UTC (permalink / raw)
  To: Hui Zhu
  Cc: David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, Hui Zhu
In-Reply-To: <20bd8bbbd5a8cc52d267f550fc0314cd0d81a223@linux.dev>

On Thu, 25 Jun 2026 01:32:49 +0000 "Hui Zhu" <hui.zhu@linux.dev> wrote:

> Good catch. ASSERT_EXCLUSIVE_BITS(mdf.f, ...) is checking a by-value
> copy of the flags word inside memdesc_nid(), not the actual shared
> page->flags/folio->flags being modified by folio_trylock(). Whatever
> made it appear to suppress the KCSAN report is likely an artifact of
> inlining/codegen (kcsan_atomic_next() happening to land on the real
> load after inlining), not a principled fix - so Sashiko's pass is
> not reassuring here.

Yeah, I was wondering if the inlining accidentally gave the macro the
correct thing.  Which seems wrong - an inlined function should treat an
incoming arg purely as a local thing.  Maybe we fooled the compiler.

> I'll move the assertion to where the real dereference happens (at
> the page_to_nid()/folio_nid() call sites) instead of inside the
> by-value helper. This probably also applies to the existing
> memdesc_zonenum() pattern - is that one actually verified to work,
> or does it have the same issue?

I assume the memdesc_zonenum() code worked, for the same (poorly
understood) reason as did your patch.

Yes, moving this into the sites where we officially have access to the
shared storage seems the right approach.



^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-25  1:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajx5Vrz9ma--hrGH@google.com>

On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > This means this module parameter only enables per-gmem memory attribute and does
> > > not guarantee that gmem in-place conversion will actually occur.
> 
> KVM module params are pretty much always about what KVM supports, not what is
> guaranteed to happen.
> 
>   - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
>     because maybe the guest never accesses emulated MMIO.
>   - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
>     not to advertise one.
>   - and so on and so forth...
> 
> Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> to "I need to set memory attributes on the guest_memfd instance, not the VM",
> but I don't see that as a big hurdle, certainly not in the long term.  And once
> the VMM code is written, I really do think most people are going to care about
> whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
Sorry, I just saw this mail after posting my reply in [1].

I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
conversion, while we can still create VMs with shared memory not from gmem.

Though it still feels a bit odd to require TDX huge pages to depend on
gmem_in_place_conversion=true when shared memory is not currently allocated from
gmem, it should become more natural over time once gmem supports in-place
conversions for huge page.

[1] https://lore.kernel.org/all/ajyCn0PnFtQK+Nka@yzhao56-desk.sh.intel.com


> > > To avoid confusion, could we rename this module parameter to something more
> > > accurate, such as gmem_memory_attribute?
> > 
> > I asked Sean about this after getting some fixes off list. Sean said
> > gmem_in_place_conversion is named for a host admin to use, and something
> > like gmem_memory_attributes is too much implementation details for the
> > admin.
> > 
> > Sean, would you reconsider since Yan also asked? If the admin compiled
> > the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> > admin would also be able to use a param like gmem_memory_attributes?
> 
> No, because it's not all memory attributes, it's very specifically the PRIVATE
> attribute that will get moved to guest_memfd.  I don't want to pick a name that
> will become stale and confusing when RWX attributes come along.  The RWX bits
> will be per-VM, while PRIVATE will be per-guest_memfd.


^ permalink raw reply

* [PATCH 3/3] mm: read remote memory without the mmap lock where possible
From: Rik van Riel @ 2026-06-25  1:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>

__access_remote_vm() takes mmap_read_lock() for the entire transfer and
uses get_user_pages_remote(), which faults pages in.  For the common case
of reading memory that is already resident -- /proc/PID/cmdline,
/proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
unnecessary and is badly contended on large machines.

Add an opportunistic, read-only fast path.  It takes the per-VMA lock with
lock_vma_under_rcu() and, only when the whole request lies within that one
VMA, copies the resident pages out using folio_walk_start(FW_VMA_LOCKED)
to grab a short-lived page reference from a page table walk run with
interrupts disabled.  Interrupts are disabled only across the walk (until
the folio is pinned): page table freeing -- a concurrent munmap() or THP
collapse of an adjacent region -- serializes against lockless walkers via
tlb_remove_table_sync_one(), which IPIs and waits for every CPU to enable
interrupts, the same contract gup_fast relies on.  The copy then runs with
interrupts on, holding only the folio reference.

A request that spans more than one VMA is left entirely to the mmap_lock
path: relocking per VMA could observe a structurally inconsistent address
space (a neighbouring VMA unmapped and a different one mapped in its place
between locks), whereas the mmap_lock path sees a stable VMA tree for the
whole transfer.

The per-VMA permission check mirrors the read side of check_vma_flags(),
including the FOLL_ANON restriction that /proc/PID/{cmdline,environ} rely
on (CVE-2018-1120).  Anything not positively allowed -- a not-present
page, a hugetlb or VM_IO/VM_PFNMAP or secretmem mapping, or a race with a
VMA writer -- falls back to the mmap_lock path for the remainder, which
re-validates everything.  Pages read on the fast path are marked accessed,
matching the FOLL_TOUCH behaviour of the get_user_pages_remote() slow
path.

untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
for the fast path; the untag mask is a stable per-mm value.

Only reads are handled here; writes keep using the slow path.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/uaccess_64.h |  14 ++-
 include/linux/uaccess.h           |  11 ++
 mm/memory.c                       | 195 +++++++++++++++++++++++++++++-
 3 files changed, 217 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 4a52497ba6a1..933b0b8b4d60 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -39,11 +39,23 @@ static inline unsigned long __untagged_addr(unsigned long addr)
 	(__force __typeof__(addr))__untagged_addr(__addr);		\
 })
 
+/* Strip the tag bits from a remote mm's address; usable without the mmap lock. */
+static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
+							    unsigned long addr)
+{
+	return addr & READ_ONCE(mm->context.untag_mask);
+}
+
+#define untagged_addr_remote_unlocked(mm, addr)	({			\
+	unsigned long __addr = (__force unsigned long)(addr);		\
+	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
+})
+
 static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 						   unsigned long addr)
 {
 	mmap_assert_locked(mm);
-	return addr & READ_ONCE((mm)->context.untag_mask);
+	return __untagged_addr_remote_unlocked(mm, addr);
 }
 
 #define untagged_addr_remote(mm, addr)	({				\
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 8a264662b242..c8c83372c9d8 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -34,6 +34,17 @@
 })
 #endif
 
+/*
+ * Like untagged_addr_remote(), but for callers that stabilize @mm by other
+ * means (e.g. a per-VMA lock) and must not assert the mmap lock.
+ */
+#ifndef untagged_addr_remote_unlocked
+#define untagged_addr_remote_unlocked(mm, addr)	({	\
+	(void)(mm);					\
+	untagged_addr(addr);				\
+})
+#endif
+
 #ifdef masked_user_access_begin
  #define can_do_masked_user_access() 1
 # ifndef masked_user_write_access_begin
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..d2b2f0014a0c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,8 @@
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
+#include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -7062,6 +7064,180 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL_GPL(generic_access_phys);
 #endif
 
+/*
+ * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
+ * lock and RCU-freed page tables to walk page tables without the mmap lock.
+ */
+#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
+/*
+ * Read-side VMA checks for the lockless fast path, mirroring the read side of
+ * check_vma_flags(): reject what FW_VMA_LOCKED cannot handle (hugetlb), what
+ * needs the ->access() handler (VM_IO/VM_PFNMAP), or what has no struct page to
+ * copy (secretmem); enforce the FOLL_ANON restriction that
+ * /proc/PID/{cmdline,environ} rely on (CVE-2018-1120); and require read access
+ * (honoring FOLL_FORCE).  Anything not positively allowed falls back to the slow
+ * path, which re-validates everything.
+ */
+static bool vma_permits_fast_access(struct vm_area_struct *vma,
+				    unsigned int gup_flags)
+{
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+		return false;
+	if (is_vm_hugetlb_page(vma) || vma_is_secretmem(vma))
+		return false;
+	if ((gup_flags & FOLL_ANON) && !vma_is_anonymous(vma))
+		return false;
+	if (!(vma->vm_flags & VM_READ) &&
+	    (!(gup_flags & FOLL_FORCE) || !(vma->vm_flags & VM_MAYREAD)))
+		return false;
+	return true;
+}
+
+/* Size of the single mapping entry folio_walk_start() landed on. */
+static unsigned long fw_entry_size(enum folio_walk_level level)
+{
+	switch (level) {
+	case FW_LEVEL_PUD:
+		return PUD_SIZE;
+	case FW_LEVEL_PMD:
+		return PMD_SIZE;
+	default:
+		return PAGE_SIZE;
+	}
+}
+
+/*
+ * Copy @len bytes of the pinned @folio out to @buf, starting at byte offset
+ * @folio_off within the folio (the position of @addr).  Maps and copies one
+ * page at a time -- kmap_local_folio() for HIGHMEM, copy_from_user_page() for
+ * the per-page flush on aliasing caches -- without re-walking page tables.
+ * Each page borrows the caller's single folio reference, so the mapping is
+ * dropped with kunmap_local() rather than folio_release_kmap().
+ */
+static void copy_folio_pages(struct vm_area_struct *vma, struct folio *folio,
+			     unsigned long folio_off, unsigned long addr,
+			     void *buf, unsigned long len)
+{
+	unsigned long done = 0;
+
+	while (done < len) {
+		unsigned long pos = folio_off + done;
+		unsigned long page_idx = pos >> PAGE_SHIFT;
+		unsigned int page_off = pos & ~PAGE_MASK;
+		unsigned int chunk = min_t(unsigned long, len - done,
+					   PAGE_SIZE - page_off);
+		void *kaddr = kmap_local_folio(folio, page_idx << PAGE_SHIFT);
+
+		copy_from_user_page(vma, folio_page(folio, page_idx),
+				    addr + done, buf + done, kaddr + page_off,
+				    chunk);
+		kunmap_local(kaddr);
+		done += chunk;
+	}
+}
+
+/*
+ * Opportunistic lockless fast path for __access_remote_vm() reads.
+ *
+ * Memory already resident in @mm can be read without taking the frequently
+ * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
+ * with FW_VMA_LOCKED grabs a short-lived reference to a present page from a page
+ * table walk run with interrupts disabled, which serializes against concurrent
+ * page table freeing the same way gup_fast does (relying on
+ * MMU_GATHER_RCU_TABLE_FREE).
+ *
+ * Only a request that lies entirely within a single VMA is handled here,
+ * which should not be an issue in practice since every caller has a
+ * buffer of PAGE_SIZE or smaller. Loop iteration inside this function
+ * should be rare, too.
+ *
+ * Returns the number of bytes transferred via the fast path.
+ */
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	void *old_buf = buf;
+	struct vm_area_struct *vma;
+
+	addr = untagged_addr_remote_unlocked(mm, addr);
+
+	vma = lock_vma_under_rcu(mm, addr);
+	if (!vma)
+		return 0;
+
+	/* Only handle a request contained entirely within this one VMA. */
+	if (len > vma->vm_end - addr)
+		goto out_unlock;
+
+	if (!vma_permits_fast_access(vma, gup_flags))
+		goto out_unlock;
+
+	while (len) {
+		struct folio_walk fw;
+		struct folio *folio;
+		struct page *page;
+		unsigned long entry_size, folio_off, span, irq_flags;
+
+		/*
+		 * The lockless page table walk must run with interrupts
+		 * disabled: page table freeing (munmap or THP collapse, which
+		 * IPI via tlb_remove_table_sync_one() and wait) then cannot free
+		 * a table mid-walk -- the same contract gup_fast relies on.  IRQs
+		 * are restored once the folio is pinned; the copy below holds only
+		 * the folio reference.
+		 */
+		local_irq_save(irq_flags);
+		folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
+		if (!folio) {
+			local_irq_restore(irq_flags);
+			goto out_unlock;	/* not present: let the slow path fault it in */
+		}
+		page = fw.page;
+		if (!page) {
+			/* No struct page to copy (e.g. a special PTE). */
+			folio_walk_end(&fw, vma);
+			local_irq_restore(irq_flags);
+			goto out_unlock;
+		}
+		entry_size = fw_entry_size(fw.level);
+		folio_get(folio);
+		folio_walk_end(&fw, vma);
+		local_irq_restore(irq_flags);
+
+		/*
+		 * folio_walk_start() validated one present mapping entry
+		 * (PAGE/PMD/PUD_SIZE).  Copy to the end of that entry, bounded by
+		 * the folio and the remaining length (already within the VMA), so
+		 * a huge mapping is handled in a single walk.
+		 */
+		folio_off = (folio_page_idx(folio, page) << PAGE_SHIFT) +
+			    offset_in_page(addr);
+		span = min3((unsigned long)len,
+			    entry_size - (addr & (entry_size - 1)),
+			    (folio_nr_pages(folio) << PAGE_SHIFT) - folio_off);
+
+		copy_folio_pages(vma, folio, folio_off, addr, buf, span);
+
+		/* Match the FOLL_TOUCH behaviour of the slow (GUP) path. */
+		folio_mark_accessed(folio);
+		folio_put(folio);
+		len -= span;
+		buf += span;
+		addr += span;
+	}
+
+out_unlock:
+	vma_end_read(vma);
+	return buf - old_buf;
+}
+#else
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	return 0;
+}
+#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+
 /*
  * Access another process' address space as given in mm.
  */
@@ -7071,15 +7247,30 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
 	void *old_buf = buf;
 	int write = gup_flags & FOLL_WRITE;
 
+	/*
+	 * Try the lockless fast path for reads first; it transfers what it can
+	 * from resident memory without taking mmap_lock, and leaves the
+	 * remainder (if any) to the slow path below.
+	 */
+	if (!write) {
+		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
+
+		addr += done;
+		buf += done;
+		len -= done;
+		if (!len)
+			return buf - old_buf;
+	}
+
 	if (mmap_read_lock_killable(mm))
-		return 0;
+		return buf - old_buf;
 
 	/* Untag the address before looking up the VMA */
 	addr = untagged_addr_remote(mm, addr);
 
 	/* Avoid triggering the temporary warning in __get_user_pages */
 	if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
-		return 0;
+		return buf - old_buf;
 
 	/* ignore errors, just check how much was successfully transferred */
 	while (len) {
-- 
2.53.0-Meta



^ permalink raw reply related

* [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
From: Rik van Riel @ 2026-06-25  1:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>

folio_walk_start() asserts the mmap lock is held.  For callers that only
need to read a single, already-present page, the mmap lock is a heavy and
often badly contended hammer.  Such a caller can instead hold the per-VMA
lock, which keeps the VMA itself stable.

The per-VMA lock does not, however, keep the page tables walked below that
VMA from being freed.  A concurrent munmap() or THP collapse of an
adjacent region in the same mm can free a shared upper-level table, and
THP collapse (collapse_huge_page() -> retract_page_tables()) frees page
tables of VMAs whose lock it does not hold.  Page table freeing
synchronizes against lockless walkers the way gup_fast relies on:
tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable
interrupts, so a walker that keeps interrupts disabled across the walk
cannot be observing a table that is about to be freed.  rcu_read_lock() is
not sufficient -- it does not block that IPI -- so the caller must keep
interrupts disabled, not merely hold an RCU read-side critical section.

Add an FW_VMA_LOCKED flag.  When passed, folio_walk_start() asserts the
per-VMA lock and that interrupts are disabled, instead of asserting the
mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses
hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not
cover).  The caller must keep interrupts disabled until folio_walk_end().

No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/pagewalk.h |  7 +++++++
 mm/pagewalk.c            | 29 +++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b41d7265c01b..d0387470d732 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t;
 
 /* Walk shared zeropages (small + huge) as well. */
 #define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(0))
+/*
+ * The caller holds the per-VMA lock instead of the mmap lock, with interrupts
+ * disabled across the walk (until folio_walk_end()) to serialize against page
+ * table freeing, the same way gup_fast does. Only valid with RCU-freed page
+ * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
+ */
+#define FW_VMA_LOCKED			((__force folio_walk_flags_t)BIT(1))
 
 enum folio_walk_level {
 	FW_LEVEL_PTE,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..ab1e81983cb8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
  * not correspond to the first physical entry of a logical hugetlb entry.
  *
- * The mmap lock must be held in read mode.
+ * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
+ * passed, the VMA's per-VMA lock must be held and interrupts must be disabled
+ * across the walk and until folio_walk_end() (only supported with RCU-freed page
+ * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
  *
  * Return: folio pointer on success, otherwise NULL.
  */
@@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 	pgd_t *pgdp;
 	p4d_t *p4dp;
 
-	mmap_assert_locked(vma->vm_mm);
+	if (flags & FW_VMA_LOCKED) {
+		/*
+		 * Lockless walk under the per-VMA lock instead of the mmap
+		 * lock. The VMA lock keeps the VMA stable, but the page tables
+		 * walked below it can still be freed concurrently: a munmap() or
+		 * THP collapse of an adjacent region in the same mm can free a
+		 * shared upper-level table, and collapse_huge_page() ->
+		 * retract_page_tables() frees page tables of VMAs whose lock it
+		 * does not hold. Page table freeing serializes against lockless
+		 * walkers via tlb_remove_table_sync_one(), which IPIs and waits
+		 * for every CPU to enable interrupts; an RCU read-side critical
+		 * section does not block that IPI, so the caller must keep
+		 * interrupts disabled across the whole walk, like gup_fast.
+		 * Hugetlb (PMD sharing) maps page tables not covered by this
+		 * VMA's lock and is not supported.
+		 */
+		VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
+		VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
+		lockdep_assert_irqs_disabled();
+		vma_assert_locked(vma);
+	} else {
+		mmap_assert_locked(vma->vm_mm);
+	}
 	vma_pgtable_walk_begin(vma);
 
 	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
-- 
2.53.0-Meta



^ permalink raw reply related

* [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask
From: Rik van Riel @ 2026-06-25  1:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>

mm->context.untag_mask is written once, when LAM is enabled
(mm_enable_lam(), under mmap_write_lock and while the process is still
single-threaded), and is otherwise stable and never reverted.
untagged_addr_remote() reads it for a remote mm, and the new
untagged_addr_remote_unlocked() (used by the per-VMA-lock
access_remote_vm() fast path) reads it without the mmap lock.

The field is a single aligned word and cannot tear, but annotate the
reads and writes with READ_ONCE()/WRITE_ONCE() to make the lockless
access explicit and keep the compiler from reloading or tearing it.

No functional change.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu_context.h | 6 +++---
 arch/x86/include/asm/uaccess_64.h  | 2 +-
 arch/x86/kernel/process_64.c       | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ef5b507de34e..cee710f64658 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -100,18 +100,18 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
 static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
-	mm->context.untag_mask = oldmm->context.untag_mask;
+	WRITE_ONCE(mm->context.untag_mask, READ_ONCE(oldmm->context.untag_mask));
 }
 
 #define mm_untag_mask mm_untag_mask
 static inline unsigned long mm_untag_mask(struct mm_struct *mm)
 {
-	return mm->context.untag_mask;
+	return READ_ONCE(mm->context.untag_mask);
 }
 
 static inline void mm_reset_untag_mask(struct mm_struct *mm)
 {
-	mm->context.untag_mask = -1UL;
+	WRITE_ONCE(mm->context.untag_mask, -1UL);
 }
 
 #define arch_pgtable_dma_compat arch_pgtable_dma_compat
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 20de34cc9aa6..4a52497ba6a1 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -43,7 +43,7 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 						   unsigned long addr)
 {
 	mmap_assert_locked(mm);
-	return addr & (mm)->context.untag_mask;
+	return addr & READ_ONCE((mm)->context.untag_mask);
 }
 
 #define untagged_addr_remote(mm, addr)	({				\
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d44afbe005bb..55096136de53 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -814,7 +814,7 @@ static void enable_lam_func(void *__mm)
 static void mm_enable_lam(struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
-	mm->context.untag_mask =  ~GENMASK(62, 57);
+	WRITE_ONCE(mm->context.untag_mask, ~GENMASK(62, 57));
 
 	/*
 	 * Even though the process must still be single-threaded at this
-- 
2.53.0-Meta



^ permalink raw reply related

* [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock
From: Rik van Riel @ 2026-06-25  1:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, kernel-team

Sometimes processes can get stuck with the mmap_lock held for
a long time. This slows down, and can even prevent system monitoring
tools from assessing and logging the situation, because they themselves
end up getting stuck on the mmap_lock.

However, with the introduction of per-VMA locks, we can improve the
reliability of system monitoring, and generally speed up __access_remote_vm
under mmap_loc contention, by adding a fast path that does not require
the process-wide mmap_lock.

This fast path is only compiled in and used when it is safe to do so,
meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
is not hugetlbfs, iomap, pfnmap, etc...

v2:
 - simplify the code, which should be ok because these copies are < PAGE_SIZE
 - clean up the code
 - fix locking wrt tlb_remove_table_sync_one()
 - hopefully address all the other comments


^ permalink raw reply

* Re: [PATCH] Docs/mm: fix documentation warning for GFP parameter in kmalloc_obj, kmalloc_objs and kmalloc_flex
From: Andrew Morton @ 2026-06-25  1:48 UTC (permalink / raw)
  To: Jakov Novak
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Harry Yoo, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	linux-kernel-mentees, Shuah Khan
In-Reply-To: <20260619113622.11712-1-jakovnovak30@gmail.com>

On Fri, 19 Jun 2026 13:36:22 +0200 Jakov Novak <jakovnovak30@gmail.com> wrote:

> Subject: [PATCH] Docs/mm: fix documentation warning for GFP parameter in kmalloc_obj, kmalloc_objs and kmalloc_flex

Thanks.

"mm/slab: ..." would be a better subject.

> Date: Fri, 19 Jun 2026 13:36:22 +0200
> X-Mailer: git-send-email 2.54.0
> 
> Compiling the documentation currently gives the errors:
> 
> WARNING: ./include/linux/slab.h:1100 Excess function parameter 'GFP' description in 'kmalloc_obj'
> WARNING: ./include/linux/slab.h:1112 Excess function parameter 'GFP' description in 'kmalloc_objs'
> WARNING: ./include/linux/slab.h:1127 Excess function parameter 'GFP' description in 'kmalloc_flex'
> WARNING: ./include/linux/slab.h:1100 Excess function parameter 'GFP' description in 'kmalloc_obj'
> WARNING: ./include/linux/slab.h:1112 Excess function parameter 'GFP' description in 'kmalloc_objs'
> WARNING: ./include/linux/slab.h:1127 Excess function parameter 'GFP' description in 'kmalloc_flex'
> 
> This effectively omits the GFP parameter from the current kernel
> documentation. This patch marks the "..." parameter with the previous
> description of the GFP parameter along with an "optional" tag in
> parantheses.

"parentheses".

I'll assume that Vlastimil will be processing this patch.


^ permalink raw reply

* [PATCH 2/2] mm/page_owner: use memcg_data snapshot instead of PageMemcgKmem() to avoid TOCTOU VM_BUG_ON
From: Ye Liu @ 2026-06-25  1:47 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Ye Liu, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260625014708.87386-1-ye.liu@linux.dev>

print_page_owner_memcg() takes a snapshot of page->memcg_data via
READ_ONCE at the top of the function and guards against tail pages
and NULL memcg_data.  However, at the end it calls PageMemcgKmem(page)
which internally calls folio_memcg_kmem() — and that function re-reads
folio->memcg_data and page->compound_head locklessly, wrapping both
in VM_BUG_ON assertions:

    VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
    VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);

If the page is concurrently freed and reallocated as a THP tail page
or a slab page between the initial guards and this final call, the
VM_BUG_ON assertions can fire on debug builds (CONFIG_DEBUG_VM=y),
causing a kernel panic.

Fix by reusing the memcg_data snapshot already taken at function entry
instead of calling PageMemcgKmem(), which is semantically equivalent:
PageMemcgKmem()->folio_memcg_kmem()->folio->memcg_data & MEMCG_DATA_KMEM.
This avoids both the TOCTOU window and the assertions entirely.

Signed-off-by: Ye Liu <ye.liu@linux.dev>
---
 mm/page_owner.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 5c403bce35ce..b3252ebc0307 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -568,7 +568,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
 	cgroup_name(memcg->css.cgroup, name, sizeof(name));
 	ret += scnprintf(kbuf + ret, count - ret,
 			"Charged %sto %smemcg %s\n",
-			PageMemcgKmem(page) ? "(via objcg) " : "",
+			(memcg_data & MEMCG_DATA_KMEM) ? "(via objcg) " : "",
 			online ? "" : "offline ",
 			name);
 out_unlock:
-- 
2.43.0



^ permalink raw reply related

* [PATCH 1/2] mm/page_owner: clamp skip_buddy_pages() PFN advance at MAX_ORDER_NR_PAGES boundary
From: Ye Liu @ 2026-06-25  1:47 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Ye Liu, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260625014708.87386-1-ye.liu@linux.dev>

The lockless buddy_order_unsafe() read can return a garbage order
value if the page is concurrently allocated between the PageBuddy
check and the private read.  If this bogus order is <= MAX_PAGE_ORDER,
skip_buddy_pages() would arbitrarily advance the PFN, potentially
jumping past a MAX_ORDER_NR_PAGES boundary whose pfn_valid() check
would have caught an offline memory section.

In read_page_owner(), which relies solely on boundary-aligned
pfn_valid() to guard pfn_to_page(), skipping the boundary could
cause pfn_to_page() to access an unmapped mem_section.

Clamp the advance so it never crosses the next MAX_ORDER_NR_PAGES
boundary.  This is safe for all three callers: the pageblock-iterating
ones already handle boundary transitions in their outer loops, and
for read_page_owner() the worst case is one extra PageBuddy check per
1024 pages for a huge buddy block straddling the boundary.

Signed-off-by: Ye Liu <ye.liu@linux.dev>
---
 mm/page_owner.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index ec9600025127..5c403bce35ce 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -435,6 +435,12 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old)
  * to skip less than the full buddy block, but that is acceptable for page owner
  * iteration purposes.
  *
+ * The lockless read of buddy_order_unsafe() can also return a garbage order if
+ * the page is concurrently allocated and PageBuddy is cleared between the check
+ * and the read. Clamp the advance at the next MAX_ORDER_NR_PAGES boundary so
+ * that a bogus order cannot carry @pfn into an unvalidated memory section,
+ * which would break callers that rely on boundary-aligned pfn_valid() checks.
+ *
  * Return: true if the page was skipped (caller should continue its loop),
  *         false if the page is not a buddy page and should be processed normally.
  */
@@ -446,8 +452,12 @@ static inline bool skip_buddy_pages(unsigned long *pfn, struct page *page)
 		return false;
 
 	order = buddy_order_unsafe(page);
-	if (order <= MAX_PAGE_ORDER)
-		*pfn += (1UL << order) - 1;
+	if (order <= MAX_PAGE_ORDER) {
+		unsigned long new_pfn = *pfn + (1UL << order);
+		unsigned long boundary = ALIGN(*pfn + 1, MAX_ORDER_NR_PAGES);
+
+		*pfn = min(new_pfn, boundary) - 1;
+	}
 
 	return true;
 }
-- 
2.43.0



^ permalink raw reply related

* [PATCH 0/2] mm/page_owner: fix TOCTOU races in lockless page state reading
From: Ye Liu @ 2026-06-25  1:47 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Ye Liu, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, linux-mm, linux-kernel

Fix two TOCTOU races found during review of [1].

page_owner reads page state locklessly by design. In two places the
code reads the same metadata twice — once as a guard, then again as
a use — and the page can be concurrently reallocated between the two:

Patch 1: buddy_order_unsafe() in skip_buddy_pages() can return garbage
if the page is allocated between PageBuddy() and the private read,
causing the PFN to skip past a pfn_valid() boundary.  Clamp the
advance at MAX_ORDER_NR_PAGES.

Patch 2: PageMemcgKmem() in print_page_owner_memcg() re-reads
folio->memcg_data and triggers VM_BUG_ON assertions if the page
became a tail page or slab page.  Use the snapshot taken at entry.

[1] https://lore.kernel.org/all/20260623065234.31866-2-ye.liu@linux.dev/
[2] https://sashiko.dev/#/patchset/20260623065234.31866-2-ye.liu@linux.dev

Ye Liu (2):
  mm/page_owner: clamp skip_buddy_pages() PFN advance at
    MAX_ORDER_NR_PAGES boundary
  mm/page_owner: use memcg_data snapshot instead of PageMemcgKmem() to
    avoid TOCTOU VM_BUG_ON

 mm/page_owner.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

--
2.43.0



^ permalink raw reply

* Re: [RFC PATCH] reserve_mem: add support for static memory
From: Shyam Saini @ 2026-06-25  1:42 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: linux-mm, linux-doc, linux-kernel, rppt, akpm, kees, tony.luck,
	gpiccoli, bp, rdunlap, peterz, feng.tang, dapeng1.mi, elver,
	enelsonmoore, kuba, lirongqing, ebiggers
In-Reply-To: <2vxzpl1hmmgn.fsf@kernel.org>

Hi,

On 23 Jun 2026 15:10, Pratyush Yadav wrote:
> On Thu, Jun 18 2026, Shyam Saini wrote:
> 
> > reserve_mem relies on dynamic memory allocation, this limits the
> > usecase where memory and its address is required to be preserved
> > across the boots. Eg: ramoops memory reservation on ACPI platforms
> >
> > So add support to pass a pre-determined static address and reserve
> > memory at this specified address. This enables use case like ramoops
> > on ACPI platforms to reliably access ramoops region across the boots.
> 
> Doesn't memmap= do exactly this? How is this different?

yes, but memmap is not available for ARM platforms
There was an unsuccessful [1]attempt to add memmap support for ARM

> I always thought the point of reserve_mem was that you _don't_ have to
> provide an explicit address, one is chosen for your machine
> automatically.

ok, but I am not sure if that was the only intent. 
> >
> > Also skip parsing of "align" parameter when static address is passed.
> >
> > Example syntax for static address
> >  reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
> >
> > Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
> [...]
> 
> -- 
> Regards,
> Pratyush Yadav

By the way, RFC v2 for this change is already posted [2] here

Thanks,
Shyam


[1] https://lkml.kernel.org/lkml/20201118063314.22940-1-song.bao.hua@hisilicon.com/T/ 
[2] https://lore.kernel.org/lkml/20260619062331.348789-1-shyamsaini@linux.microsoft.com/


^ permalink raw reply

* Re: [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
From: yu kuai @ 2026-06-25  1:42 UTC (permalink / raw)
  To: Jens Axboe, yukuai, nilay, tom.leiming, bvanassche, tj, josef
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <34d48fb5-4952-4a48-b92a-f189bc3edd0b@kernel.dk>

Hi,

在 2026/6/24 20:43, Jens Axboe 写道:
> On 6/24/26 12:57 AM, yu kuai wrote:
>> Friendly ping ...
>>
>> This set can still be applied cleanly for block-7.2 branch.
> Not sure how you checked that, because patch 3 very much needs some
> manual attention to get applied. I have applied it now.

Thanks!

This was build on the top of my other set:
blk-cgroup: fix blkg list and policy data races

I'll rebase and resend this set :)

>
-- 
Thanks,
Kuai


^ permalink raw reply

* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Binbin Wu @ 2026-06-25  1:39 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <CAEvNRgG-WDzHp-15Mig4hiU5Dag0pFCu70-R-9b=PkD69W=ZMg@mail.gmail.com>



On 6/24/2026 10:38 PM, Ackerley Tng wrote:
> Binbin Wu <binbin.wu@linux.intel.com> writes:
> 
>>
>> [...snip...]
>>
>>> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
>>> +{
>>> +	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>>> +	struct inode *inode;
>>> +
>>> +	/*
>>> +	 * If this gfn has no associated memslot, there's no chance of the gfn
>>> +	 * being backed by private memory, since guest_memfd must be used for
>>> +	 * private memory,
>>
>> "guest_memfd must be used for private memory" is a bit confusing to me.
>>
> 
> Hmm good point. Is the source of confusion that guest_memfd can be used
> for both shared and private memory?

Yes.

> 
> Perhaps this can be rephrased as:
> 
> guest_memfd is the only provider of private memory and guest_memfd must
> be used with a memslot, hence if there's no associated memslot, there's
> no chance of this gfn being private.

LGTM.

> 
>>> and guest_memfd must be associated with some memslot.
>>> +	 */
>>> +	if (!slot)
>>> +		return 0;
>>> +
>>>
>>> [...snip...]
>>>
> 



^ permalink raw reply

* Re: [PATCH v2] mm: avoid KCSAN false positive in memdesc_nid()
From: Hui Zhu @ 2026-06-25  1:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, Hui Zhu
In-Reply-To: <20260624140104.eacc15e291eec123bc7b3349@linux-foundation.org>

> 
> On Tue, 23 Jun 2026 16:44:32 +0800 Hui Zhu <hui.zhu@linux.dev> wrote:
> 
> > 
> > From: Hui Zhu <zhuhui@kylinos.cn>
> >  
> >  KCSAN reports a data race between page_to_nid()/folio_nid() reading
> >  page->flags and folio_trylock()/folio_lock() concurrently doing
> >  test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:
> >  
> >  BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp
> >  
> >  The node id occupies a fixed bit-range of page->flags that is set
> >  once at page init and never modified afterwards, so it can never
> >  overlap with the low PG_locked/PG_waiters bits touched by the folio
> >  lock path.
> >  
> >  Use ASSERT_EXCLUSIVE_BITS() in memdesc_nid() to scope the exemption
> >  to just the node-id bits, consistent with how memdesc_zonenum()
> >  already handles the same class of race for the zone-id bits.
> >  
> >  ...
> > 
> >  --- a/include/linux/mm.h
> >  +++ b/include/linux/mm.h
> >  @@ -2290,6 +2290,7 @@ int memdesc_nid(memdesc_flags_t mdf);
> >  #else
> >  static inline int memdesc_nid(memdesc_flags_t mdf)
> >  {
> >  + ASSERT_EXCLUSIVE_BITS(mdf.f, NODES_MASK << NODES_PGSHIFT);
> >  return (mdf.f >> NODES_PGSHIFT) & NODES_MASK;
> >  }
> >  #endif
> > 
> It seems weird to be doing this against a local variable within a
> random function, seemingly unrelated to the problematic functions which
> you've identified.
> 
> Seems that it fooled Sashiko:
>  https://sashiko.dev/#/patchset/20260623084432.701120-1-hui.zhu@linux.dev
> 
> I'm wondering what the heck is going on here?
>

Hi Andrew,

Good catch. ASSERT_EXCLUSIVE_BITS(mdf.f, ...) is checking a by-value
copy of the flags word inside memdesc_nid(), not the actual shared
page->flags/folio->flags being modified by folio_trylock(). Whatever
made it appear to suppress the KCSAN report is likely an artifact of
inlining/codegen (kcsan_atomic_next() happening to land on the real
load after inlining), not a principled fix - so Sashiko's pass is
not reassuring here.

I'll move the assertion to where the real dereference happens (at
the page_to_nid()/folio_nid() call sites) instead of inside the
by-value helper. This probably also applies to the existing
memdesc_zonenum() pattern - is that one actually verified to work,
or does it have the same issue?

Best,
Hui


^ permalink raw reply

* [PATCH] mm/damon/paddr: remove folio_put from damon_pa_invalid_damos_folio
From: Yu Qin @ 2026-06-25  1:22 UTC (permalink / raw)
  To: sj; +Cc: akpm, damon, linux-mm, Yu Qin

This boolean function called folio_put() implicitly. Remove the put and let
callers handle it explicitly, making the get/put pair more clear.

Signed-off-by: Yu Qin <qin.yuA@h3c.com>
---
 mm/damon/paddr.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 85cd64a55..f45c7939a 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -313,15 +313,12 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
 	return scheme->ops_filters_default_reject;
 }
 
-static bool damon_pa_invalid_damos_folio(struct folio *folio, struct damos *s)
+static inline bool damon_pa_invalid_damos_folio(struct folio *folio,
+		struct damos *s)
 {
 	if (!folio)
 		return true;
-	if (folio == s->last_applied) {
-		folio_put(folio);
-		return true;
-	}
-	return false;
+	return folio == s->last_applied;
 }
 
 static unsigned long damon_pa_pageout(struct damon_region *r,
@@ -353,6 +350,8 @@ static unsigned long damon_pa_pageout(struct damon_region *r,
 	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
 		folio = damon_get_folio(PHYS_PFN(addr));
 		if (damon_pa_invalid_damos_folio(folio, s)) {
+			if (folio)
+				folio_put(folio);
 			addr += PAGE_SIZE;
 			continue;
 		}
@@ -394,6 +393,8 @@ static inline unsigned long damon_pa_de_activate(
 	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
 		folio = damon_get_folio(PHYS_PFN(addr));
 		if (damon_pa_invalid_damos_folio(folio, s)) {
+			if (folio)
+				folio_put(folio);
 			addr += PAGE_SIZE;
 			continue;
 		}
@@ -442,6 +443,8 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
 		folio = damon_get_folio(PHYS_PFN(addr));
 		if (damon_pa_invalid_damos_folio(folio, s)) {
+			if (folio)
+				folio_put(folio);
 			addr += PAGE_SIZE;
 			continue;
 		}
@@ -478,6 +481,8 @@ static unsigned long damon_pa_stat(struct damon_region *r,
 	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
 		folio = damon_get_folio(PHYS_PFN(addr));
 		if (damon_pa_invalid_damos_folio(folio, s)) {
+			if (folio)
+				folio_put(folio);
 			addr += PAGE_SIZE;
 			continue;
 		}
-- 
2.43.0



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox