* Re: [PATCH v2 1/4] mm: mincore: use walk_page_range_vma() in do_mincore()
From: Andrew Morton @ 2026-06-25 2:38 UTC (permalink / raw)
To: Kefeng Wang
Cc: Pedro Falcato, David Hildenbrand (Arm), Zi Yan, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Suren Baghdasaryan, linux-mm
In-Reply-To: <e7244b54-d794-4e46-88e7-69204da2b60d@huawei.com>
On Mon, 22 Jun 2026 11:23:11 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>> - err = walk_page_range(vma->vm_mm, addr, end, &mincore_walk_ops, vec);
> >>>> +
> >>>> + /*
> >>>> + * walk_page_range_vma() does not call walk_page_test(), which
> >>>> + * handles VM_PFNMAP VMA by invoking ->pte_hole() to skip the
> >>>> + * page table walk. Without this check, PFNMAP PTEs would be
> >>>> + * treated as present by mincore_pte_range(), changing the returned
> >>>> + * residency status from the historical "not resident" to "resident".
> >>>> + * Handle VM_PFNMAP explicitly to preserve the original behavior.
> >>>> + */
> >
> > I would rather we amend this comment to something like:
> >
> > /* mincore (historically) reports PFNMAP mappings as non-resident. */
> >
> > because we don't need to explain internal differences in walk_page_range
> > functions in a random comment in mincore. And perhaps attempt a separate
>
> Hope Andrew can fix the comments when pickup patches.
I added this:
--- a/mm/mincore.c~mm-mincore-use-walk_page_range_vma-in-do_mincore-fix
+++ a/mm/mincore.c
@@ -261,12 +261,7 @@ static long do_mincore(unsigned long add
}
/*
- * walk_page_range_vma() does not call walk_page_test(), which
- * handles VM_PFNMAP VMA by invoking ->pte_hole() to skip the
- * page table walk. Without this check, PFNMAP PTEs would be
- * treated as present by mincore_pte_range(), changing the returned
- * residency status from the historical "not resident" to "resident".
- * Handle VM_PFNMAP explicitly to preserve the original behavior.
+ * mincore historically reports PFNMAP mappings as non-resident.
*/
if (vma->vm_flags & VM_PFNMAP) {
__mincore_unmapped_range(addr, end, vma, vec);
_
Sashiko is OK with your patchset, but it might have found four(!)
pre-existing issues:
https://sashiko.dev/#/patchset/20260618092845.3905740-1-wangkefeng.wang@huawei.com
^ permalink raw reply
* Re: [PATCH] mm/page_alloc: don't build vm_numa_stat_key if CONFIG_NUMA=n
From: Andrew Morton @ 2026-06-25 2:27 UTC (permalink / raw)
To: Ben Dooks
Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260618100614.1321950-1-ben.dooks@codethink.co.uk>
On Thu, 18 Jun 2026 11:06:14 +0100 Ben Dooks <ben.dooks@codethink.co.uk> wrote:
> The vm_numa_stat_key is only exported if CONFIG_NUMA is set,
> so avoid the following warning by guarding it in an #ifdef
> on CONFIG_NUMA:
>
> mm/page_alloc.c:165:1: warning: symbol 'vm_numa_stat_key' was not declared. Should it be static?
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -162,7 +162,9 @@ DEFINE_PER_CPU(int, numa_node);
> EXPORT_PER_CPU_SYMBOL(numa_node);
> #endif
>
> +#ifdef CONFIG_NUMA
> DEFINE_STATIC_KEY_TRUE(vm_numa_stat_key);
> +#endif
>
> #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> /*
It might be tidier to move this into mm/vmstat.c, around line 38?
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-25 2:25 UTC (permalink / raw)
To: Ackerley Tng
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <CAEvNRgG1nHipzw4=eBgwhvyXi8xYo7FQD_sy9Ax6FDf7YDu3Og@mail.gmail.com>
On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
>
> > On Tue, Jun 23, 2026, Yan Zhao wrote:
> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> >> > > > > return -EIO;
> >> > > > >
> >> > > > > - if (!src_page)
> >> > > > > - return -EOPNOTSUPP;
> >> > > > > + if (!src_page) {
> >> > > > > + if (!gmem_in_place_conversion)
> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
> >> > > > error.
> >> > >
> >> > > Why MMAP?
> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
> >> >
> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> >> > > and written memory. And when write() lands, MMAP wouldn't be necessary to
> >> > > initialize the memory.
> >> > Do you mean using up-to-date flag as below?
> >
> > Yes? I didn't actually look at the implementation details.
> >
> >> > if (!src_page) {
> >> > src_page = pfn_to_page(pfn);
> >> > if (!folio_test_uptodate(page_folio(src_page)))
> >> > return -EOPNOTSUPP;
> >> > }
>
> Yan is right that with the earlier patch "Zero page while getting pfn",
> folio_test_uptodate() here will always return true.
>
> Actually, this is an alternative fix for the issue Sashiko pointed out
> on v7 where userspace can do a populate() (either TDX or SNP) without
> first allocating the page, with src_address == NULL, and leak
> uninitialized memory into the guest.
>
> Advantage of using the uptodate check in populate: if the host never
> allocates the page, populate doesn't incur zeroing before writing the
> page anyway in populate().
>
> Disadvantage: Both TDX and SNP will have to implement this uptodate
> check. guest_memfd can't check centrally because for SNP, for a
> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
> firmware will zero and there's no leakage of uninitialized host memory?
Another disadvantage: the uptodate flag is per-folio. What if the folio
is only partially initialized by the userspace especially after huge page is
supported?
> >> Another concern with this fix is that:
> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
> >> folio uptodate before reaching post_populate().
> >>
> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
> >>
> >> > One concern is that TDX now does not much care about the up-to-date flag since
> >> > TDX doesn't rely on the flag to clear pages on conversions.
> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
> >> > user access.
> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
> >> > to have more effort than introducing a dedicated flag for TDX.
> >> >
> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> >> > > > 1. create guest_memfd with MMAP flag
> >> > > > 2. mmap the guest_memfd.
> >> > > > 3. convert the initial memory range to shared.
> >> > > > 4. copy initial content to the source page.
> >> > > > 5. convert the initial memory range to private
> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> >> > > > 7. do not unmap the source backend.
> >> > > >
> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
> >> > >
> >> > > Why? It's userspace's responsibility to get the above right. If userspace fails
> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
>
> Yan, is your concern that userspace forgot to update the code and
> forgets to provide a src_page, and if we keep the "Zero page while
Yes. Previously, it would be rejected after GUP fails.
> getting pfn" patch, ends up with the guest silently having a zero page?
> I think that would be found quite early in userspace VMM testing...
I actually encountered this during testing this patch.
I update most code path to follow this sequence. However, still some corner ones
for TDVF HOB, which are less obvious and harder to update.
The TD just booted up and hang silently.
> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
> >> > kernel to detect this mistake, similar to how it validates whether source_addr
> >> > is PAGE_ALIGNED.
> >
> > The alignment case is different. If userspace provides an unaligned value, KVM
> > *can't* do what userspace is asking because hardware and thus KVM only supports
> > converting on page boundaries.
> >
> > For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's
> > request would then be making assumptions about what userspace wants.
> >
>
> Also, +1 on this, what if userspace, knowing that pages are zeroed on
> allocation, actually wants to rely on that to get a zero page in the guest?
What if 0 uaddr is a valid address? :)
> >> > Since userspace already needs to perform additional steps to enable in-place
> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
> >> > intentional seems like a reasonable burden.
> >
> > I don't see how it adds any value. I wouldn't be at all surprised if most VMMs
> > just wen up with code that does:
> >
> > if (in-place) {
> > src = NULL;
> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
> > }
>
^ permalink raw reply
* Re: [PATCH 0/2] mm/page_owner: fix TOCTOU races in lockless page state reading
From: Andrew Morton @ 2026-06-25 2:04 UTC (permalink / raw)
To: Ye Liu
Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260625014708.87386-1-ye.liu@linux.dev>
On Thu, 25 Jun 2026 09:47:03 +0800 Ye Liu <ye.liu@linux.dev> wrote:
> Fix two TOCTOU races found during review of [1].
>
> page_owner reads page state locklessly by design. In two places the
> code reads the same metadata twice — once as a guard, then again as
> a use — and the page can be concurrently reallocated between the two:
>
> Patch 1: buddy_order_unsafe() in skip_buddy_pages() can return garbage
> if the page is allocated between PageBuddy() and the private read,
> causing the PFN to skip past a pfn_valid() boundary. Clamp the
> advance at MAX_ORDER_NR_PAGES.
>
> Patch 2: PageMemcgKmem() in print_page_owner_memcg() re-reads
> folio->memcg_data and triggers VM_BUG_ON assertions if the page
> became a tail page or slab page. Use the snapshot taken at entry.
That was fast. I haven't pushed out mm-new yet, so Sashiko wasn't able
to apply these.
> [1] https://lore.kernel.org/all/20260623065234.31866-2-ye.liu@linux.dev/
> [2] https://sashiko.dev/#/patchset/20260623065234.31866-2-ye.liu@linux.dev
Nothing cites "[2]". That's OK.
^ permalink raw reply
* Re: [PATCH v2] mm: avoid KCSAN false positive in memdesc_nid()
From: Andrew Morton @ 2026-06-25 1:58 UTC (permalink / raw)
To: Hui Zhu
Cc: David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
linux-mm, linux-kernel, Hui Zhu
In-Reply-To: <20bd8bbbd5a8cc52d267f550fc0314cd0d81a223@linux.dev>
On Thu, 25 Jun 2026 01:32:49 +0000 "Hui Zhu" <hui.zhu@linux.dev> wrote:
> Good catch. ASSERT_EXCLUSIVE_BITS(mdf.f, ...) is checking a by-value
> copy of the flags word inside memdesc_nid(), not the actual shared
> page->flags/folio->flags being modified by folio_trylock(). Whatever
> made it appear to suppress the KCSAN report is likely an artifact of
> inlining/codegen (kcsan_atomic_next() happening to land on the real
> load after inlining), not a principled fix - so Sashiko's pass is
> not reassuring here.
Yeah, I was wondering if the inlining accidentally gave the macro the
correct thing. Which seems wrong - an inlined function should treat an
incoming arg purely as a local thing. Maybe we fooled the compiler.
> I'll move the assertion to where the real dereference happens (at
> the page_to_nid()/folio_nid() call sites) instead of inside the
> by-value helper. This probably also applies to the existing
> memdesc_zonenum() pattern - is that one actually verified to work,
> or does it have the same issue?
I assume the memdesc_zonenum() code worked, for the same (poorly
understood) reason as did your patch.
Yes, moving this into the sites where we officially have access to the
shared storage seems the right approach.
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-25 1:51 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajx5Vrz9ma--hrGH@google.com>
On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > This means this module parameter only enables per-gmem memory attribute and does
> > > not guarantee that gmem in-place conversion will actually occur.
>
> KVM module params are pretty much always about what KVM supports, not what is
> guaranteed to happen.
>
> - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> because maybe the guest never accesses emulated MMIO.
> - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> not to advertise one.
> - and so on and so forth...
>
> Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> to "I need to set memory attributes on the guest_memfd instance, not the VM",
> but I don't see that as a big hurdle, certainly not in the long term. And once
> the VMM code is written, I really do think most people are going to care about
> whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
Sorry, I just saw this mail after posting my reply in [1].
I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
conversion, while we can still create VMs with shared memory not from gmem.
Though it still feels a bit odd to require TDX huge pages to depend on
gmem_in_place_conversion=true when shared memory is not currently allocated from
gmem, it should become more natural over time once gmem supports in-place
conversions for huge page.
[1] https://lore.kernel.org/all/ajyCn0PnFtQK+Nka@yzhao56-desk.sh.intel.com
> > > To avoid confusion, could we rename this module parameter to something more
> > > accurate, such as gmem_memory_attribute?
> >
> > I asked Sean about this after getting some fixes off list. Sean said
> > gmem_in_place_conversion is named for a host admin to use, and something
> > like gmem_memory_attributes is too much implementation details for the
> > admin.
> >
> > Sean, would you reconsider since Yan also asked? If the admin compiled
> > the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> > admin would also be able to use a param like gmem_memory_attributes?
>
> No, because it's not all memory attributes, it's very specifically the PRIVATE
> attribute that will get moved to guest_memfd. I don't want to pick a name that
> will become stale and confusing when RWX attributes come along. The RWX bits
> will be per-VM, while PRIVATE will be per-guest_memfd.
^ permalink raw reply
* [PATCH 3/3] mm: read remote memory without the mmap lock where possible
From: Rik van Riel @ 2026-06-25 1:50 UTC (permalink / raw)
To: linux-kernel
Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>
__access_remote_vm() takes mmap_read_lock() for the entire transfer and
uses get_user_pages_remote(), which faults pages in. For the common case
of reading memory that is already resident -- /proc/PID/cmdline,
/proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
unnecessary and is badly contended on large machines.
Add an opportunistic, read-only fast path. It takes the per-VMA lock with
lock_vma_under_rcu() and, only when the whole request lies within that one
VMA, copies the resident pages out using folio_walk_start(FW_VMA_LOCKED)
to grab a short-lived page reference from a page table walk run with
interrupts disabled. Interrupts are disabled only across the walk (until
the folio is pinned): page table freeing -- a concurrent munmap() or THP
collapse of an adjacent region -- serializes against lockless walkers via
tlb_remove_table_sync_one(), which IPIs and waits for every CPU to enable
interrupts, the same contract gup_fast relies on. The copy then runs with
interrupts on, holding only the folio reference.
A request that spans more than one VMA is left entirely to the mmap_lock
path: relocking per VMA could observe a structurally inconsistent address
space (a neighbouring VMA unmapped and a different one mapped in its place
between locks), whereas the mmap_lock path sees a stable VMA tree for the
whole transfer.
The per-VMA permission check mirrors the read side of check_vma_flags(),
including the FOLL_ANON restriction that /proc/PID/{cmdline,environ} rely
on (CVE-2018-1120). Anything not positively allowed -- a not-present
page, a hugetlb or VM_IO/VM_PFNMAP or secretmem mapping, or a race with a
VMA writer -- falls back to the mmap_lock path for the remainder, which
re-validates everything. Pages read on the fast path are marked accessed,
matching the FOLL_TOUCH behaviour of the get_user_pages_remote() slow
path.
untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
for the fast path; the untag mask is a stable per-mm value.
Only reads are handled here; writes keep using the slow path.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/include/asm/uaccess_64.h | 14 ++-
include/linux/uaccess.h | 11 ++
mm/memory.c | 195 +++++++++++++++++++++++++++++-
3 files changed, 217 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 4a52497ba6a1..933b0b8b4d60 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -39,11 +39,23 @@ static inline unsigned long __untagged_addr(unsigned long addr)
(__force __typeof__(addr))__untagged_addr(__addr); \
})
+/* Strip the tag bits from a remote mm's address; usable without the mmap lock. */
+static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
+ unsigned long addr)
+{
+ return addr & READ_ONCE(mm->context.untag_mask);
+}
+
+#define untagged_addr_remote_unlocked(mm, addr) ({ \
+ unsigned long __addr = (__force unsigned long)(addr); \
+ (__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
+})
+
static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
unsigned long addr)
{
mmap_assert_locked(mm);
- return addr & READ_ONCE((mm)->context.untag_mask);
+ return __untagged_addr_remote_unlocked(mm, addr);
}
#define untagged_addr_remote(mm, addr) ({ \
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 8a264662b242..c8c83372c9d8 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -34,6 +34,17 @@
})
#endif
+/*
+ * Like untagged_addr_remote(), but for callers that stabilize @mm by other
+ * means (e.g. a per-VMA lock) and must not assert the mmap lock.
+ */
+#ifndef untagged_addr_remote_unlocked
+#define untagged_addr_remote_unlocked(mm, addr) ({ \
+ (void)(mm); \
+ untagged_addr(addr); \
+})
+#endif
+
#ifdef masked_user_access_begin
#define can_do_masked_user_access() 1
# ifndef masked_user_write_access_begin
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..d2b2f0014a0c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,8 @@
#include <linux/kernel_stat.h>
#include <linux/mm.h>
#include <linux/mm_inline.h>
+#include <linux/secretmem.h>
+#include <linux/pagewalk.h>
#include <linux/sched/mm.h>
#include <linux/sched/numa_balancing.h>
#include <linux/sched/task.h>
@@ -7062,6 +7064,180 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
EXPORT_SYMBOL_GPL(generic_access_phys);
#endif
+/*
+ * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
+ * lock and RCU-freed page tables to walk page tables without the mmap lock.
+ */
+#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
+/*
+ * Read-side VMA checks for the lockless fast path, mirroring the read side of
+ * check_vma_flags(): reject what FW_VMA_LOCKED cannot handle (hugetlb), what
+ * needs the ->access() handler (VM_IO/VM_PFNMAP), or what has no struct page to
+ * copy (secretmem); enforce the FOLL_ANON restriction that
+ * /proc/PID/{cmdline,environ} rely on (CVE-2018-1120); and require read access
+ * (honoring FOLL_FORCE). Anything not positively allowed falls back to the slow
+ * path, which re-validates everything.
+ */
+static bool vma_permits_fast_access(struct vm_area_struct *vma,
+ unsigned int gup_flags)
+{
+ if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+ return false;
+ if (is_vm_hugetlb_page(vma) || vma_is_secretmem(vma))
+ return false;
+ if ((gup_flags & FOLL_ANON) && !vma_is_anonymous(vma))
+ return false;
+ if (!(vma->vm_flags & VM_READ) &&
+ (!(gup_flags & FOLL_FORCE) || !(vma->vm_flags & VM_MAYREAD)))
+ return false;
+ return true;
+}
+
+/* Size of the single mapping entry folio_walk_start() landed on. */
+static unsigned long fw_entry_size(enum folio_walk_level level)
+{
+ switch (level) {
+ case FW_LEVEL_PUD:
+ return PUD_SIZE;
+ case FW_LEVEL_PMD:
+ return PMD_SIZE;
+ default:
+ return PAGE_SIZE;
+ }
+}
+
+/*
+ * Copy @len bytes of the pinned @folio out to @buf, starting at byte offset
+ * @folio_off within the folio (the position of @addr). Maps and copies one
+ * page at a time -- kmap_local_folio() for HIGHMEM, copy_from_user_page() for
+ * the per-page flush on aliasing caches -- without re-walking page tables.
+ * Each page borrows the caller's single folio reference, so the mapping is
+ * dropped with kunmap_local() rather than folio_release_kmap().
+ */
+static void copy_folio_pages(struct vm_area_struct *vma, struct folio *folio,
+ unsigned long folio_off, unsigned long addr,
+ void *buf, unsigned long len)
+{
+ unsigned long done = 0;
+
+ while (done < len) {
+ unsigned long pos = folio_off + done;
+ unsigned long page_idx = pos >> PAGE_SHIFT;
+ unsigned int page_off = pos & ~PAGE_MASK;
+ unsigned int chunk = min_t(unsigned long, len - done,
+ PAGE_SIZE - page_off);
+ void *kaddr = kmap_local_folio(folio, page_idx << PAGE_SHIFT);
+
+ copy_from_user_page(vma, folio_page(folio, page_idx),
+ addr + done, buf + done, kaddr + page_off,
+ chunk);
+ kunmap_local(kaddr);
+ done += chunk;
+ }
+}
+
+/*
+ * Opportunistic lockless fast path for __access_remote_vm() reads.
+ *
+ * Memory already resident in @mm can be read without taking the frequently
+ * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
+ * with FW_VMA_LOCKED grabs a short-lived reference to a present page from a page
+ * table walk run with interrupts disabled, which serializes against concurrent
+ * page table freeing the same way gup_fast does (relying on
+ * MMU_GATHER_RCU_TABLE_FREE).
+ *
+ * Only a request that lies entirely within a single VMA is handled here,
+ * which should not be an issue in practice since every caller has a
+ * buffer of PAGE_SIZE or smaller. Loop iteration inside this function
+ * should be rare, too.
+ *
+ * Returns the number of bytes transferred via the fast path.
+ */
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, unsigned int gup_flags)
+{
+ void *old_buf = buf;
+ struct vm_area_struct *vma;
+
+ addr = untagged_addr_remote_unlocked(mm, addr);
+
+ vma = lock_vma_under_rcu(mm, addr);
+ if (!vma)
+ return 0;
+
+ /* Only handle a request contained entirely within this one VMA. */
+ if (len > vma->vm_end - addr)
+ goto out_unlock;
+
+ if (!vma_permits_fast_access(vma, gup_flags))
+ goto out_unlock;
+
+ while (len) {
+ struct folio_walk fw;
+ struct folio *folio;
+ struct page *page;
+ unsigned long entry_size, folio_off, span, irq_flags;
+
+ /*
+ * The lockless page table walk must run with interrupts
+ * disabled: page table freeing (munmap or THP collapse, which
+ * IPI via tlb_remove_table_sync_one() and wait) then cannot free
+ * a table mid-walk -- the same contract gup_fast relies on. IRQs
+ * are restored once the folio is pinned; the copy below holds only
+ * the folio reference.
+ */
+ local_irq_save(irq_flags);
+ folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
+ if (!folio) {
+ local_irq_restore(irq_flags);
+ goto out_unlock; /* not present: let the slow path fault it in */
+ }
+ page = fw.page;
+ if (!page) {
+ /* No struct page to copy (e.g. a special PTE). */
+ folio_walk_end(&fw, vma);
+ local_irq_restore(irq_flags);
+ goto out_unlock;
+ }
+ entry_size = fw_entry_size(fw.level);
+ folio_get(folio);
+ folio_walk_end(&fw, vma);
+ local_irq_restore(irq_flags);
+
+ /*
+ * folio_walk_start() validated one present mapping entry
+ * (PAGE/PMD/PUD_SIZE). Copy to the end of that entry, bounded by
+ * the folio and the remaining length (already within the VMA), so
+ * a huge mapping is handled in a single walk.
+ */
+ folio_off = (folio_page_idx(folio, page) << PAGE_SHIFT) +
+ offset_in_page(addr);
+ span = min3((unsigned long)len,
+ entry_size - (addr & (entry_size - 1)),
+ (folio_nr_pages(folio) << PAGE_SHIFT) - folio_off);
+
+ copy_folio_pages(vma, folio, folio_off, addr, buf, span);
+
+ /* Match the FOLL_TOUCH behaviour of the slow (GUP) path. */
+ folio_mark_accessed(folio);
+ folio_put(folio);
+ len -= span;
+ buf += span;
+ addr += span;
+ }
+
+out_unlock:
+ vma_end_read(vma);
+ return buf - old_buf;
+}
+#else
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, unsigned int gup_flags)
+{
+ return 0;
+}
+#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+
/*
* Access another process' address space as given in mm.
*/
@@ -7071,15 +7247,30 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *old_buf = buf;
int write = gup_flags & FOLL_WRITE;
+ /*
+ * Try the lockless fast path for reads first; it transfers what it can
+ * from resident memory without taking mmap_lock, and leaves the
+ * remainder (if any) to the slow path below.
+ */
+ if (!write) {
+ int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
+
+ addr += done;
+ buf += done;
+ len -= done;
+ if (!len)
+ return buf - old_buf;
+ }
+
if (mmap_read_lock_killable(mm))
- return 0;
+ return buf - old_buf;
/* Untag the address before looking up the VMA */
addr = untagged_addr_remote(mm, addr);
/* Avoid triggering the temporary warning in __get_user_pages */
if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
- return 0;
+ return buf - old_buf;
/* ignore errors, just check how much was successfully transferred */
while (len) {
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
From: Rik van Riel @ 2026-06-25 1:50 UTC (permalink / raw)
To: linux-kernel
Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>
folio_walk_start() asserts the mmap lock is held. For callers that only
need to read a single, already-present page, the mmap lock is a heavy and
often badly contended hammer. Such a caller can instead hold the per-VMA
lock, which keeps the VMA itself stable.
The per-VMA lock does not, however, keep the page tables walked below that
VMA from being freed. A concurrent munmap() or THP collapse of an
adjacent region in the same mm can free a shared upper-level table, and
THP collapse (collapse_huge_page() -> retract_page_tables()) frees page
tables of VMAs whose lock it does not hold. Page table freeing
synchronizes against lockless walkers the way gup_fast relies on:
tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable
interrupts, so a walker that keeps interrupts disabled across the walk
cannot be observing a table that is about to be freed. rcu_read_lock() is
not sufficient -- it does not block that IPI -- so the caller must keep
interrupts disabled, not merely hold an RCU read-side critical section.
Add an FW_VMA_LOCKED flag. When passed, folio_walk_start() asserts the
per-VMA lock and that interrupts are disabled, instead of asserting the
mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses
hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not
cover). The caller must keep interrupts disabled until folio_walk_end().
No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
include/linux/pagewalk.h | 7 +++++++
mm/pagewalk.c | 29 +++++++++++++++++++++++++++--
2 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b41d7265c01b..d0387470d732 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t;
/* Walk shared zeropages (small + huge) as well. */
#define FW_ZEROPAGE ((__force folio_walk_flags_t)BIT(0))
+/*
+ * The caller holds the per-VMA lock instead of the mmap lock, with interrupts
+ * disabled across the walk (until folio_walk_end()) to serialize against page
+ * table freeing, the same way gup_fast does. Only valid with RCU-freed page
+ * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
+ */
+#define FW_VMA_LOCKED ((__force folio_walk_flags_t)BIT(1))
enum folio_walk_level {
FW_LEVEL_PTE,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..ab1e81983cb8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
* huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
* not correspond to the first physical entry of a logical hugetlb entry.
*
- * The mmap lock must be held in read mode.
+ * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
+ * passed, the VMA's per-VMA lock must be held and interrupts must be disabled
+ * across the walk and until folio_walk_end() (only supported with RCU-freed page
+ * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
*
* Return: folio pointer on success, otherwise NULL.
*/
@@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw,
pgd_t *pgdp;
p4d_t *p4dp;
- mmap_assert_locked(vma->vm_mm);
+ if (flags & FW_VMA_LOCKED) {
+ /*
+ * Lockless walk under the per-VMA lock instead of the mmap
+ * lock. The VMA lock keeps the VMA stable, but the page tables
+ * walked below it can still be freed concurrently: a munmap() or
+ * THP collapse of an adjacent region in the same mm can free a
+ * shared upper-level table, and collapse_huge_page() ->
+ * retract_page_tables() frees page tables of VMAs whose lock it
+ * does not hold. Page table freeing serializes against lockless
+ * walkers via tlb_remove_table_sync_one(), which IPIs and waits
+ * for every CPU to enable interrupts; an RCU read-side critical
+ * section does not block that IPI, so the caller must keep
+ * interrupts disabled across the whole walk, like gup_fast.
+ * Hugetlb (PMD sharing) maps page tables not covered by this
+ * VMA's lock and is not supported.
+ */
+ VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
+ VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
+ lockdep_assert_irqs_disabled();
+ vma_assert_locked(vma);
+ } else {
+ mmap_assert_locked(vma->vm_mm);
+ }
vma_pgtable_walk_begin(vma);
if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask
From: Rik van Riel @ 2026-06-25 1:50 UTC (permalink / raw)
To: linux-kernel
Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>
mm->context.untag_mask is written once, when LAM is enabled
(mm_enable_lam(), under mmap_write_lock and while the process is still
single-threaded), and is otherwise stable and never reverted.
untagged_addr_remote() reads it for a remote mm, and the new
untagged_addr_remote_unlocked() (used by the per-VMA-lock
access_remote_vm() fast path) reads it without the mmap lock.
The field is a single aligned word and cannot tear, but annotate the
reads and writes with READ_ONCE()/WRITE_ONCE() to make the lockless
access explicit and keep the compiler from reloading or tearing it.
No functional change.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/include/asm/mmu_context.h | 6 +++---
arch/x86/include/asm/uaccess_64.h | 2 +-
arch/x86/kernel/process_64.c | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ef5b507de34e..cee710f64658 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -100,18 +100,18 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
{
mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
- mm->context.untag_mask = oldmm->context.untag_mask;
+ WRITE_ONCE(mm->context.untag_mask, READ_ONCE(oldmm->context.untag_mask));
}
#define mm_untag_mask mm_untag_mask
static inline unsigned long mm_untag_mask(struct mm_struct *mm)
{
- return mm->context.untag_mask;
+ return READ_ONCE(mm->context.untag_mask);
}
static inline void mm_reset_untag_mask(struct mm_struct *mm)
{
- mm->context.untag_mask = -1UL;
+ WRITE_ONCE(mm->context.untag_mask, -1UL);
}
#define arch_pgtable_dma_compat arch_pgtable_dma_compat
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 20de34cc9aa6..4a52497ba6a1 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -43,7 +43,7 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
unsigned long addr)
{
mmap_assert_locked(mm);
- return addr & (mm)->context.untag_mask;
+ return addr & READ_ONCE((mm)->context.untag_mask);
}
#define untagged_addr_remote(mm, addr) ({ \
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d44afbe005bb..55096136de53 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -814,7 +814,7 @@ static void enable_lam_func(void *__mm)
static void mm_enable_lam(struct mm_struct *mm)
{
mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
- mm->context.untag_mask = ~GENMASK(62, 57);
+ WRITE_ONCE(mm->context.untag_mask, ~GENMASK(62, 57));
/*
* Even though the process must still be single-threaded at this
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock
From: Rik van Riel @ 2026-06-25 1:50 UTC (permalink / raw)
To: linux-kernel
Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Suren Baghdasaryan, kernel-team
Sometimes processes can get stuck with the mmap_lock held for
a long time. This slows down, and can even prevent system monitoring
tools from assessing and logging the situation, because they themselves
end up getting stuck on the mmap_lock.
However, with the introduction of per-VMA locks, we can improve the
reliability of system monitoring, and generally speed up __access_remote_vm
under mmap_loc contention, by adding a fast path that does not require
the process-wide mmap_lock.
This fast path is only compiled in and used when it is safe to do so,
meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
is not hugetlbfs, iomap, pfnmap, etc...
v2:
- simplify the code, which should be ok because these copies are < PAGE_SIZE
- clean up the code
- fix locking wrt tlb_remove_table_sync_one()
- hopefully address all the other comments
^ permalink raw reply
* Re: [PATCH] Docs/mm: fix documentation warning for GFP parameter in kmalloc_obj, kmalloc_objs and kmalloc_flex
From: Andrew Morton @ 2026-06-25 1:48 UTC (permalink / raw)
To: Jakov Novak
Cc: linux-mm, linux-kernel, Vlastimil Babka, Harry Yoo, Hao Li,
Christoph Lameter, David Rientjes, Roman Gushchin,
linux-kernel-mentees, Shuah Khan
In-Reply-To: <20260619113622.11712-1-jakovnovak30@gmail.com>
On Fri, 19 Jun 2026 13:36:22 +0200 Jakov Novak <jakovnovak30@gmail.com> wrote:
> Subject: [PATCH] Docs/mm: fix documentation warning for GFP parameter in kmalloc_obj, kmalloc_objs and kmalloc_flex
Thanks.
"mm/slab: ..." would be a better subject.
> Date: Fri, 19 Jun 2026 13:36:22 +0200
> X-Mailer: git-send-email 2.54.0
>
> Compiling the documentation currently gives the errors:
>
> WARNING: ./include/linux/slab.h:1100 Excess function parameter 'GFP' description in 'kmalloc_obj'
> WARNING: ./include/linux/slab.h:1112 Excess function parameter 'GFP' description in 'kmalloc_objs'
> WARNING: ./include/linux/slab.h:1127 Excess function parameter 'GFP' description in 'kmalloc_flex'
> WARNING: ./include/linux/slab.h:1100 Excess function parameter 'GFP' description in 'kmalloc_obj'
> WARNING: ./include/linux/slab.h:1112 Excess function parameter 'GFP' description in 'kmalloc_objs'
> WARNING: ./include/linux/slab.h:1127 Excess function parameter 'GFP' description in 'kmalloc_flex'
>
> This effectively omits the GFP parameter from the current kernel
> documentation. This patch marks the "..." parameter with the previous
> description of the GFP parameter along with an "optional" tag in
> parantheses.
"parentheses".
I'll assume that Vlastimil will be processing this patch.
^ permalink raw reply
* [PATCH 2/2] mm/page_owner: use memcg_data snapshot instead of PageMemcgKmem() to avoid TOCTOU VM_BUG_ON
From: Ye Liu @ 2026-06-25 1:47 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Ye Liu, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260625014708.87386-1-ye.liu@linux.dev>
print_page_owner_memcg() takes a snapshot of page->memcg_data via
READ_ONCE at the top of the function and guards against tail pages
and NULL memcg_data. However, at the end it calls PageMemcgKmem(page)
which internally calls folio_memcg_kmem() — and that function re-reads
folio->memcg_data and page->compound_head locklessly, wrapping both
in VM_BUG_ON assertions:
VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
If the page is concurrently freed and reallocated as a THP tail page
or a slab page between the initial guards and this final call, the
VM_BUG_ON assertions can fire on debug builds (CONFIG_DEBUG_VM=y),
causing a kernel panic.
Fix by reusing the memcg_data snapshot already taken at function entry
instead of calling PageMemcgKmem(), which is semantically equivalent:
PageMemcgKmem()->folio_memcg_kmem()->folio->memcg_data & MEMCG_DATA_KMEM.
This avoids both the TOCTOU window and the assertions entirely.
Signed-off-by: Ye Liu <ye.liu@linux.dev>
---
mm/page_owner.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 5c403bce35ce..b3252ebc0307 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -568,7 +568,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
cgroup_name(memcg->css.cgroup, name, sizeof(name));
ret += scnprintf(kbuf + ret, count - ret,
"Charged %sto %smemcg %s\n",
- PageMemcgKmem(page) ? "(via objcg) " : "",
+ (memcg_data & MEMCG_DATA_KMEM) ? "(via objcg) " : "",
online ? "" : "offline ",
name);
out_unlock:
--
2.43.0
^ permalink raw reply related
* [PATCH 1/2] mm/page_owner: clamp skip_buddy_pages() PFN advance at MAX_ORDER_NR_PAGES boundary
From: Ye Liu @ 2026-06-25 1:47 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Ye Liu, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260625014708.87386-1-ye.liu@linux.dev>
The lockless buddy_order_unsafe() read can return a garbage order
value if the page is concurrently allocated between the PageBuddy
check and the private read. If this bogus order is <= MAX_PAGE_ORDER,
skip_buddy_pages() would arbitrarily advance the PFN, potentially
jumping past a MAX_ORDER_NR_PAGES boundary whose pfn_valid() check
would have caught an offline memory section.
In read_page_owner(), which relies solely on boundary-aligned
pfn_valid() to guard pfn_to_page(), skipping the boundary could
cause pfn_to_page() to access an unmapped mem_section.
Clamp the advance so it never crosses the next MAX_ORDER_NR_PAGES
boundary. This is safe for all three callers: the pageblock-iterating
ones already handle boundary transitions in their outer loops, and
for read_page_owner() the worst case is one extra PageBuddy check per
1024 pages for a huge buddy block straddling the boundary.
Signed-off-by: Ye Liu <ye.liu@linux.dev>
---
mm/page_owner.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/page_owner.c b/mm/page_owner.c
index ec9600025127..5c403bce35ce 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -435,6 +435,12 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old)
* to skip less than the full buddy block, but that is acceptable for page owner
* iteration purposes.
*
+ * The lockless read of buddy_order_unsafe() can also return a garbage order if
+ * the page is concurrently allocated and PageBuddy is cleared between the check
+ * and the read. Clamp the advance at the next MAX_ORDER_NR_PAGES boundary so
+ * that a bogus order cannot carry @pfn into an unvalidated memory section,
+ * which would break callers that rely on boundary-aligned pfn_valid() checks.
+ *
* Return: true if the page was skipped (caller should continue its loop),
* false if the page is not a buddy page and should be processed normally.
*/
@@ -446,8 +452,12 @@ static inline bool skip_buddy_pages(unsigned long *pfn, struct page *page)
return false;
order = buddy_order_unsafe(page);
- if (order <= MAX_PAGE_ORDER)
- *pfn += (1UL << order) - 1;
+ if (order <= MAX_PAGE_ORDER) {
+ unsigned long new_pfn = *pfn + (1UL << order);
+ unsigned long boundary = ALIGN(*pfn + 1, MAX_ORDER_NR_PAGES);
+
+ *pfn = min(new_pfn, boundary) - 1;
+ }
return true;
}
--
2.43.0
^ permalink raw reply related
* [PATCH 0/2] mm/page_owner: fix TOCTOU races in lockless page state reading
From: Ye Liu @ 2026-06-25 1:47 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Ye Liu, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, linux-mm, linux-kernel
Fix two TOCTOU races found during review of [1].
page_owner reads page state locklessly by design. In two places the
code reads the same metadata twice — once as a guard, then again as
a use — and the page can be concurrently reallocated between the two:
Patch 1: buddy_order_unsafe() in skip_buddy_pages() can return garbage
if the page is allocated between PageBuddy() and the private read,
causing the PFN to skip past a pfn_valid() boundary. Clamp the
advance at MAX_ORDER_NR_PAGES.
Patch 2: PageMemcgKmem() in print_page_owner_memcg() re-reads
folio->memcg_data and triggers VM_BUG_ON assertions if the page
became a tail page or slab page. Use the snapshot taken at entry.
[1] https://lore.kernel.org/all/20260623065234.31866-2-ye.liu@linux.dev/
[2] https://sashiko.dev/#/patchset/20260623065234.31866-2-ye.liu@linux.dev
Ye Liu (2):
mm/page_owner: clamp skip_buddy_pages() PFN advance at
MAX_ORDER_NR_PAGES boundary
mm/page_owner: use memcg_data snapshot instead of PageMemcgKmem() to
avoid TOCTOU VM_BUG_ON
mm/page_owner.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [RFC PATCH] reserve_mem: add support for static memory
From: Shyam Saini @ 2026-06-25 1:42 UTC (permalink / raw)
To: Pratyush Yadav
Cc: linux-mm, linux-doc, linux-kernel, rppt, akpm, kees, tony.luck,
gpiccoli, bp, rdunlap, peterz, feng.tang, dapeng1.mi, elver,
enelsonmoore, kuba, lirongqing, ebiggers
In-Reply-To: <2vxzpl1hmmgn.fsf@kernel.org>
Hi,
On 23 Jun 2026 15:10, Pratyush Yadav wrote:
> On Thu, Jun 18 2026, Shyam Saini wrote:
>
> > reserve_mem relies on dynamic memory allocation, this limits the
> > usecase where memory and its address is required to be preserved
> > across the boots. Eg: ramoops memory reservation on ACPI platforms
> >
> > So add support to pass a pre-determined static address and reserve
> > memory at this specified address. This enables use case like ramoops
> > on ACPI platforms to reliably access ramoops region across the boots.
>
> Doesn't memmap= do exactly this? How is this different?
yes, but memmap is not available for ARM platforms
There was an unsuccessful [1]attempt to add memmap support for ARM
> I always thought the point of reserve_mem was that you _don't_ have to
> provide an explicit address, one is chosen for your machine
> automatically.
ok, but I am not sure if that was the only intent.
> >
> > Also skip parsing of "align" parameter when static address is passed.
> >
> > Example syntax for static address
> > reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
> >
> > Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
> [...]
>
> --
> Regards,
> Pratyush Yadav
By the way, RFC v2 for this change is already posted [2] here
Thanks,
Shyam
[1] https://lkml.kernel.org/lkml/20201118063314.22940-1-song.bao.hua@hisilicon.com/T/
[2] https://lore.kernel.org/lkml/20260619062331.348789-1-shyamsaini@linux.microsoft.com/
^ permalink raw reply
* Re: [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
From: yu kuai @ 2026-06-25 1:42 UTC (permalink / raw)
To: Jens Axboe, yukuai, nilay, tom.leiming, bvanassche, tj, josef
Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <34d48fb5-4952-4a48-b92a-f189bc3edd0b@kernel.dk>
Hi,
在 2026/6/24 20:43, Jens Axboe 写道:
> On 6/24/26 12:57 AM, yu kuai wrote:
>> Friendly ping ...
>>
>> This set can still be applied cleanly for block-7.2 branch.
> Not sure how you checked that, because patch 3 very much needs some
> manual attention to get applied. I have applied it now.
Thanks!
This was build on the top of my other set:
blk-cgroup: fix blkg list and policy data races
I'll rebase and resend this set :)
>
--
Thanks,
Kuai
^ permalink raw reply
* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Binbin Wu @ 2026-06-25 1:39 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <CAEvNRgG-WDzHp-15Mig4hiU5Dag0pFCu70-R-9b=PkD69W=ZMg@mail.gmail.com>
On 6/24/2026 10:38 PM, Ackerley Tng wrote:
> Binbin Wu <binbin.wu@linux.intel.com> writes:
>
>>
>> [...snip...]
>>
>>> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
>>> +{
>>> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>>> + struct inode *inode;
>>> +
>>> + /*
>>> + * If this gfn has no associated memslot, there's no chance of the gfn
>>> + * being backed by private memory, since guest_memfd must be used for
>>> + * private memory,
>>
>> "guest_memfd must be used for private memory" is a bit confusing to me.
>>
>
> Hmm good point. Is the source of confusion that guest_memfd can be used
> for both shared and private memory?
Yes.
>
> Perhaps this can be rephrased as:
>
> guest_memfd is the only provider of private memory and guest_memfd must
> be used with a memslot, hence if there's no associated memslot, there's
> no chance of this gfn being private.
LGTM.
>
>>> and guest_memfd must be associated with some memslot.
>>> + */
>>> + if (!slot)
>>> + return 0;
>>> +
>>>
>>> [...snip...]
>>>
>
^ permalink raw reply
* Re: [PATCH v2] mm: avoid KCSAN false positive in memdesc_nid()
From: Hui Zhu @ 2026-06-25 1:32 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
linux-mm, linux-kernel, Hui Zhu
In-Reply-To: <20260624140104.eacc15e291eec123bc7b3349@linux-foundation.org>
>
> On Tue, 23 Jun 2026 16:44:32 +0800 Hui Zhu <hui.zhu@linux.dev> wrote:
>
> >
> > From: Hui Zhu <zhuhui@kylinos.cn>
> >
> > KCSAN reports a data race between page_to_nid()/folio_nid() reading
> > page->flags and folio_trylock()/folio_lock() concurrently doing
> > test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:
> >
> > BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp
> >
> > The node id occupies a fixed bit-range of page->flags that is set
> > once at page init and never modified afterwards, so it can never
> > overlap with the low PG_locked/PG_waiters bits touched by the folio
> > lock path.
> >
> > Use ASSERT_EXCLUSIVE_BITS() in memdesc_nid() to scope the exemption
> > to just the node-id bits, consistent with how memdesc_zonenum()
> > already handles the same class of race for the zone-id bits.
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2290,6 +2290,7 @@ int memdesc_nid(memdesc_flags_t mdf);
> > #else
> > static inline int memdesc_nid(memdesc_flags_t mdf)
> > {
> > + ASSERT_EXCLUSIVE_BITS(mdf.f, NODES_MASK << NODES_PGSHIFT);
> > return (mdf.f >> NODES_PGSHIFT) & NODES_MASK;
> > }
> > #endif
> >
> It seems weird to be doing this against a local variable within a
> random function, seemingly unrelated to the problematic functions which
> you've identified.
>
> Seems that it fooled Sashiko:
> https://sashiko.dev/#/patchset/20260623084432.701120-1-hui.zhu@linux.dev
>
> I'm wondering what the heck is going on here?
>
Hi Andrew,
Good catch. ASSERT_EXCLUSIVE_BITS(mdf.f, ...) is checking a by-value
copy of the flags word inside memdesc_nid(), not the actual shared
page->flags/folio->flags being modified by folio_trylock(). Whatever
made it appear to suppress the KCSAN report is likely an artifact of
inlining/codegen (kcsan_atomic_next() happening to land on the real
load after inlining), not a principled fix - so Sashiko's pass is
not reassuring here.
I'll move the assertion to where the real dereference happens (at
the page_to_nid()/folio_nid() call sites) instead of inside the
by-value helper. This probably also applies to the existing
memdesc_zonenum() pattern - is that one actually verified to work,
or does it have the same issue?
Best,
Hui
^ permalink raw reply
* [PATCH] mm/damon/paddr: remove folio_put from damon_pa_invalid_damos_folio
From: Yu Qin @ 2026-06-25 1:22 UTC (permalink / raw)
To: sj; +Cc: akpm, damon, linux-mm, Yu Qin
This boolean function called folio_put() implicitly. Remove the put and let
callers handle it explicitly, making the get/put pair more clear.
Signed-off-by: Yu Qin <qin.yuA@h3c.com>
---
mm/damon/paddr.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 85cd64a55..f45c7939a 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -313,15 +313,12 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
return scheme->ops_filters_default_reject;
}
-static bool damon_pa_invalid_damos_folio(struct folio *folio, struct damos *s)
+static inline bool damon_pa_invalid_damos_folio(struct folio *folio,
+ struct damos *s)
{
if (!folio)
return true;
- if (folio == s->last_applied) {
- folio_put(folio);
- return true;
- }
- return false;
+ return folio == s->last_applied;
}
static unsigned long damon_pa_pageout(struct damon_region *r,
@@ -353,6 +350,8 @@ static unsigned long damon_pa_pageout(struct damon_region *r,
while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
folio = damon_get_folio(PHYS_PFN(addr));
if (damon_pa_invalid_damos_folio(folio, s)) {
+ if (folio)
+ folio_put(folio);
addr += PAGE_SIZE;
continue;
}
@@ -394,6 +393,8 @@ static inline unsigned long damon_pa_de_activate(
while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
folio = damon_get_folio(PHYS_PFN(addr));
if (damon_pa_invalid_damos_folio(folio, s)) {
+ if (folio)
+ folio_put(folio);
addr += PAGE_SIZE;
continue;
}
@@ -442,6 +443,8 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
folio = damon_get_folio(PHYS_PFN(addr));
if (damon_pa_invalid_damos_folio(folio, s)) {
+ if (folio)
+ folio_put(folio);
addr += PAGE_SIZE;
continue;
}
@@ -478,6 +481,8 @@ static unsigned long damon_pa_stat(struct damon_region *r,
while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
folio = damon_get_folio(PHYS_PFN(addr));
if (damon_pa_invalid_damos_folio(folio, s)) {
+ if (folio)
+ folio_put(folio);
addr += PAGE_SIZE;
continue;
}
--
2.43.0
^ permalink raw reply related
* Re: [RFC v2 PATCH] reserve_mem: add support for static memory
From: Shyam Saini @ 2026-06-25 1:22 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, linux-doc, linux-kernel, akpm, tgopinath, bboscaccy,
kees, tony.luck, gpiccoli, bp, rdunlap, peterz, feng.tang,
dapeng1.mi, elver, enelsonmoore, kuba, lirongqing, ebiggers
In-Reply-To: <aje-nY6QbwZP9XLG@kernel.org>
Hi Mike,
On 21 Jun 2026 13:36, Mike Rapoport wrote:
> On Thu, Jun 18, 2026 at 11:23:31PM -0700, Shyam Saini wrote:
> > reserve_mem relies on dynamic memory allocation, this limits the
> > usecase where memory is required to be preserved across the boots.
> > Eg: ramoops memory reservation on ACPI platforms
> >
> > So add support to pass a pre-determined static address and reserve
> > memory at a specified location. This enables use case like ramoops
> > on ACPI platforms to reliably access ramoops region with previous
> > boot logs.
> >
> > Also skip the parsing of <align> when static address is passed.
> >
> > Example syntax for static address
> > reserve_mem=4M@0x1E0000000:oops
>
> reserve_mem is best effort by design because such hacks as well as memmap=
> cannot guarantee this memory is actually free.
>
> If you want to preserve ramoops reliably, use KHO with reserve_mem.
> The first kernel will allocate memory, this memory will be preserved by KHO
> and could be picked up by the second kernel.
ok, On ARM64 DTS systems, we can reserve ramoops memory in the device tree during
the warm reboot.
For an equivalent ARM64 ACPI platform, what is the recommended way to reserve
and preserve that memory across the boots?
> > Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
> > ---
> > v1: https://lore.kernel.org/lkml/0eaf3be2-5121-48b7-aeed-196405c0a480@infradead.org/
> > v2: Fix code logic and incorporate Randy's suggestion
> > ---
> > .../admin-guide/kernel-parameters.txt | 15 ++++++
> > mm/memblock.c | 47 +++++++++++++------
> > 2 files changed, 47 insertions(+), 15 deletions(-)
>
> --
> Sincerely yours,
> Mike.
Thanks,
Shyam
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-25 1:21 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgHYTFnHbsLLgMTCSitmnp1_j9Pomikm9qmpGTh1w8YE5Q@mail.gmail.com>
On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
>
> >
> > [...snip...]
> >
> >>
> >> #ifdef kvm_arch_has_private_mem
> >> -bool __ro_after_init gmem_in_place_conversion = false;
> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
> >> +module_param(gmem_in_place_conversion, bool, 0444);
> >
> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > MMAP flag. In such cases, shared memory is allocated from different backends.
> > This means this module parameter only enables per-gmem memory attribute and does
> > not guarantee that gmem in-place conversion will actually occur.
> >
> > To avoid confusion, could we rename this module parameter to something more
> > accurate, such as gmem_memory_attribute?
> >
>
> I asked Sean about this after getting some fixes off list. Sean said
> gmem_in_place_conversion is named for a host admin to use, and something
> like gmem_memory_attributes is too much implementation details for the
> admin.
Thanks for this background.
Some more context on why I'm asking:
Currently, I'm testing TDX huge pages with the following two gmem components:
1. The gmem memory attribute in this gmem in-place conversion v8.
2. The gmem 2MB from buddy allocator. (for development/testing only).
The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
memory, while shared memory is allocated from a different backend.
(To avoid fragmentation, only private mappings are split during private-to-shared
conversions. In this approach, the 2MB folios are always retained in the gmem
inode filemap cache without splitting.)
Since shared memory is not allocated from gmem, there're no in-place conversions.
The reason I'm using "gmem memory attribute" is that the per-VM attribute is
being deprecated, as suggested by Sean [1].
Besides my current usage, there may be other scenarios where gmem memory
attributes is preferred without allocating shared memory from gmem.
(e.g., PAGE.ADD from a temp extra shared source memory).
For such use cases, I'm concerns that the admins may find it confusing if they
enable gmem_in_place_conversion but still observe extra memory consumptions for
shared memory.
[1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/
> Sean, would you reconsider since Yan also asked? If the admin compiled
> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> admin would also be able to use a param like gmem_memory_attributes?
>
> There's the additional benefit that the similar naming aids in
> understanding for both the admin and software engineers.
>
> Either way, in the next revision, I'll also add this documentation for
> this module_param:
>
> Setting the module parameter gmem_in_place_conversion to true will
> enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
> the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
> true, the private/shared attribute will be tracked per-guest_memfd
> instead of per-VM.
>
> Let me know what y'all think of the wording!
>
> >>
> >> [...snip...]
> >>
^ permalink raw reply
* Re: [RFC v2 PATCH] reserve_mem: add support for static memory
From: Shyam Saini @ 2026-06-25 1:09 UTC (permalink / raw)
To: Randy Dunlap
Cc: linux-mm, linux-doc, linux-kernel, rppt, akpm, tgopinath,
bboscaccy, kees, tony.luck, gpiccoli, bp, peterz, feng.tang,
dapeng1.mi, elver, enelsonmoore, kuba, lirongqing, ebiggers
In-Reply-To: <3e206be0-3ef4-468f-b7e7-7bc03848b0d0@infradead.org>
Hi,
On 19 Jun 2026 11:35, Randy Dunlap wrote:
> Hi,
>
> On 6/18/26 11:23 PM, Shyam Saini wrote:
> > reserve_mem relies on dynamic memory allocation, this limits the
> > usecase where memory is required to be preserved across the boots.
> > Eg: ramoops memory reservation on ACPI platforms
> >
> > So add support to pass a pre-determined static address and reserve
> > memory at a specified location. This enables use case like ramoops
> > on ACPI platforms to reliably access ramoops region with previous
> > boot logs.
> >
> > Also skip the parsing of <align> when static address is passed.
> >
> > Example syntax for static address
> > reserve_mem=4M@0x1E0000000:oops
> >
> > Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
> > ---
> > v1: https://lore.kernel.org/lkml/0eaf3be2-5121-48b7-aeed-196405c0a480@infradead.org/
> > v2: Fix code logic and incorporate Randy's suggestion
>
> OK, you fixed a few typos.
> There are some bigger things that you seem to have ignored.
Thanks for calling this out. You are right that I did not address all
comments in v2.
My goal for v2 was to quickly fix the core logic issue and keep
discussion focused on the reserve_mem static address direction in this
RFC cycle. I should have stated that clearly.
> > ---
> > .../admin-guide/kernel-parameters.txt | 15 ++++++
> > mm/memblock.c | 47 +++++++++++++------
> > 2 files changed, 47 insertions(+), 15 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index b5493a7f8f228..7e0baca564b97 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -6563,6 +6563,21 @@ Kernel parameters
> >
> > reserve_mem=12M:4096:oops ramoops.mem_name=oops
> >
> > + reserve_mem= [RAM]
>
> [RAM] means "RAM disk support is enabled."
> Is that the case here? Is "reserve_mem=" only for use in case
> RAM disk support is enabled?
>
> ISTM that you need a new designator instead of RAM...
> or overload the use of RAM by adding more info near the top of
> Documentation/admin-guide/kernel-parameters.txt.
will address them in future iterations
>
> > + Format: nn[KMG]:<@offset>:<label>
> > + Reserve physical memory at predetermined location and label it with
> > + a name that other subsystems can use to access it. This is typically
> > + used for systems that do not wipe the RAM, and this command
> > + line will try to reserve the same physical memory on
> > + soft reboots. Note, it is guaranteed to be the same
> > + location unless some other early allocation, e.g.: crashkernel=256M
> > + (without static address) is reserved or overlaps this region.
> > +
> > + The format is size:offset:label for example, to request
> > + 4 megabytes for ramoops at 0x1E0000000:
> > +
> > + reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
> > +
> > reservetop= [X86-32,EARLY]
> > Format: nn[KMG]
> > Reserves a hole at the top of the kernel virtual
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 6349c48154f4b..c76cefa0a8a83 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -2721,6 +2721,7 @@ static int __init reserve_mem(char *p)
> > char *name;
> > char *oldp;
> > int len;
> > + bool addr_is_static = false;
> >
> > if (!p)
> > goto err_param;
> > @@ -2736,19 +2737,27 @@ static int __init reserve_mem(char *p)
> > if (!size || p == oldp)
> > goto err_param;
> >
> > - if (*p != ':')
> > - goto err_param;
> > + /* parse the static memory address */
> > + if (*p == '@') {
> > + start = memparse(p+1, &p);
> > + addr_is_static = true;
> > + }
> >
> > - align = memparse(p+1, &p);
> > if (*p != ':')
> > goto err_param;
> >
> > - /*
> > - * memblock_phys_alloc() doesn't like a zero size align,
> > - * but it is OK for this command to have it.
> > - */
> > - if (align < SMP_CACHE_BYTES)
> > - align = SMP_CACHE_BYTES;
> > + if (!addr_is_static) {
> > + align = memparse(p+1, &p);
> > + if (*p != ':')
> > + goto err_param;
> > +
> > + /*
> > + * memblock_phys_alloc() doesn't like a zero size align,
> > + * but it is OK for this command to have it.
> > + */
> > + if (align < SMP_CACHE_BYTES)
> > + align = SMP_CACHE_BYTES;
> > + }
> >
> > name = p + 1;
> > len = strlen(name);
> > @@ -2772,14 +2781,22 @@ static int __init reserve_mem(char *p)
> > }
> >
> > /* Pick previous allocations up from KHO if available */
> > - if (reserve_mem_kho_revive(name, size, align))
> > + if (!addr_is_static && reserve_mem_kho_revive(name, size, align))
> > return 1;
> >
> > - /* TODO: Allocation must be outside of scratch region */
> > - start = memblock_phys_alloc(size, align);
> > - if (!start) {
> > - pr_err("reserve_mem: memblock allocation failed\n");
> > - return -ENOMEM;
>
> return 1;
>
> > + if (addr_is_static) {
> > + if (memblock_reserve(start, size)) {
> > + pr_err("reserve_mem: memblock reservation failed\n");
> > + return -ENOMEM;
>
> return 1;
>
> > + }
> > +
> > + } else {
> > + /* TODO: Allocation must be outside of scratch region */
> > + start = memblock_phys_alloc(size, align);
> > + if (!start) {
> > + pr_err("reserve_mem: memblock allocation failed\n");
> > + return -ENOMEM;
>
> return 1;
>
> > + }
> > }
> >
> > reserved_mem_add(start, size, name);
>
>
> __setup() functions return 1 for "yes, I recognized this string/option
> and attempted to handle it" or 0 for "This string/option is meaningless."
> There is no "return -Eerror".
> If you need that, you could consider using early_param() [see
> <linux/init.h>].
>
same for this concern, will address them in next iteration.
Thanks,
Shyam
^ permalink raw reply
* Re: [PATCH] fs/proc: fix KPF_KSM reported for all anonymous pages
From: Andrew Morton @ 2026-06-25 1:00 UTC (permalink / raw)
To: Jinjiang Tu
Cc: David Hildenbrand (Arm), ziy, luizcap, willy, linmiaohe,
svetly.todorov, xu.xin16, chengming.zhou, linux-fsdevel, linux-mm,
wangkefeng.wang, sunnanyong
In-Reply-To: <601fb5dd-18e1-4a6c-bc99-dc2a655240e2@huawei.com>
On Tue, 23 Jun 2026 09:37:57 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:
> > This only affects /proc/kpageflags (well, and hwpoison inject on a weird testing
> > interface), so it's not really that relevant for real workloads (debugging and
> > testing).
> >
> > So not sure whether we should CC:stable. Likely not.
>
> /proc/kpageflags is generally used only for analysis and is unlikely to be
> used in production environments. I found this issue due to I was analyzing
> pfns allocated by which stacks are KSM-merged. So I think it's unnecessary
> to CC:stable.
Well, it's a bug. The fix is super-simple so I think it's reasonable
to feed it back to users of earlier kernels.
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Sean Christopherson @ 2026-06-25 0:41 UTC (permalink / raw)
To: Ackerley Tng
Cc: Yan Zhao, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgHYTFnHbsLLgMTCSitmnp1_j9Pomikm9qmpGTh1w8YE5Q@mail.gmail.com>
On Wed, Jun 24, 2026, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > MMAP flag. In such cases, shared memory is allocated from different backends.
> > This means this module parameter only enables per-gmem memory attribute and does
> > not guarantee that gmem in-place conversion will actually occur.
KVM module params are pretty much always about what KVM supports, not what is
guaranteed to happen.
- enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
because maybe the guest never accesses emulated MMIO.
- enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
not to advertise one.
- and so on and so forth...
Yes, there's a small mental jump to get from "KVM supports in-place conversion"
to "I need to set memory attributes on the guest_memfd instance, not the VM",
but I don't see that as a big hurdle, certainly not in the long term. And once
the VMM code is written, I really do think most people are going to care about
whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
> > To avoid confusion, could we rename this module parameter to something more
> > accurate, such as gmem_memory_attribute?
>
> I asked Sean about this after getting some fixes off list. Sean said
> gmem_in_place_conversion is named for a host admin to use, and something
> like gmem_memory_attributes is too much implementation details for the
> admin.
>
> Sean, would you reconsider since Yan also asked? If the admin compiled
> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> admin would also be able to use a param like gmem_memory_attributes?
No, because it's not all memory attributes, it's very specifically the PRIVATE
attribute that will get moved to guest_memfd. I don't want to pick a name that
will become stale and confusing when RWX attributes come along. The RWX bits
will be per-VM, while PRIVATE will be per-guest_memfd.
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-25 0:35 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgE8HZDOnexMJeim6TjmxGG1AUXFY2+HH1YyKB=aM6D-DQ@mail.gmail.com>
On Wed, Jun 24, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
>
> > On Thu, Jun 18, 2026, Ackerley Tng wrote:
> >> When checking if a guest_memfd folio is safe for conversion, its refcount
> >> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> >> temporarily increases its refcount.
> >
> > Under what circumstances does this happen,
>
> It happened 100% of the time in selftests. Perhaps it's because in the
> selftests the pages are almost always freshly allocated and so the
> lru_add fbatch isn't full yet? (and that the host isn't super busy so
> lru_add fbatch doesn't get drained yet).
I chatted with Ackerley about this. What I wanted to understand is why guest_memfd
pages were getting put onto per-CPU batches for lru_add(), given that guest_memfd
pages are unevictable. The answer (assuming I read the code right), is that
lruvec_add_folio() updates stats and other per-lru metadata for the unevictable
lru, and does so under a per-lru lock. I.e. we don't want to skip that stuff
entirely.
One thought I had, to avoid the IPIs that draining all per-CPU caches requires,
was to disallow putting guest_memfd pages in folio batches, e.g. by hacking
something into folio_may_be_lru_cached(). But due to taking a per-lru lock,
that would penalize the relatively hot path and definitely common operation of
faulting in guest memory. On the other hand, memory conversion is already a
relatively slow operation and is relatively uncommon compared to page faults,
(and likely very uncommon for real world setups). I.e. having to drain all
caches if conversion isn't safe penalizes a relatively slow, relatively uncommon
path.
If we're concerned about noisy neighbor problems, or outright abuse, I think a
simple (per process?) ratelimit would suffice. But it's not clear to me that we
even need that, because there are already many flows in the kernel that allow
blasting IPIs without too much effort.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox