Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: JP Kobryn @ 2026-06-10 19:52 UTC (permalink / raw)
  To: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park
  Cc: linux-kernel, linux-trace-kernel

LRU add batches can be drained before they reach capacity. This can be a
source of LRU lock contention, but it is not currently possible to
attribute these drains to callers with existing tracepoints.

Add mm_lru_add_drain to report the CPU and lru_add batch count when an
lru_add batch is drained. This allows tracing to distinguish full drains
from partial drains and attribute them to the calling stack.

Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
whether they set the force flag for all CPUs. The tracepoint resembles
the signature of the enclosing function, but is needed because of
potential inlining.

Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
---
 include/trace/events/pagemap.h | 37 ++++++++++++++++++++++++++++++++++
 mm/swap.c                      |  7 ++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 171524d3526d..ff3da07ccb40 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -77,6 +77,43 @@ TRACE_EVENT(mm_lru_activate,
 	TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
 );
 
+TRACE_EVENT(mm_lru_add_drain,
+
+	TP_PROTO(int cpu, unsigned int nr),
+
+	TP_ARGS(cpu, nr),
+
+	TP_STRUCT__entry(
+		__field(int,		cpu	)
+		__field(unsigned int,	nr	)
+	),
+
+	TP_fast_assign(
+		__entry->cpu	= cpu;
+		__entry->nr	= nr;
+	),
+
+	TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
+);
+
+TRACE_EVENT(mm_lru_add_drain_all,
+
+	TP_PROTO(bool force_all_cpus),
+
+	TP_ARGS(force_all_cpus),
+
+	TP_STRUCT__entry(
+		__field(bool,	force_all_cpus	)
+	),
+
+	TP_fast_assign(
+		__entry->force_all_cpus	= force_all_cpus;
+	),
+
+	TP_printk("force_all_cpus=%s",
+		__entry->force_all_cpus ? "true" : "false")
+);
+
 #endif /* _TRACE_PAGEMAP_H */
 
 /* This part must be outside protection */
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..e14b7612f896 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 	struct folio_batch *fbatch = &fbatches->lru_add;
+	unsigned int nr_folios_add = folio_batch_count(fbatch);
 
-	if (folio_batch_count(fbatch))
+	if (nr_folios_add) {
 		folio_batch_move_lru(fbatch, lru_add);
+		trace_mm_lru_add_drain(cpu, nr_folios_add);
+	}
 
 	fbatch = &fbatches->lru_move_tail;
 	/* Disabling interrupts below acts as a compiler barrier. */
@@ -869,6 +872,8 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
+	trace_mm_lru_add_drain_all(force_all_cpus);
+
 	/*
 	 * Guarantee folio_batch counter stores visible by this CPU
 	 * are visible to other CPUs before loading the current drain
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Mathieu Desnoyers @ 2026-06-10 20:05 UTC (permalink / raw)
  To: Steven Rostedt, David Laight
  Cc: Masami Hiramatsu (Google),
	Markus Schneider-Pargmann (The Capable Hub), Heiko Carstens,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260610155139.01b6def4@gandalf.local.home>

On 2026-06-10 15:51, Steven Rostedt wrote:
> On Wed, 10 Jun 2026 12:06:59 +0100
> David Laight <david.laight.linux@gmail.com> wrote:
> 
>> So you only want __packed on structures that might be misaligned and those
>> that contain misaligned members.
>>
>> If the structure is only guaranteed to be 32bit aligned then use __packed
>> __aligned(4) so that two 32bit accesses get used instead of 8 8bit ones.
>>
>> -- David
>>
>>>
>>> Thank you,
>>>    
>>>> Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
>>>> ---
>>>>   kernel/trace/fprobe.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
>>>> index cc49ebd2a773..21751dcdb7b9 100644
>>>> --- a/kernel/trace/fprobe.c
>>>> +++ b/kernel/trace/fprobe.c
>>>> @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
>>>>   struct __fprobe_header {
>>>>   	struct fprobe *fp;
>>>>   	unsigned long size_words;
>>>> -} __packed;
>>>> +};
>>>>   
> 
> Does "__packed" really do anything between a pointer and a long?

If that structure is allocated at a non-void-ptr-aligned address, the
packed attribute will ensure that the compiler don't emit instructions
that require aligned loads/stores when accessing those fields.

It does not change the layout of the structure per se in this specific
case, but it informs the compiler about the lack of guarantees about
alignment for the entire structure.

x86 32/64 cannot care less about this, but it's relevant on other
architectures.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 20:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <d01fb1ed-2418-42ee-aea2-37f9a5c5729c@kernel.org>

On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> On 6/10/26 18:37, Gregory Price wrote:
> > On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/10/26 12:41, Gregory Price wrote:
> > 
> > So, I remember this being asked, and I didn't fully grok the request.
> > 
> > I'm still not sure I fully understand the question, so apologies if I'm
> > answer the wrong things here.
> > 
> > I understand this question in two ways:
> > 
> >   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
> 
> Yes. Can we only allow folios to be allocated from private memory nodes. So let
> me reply to that one below.
> 
... snip ...
> 
> At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> context might be better. I think there was also talk about how the memalloc_*
> interface might be a better way forward. Maybe we would start giving the
> allocator more context ("we are allocating a folio").
> 
> The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>

Ok, the mental gap I have is not knowing the full context behind
memalloc.  I'll take this and do some reading / prototyping, but
this looks entirely reasonable.

I will still probably send the next RFC version tomorrow or friday,
as I want to get some eyes on the __GFP_PRIVATE-less pattern.

Also, I made a new `anondax` driver which enables userland testing
of this functionality without any specialty hardware.

tl;dr:

fd = open("/dev/anondax0.0", ....);
buf = mmap(fd, ...);
buf[0] = 0xDEADBEEF; /* fault to anondax driver */

static vm_fault_t anon_dax_fault(struct vm_fault *vmf)
{
        struct dev_dax *dev_dax = vmf->vma->vm_file->private_data;
        vm_fault_t ret;
        int id;

        id = dax_read_lock();
        if (!dax_alive(dev_dax->dax_dev))
                ret = VM_FAULT_SIGBUS;
        else
                ret = do_anonymous_page_node(vmf, dev_dax->target_node);
        dax_read_unlock(id);

        if (ret & VM_FAULT_OOM)
                return VM_FAULT_SIGBUS;
        return ret ? ret : VM_FAULT_NOPAGE;
}

With:
  qemu-system-x86_64 -m 5G \
    -object memory-backend-ram,id=m0,size=4G -numa node,nodeid=0,memdev=m0 \
    -object memory-backend-ram,id=m1,size=1G -numa node,nodeid=1,memdev=m1 \
    -append "... memmap=0x40000000!0x140000000"

Voila - buddy-managed private anonymous memory (1G region)

No need to reinvent page_alloc.c or fault handling :]

This can be used to hammer on reclaim/compaction/whatever support
without needing any particular hardware setup, and in fact it gives
some memory devices a path to support in userland while standards
get worked out.

do_anonymous_page_node is a bit of a bodge right now but I just haven't
fleshed it out yet.  The idea is - don't reinvent the fault path, just
provide the appropriate context to memory.c to do the right thing.

If this is acceptable, I imagine whatever interface gets implemented
will carry an in-tree driver export only, similar to hotplug/kmem.

> From 64aaff5f40497201ecc089c3339df6576184c433 Mon Sep 17 00:00:00 2001
> From: "David Hildenbrand (Arm)" <david@kernel.org>
> Date: Wed, 10 Jun 2026 20:55:49 +0200
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
>  include/linux/sched.h    |  2 +-
>  include/linux/sched/mm.h | 11 +++++++++++
>  mm/mempolicy.c           | 14 ++++++++++++--
>  mm/page_alloc.c          |  7 ++++++-
>  4 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ee06cba5c6f5..9c850b7be6bf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1778,7 +1778,7 @@ extern struct pid *cad_pid;
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> -#define PF__HOLE__00800000	0x00800000
> +#define PF__MEMALLOC_FOLIO	0x00800000	/* Allocating a folio that can end up on
> private memory nodes */
>  #define PF__HOLE__01000000	0x01000000
>  #define PF__HOLE__02000000	0x02000000
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with
> cpus_mask */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 95d0040df584..2101a447c084 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -471,6 +471,17 @@ static inline void memalloc_pin_restore(unsigned int flags)
>  	memalloc_flags_restore(flags);
>  }
> 
> +static inline unsigned int memalloc_folio_save(void)
> +{
> +	return memalloc_flags_save(PF_MEMALLOC_FOLIO);
> +}
> +
> +static inline void memalloc_folio_restore(unsigned int flags)
> +{
> +	memalloc_flags_restore(flags);
> +}
> +
> +
>  #ifdef CONFIG_MEMCG
>  DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>  /**
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36699fabd3c2..a78b0e5a1fce 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2506,8 +2506,13 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned
> int order,
>  struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
>  		struct mempolicy *pol, pgoff_t ilx, int nid)
>  {
> -	struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
> +	struct page *page;
> +	int flags;
> +
> +	flags = memalloc_folio_save();
> +	page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
>  			ilx, nid);
> +	memalloc_folio_restore(flags);
>  	if (!page)
>  		return NULL;
> 
> @@ -2588,7 +2593,12 @@ EXPORT_SYMBOL(alloc_pages_noprof);
> 
>  struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
>  {
> -	return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
> +	struct folio *folio;
> +	int flags;
> +
> +	flags = memalloc_folio_save();
> +	folio = page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
> +	memalloc_folio_restore(flags);
> +	return folio;
>  }
>  EXPORT_SYMBOL(folio_alloc_noprof);
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee902a468c2f..37434b37f7af 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5345,8 +5345,13 @@ EXPORT_SYMBOL(__alloc_pages_noprof);
>  struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int
> preferred_nid,
>  		nodemask_t *nodemask)
>  {
> -	struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> +	struct page *page;
> +	int flags;
> +
> +	flags = memalloc_folio_save();
> +	page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
>  					preferred_nid, nodemask);
> +	memalloc_folio_restore(flags);
>  	return page_rmappable_folio(page);
>  }
>  EXPORT_SYMBOL(__folio_alloc_noprof);
> -- 
> 2.43.0
> 
> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply

* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: Barry Song @ 2026-06-10 21:03 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <20260610195220.12403-1-jp.kobryn@linux.dev>

On Thu, Jun 11, 2026 at 3:53 AM JP Kobryn <jp.kobryn@linux.dev> wrote:
>
> LRU add batches can be drained before they reach capacity. This can be a
> source of LRU lock contention, but it is not currently possible to
> attribute these drains to callers with existing tracepoints.
>
> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
> lru_add batch is drained. This allows tracing to distinguish full drains
> from partial drains and attribute them to the calling stack.
>
> Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
> whether they set the force flag for all CPUs. The tracepoint resembles
> the signature of the enclosing function, but is needed because of
> potential inlining.
>
> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>

Reviewed-by: Barry Song <baohua@kernel.org>

Some minor nits:

[...]

> +       unsigned int nr_folios_add = folio_batch_count(fbatch);
>
> -       if (folio_batch_count(fbatch))
> +       if (nr_folios_add) {
>                 folio_batch_move_lru(fbatch, lru_add);
> +               trace_mm_lru_add_drain(cpu, nr_folios_add);
> +       }

Would "nr_folios" work here, given the surrounding lru_add context?
Alternatively, nr_folios_added might make the meaning a little clearer.

Best Regards
Barry

^ permalink raw reply

* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: Shakeel Butt @ 2026-06-10 21:13 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, willy, usama.arif, akpm, vbabka, mhocko, rostedt,
	mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <20260610195220.12403-1-jp.kobryn@linux.dev>

On Wed, Jun 10, 2026 at 12:52:20PM -0700, JP Kobryn wrote:
> LRU add batches can be drained before they reach capacity. This can be a
> source of LRU lock contention, but it is not currently possible to
> attribute these drains to callers with existing tracepoints.
> 
> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
> lru_add batch is drained. This allows tracing to distinguish full drains
> from partial drains and attribute them to the calling stack.
> 
> Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
> whether they set the force flag for all CPUs. The tracepoint resembles
> the signature of the enclosing function, but is needed because of
> potential inlining.
> 
> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 22:18 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <d01fb1ed-2418-42ee-aea2-37f9a5c5729c@kernel.org>

On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> 
> At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> context might be better. I think there was also talk about how the memalloc_*
> interface might be a better way forward. Maybe we would start giving the
> allocator more context ("we are allocating a folio").
> 
> The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
> 

Ok, this was easier to test than I expected, and hugetlb is indeed a
stickler.  We can't get there 100% with just MEMALLOC_FOLIO, we still
need a MEMALLOC_PRIVATE - specifically because of users like hugetlb.

hugetlb uses __GFP_THISNODE to do its allocations, and all hugetlb
allocations are folio allocations - so the code you shared by itself
does not gate hugetlb from spilling into private nodes.

That means we still need something like this in hugetlb:

  if (node_is_private(nid))
      /* fail allocation */

HOWEVER... if you have MEMALLOC_PRIVATE - you make the allocation
failure a *page allocator* problem, and it serves exactly the same
purpose that __GFP_PRIVATE did.

the resulting code is two lines in my anondax driver:

    unsigned int priv_flags = memalloc_private_save();
    ret = do_anonymous_page_node(vmf, dev_dax->target_node);
    memalloc_private_restore(priv_flags);

No special hugetlb, slab, arch code handling - they all just fail
to allocate / fall back.  If they fail - it means that code is using
a bad nodemask and we need to go fix it (exactly what we want!)

I think additionally, we might be able to repurpose MEMALLOC_PRIVATE
flag for Brendan's needs as well [1].

Their goal (IIRC) was to have a pile of unmapped blocks that could
be opportunistically converted to normal memory, but otherwise left
unmapped and sitting in the buddy.

Same thing - different filter point (blocks vs nodes).

If you set MEMALLOC_PRIVATE - it makes private node allocations
possible, and "private block" access (without conversion) possible.

Otherwise private nodes are unreachable, and private blocks would be
treated like CMA (last-resort stealing, lazy-direct-mapping).

And they stack (private blocks on private nodes :V).

I don't have enough time looking at his proposal, but it seems like we
can kill two birds with one stone on this.

[1] https://lore.kernel.org/linux-mm/agYJcRgOHho8upVv@gourry-fedora-PF4VCD3F/

~Gregory

^ permalink raw reply

* Re: [PATCH v7 04/42] KVM: Stub in ability to disable per-VM memory attribute tracking
From: Sean Christopherson @ 2026-06-10 22:19 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-4-2f0fae496530@google.com>

On Fri, May 22, 2026, Ackerley Tng wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Introduce the basic infrastructure to allow per-VM memory attribute
> tracking to be disabled. This will be built-upon in a later patch, where a
> module param can disable per-VM memory attribute tracking.
> 
> Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
> existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
> plumbing, while the latter enables the full per-VM tracking via an xarray
> and the associated ioctls.
> 
> kvm_get_memory_attributes() now performs a static call that either looks up
> kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
> just returns 0 otherwise. The static call can be patched depending on
> whether per-VM tracking is enabled by the CONFIG.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---

...

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index abb9cfa3eb04d..ee26f1d9b5fda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -101,6 +101,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
>  static bool __ro_after_init allow_unsafe_mappings;
>  module_param(allow_unsafe_mappings, bool, 0444);
>  
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static bool vm_memory_attributes = true;
> +#else
> +#define vm_memory_attributes false
> +#endif
> +DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> +#endif

Fudge.  This morning's PUCK discussion about VBS made me realize that we really
don't want to kill off _all_ per-VM attributes like this, we really just want to
kill off PRIVATE.  And even if RWX protections never arrive, conceptually shoving
all attributes into guest_memfd doesn't make any sense, because it really is only
the private vs. shared state that is tied to the physical memory, things like RWX
protections aren't so tightly couple to the data.

It'll require a bit of minor surgery to these patches, but the silver lining is
that I think the end code will be slightly easier to follow.

I'll sync with you off-list to splice in the changes to your current series (I
have them sketched out).

^ permalink raw reply

* Re: [PATCH v7 06/42] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
From: Sean Christopherson @ 2026-06-10 22:23 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-6-2f0fae496530@google.com>

On Fri, May 22, 2026, Ackerley Tng wrote:
> Update the guest_memfd populate() flow to pull memory attributes from the
> gmem instance instead of the VM when KVM is not configured to track
> shared/private status in the VM.
> 
> Rename the per-VM API to make it clear that it retrieves per-VM
> attributes, i.e. is not suitable for use outside of flows that are
> specific to generic per-VM attributes.
> 
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

We should squash this in with the previous patch, i.e. wire up PRIVATE to gmem
in a single patch (sans the ioctl support).  I had a hell of time figure out how
the range-based lookup was supposed to work when revisiting the "wire up" patch,
until I realized populate() was handled in the next patch.

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-10 23:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aiFtJFqkpbZ9qFvM@gourry-fedora-PF4VCD3F>

On Thu, Jun 04, 2026 at 01:18:44PM +0100, Gregory Price wrote:
> On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
> > 
> > My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> > need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> > Very similar to how not all ZONE_DEVICE memory is homogenous.
> >
> 
> Can you more precise about your definition of homogeneous here?
> 
> Are you saying not all memory on a private node will be homogeneous?
>    While possible, I would argue that you should not do this and
>    should instead prefer to use multiple nodes - 1 per memory class.
> 
> Are you saying not all private nodes will be homogenous?
>    I don't see the issue with this.

Yes, I meant, nodes might belong to different devices. These might not
want fallover allocations, for example __GFP_PRIVATE falling back to
unwanted nodes.

> 
> > > 
> > > Agreed, but also one which can be deferred and played with since it's
> > > all kernel-internal.  None of this should have UAPI implications, and we
> > > need need to accept that we're going to get it wrong on the first try.
> > > 
> > 
> > Agreed that we might get the design wrong, until we fix it up. I feel
> > that __GFP_PRIVATE should be an evolution of the design to that point.
> >
> 
> Possibly.  If we can't guarantee isolation without __GFP_PRIVATE, then
> we probably can't merge the baseline without it.
> 

I'll rethink about this, but I am concerned that __GFP_PRIVATE is too
broad, in fact it breaks isolation by allocating from any private
device. Again this is a function of how fallback lists are organized.

> > > Because pagecache pages are associated with potentially many VMAs.
> > > 
> > > The fault can be a soft fault or a hard fault.  On soft fault - the page
> > > was already present, and will simply fault into VMA without being
> > > migrated.
> > > 
> > 
> > Let's split this into two:
> > 
> > 1. unmapped page cache is never impacted by mempolicy and should not
> >    end up on private memory nodes
> > 2. For shared pages, mempolicy would be hard, but it would need to
> >    be on a set of nodes backed by private memory, depending on mbind()
> >    policy
> >
> ... snip ...
> > 
> > I'd need to think more about this. For now, my basic requirement would
> > be that unmapped page cache should not come from/to private nodes.
> > 
> 
> This does not fully describe the problem.
> 
> A file can be opened and cached as unmapped page cache, and then mapped
> at a later time - at which point the mapped copy would share the filemap
> page cache page.
> 
> Worse, because it's file-backed, you can have the memory faulted onto
> your remote node - reclaimed - and the faulted back in via the process
> accessing the file via unmapped operations (read/write), at which point
> you've had a silent migration occur.
> 
> Basically consider
> 
> Process A:
>    fd = open("myfile", ..., RO);
>    read(fd, ...);  /* mm/filemap.c fills page cache */
> 
> Process B:
>    fd = open("myfile", ...);
>    mem = mmap(fd, ...);
>    mbind(mem, ..., private_node);
>    for page in mem:
>        int tmp = mem[page]; /* fault into vma */
> 
> The result of Process A running first is Process B thinks it has faulted
> the memory onto private_node, but in reality it's taking soft faults and
> just getting the filemap folio mapped in.
> 
> If you wanted mbind() support from the start, we would have to limit
> applicability to anon memory only.
> 
> Shared anon memory is different, as there is a radix tree that deals
> with a shared mempolicy state.

Ack, need to think through this.

> 
> > 
> > I am open to this, I was coming from the blueprint approach of:
> > - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
> >   what features to change or make specific to the implementation
> >
> 
> N_MEMORY essentially states:
> 	"This is normal memory touch it however you like"
> 
> N_MEMORY_PRIVATE (_MANAGED, w/e) says
> 	"This is NOT normal memory, there are special rules here"
> 
> So, no, lets not mimic N_MEMORY.  This is a "closed by default" design,
> while N_MEMORY is an "open by default" design.  This design choice is
> explicit to make reasoning about these nodes feasible.
> 
> > > This is informed by a single use case / device.
> > > 
> > > There are users / devices that don't want any UAPI for their memory,
> > > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > > reclaim, etc).
> > > 
> > 
> > But then, why do they need NUMA nodes? Do we have a list of use cases?
> >
> 
> So far i have collected:
> 
> - Network accelerators carrying their own memory for message buffers
> - GPUs with semi-general-purpose working memory across coherent links
> - Acceptionally slow distributed memory that you do not want fallback
>   allocations to (so you want to deliberately tier what lands there)
> - Compressed memory (just another form of accelerator really) which
>   has *special access rules* (i.e. writes need to be controlled)
> 
> In most if not all of these cases, the right abstraction to reason about
> where memory *should come from* IS a NUMA node.
> 
> - the network stack can be taught to check if the target device has a
>   node with memory and prefer that node over local memory
> 
> - accelerators can be given private nodes to manage memory using
>   core mm/ components, without worrying that general kernel operation
>   will put unrelated memory on those nodes or do things like migrate
>   your pages out from under you (unless your driver/service requested
>   that).
> 
> the tiering application should be somewhat obvious / trivial.
> 
> > > 
> > > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > > operations access private nodes removed from fallback lists are reached
> > > via something like the possible / online nodemask.
> > > 
> > > I remember, maybe a year ago, there were per-node allocations happening
> > > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > > I'm trying to re-collect that data now.
> > > 
> > 
> > Thanks, I look forward to the next set of patches. Let me know if I
> > can help test what's on the list or if you want me to wait for the next
> > round
> >
> 
> Really I want to get the minimized set out the door so we can start
> breaking this up by feature (reclaim, mempolicy, etc), because trying to
> reason about it as a whole is infeasible - and I cannot be the single
> arbiter of every use case (I simply do not have sufficient context).
> 
> I'm reworking it all as we speak.
> 

Look forward to it

Balbir

^ permalink raw reply

* Re: [PATCH 0/2] arm64: ftrace: support DIRECT_CALLS without CALL_OPS
From: Nathan Chancellor @ 2026-06-10 23:36 UTC (permalink / raw)
  To: Jose Fernandez (Anthropic)
  Cc: Steven Rostedt, Masami Hiramatsu, Mark Rutland, Catalin Marinas,
	Will Deacon, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-kernel, linux-trace-kernel, linux-arm-kernel, llvm, bpf,
	Florent Revest, Puranjay Mohan, Xu Kuohai
In-Reply-To: <20260609-arm64-ftrace-direct-calls-v1-0-4a46f266697f@linux.dev>

Hi Jose,

On Tue, Jun 09, 2026 at 05:19:25AM +0000, Jose Fernandez (Anthropic) wrote:
> Jose Fernandez (Anthropic) (2):
>       arm64: ftrace: prepare ftrace_modify_call() for use without CALL_OPS
>       arm64: ftrace: allow DIRECT_CALLS without CALL_OPS

Thanks, I applied these two changes on -next and it looks like it
resolves the issue I originally noticed with systemd's restrict-fs
program not working on both of my arm64 machines.

Tested-by: Nathan Chancellor <nathan@kernel.org>

-- 
Cheers,
Nathan

^ permalink raw reply

* [PATCH v4] mm/lruvec: trace LRU add drains and drain-all requests
From: JP Kobryn @ 2026-06-10 23:48 UTC (permalink / raw)
  To: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park
  Cc: linux-kernel, linux-trace-kernel

LRU add batches can be drained before they reach capacity. This can be a
source of LRU lock contention, but it is not currently possible to
attribute these drains to callers with existing tracepoints.

Add mm_lru_add_drain to report the CPU and lru_add batch count when an
lru_add batch is drained. This allows tracing to distinguish full drains
from partial drains and attribute them to the calling stack.

Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
whether they set the force flag for all CPUs. The tracepoint resembles
the signature of the enclosing function, but is needed because of
potential inlining.

Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
v4:
  - renamed nr_folio_add to nr_folios in lru_add_drain()
  - renamed nr to nr_folios in tracepoint for consistency

v3: https://lore.kernel.org/linux-mm/20260610195220.12403-1-jp.kobryn@linux.dev/
  - restored and renamed tracepoint in __lru_add_drain_all

v2: https://lore.kernel.org/linux-mm/20260609041156.31127-1-jp.kobryn@linux.dev/
  - removed mm_lru_drain_all tracepoint

v1: https://lore.kernel.org/linux-mm/20260609041156.31127-1-jp.kobryn@linux.dev/

 include/trace/events/pagemap.h | 37 ++++++++++++++++++++++++++++++++++
 mm/swap.c                      |  7 ++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 171524d3526d..df6ac4d13dcf 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -77,6 +77,43 @@ TRACE_EVENT(mm_lru_activate,
 	TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
 );
 
+TRACE_EVENT(mm_lru_add_drain,
+
+	TP_PROTO(int cpu, unsigned int nr_folios),
+
+	TP_ARGS(cpu, nr_folios),
+
+	TP_STRUCT__entry(
+		__field(int,		cpu	)
+		__field(unsigned int,	nr_folios	)
+	),
+
+	TP_fast_assign(
+		__entry->cpu	= cpu;
+		__entry->nr_folios	= nr_folios;
+	),
+
+	TP_printk("cpu=%d nr_folios=%u", __entry->cpu, __entry->nr_folios)
+);
+
+TRACE_EVENT(mm_lru_add_drain_all,
+
+	TP_PROTO(bool force_all_cpus),
+
+	TP_ARGS(force_all_cpus),
+
+	TP_STRUCT__entry(
+		__field(bool,	force_all_cpus	)
+	),
+
+	TP_fast_assign(
+		__entry->force_all_cpus	= force_all_cpus;
+	),
+
+	TP_printk("force_all_cpus=%s",
+		__entry->force_all_cpus ? "true" : "false")
+);
+
 #endif /* _TRACE_PAGEMAP_H */
 
 /* This part must be outside protection */
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..b506fa912a93 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 	struct folio_batch *fbatch = &fbatches->lru_add;
+	unsigned int nr_folios = folio_batch_count(fbatch);
 
-	if (folio_batch_count(fbatch))
+	if (nr_folios) {
 		folio_batch_move_lru(fbatch, lru_add);
+		trace_mm_lru_add_drain(cpu, nr_folios);
+	}
 
 	fbatch = &fbatches->lru_move_tail;
 	/* Disabling interrupts below acts as a compiler barrier. */
@@ -869,6 +872,8 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
+	trace_mm_lru_add_drain_all(force_all_cpus);
+
 	/*
 	 * Guarantee folio_batch counter stores visible by this CPU
 	 * are visible to other CPUs before loading the current drain
-- 
2.54.0


^ permalink raw reply related

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-10 23:53 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aik_ddHymus2DJ6D@gourry-fedora-PF4VCD3F>

On Wed, Jun 10, 2026 at 06:41:57AM -0400, Gregory Price wrote:
> On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > > 
> > >    __GFP_THISNODE cannot be overloaded to do anything useful here.
> > 
> > Let me clarify, I meant to say, let's use a nodemask for allocation
> > and __GFP_THISNODE gets us to the node we desire, if that is the only
> > node. My earlier comment might not have been clear.
> > 
> 
> I've been tested an stripped back patch set where I drop all FALLBACK
> entries for private nodes (including for itself) and only keep the
> NOFALLBACK entry for private nodes.
> 
> This effectively isolates the nodes for any allocation without
> __GFP_THISNODE.
> 
> This also precludes these nodes from ever using non-mbind mempolicies,
> which I think is a completely reasonable compromise and something I was
> already expecting we would do.
> 
> Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> which causes spillage into private nodes because slub allows private
> nodes in its mask.  I think this is fixable.
> 

Agreed.

> I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> code, etc), but it seems like fully dropping the FALLBACK entries and
> requiring __GFP_THISNODE might be sufficient.
>
> ~Gregory

That's good progress, thanks for the update!

Balbir

^ permalink raw reply

* Re: [PATCH] samples/ftrace: reject zero ftrace-ops call count
From: Steven Rostedt @ 2026-06-11  0:03 UTC (permalink / raw)
  To: Samuel Moelius
  Cc: Masami Hiramatsu, Mark Rutland, open list:FUNCTION HOOKS (FTRACE),
	open list:FUNCTION HOOKS (FTRACE)
In-Reply-To: <CAE+C+DZXcQfyQt-UV2PRt0vVFJqCciW6QkyEd9FaJ8++B3M4Ow@mail.gmail.com>

On Tue, 9 Jun 2026 07:26:27 -0400
Samuel Moelius <sam.moelius@trailofbits.com> wrote:

> Is it okay to keep the same subject line or should I change it?

Yeah, and also note that the tracing subsystem uses capital letters:

  samples/ftrace: Reject zero ftrace-ops call count

But you can change it to:

  samples/ftrace: Prevent division by zero when nr_function_calls is zero

-- Steve

^ permalink raw reply

* Re: [PATCH 1/2] arm64: ftrace: prepare ftrace_modify_call() for use without CALL_OPS
From: Xu Kuohai @ 2026-06-11  4:06 UTC (permalink / raw)
  To: Jose Fernandez (Anthropic), Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Catalin Marinas, Will Deacon, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt
  Cc: linux-kernel, linux-trace-kernel, linux-arm-kernel, llvm, bpf,
	Florent Revest, Puranjay Mohan
In-Reply-To: <20260609-arm64-ftrace-direct-calls-v1-1-4a46f266697f@linux.dev>

On 6/9/2026 1:19 PM, Jose Fernandez (Anthropic) wrote:
> ftrace_modify_call() is guarded by CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
> and calls ftrace_rec_set_ops(rec, arm64_rec_get_ops(rec)) directly,
> which only exists when CALL_OPS is enabled.
> 
> Generic ftrace also needs ftrace_modify_call() when
> CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is enabled, to retarget a
> callsite between two non-FTRACE_ADDR destinations, as happens when a
> direct trampoline is modified. The next patch allows DIRECT_CALLS without
> CALL_OPS, so widen the guard to cover both configurations and switch
> the body to the ftrace_rec_update_ops() wrapper, which already has a
> stub for the !CALL_OPS case. ftrace_make_call() already uses the same
> wrapper today.
> 
> No functional change: with CALL_OPS enabled, ftrace_rec_update_ops()
> expands to the exact call this replaces.
> 
> Assisted-by: Claude:unspecified
> Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
> ---
>   arch/arm64/kernel/ftrace.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
> index 5a1554a441628..e1a3c0b3a0514 100644
> --- a/arch/arm64/kernel/ftrace.c
> +++ b/arch/arm64/kernel/ftrace.c
> @@ -409,7 +409,8 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
>   	return ftrace_modify_code(pc, old, new, true);
>   }
>   
> -#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
> +#if defined(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS) || \
> +	defined(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)
>   int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
>   		       unsigned long addr)
>   {
> @@ -417,7 +418,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
>   	u32 old, new;
>   	int ret;
>   
> -	ret = ftrace_rec_set_ops(rec, arm64_rec_get_ops(rec));
> +	ret = ftrace_rec_update_ops(rec);
>   	if (ret)
>   		return ret;
>   
>
Acked-by: Xu Kuohai <xukuohai@huawei.com>


^ permalink raw reply

* Re: [PATCH 2/2] arm64: ftrace: allow DIRECT_CALLS without CALL_OPS
From: Xu Kuohai @ 2026-06-11  4:06 UTC (permalink / raw)
  To: Jose Fernandez (Anthropic), Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Catalin Marinas, Will Deacon, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt
  Cc: linux-kernel, linux-trace-kernel, linux-arm-kernel, llvm, bpf,
	Florent Revest, Puranjay Mohan
In-Reply-To: <20260609-arm64-ftrace-direct-calls-v1-2-4a46f266697f@linux.dev>

On 6/9/2026 1:19 PM, Jose Fernandez (Anthropic) wrote:
> arm64 gained ftrace direct calls in commit 2aa6ac03516d ("arm64:
> ftrace: Add direct call support") on top of
> DYNAMIC_FTRACE_WITH_CALL_OPS, using the per-callsite ops pointer as a
> fast path to reach the direct trampoline. Since commit baaf553d3bc3
> ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS"), CALL_OPS is
> mutually exclusive with CFI: the pre-function NOPs would change the
> offset of the pre-function kCFI type hash, and the compiler support
> needed to keep that offset consistent does not exist yet.
> 
> The result is that a CONFIG_CFI=y kernel loses CALL_OPS, and with it
> DIRECT_CALLS, and with it every BPF trampoline attachment to kernel
> functions: register_fentry() returns -ENOTSUPP, so fentry/fexit,
> fmod_ret and BPF LSM programs cannot attach at all. This is a real
> problem for hardened arm64 deployments that rely on BPF LSM for
> security monitoring while keeping kCFI enabled.
> 
> CALL_OPS is an optimization for direct calls, not a dependency. When
> the direct trampoline is within BL range, the callsite branches
> straight to it and ftrace_caller is not involved. When it is out of
> range, ftrace_find_callable_addr() already falls back to
> ftrace_caller, and the DIRECT_CALLS machinery there
> (FREGS_DIRECT_TRAMP, ftrace_caller_direct_late) is gated on
> DIRECT_CALLS alone, not CALL_OPS: the ops dispatch invokes
> call_direct_funcs(), which stores the trampoline address in
> ftrace_regs, and ftrace_caller tail-calls it. s390 and loongarch use
> this same mechanism for HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS without
> having CALL_OPS at all, and DYNAMIC_FTRACE_WITH_ARGS without CALL_OPS
> is already a supported arm64 configuration (GCC builds with
> CC_OPTIMIZE_FOR_SIZE do not satisfy the CALL_OPS select condition).
> 
> Drop the CALL_OPS requirement from the
> HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS select. Configurations that
> keep CALL_OPS (!CFI clang builds, and GCC builds without
> CC_OPTIMIZE_FOR_SIZE) are unchanged. CALL_OPS-less configurations
> take the ftrace_caller ops-dispatch path for out-of-range direct
> calls, trading the per-callsite fast path for working BPF
> trampolines; in-range attachments still branch directly with no
> overhead. GCC -Os builds also gain DIRECT_CALLS as a side effect.
> That is intended: s390 and loongarch already ship DIRECT_CALLS
> without any per-callsite fast path.
> 
> Assisted-by: Claude:unspecified
> Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
> ---
>   arch/arm64/Kconfig | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index fe60738e5943b..2cd7d536671c9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -188,7 +188,7 @@ config ARM64
>   		if (GCC_SUPPORTS_DYNAMIC_FTRACE_WITH_ARGS || \
>   		    CLANG_SUPPORTS_DYNAMIC_FTRACE_WITH_ARGS)
>   	select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS \
> -		if DYNAMIC_FTRACE_WITH_ARGS && DYNAMIC_FTRACE_WITH_CALL_OPS
> +		if DYNAMIC_FTRACE_WITH_ARGS
>   	select HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS \
>   		if (DYNAMIC_FTRACE_WITH_ARGS && !CFI && \
>   		    (CC_IS_CLANG || !CC_OPTIMIZE_FOR_SIZE))
> 
Acked-by: Xu Kuohai <xukuohai@huawei.com>


^ permalink raw reply

* [PATCH] tracing: Remove unused ret assignment in tracing_set_tracer()
From: Wayen.Yan @ 2026-06-11  3:52 UTC (permalink / raw)
  To: rostedt; +Cc: mhiramat, mathieu.desnoyers, linux-trace-kernel, linux-kernel

In tracing_set_tracer(), the assignment 'ret = 0' following the
__tracing_resize_ring_buffer() error check is a dead store. After
this point, all subsequent code paths either return with a constant
value (-EINVAL, 0, -EBUSY) or reassign ret before reading it
(tracing_arm_snapshot_locked, tracer_init).

Remove the unnecessary assignment.

No functional change.

Signed-off-by: Wayen.Yan <win847@gmail.com>
---
 kernel/trace/trace.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a..58f81b36b7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5018,7 +5018,6 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf)
 						RING_BUFFER_ALL_CPUS);
 		if (ret < 0)
 			return ret;
-		ret = 0;
 	}

 	list_for_each_entry(t, &tr->tracers, list) {
-- 
2.51.0

^ permalink raw reply related

* [PATCH v5 1/3] tracing: Use __free() for expr_str() buffer
From: Pengpeng Hou @ 2026-06-11  5:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Pengpeng Hou

expr_str() allocates a temporary expression buffer and manually frees it
on some error paths.

Convert the buffer to __free(kfree) and return it with return_ptr() on
success. This keeps ownership handling separate from the later ERR_PTR()
conversion and string-bound change.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 kernel/trace/trace_events_hist.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 82ce492ab268..f778f060e922 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1759,7 +1759,7 @@ static void expr_field_str(struct hist_field *field, char *expr)
 
 static char *expr_str(struct hist_field *field, unsigned int level)
 {
-	char *expr;
+	char *expr __free(kfree) = NULL;
 
 	if (level > 1)
 		return NULL;
@@ -1770,7 +1770,7 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 
 	if (!field->operands[0]) {
 		expr_field_str(field, expr);
-		return expr;
+		return_ptr(expr);
 	}
 
 	if (field->operator == FIELD_OP_UNARY_MINUS) {
@@ -1778,16 +1778,15 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 
 		strcat(expr, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
-		if (!subexpr) {
-			kfree(expr);
+		if (!subexpr)
 			return NULL;
-		}
+
 		strcat(expr, subexpr);
 		strcat(expr, ")");
 
 		kfree(subexpr);
 
-		return expr;
+		return_ptr(expr);
 	}
 
 	expr_field_str(field->operands[0], expr);
@@ -1806,13 +1805,12 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 		strcat(expr, "*");
 		break;
 	default:
-		kfree(expr);
 		return NULL;
 	}
 
 	expr_field_str(field->operands[1], expr);
 
-	return expr;
+	return_ptr(expr);
 }
 
 /*
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v5 2/3] tracing: Return ERR_PTR() from expr_str()
From: Pengpeng Hou @ 2026-06-11  5:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Pengpeng Hou

expr_str() currently reports all failure cases as NULL, so callers cannot
distinguish invalid recursion depth from allocation failure or later
string construction errors.

Return ERR_PTR()-encoded errors from expr_str() and make parse_unary()
and parse_expr() propagate them. Clear expr->name before destroying the
hist field so the error pointer is not freed as a string.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 kernel/trace/trace_events_hist.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index f778f060e922..082842e64ccd 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1762,11 +1762,11 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 	char *expr __free(kfree) = NULL;
 
 	if (level > 1)
-		return NULL;
+		return ERR_PTR(-EINVAL);
 
 	expr = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
 	if (!expr)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	if (!field->operands[0]) {
 		expr_field_str(field, expr);
@@ -1778,8 +1778,8 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 
 		strcat(expr, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
-		if (!subexpr)
-			return NULL;
+		if (IS_ERR(subexpr))
+			return subexpr;
 
 		strcat(expr, subexpr);
 		strcat(expr, ")");
@@ -1805,7 +1805,7 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 		strcat(expr, "*");
 		break;
 	default:
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	}
 
 	expr_field_str(field->operands[1], expr);
@@ -2624,6 +2624,11 @@ static struct hist_field *parse_unary(struct hist_trigger_data *hist_data,
 	expr->is_signed = operand1->is_signed;
 	expr->operator = FIELD_OP_UNARY_MINUS;
 	expr->name = expr_str(expr, 0);
+	if (IS_ERR(expr->name)) {
+		ret = PTR_ERR(expr->name);
+		expr->name = NULL;
+		goto free;
+	}
 	expr->type = kstrdup_const(operand1->type, GFP_KERNEL);
 	if (!expr->type) {
 		ret = -ENOMEM;
@@ -2836,6 +2841,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
 		destroy_hist_field(operand1, 0);
 
 		expr->name = expr_str(expr, 0);
+		if (IS_ERR(expr->name)) {
+			ret = PTR_ERR(expr->name);
+			expr->name = NULL;
+			goto free_expr;
+		}
 	} else {
 		/* The operand sizes should be the same, so just pick one */
 		expr->size = operand1->size;
@@ -2849,6 +2859,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
 		}
 
 		expr->name = expr_str(expr, 0);
+		if (IS_ERR(expr->name)) {
+			ret = PTR_ERR(expr->name);
+			expr->name = NULL;
+			goto free_expr;
+		}
 	}
 
 	return expr;
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v5 0/3] tracing: bound histogram expression strings
From: Pengpeng Hou @ 2026-06-11  5:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Pengpeng Hou

This v5 starts a new thread and splits the previous expr_str() update
into the preparatory lifetime cleanup, the ERR_PTR() conversion, and the
bounded seq_buf conversion.

The series is based on trace/for-next at 8970865b788e
("Merge trace/for-next").

Patch 1 converts the expr_str() output buffer to __free(kfree).
Patch 2 converts expr_str() failures from NULL to ERR_PTR().
Patch 3 replaces the raw strcat() expression construction with seq_buf
and returns -E2BIG when the rendered expression would exceed
MAX_FILTER_STR_VAL.

Changes since v4:
https://lore.kernel.org/all/20260521022817.38453-1-pengpeng@iscas.ac.cn/
- start a new thread for the new patch revision
- split the __free(kfree) conversion and ERR_PTR() conversion into
  separate patches as requested
- simplify the unary-minus seq_buf conversion with a single
  seq_buf_printf(&s, "-(%s)", subexpr) and keep subexpr manually freed
- keep the seq_buf include unchanged because trace/for-next already has
  <linux/seq_buf.h> in this file

Pengpeng Hou (3):
  tracing: Use __free() for expr_str() buffer
  tracing: Return ERR_PTR() from expr_str()
  tracing: Bound histogram expression strings with seq_buf

 kernel/trace/trace_events_hist.c | 92 ++++++++++++++++++++------------
 1 file changed, 57 insertions(+), 35 deletions(-)

-- 
2.50.1 (Apple Git-155)

^ permalink raw reply

* [PATCH v5 3/3] tracing: Bound histogram expression strings with seq_buf
From: Pengpeng Hou @ 2026-06-11  5:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Pengpeng Hou

expr_str() allocates a fixed MAX_FILTER_STR_VAL buffer and then builds
expression names with a series of raw strcat() appends. Nested operands,
constants, field flags, and generated field names can push the rendered
string past that fixed limit before the name is attached to the hist
field.

Build expression strings with seq_buf and return -E2BIG when the
rendered name would exceed MAX_FILTER_STR_VAL. This keeps the existing
tracing-side limit while replacing the raw append logic with bounded
construction.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 kernel/trace/trace_events_hist.c | 57 ++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 082842e64ccd..810c63c6b3df 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -94,7 +94,6 @@ typedef u64 (*hist_field_fn_t) (struct hist_field *field,
 #define HIST_FIELD_OPERANDS_MAX	2
 #define HIST_FIELDS_MAX		(TRACING_MAP_FIELDS_MAX + TRACING_MAP_VARS_MAX)
 #define HIST_ACTIONS_MAX	8
-#define HIST_CONST_DIGITS_MAX	21
 #define HIST_DIV_SHIFT		20  /* For optimizing division by constants */
 
 enum field_op_id {
@@ -1733,33 +1732,36 @@ static const char *get_hist_field_flags(struct hist_field *hist_field)
 	return flags_str;
 }
 
-static void expr_field_str(struct hist_field *field, char *expr)
+static bool expr_field_str(struct hist_field *field, struct seq_buf *s)
 {
+	const char *field_name;
+
 	if (field->flags & HIST_FIELD_FL_VAR_REF) {
 		if (!field->system)
-			strcat(expr, "$");
-	} else if (field->flags & HIST_FIELD_FL_CONST) {
-		char str[HIST_CONST_DIGITS_MAX];
+			seq_buf_putc(s, '$');
+	} else if (field->flags & HIST_FIELD_FL_CONST)
+		seq_buf_printf(s, "%llu", field->constant);
 
-		snprintf(str, HIST_CONST_DIGITS_MAX, "%llu", field->constant);
-		strcat(expr, str);
-	}
+	field_name = hist_field_name(field, 0);
+	if (!field_name)
+		return false;
 
-	strcat(expr, hist_field_name(field, 0));
+	seq_buf_puts(s, field_name);
 
 	if (field->flags && !(field->flags & HIST_FIELD_FL_VAR_REF)) {
 		const char *flags_str = get_hist_field_flags(field);
 
-		if (flags_str) {
-			strcat(expr, ".");
-			strcat(expr, flags_str);
-		}
+		if (flags_str)
+			seq_buf_printf(s, ".%s", flags_str);
 	}
+
+	return !seq_buf_has_overflowed(s);
 }
 
 static char *expr_str(struct hist_field *field, unsigned int level)
 {
 	char *expr __free(kfree) = NULL;
+	struct seq_buf s;
 
 	if (level > 1)
 		return ERR_PTR(-EINVAL);
@@ -1768,47 +1770,54 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 	if (!expr)
 		return ERR_PTR(-ENOMEM);
 
+	seq_buf_init(&s, expr, MAX_FILTER_STR_VAL);
+
 	if (!field->operands[0]) {
-		expr_field_str(field, expr);
+		if (!expr_field_str(field, &s))
+			return ERR_PTR(-E2BIG);
+
 		return_ptr(expr);
 	}
 
 	if (field->operator == FIELD_OP_UNARY_MINUS) {
 		char *subexpr;
 
-		strcat(expr, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
 		if (IS_ERR(subexpr))
 			return subexpr;
 
-		strcat(expr, subexpr);
-		strcat(expr, ")");
-
+		seq_buf_printf(&s, "-(%s)", subexpr);
 		kfree(subexpr);
 
+		if (seq_buf_has_overflowed(&s))
+			return ERR_PTR(-E2BIG);
+
 		return_ptr(expr);
 	}
 
-	expr_field_str(field->operands[0], expr);
+	if (!expr_field_str(field->operands[0], &s))
+		return ERR_PTR(-E2BIG);
 
 	switch (field->operator) {
 	case FIELD_OP_MINUS:
-		strcat(expr, "-");
+		seq_buf_putc(&s, '-');
 		break;
 	case FIELD_OP_PLUS:
-		strcat(expr, "+");
+		seq_buf_putc(&s, '+');
 		break;
 	case FIELD_OP_DIV:
-		strcat(expr, "/");
+		seq_buf_putc(&s, '/');
 		break;
 	case FIELD_OP_MULT:
-		strcat(expr, "*");
+		seq_buf_putc(&s, '*');
 		break;
 	default:
 		return ERR_PTR(-EINVAL);
 	}
 
-	expr_field_str(field->operands[1], expr);
+	if (seq_buf_has_overflowed(&s) ||
+	    !expr_field_str(field->operands[1], &s))
+		return ERR_PTR(-E2BIG);
 
 	return_ptr(expr);
 }
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
From: Huang Shijie @ 2026-06-11  6:18 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

In the UnixBench tests, there is a test "execl" which tests
the execve system call.
  For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
When we test our server with "./Run -c 384 execl",
the test result is not good enough. The i_mmap locks contended heavily on
"libc.so" and "ld.so". The i_mmap tree for "libc.so" can be
over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove
operations do not run quickly enough.

 In order to reduce the competition of the i_mmap lock, this patch does
following:
   1.) Split the single i_mmap tree into several sibling trees:
       Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to
       turn on/off this feature.
   2.) Introduce a new field "tree_idx" for vm_area_struct to save the
       sibling tree index for this VMA.
   3.) Introduce a new field "vma_count" for address_space.
       The new mapping_mapped() will use it.
   4.) Rewrite the vma_interval_tree_foreach()
   5.) Rewrite the lock functions.	

 After this patch, the VMA insert/remove operations will work faster,
and we can get over 400% performance improvement with the above test.

Signed-off-by: Huang Shijie <huangsj@hygon.cn>
---
 fs/Kconfig               |   8 ++
 fs/hugetlbfs/inode.c     |  20 ++++-
 fs/inode.c               |  75 ++++++++++++++++-
 include/linux/fs.h       | 174 ++++++++++++++++++++++++++++++++++++++-
 include/linux/mm.h       |  80 ++++++++++++++++++
 include/linux/mm_types.h |   3 +
 mm/internal.h            |   3 +-
 mm/mmap.c                |  11 ++-
 mm/nommu.c               |  23 ++++--
 mm/pagewalk.c            |   2 +-
 mm/vma.c                 |  72 +++++++++++-----
 mm/vma_init.c            |   3 +
 12 files changed, 436 insertions(+), 38 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 43cb06de297f..e24804f70432 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -9,6 +9,14 @@ menu "File systems"
 config DCACHE_WORD_ACCESS
        bool
 
+config SPLIT_I_MMAP
+	bool "Split the file's i_mmap to several trees"
+	default n
+	help
+	  Split the file's i_mmap to several trees, each tree has a separate
+	  lock. This will reduce the lock contention of file's i_mmap tree,
+	  but it will cost more memory for per inode.
+
 config VALIDATE_FS_PARSER
 	bool "Validate filesystem parameter description"
 	help
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index da5b41ea5bdd..68d8308418dd 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -891,6 +891,23 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
  */
 static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
 
+#ifdef CONFIG_SPLIT_I_MMAP
+static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++) {
+		lockdep_set_class(&mapping->i_mmap[i].rwsem,
+				&hugetlbfs_i_mmap_rwsem_key);
+	}
+}
+#else
+static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
+{
+	lockdep_set_class(&mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key);
+}
+#endif
+
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					struct mnt_idmap *idmap,
 					struct inode *dir,
@@ -915,8 +932,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 
 		inode->i_ino = get_next_ino();
 		inode_init_owner(idmap, inode, dir, mode);
-		lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
-				&hugetlbfs_i_mmap_rwsem_key);
+		hugetlbfs_lockdep_set_class(inode->i_mapping);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		simple_inode_init_ts(inode);
 		info->resv_map = resv_map;
diff --git a/fs/inode.c b/fs/inode.c
index 62c579a0cf7d..cb67ae83f5b3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -214,6 +214,70 @@ static int no_open(struct inode *inode, struct file *file)
 	return -ENXIO;
 }
 
+#ifdef CONFIG_SPLIT_I_MMAP
+int split_tree_num;
+static int split_tree_align __maybe_unused = 32;
+
+static void __init init_split_tree_num(void)
+{
+#ifdef CONFIG_NUMA
+	split_tree_num = nr_node_ids;
+#else
+	split_tree_num = ALIGN(nr_cpu_ids, split_tree_align);
+#endif
+}
+
+static void free_mapping_i_mmap(struct address_space *mapping)
+{
+	int i;
+
+	if (!mapping->i_mmap)
+		return;
+
+	for (i = 0; i < split_tree_num; i++)
+		kfree(mapping->i_mmap[i]);
+
+	kfree(mapping->i_mmap);
+	mapping->i_mmap = NULL;
+}
+
+static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
+{
+	struct i_mmap_tree *tree;
+	int i;
+
+	/* The extra one is used as terminator in vma_interval_tree_foreach() */
+	mapping->i_mmap = kzalloc(sizeof(tree) * (split_tree_num + 1), gfp);
+	if (!mapping->i_mmap)
+		return -ENOMEM;
+
+	for (i = 0; i < split_tree_num; i++) {
+		tree = kzalloc_node(sizeof(*tree), gfp, i);
+		if (!tree)
+			goto nomem;
+
+		tree->root = RB_ROOT_CACHED;
+		init_rwsem(&tree->rwsem);
+
+		mapping->i_mmap[i] = tree;
+	}
+	return 0;
+nomem:
+	free_mapping_i_mmap(mapping);
+	return -ENOMEM;
+}
+#else
+static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
+{
+	mapping->i_mmap = RB_ROOT_CACHED;
+	init_rwsem(&mapping->i_mmap_rwsem);
+	return 0;
+}
+
+static void free_mapping_i_mmap(struct address_space *mapping) { }
+static void __init init_split_tree_num(void) {}
+#endif
+
 /**
  * inode_init_always_gfp - perform inode structure initialisation
  * @sb: superblock inode belongs to
@@ -302,9 +366,14 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
 #endif
 	inode->i_flctx = NULL;
 
-	if (unlikely(security_inode_alloc(inode, gfp)))
+	if (init_mapping_i_mmap(mapping, gfp))
 		return -ENOMEM;
 
+	if (unlikely(security_inode_alloc(inode, gfp))) {
+		free_mapping_i_mmap(mapping);
+		return -ENOMEM;
+	}
+
 	this_cpu_inc(nr_inodes);
 
 	return 0;
@@ -380,6 +449,7 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
 		posix_acl_release(inode->i_default_acl);
 #endif
+	free_mapping_i_mmap(&inode->i_data);
 	this_cpu_dec(nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
@@ -480,9 +550,7 @@ EXPORT_SYMBOL(inc_nlink);
 static void __address_space_init_once(struct address_space *mapping)
 {
 	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
-	init_rwsem(&mapping->i_mmap_rwsem);
 	spin_lock_init(&mapping->i_private_lock);
-	mapping->i_mmap = RB_ROOT_CACHED;
 }
 
 void address_space_init_once(struct address_space *mapping)
@@ -2619,6 +2687,7 @@ void __init inode_init(void)
 					&i_hash_mask,
 					0,
 					0);
+	init_split_tree_num();
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cd46615b8f53..f4b3645b61df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -450,6 +450,25 @@ struct mapping_metadata_bhs {
 	struct list_head list;	/* The list of bhs (b_assoc_buffers) */
 };
 
+#ifdef CONFIG_SPLIT_I_MMAP
+/*
+ * struct i_mmap_tree - A single sibling tree of the file's split i_mmap.
+ * @root: The red/black interval tree root.
+ * @rwsem: Protects insert/remove operations on this sibling tree.
+ * @vma_count: Number of VMAs in this sibling tree.
+ *
+ * When CONFIG_SPLIT_I_MMAP is enabled, the file's single i_mmap tree is
+ * split into split_tree_num sibling trees, each with its own lock. This
+ * reduces lock contention by allowing concurrent VMA insert/remove
+ * operations on different sibling trees.
+ */
+struct i_mmap_tree {
+	struct rb_root_cached	root;
+	struct rw_semaphore	rwsem;
+	atomic_t		vma_count;
+};
+#endif
+
 /**
  * struct address_space - Contents of a cacheable, mappable object.
  * @host: Owner, either the inode or the block_device.
@@ -461,8 +480,13 @@ struct mapping_metadata_bhs {
  * @gfp_mask: Memory allocation flags to use for allocating pages.
  * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
  * @nr_thps: Number of THPs in the pagecache (non-shmem only).
- * @i_mmap: Tree of private and shared mappings.
- * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
+ * @i_mmap: Tree of private and shared mappings. When CONFIG_SPLIT_I_MMAP
+ *   is enabled, this is an array of split_tree_num struct i_mmap_tree
+ *   pointers (plus a NULL terminator).
+ * @vma_count: Total number of VMAs across all sibling trees (only when
+ *   CONFIG_SPLIT_I_MMAP is enabled). Used by mapping_mapped().
+ * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable (only when
+ *   CONFIG_SPLIT_I_MMAP is disabled; otherwise per-tree rwsem is used).
  * @nrpages: Number of page entries, protected by the i_pages lock.
  * @writeback_index: Writeback starts here.
  * @a_ops: Methods.
@@ -480,14 +504,19 @@ struct address_space {
 	/* number of thp, only for non-shmem files */
 	atomic_t		nr_thps;
 #endif
+#ifdef CONFIG_SPLIT_I_MMAP
+	struct i_mmap_tree	**i_mmap;
+	atomic_t		vma_count;
+#else
 	struct rb_root_cached	i_mmap;
+	struct rw_semaphore	i_mmap_rwsem;
+#endif
 	unsigned long		nrpages;
 	pgoff_t			writeback_index;
 	const struct address_space_operations *a_ops;
 	unsigned long		flags;
 	errseq_t		wb_err;
 	spinlock_t		i_private_lock;
-	struct rw_semaphore	i_mmap_rwsem;
 } __attribute__((aligned(sizeof(long)))) __randomize_layout;
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -508,6 +537,133 @@ static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t
 	return xa_marked(&mapping->i_pages, tag);
 }
 
+#ifdef CONFIG_SPLIT_I_MMAP
+static inline int mapping_mapped(const struct address_space *mapping)
+{
+	return	atomic_read(&mapping->vma_count);
+}
+
+static inline void inc_mapping_vma(struct address_space *mapping,
+				struct vm_area_struct *vma)
+{
+	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
+
+	atomic_inc(&tree->vma_count);
+	atomic_inc(&mapping->vma_count);
+}
+
+static inline void dec_mapping_vma(struct address_space *mapping,
+				struct vm_area_struct *vma)
+{
+	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
+
+	atomic_dec(&tree->vma_count);
+	atomic_dec(&mapping->vma_count);
+}
+
+static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
+{
+	return (struct rb_root_cached *)mapping->i_mmap;
+}
+
+static inline void i_mmap_tree_lock_write(struct address_space *mapping,
+					struct vm_area_struct *vma)
+{
+	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
+
+	down_write(&tree->rwsem);
+}
+
+static inline void i_mmap_tree_unlock_write(struct address_space *mapping,
+					struct vm_area_struct *vma)
+{
+	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
+
+	up_write(&tree->rwsem);
+}
+
+#define i_mmap_lock_write_prepare(mapping)
+#define i_mmap_unlock_write_complete(mapping)
+
+extern int split_tree_num;
+static inline void i_mmap_lock_write(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		down_write(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline int i_mmap_trylock_write(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++) {
+		if (!down_write_trylock(&mapping->i_mmap[i]->rwsem)) {
+			while (i--)
+				up_write(&mapping->i_mmap[i]->rwsem);
+			return 0;
+		}
+	}
+	return 1;
+}
+
+static inline void i_mmap_unlock_write(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		up_write(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline int i_mmap_trylock_read(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++) {
+		if (!down_read_trylock(&mapping->i_mmap[i]->rwsem)) {
+			while (i--)
+				up_read(&mapping->i_mmap[i]->rwsem);
+			return 0;
+		}
+	}
+	return 1;
+}
+
+static inline void i_mmap_lock_read(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		down_read(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline void i_mmap_unlock_read(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		up_read(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline void i_mmap_assert_locked(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		lockdep_assert_held(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline void i_mmap_assert_write_locked(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		lockdep_assert_held_write(&mapping->i_mmap[i]->rwsem);
+}
+
+#else
+
 static inline void i_mmap_lock_write(struct address_space *mapping)
 {
 	down_write(&mapping->i_mmap_rwsem);
@@ -561,6 +717,18 @@ static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mappi
 	return &mapping->i_mmap;
 }
 
+static inline void inc_mapping_vma(struct address_space *mapping,
+				struct vm_area_struct *vma) { }
+static inline void dec_mapping_vma(struct address_space *mapping,
+				struct vm_area_struct *vma) { }
+
+#define i_mmap_lock_write_prepare(mapping)	i_mmap_lock_write(mapping)
+#define i_mmap_unlock_write_complete(mapping)	i_mmap_unlock_write(mapping)
+#define i_mmap_tree_lock_write(mapping, vma)
+#define i_mmap_tree_unlock_write(mapping, vma)
+
+#endif
+
 /*
  * Might pages of this file have been modified in userspace?
  * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a45c6a8b9f2..9aa8119fa9bf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4041,11 +4041,91 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
 struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
 				unsigned long start, unsigned long last);
 
+#ifdef CONFIG_SPLIT_I_MMAP
+extern int split_tree_num;
+
+static inline int smallest_tree_idx(struct file *file)
+{
+	struct address_space *mapping = file->f_mapping;
+	int tmp = INT_MAX, count;
+	int i, j = 0;
+
+	/*
+	 * Since a not 100% accurate value is still okay,
+	 * we do not need any lock here.
+	 */
+	for (i = 0; i < split_tree_num; i++) {
+		count = atomic_read(&mapping->i_mmap[i]->vma_count);
+		if (count < tmp) {
+			j = i;
+			tmp = count;
+			if (!tmp)
+				break;
+		}
+	}
+	return j;
+}
+
+static inline void vma_set_tree_idx(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_NUMA
+	vma->tree_idx = numa_node_id();
+#else
+	vma->tree_idx = smallest_tree_idx(vma->vm_file);
+#endif
+}
+
+static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
+					struct address_space *mapping)
+{
+	return &mapping->i_mmap[vma->tree_idx]->root;
+}
+
+/* Find the first valid VMA in the sibling trees */
+static inline struct vm_area_struct *first_vma(struct i_mmap_tree ***__r,
+				unsigned long start, unsigned long last)
+{
+	struct vm_area_struct *vma = NULL;
+	struct i_mmap_tree **tree = *__r;
+	struct rb_root_cached *root;
+
+	while (*tree) {
+		root = &(*tree)->root;
+		tree++;
+		vma = vma_interval_tree_iter_first(root, start, last);
+		if (vma)
+			break;
+	}
+
+	/* Save for the next loop */
+	*__r = tree;
+	return vma;
+}
+
+/*
+ * Please use get_i_mmap_root() to get the @root.
+ * @_tmp is referenced to avoid unused variable warning.
+ */
+#define vma_interval_tree_foreach(vma, root, start, last)		\
+	for (struct i_mmap_tree **_r = (struct i_mmap_tree **)(root),	\
+		**_tmp = (vma = first_vma(&_r, start, last)) ? _r : NULL;\
+	     ((_tmp && vma) || (vma = first_vma(&_r, start, last)));	\
+		vma = vma_interval_tree_iter_next(vma, start, last))
+#else
 /* Please use get_i_mmap_root() to get the @root */
 #define vma_interval_tree_foreach(vma, root, start, last)		\
 	for (vma = vma_interval_tree_iter_first(root, start, last);	\
 	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
 
+static inline void vma_set_tree_idx(struct vm_area_struct *vma) { }
+
+static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
+					struct address_space *mapping)
+{
+	return &mapping->i_mmap;
+}
+#endif
+
 void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
 				   struct rb_root_cached *root);
 void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..8d6aab3346ce 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1072,6 +1072,9 @@ struct vm_area_struct {
 #ifdef __HAVE_PFNMAP_TRACKING
 	struct pfnmap_track_ctx *pfnmap_track_ctx;
 #endif
+#ifdef CONFIG_SPLIT_I_MMAP
+	int tree_idx;			/* The sibling tree index for the VMA */
+#endif
 } __randomize_layout;
 
 /* Clears all bits in the VMA flags bitmap, non-atomically. */
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..2d35cacffd19 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1888,7 +1888,8 @@ static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma,
 
 	VM_WARN_ON_ONCE(vma_is_anonymous(vma));
 	file = vma->vm_file;
-	i_mmap_unlock_write(file->f_mapping);
+	i_mmap_tree_unlock_write(file->f_mapping, vma);
+	i_mmap_unlock_write_complete(file->f_mapping);
 	action->hide_from_rmap_until_complete = false;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index d714fdb357e5..70036ec9dcaa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1825,15 +1825,20 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			struct address_space *mapping = file->f_mapping;
 
 			get_file(file);
-			i_mmap_lock_write(mapping);
+			i_mmap_lock_write_prepare(mapping);
+			i_mmap_tree_lock_write(mapping, mpnt);
+
 			if (vma_is_shared_maywrite(tmp))
 				mapping_allow_writable(mapping);
 			flush_dcache_mmap_lock(mapping);
 			/* insert tmp into the share list, just after mpnt */
 			vma_interval_tree_insert_after(tmp, mpnt,
-					get_i_mmap_root(mapping));
+					get_rb_root(mpnt, mapping));
+			inc_mapping_vma(mapping, tmp);
 			flush_dcache_mmap_unlock(mapping);
-			i_mmap_unlock_write(mapping);
+
+			i_mmap_tree_unlock_write(mapping, mpnt);
+			i_mmap_unlock_write_complete(mapping);
 		}
 
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
diff --git a/mm/nommu.c b/mm/nommu.c
index 0f18ffc658e9..1f2c60a220f6 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -567,11 +567,16 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
 	if (vma->vm_file) {
 		struct address_space *mapping = vma->vm_file->f_mapping;
 
-		i_mmap_lock_write(mapping);
+		i_mmap_lock_write_prepare(mapping);
+		i_mmap_tree_lock_write(mapping, vma);
+
 		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
+		vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
+		inc_mapping_vma(mapping, vma);
 		flush_dcache_mmap_unlock(mapping);
-		i_mmap_unlock_write(mapping);
+
+		i_mmap_tree_unlock_write(mapping, vma);
+		i_mmap_unlock_write_complete(mapping);
 	}
 }
 
@@ -583,11 +588,16 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
 		struct address_space *mapping;
 		mapping = vma->vm_file->f_mapping;
 
-		i_mmap_lock_write(mapping);
+		i_mmap_lock_write_prepare(mapping);
+		i_mmap_tree_lock_write(mapping, vma);
+
 		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
+		vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
+		dec_mapping_vma(mapping, vma);
 		flush_dcache_mmap_unlock(mapping);
-		i_mmap_unlock_write(mapping);
+
+		i_mmap_tree_unlock_write(mapping, vma);
+		i_mmap_unlock_write_complete(mapping);
 	}
 }
 
@@ -1063,6 +1073,7 @@ unsigned long do_mmap(struct file *file,
 	if (file) {
 		region->vm_file = get_file(file);
 		vma->vm_file = get_file(file);
+		vma_set_tree_idx(vma);
 	}
 
 	down_write(&nommu_region_sem);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 8df1b5077951..d5745519d95a 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -809,7 +809,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 	if (!check_ops_safe(ops))
 		return -EINVAL;
 
-	lockdep_assert_held(&mapping->i_mmap_rwsem);
+	i_mmap_assert_locked(mapping);
 	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
 				  first_index + nr - 1) {
 		/* Clip to the vma */
diff --git a/mm/vma.c b/mm/vma.c
index 6159650c1b42..2055758064a9 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -234,22 +234,23 @@ static void __vma_link_file(struct vm_area_struct *vma,
 		mapping_allow_writable(mapping);
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
+	vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
+	inc_mapping_vma(mapping, vma);
 	flush_dcache_mmap_unlock(mapping);
 }
 
-/*
- * Requires inode->i_mapping->i_mmap_rwsem
- */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 				      struct address_space *mapping)
 {
+	i_mmap_tree_lock_write(mapping, vma);
 	if (vma_is_shared_maywrite(vma))
 		mapping_unmap_writable(mapping);
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
+	vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
+	dec_mapping_vma(mapping, vma);
 	flush_dcache_mmap_unlock(mapping);
+	i_mmap_tree_unlock_write(mapping, vma);
 }
 
 /*
@@ -297,8 +298,9 @@ static void vma_prepare(struct vma_prepare *vp)
 			uprobe_munmap(vp->adj_next, vp->adj_next->vm_start,
 				      vp->adj_next->vm_end);
 
-		i_mmap_lock_write(vp->mapping);
+		i_mmap_lock_write_prepare(vp->mapping);
 		if (vp->insert && vp->insert->vm_file) {
+			i_mmap_tree_lock_write(vp->mapping, vp->insert);
 			/*
 			 * Put into interval tree now, so instantiated pages
 			 * are visible to arm/parisc __flush_dcache_page
@@ -307,6 +309,7 @@ static void vma_prepare(struct vma_prepare *vp)
 			 */
 			__vma_link_file(vp->insert,
 					vp->insert->vm_file->f_mapping);
+			i_mmap_tree_unlock_write(vp->mapping, vp->insert);
 		}
 	}
 
@@ -318,12 +321,17 @@ static void vma_prepare(struct vma_prepare *vp)
 	}
 
 	if (vp->file) {
+		i_mmap_tree_lock_write(vp->mapping, vp->vma);
 		flush_dcache_mmap_lock(vp->mapping);
 		vma_interval_tree_remove(vp->vma,
-					get_i_mmap_root(vp->mapping));
-		if (vp->adj_next)
+					get_rb_root(vp->vma, vp->mapping));
+		dec_mapping_vma(vp->mapping, vp->vma);
+		if (vp->adj_next) {
+			i_mmap_tree_lock_write(vp->mapping, vp->adj_next);
 			vma_interval_tree_remove(vp->adj_next,
-					get_i_mmap_root(vp->mapping));
+					get_rb_root(vp->adj_next, vp->mapping));
+			dec_mapping_vma(vp->mapping, vp->adj_next);
+		}
 	}
 
 }
@@ -340,12 +348,17 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 			 struct mm_struct *mm)
 {
 	if (vp->file) {
-		if (vp->adj_next)
+		if (vp->adj_next) {
 			vma_interval_tree_insert(vp->adj_next,
-					get_i_mmap_root(vp->mapping));
+					get_rb_root(vp->adj_next, vp->mapping));
+			inc_mapping_vma(vp->mapping, vp->adj_next);
+			i_mmap_tree_unlock_write(vp->mapping, vp->adj_next);
+		}
 		vma_interval_tree_insert(vp->vma,
-					get_i_mmap_root(vp->mapping));
+					get_rb_root(vp->vma, vp->mapping));
+		inc_mapping_vma(vp->mapping, vp->vma);
 		flush_dcache_mmap_unlock(vp->mapping);
+		i_mmap_tree_unlock_write(vp->mapping, vp->vma);
 	}
 
 	if (vp->remove && vp->file) {
@@ -370,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 	}
 
 	if (vp->file) {
-		i_mmap_unlock_write(vp->mapping);
+		i_mmap_unlock_write_complete(vp->mapping);
 
 		if (!vp->skip_vma_uprobe) {
 			uprobe_mmap(vp->vma);
@@ -1799,12 +1812,12 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
 	int i;
 
 	mapping = vb->vmas[0]->vm_file->f_mapping;
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_write_prepare(mapping);
 	for (i = 0; i < vb->count; i++) {
 		VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping);
 		__remove_shared_vm_struct(vb->vmas[i], mapping);
 	}
-	i_mmap_unlock_write(mapping);
+	i_mmap_unlock_write_complete(mapping);
 
 	unlink_file_vma_batch_init(vb);
 }
@@ -1836,10 +1849,13 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
 
 	if (file) {
 		mapping = file->f_mapping;
-		i_mmap_lock_write(mapping);
+		i_mmap_lock_write_prepare(mapping);
+		i_mmap_tree_lock_write(mapping, vma);
 		__vma_link_file(vma, mapping);
-		if (!hold_rmap_lock)
-			i_mmap_unlock_write(mapping);
+		if (!hold_rmap_lock) {
+			i_mmap_tree_unlock_write(mapping, vma);
+			i_mmap_unlock_write_complete(mapping);
+		}
 	}
 }
 
@@ -2164,6 +2180,23 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
 	}
 }
 
+#ifdef CONFIG_SPLIT_I_MMAP
+static inline void i_mmap_nest_lock(struct address_space *mapping,
+				struct rw_semaphore *lock)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		down_write_nest_lock(&mapping->i_mmap[i]->rwsem, lock);
+}
+#else
+static inline void i_mmap_nest_lock(struct address_space *mapping,
+				struct rw_semaphore *lock)
+{
+	down_write_nest_lock(&mapping->i_mmap_rwsem, lock);
+}
+#endif
+
 static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
 {
 	if (!test_bit(AS_MM_ALL_LOCKS, &mapping->flags)) {
@@ -2178,7 +2211,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_lock);
+		i_mmap_nest_lock(mapping, &mm->mmap_lock);
 	}
 }
 
@@ -2489,6 +2522,7 @@ static int __mmap_new_file_vma(struct mmap_state *map,
 	int error;
 
 	vma->vm_file = map->file;
+	vma_set_tree_idx(vma);
 	if (!map->file_doesnt_need_get)
 		get_file(map->file);
 
diff --git a/mm/vma_init.c b/mm/vma_init.c
index 3c0b65950510..c115e33d4812 100644
--- a/mm/vma_init.c
+++ b/mm/vma_init.c
@@ -72,6 +72,9 @@ static void vm_area_init_from(const struct vm_area_struct *src,
 #ifdef CONFIG_NUMA
 	dest->vm_policy = src->vm_policy;
 #endif
+#ifdef CONFIG_SPLIT_I_MMAP
+	dest->tree_idx = src->tree_idx;
+#endif
 #ifdef __HAVE_PFNMAP_TRACKING
 	dest->pfnmap_track_ctx = NULL;
 #endif
-- 
2.53.0



^ permalink raw reply related

* [PATCH v2 4/4] docs/mm: update document for split i_mmap tree
From: Huang Shijie @ 2026-06-11  6:19 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

Document the i_mmap locking changes introduced by the following patches:
- Use mapping_mapped() to simplify the code
- Use get_i_mmap_root() to access the file's i_mmap
- Split the file's i_mmap tree (CONFIG_SPLIT_I_MMAP)

Add documentation for:
- CONFIG_SPLIT_I_MMAP split i_mmap tree architecture with per-tree locks
- New per-tree lock helpers: i_mmap_tree_lock_write/unlock_write
- New vm_area_struct.tree_idx field for sibling tree selection
- Updated i_mmap_lock_read/write semantics acquiring all per-tree locks
- Updated lock ordering notes for split tree configuration
- Updated page table freeing section for split tree scenario

Signed-off-by: Huang Shijie <huangsj@hygon.cn>
---
 Documentation/mm/process_addrs.rst | 63 +++++++++++++++++++++++-------
 1 file changed, 49 insertions(+), 14 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 851680ead45f..4aed3100b249 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -60,6 +60,15 @@ Terminology
   :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
   locks as the reverse mapping locks, or 'rmap locks' for brevity.
 
+  When :c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled, the file-backed i_mmap tree
+  is split into multiple sibling trees (one per NUMA node or a number based on
+  CPU count), each with its own :c:type:`!struct i_mmap_tree` containing a
+  red/black interval tree and a :c:type:`!struct rw_semaphore`. In this
+  configuration, :c:func:`!i_mmap_lock_read` and :c:func:`!i_mmap_lock_write`
+  acquire all per-tree locks, while VMA insert/remove operations use the
+  per-tree granularity :c:func:`!i_mmap_tree_lock_write` to lock only the
+  relevant sibling tree, significantly reducing lock contention.
+
 We discuss page table locks separately in the dedicated section below.
 
 The first thing **any** of these locks achieve is to **stabilise** the VMA
@@ -230,12 +239,16 @@ These are the core fields which describe the MM the VMA belongs to and its attri
                                                            Updated under mmap read lock by
                                                            :c:func:`!task_numa_work`.
    :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
-                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
-                                                           either of zero size if userfaultfd is
-                                                           disabled, or containing a pointer
-                                                           to an underlying
-                                                           :c:type:`!userfaultfd_ctx` object which
-                                                           describes userfaultfd metadata.
+                                                            type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
+                                                            either of zero size if userfaultfd is
+                                                            disabled, or containing a pointer
+                                                            to an underlying
+                                                            :c:type:`!userfaultfd_ctx` object which
+                                                            describes userfaultfd metadata.
+   :c:member:`!tree_idx`             CONFIG_SPLIT_I_MMAP   The index of the sibling i_mmap tree     Written once on
+                                                            that this VMA belongs to, set at         initial map.
+                                                            VMA creation time based on the NUMA
+                                                            node or the smallest sibling tree.
    ================================= ===================== ======================================== ===============
 
 These fields are present or not depending on whether the relevant kernel
@@ -247,12 +260,18 @@ configuration option is set.
    Field                               Description                               Write lock
    =================================== ========================================= ============================
    :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write, VMA write,
-                                       mapping is file-backed, to place the VMA  i_mmap write.
-                                       in the
-                                       :c:member:`!struct address_space->i_mmap`
-                                       red/black interval tree.
+                                        mapping is file-backed, to place the VMA  i_mmap write (or per-tree
+                                        in the                                    i_mmap write when
+                                        :c:member:`!struct address_space->i_mmap` :c:macro:`!CONFIG_SPLIT_I_MMAP`
+                                        red/black interval tree (or one of the    is set).
+                                        sibling trees when
+                                        :c:macro:`!CONFIG_SPLIT_I_MMAP`
+                                        is enabled).
    :c:member:`!shared.rb_subtree_last` Metadata used for management of the       mmap write, VMA write,
-                                       interval tree if the VMA is file-backed.  i_mmap write.
+                                        interval tree if the VMA is file-backed.  i_mmap write (or per-tree
+                                                                                  i_mmap write when
+                                                                                  :c:macro:`!CONFIG_SPLIT_I_MMAP`
+                                                                                  is set).
    :c:member:`!anon_vma_chain`         List of pointers to both forked/CoW’d     mmap read, anon_vma write.
                                        :c:type:`!anon_vma` objects and
                                        :c:member:`!vma->anon_vma` if it is
@@ -490,6 +509,16 @@ There is also a file-system specific lock ordering comment located at the top of
 Please check the current state of these comments which may have changed since
 the time of writing of this document.
 
+.. note:: When :c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled, the single
+   ``mapping->i_mmap_rwsem`` is replaced by an array of per-tree locks
+   ``mapping->i_mmap[i]->rwsem``. The lock ordering positions of
+   ``mapping->i_mmap_rwsem`` above apply to each per-tree lock
+   equivalently. VMA insert/remove operations acquire only the relevant
+   per-tree lock via :c:func:`!i_mmap_tree_lock_write`, while operations
+   that require all trees to be locked (such as
+   :c:func:`!unmap_mapping_range`) acquire all per-tree locks via
+   :c:func:`!i_mmap_lock_write` or :c:func:`!i_mmap_lock_read`.
+
 ------------------------------
 Locking Implementation Details
 ------------------------------
@@ -704,11 +733,15 @@ traversed or referenced by concurrent tasks.
 
 It is insufficient to simply hold an mmap write lock and VMA lock (which will
 prevent racing faults, and rmap operations), as a file-backed mapping can be
-truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
+truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone
+(or, when :c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled, under all per-tree
+``mapping->i_mmap[i]->rwsem`` locks acquired via
+:c:func:`!i_mmap_lock_write`).
 
 As a result, no VMA which can be accessed via the reverse mapping (either
 through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
-address_space->i_mmap` interval trees) can have its page tables torn down.
+address_space->i_mmap` interval trees, or the sibling trees when
+:c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled) can have its page tables torn down.
 
 The operation is typically performed via :c:func:`!free_pgtables`, which assumes
 either the mmap write lock has been taken (as specified by its
@@ -729,7 +762,9 @@ cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_cle
 .. note:: It is possible for leaf page tables to be torn down independent of
           the page tables above it as is done by
           :c:func:`!retract_page_tables`, which is performed under the i_mmap
-          read lock, PMD, and PTE page table locks, without this level of care.
+          read lock (or all per-tree ``mapping->i_mmap[i]->rwsem`` locks in
+          read mode when :c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled), PMD, and
+          PTE page table locks, without this level of care.
 
 Page table moving
 ^^^^^^^^^^^^^^^^^
-- 
2.53.0



^ permalink raw reply related

* [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
From: Huang Shijie @ 2026-06-11  6:18 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

Use mapping_mapped() to simplify the code, make
the code tidy and clean.

Signed-off-by: Huang Shijie <huangsj@hygon.cn>
---
 fs/hugetlbfs/inode.c | 4 ++--
 mm/memory.c          | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 78d61bf2bd9b..216e1a0dd0b2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -614,7 +614,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
-	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
+	if (mapping_mapped(mapping))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
@@ -675,7 +675,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 	/* Unmap users of full pages in the hole. */
 	if (hole_end > hole_start) {
-		if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
+		if (mapping_mapped(mapping))
 			hugetlb_vmdelete_list(&mapping->i_mmap,
 					      hole_start >> PAGE_SHIFT,
 					      hole_end >> PAGE_SHIFT, 0);
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..5335077765e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4386,7 +4386,7 @@ void unmap_mapping_folio(struct folio *folio)
 	details.zap_flags = ZAP_FLAG_DROP_MARKER;
 
 	i_mmap_lock_read(mapping);
-	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+	if (unlikely(mapping_mapped(mapping)))
 		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
 					 last_index, &details);
 	i_mmap_unlock_read(mapping);
@@ -4416,7 +4416,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 		last_index = ULONG_MAX;
 
 	i_mmap_lock_read(mapping);
-	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+	if (unlikely(mapping_mapped(mapping)))
 		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
 					 last_index, &details);
 	i_mmap_unlock_read(mapping);
-- 
2.53.0



^ permalink raw reply related

* [PATCH v2 2/4] mm: use get_i_mmap_root to access the file's i_mmap
From: Huang Shijie @ 2026-06-11  6:18 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

Do not access the file's i_mmap directly, use get_i_mmap_root()
to access it. This patch makes preparations for later patch.

Signed-off-by: Huang Shijie <huangsj@hygon.cn>
---
 arch/arm/mm/fault-armv.c   |  3 ++-
 arch/arm/mm/flush.c        |  3 ++-
 arch/nios2/mm/cacheflush.c |  3 ++-
 arch/parisc/kernel/cache.c |  4 +++-
 fs/dax.c                   |  3 ++-
 fs/hugetlbfs/inode.c       |  6 +++---
 include/linux/fs.h         |  5 +++++
 include/linux/mm.h         |  1 +
 kernel/events/uprobes.c    |  3 ++-
 mm/hugetlb.c               |  7 +++++--
 mm/khugepaged.c            |  6 ++++--
 mm/memory-failure.c        |  8 +++++---
 mm/memory.c                |  4 ++--
 mm/mmap.c                  |  2 +-
 mm/nommu.c                 |  9 +++++----
 mm/pagewalk.c              |  2 +-
 mm/rmap.c                  |  2 +-
 mm/vma.c                   | 14 ++++++++------
 18 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 91e488767783..1b5fe151e805 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -126,6 +126,7 @@ make_coherent(struct address_space *mapping, struct vm_area_struct *vma,
 {
 	const unsigned long pmd_start_addr = ALIGN_DOWN(addr, PMD_SIZE);
 	const unsigned long pmd_end_addr = pmd_start_addr + PMD_SIZE;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *mpnt;
 	unsigned long offset;
@@ -140,7 +141,7 @@ make_coherent(struct address_space *mapping, struct vm_area_struct *vma,
 	 * cache coherency.
 	 */
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(mpnt, root, pgoff, pgoff) {
 		/*
 		 * If we are using split PTE locks, then we need to take the pte
 		 * lock. Otherwise we are using shared mm->page_table_lock which
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 4d7ef5cc36b6..01588df81bfc 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -238,6 +238,7 @@ void __flush_dcache_folio(struct address_space *mapping, struct folio *folio)
 static void __flush_dcache_aliases(struct address_space *mapping, struct folio *folio)
 {
 	struct mm_struct *mm = current->active_mm;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	struct vm_area_struct *vma;
 	pgoff_t pgoff, pgoff_end;
 
@@ -251,7 +252,7 @@ static void __flush_dcache_aliases(struct address_space *mapping, struct folio *
 	pgoff_end = pgoff + folio_nr_pages(folio) - 1;
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff_end) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff_end) {
 		unsigned long start, offset, pfn;
 		unsigned int nr;
 
diff --git a/arch/nios2/mm/cacheflush.c b/arch/nios2/mm/cacheflush.c
index 8321182eb927..ab6e064fabe2 100644
--- a/arch/nios2/mm/cacheflush.c
+++ b/arch/nios2/mm/cacheflush.c
@@ -78,11 +78,12 @@ static void flush_aliases(struct address_space *mapping, struct folio *folio)
 	unsigned long flags;
 	pgoff_t pgoff;
 	unsigned long nr = folio_nr_pages(folio);
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 
 	pgoff = folio->index;
 
 	flush_dcache_mmap_lock_irqsave(mapping, flags);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + nr - 1) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff + nr - 1) {
 		unsigned long start;
 
 		if (vma->vm_mm != mm)
diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index 0170b69a21d3..f99dffd6cc22 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -473,6 +473,7 @@ static inline unsigned long get_upa(struct mm_struct *mm, unsigned long addr)
 void flush_dcache_folio(struct folio *folio)
 {
 	struct address_space *mapping = folio_flush_mapping(folio);
+	struct rb_root_cached *root;
 	struct vm_area_struct *vma;
 	unsigned long addr, old_addr = 0;
 	void *kaddr;
@@ -494,6 +495,7 @@ void flush_dcache_folio(struct folio *folio)
 		return;
 
 	pgoff = folio->index;
+	root = get_i_mmap_root(mapping);
 
 	/*
 	 * We have carefully arranged in arch_get_unmapped_area() that
@@ -503,7 +505,7 @@ void flush_dcache_folio(struct folio *folio)
 	 * on machines that support equivalent aliasing
 	 */
 	flush_dcache_mmap_lock_irqsave(mapping, flags);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + nr - 1) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff + nr - 1) {
 		unsigned long offset = pgoff - vma->vm_pgoff;
 		unsigned long pfn = folio_pfn(folio);
 
diff --git a/fs/dax.c b/fs/dax.c
index 6d175cd47a99..d402edc3c1b8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1138,6 +1138,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 		struct address_space *mapping, void *entry)
 {
 	unsigned long pfn, index, count, end;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	long ret = 0;
 	struct vm_area_struct *vma;
 
@@ -1201,7 +1202,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 
 	/* Walk all mappings of a given index of a file and writeprotect them */
 	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) {
+	vma_interval_tree_foreach(vma, root, index, end) {
 		pfn_mkclean_range(pfn, count, index, vma);
 		cond_resched();
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 216e1a0dd0b2..da5b41ea5bdd 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -380,7 +380,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 					struct address_space *mapping,
 					struct folio *folio, pgoff_t index)
 {
-	struct rb_root_cached *root = &mapping->i_mmap;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	struct hugetlb_vma_lock *vma_lock;
 	unsigned long pfn = folio_pfn(folio);
 	struct vm_area_struct *vma;
@@ -615,7 +615,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
 	if (mapping_mapped(mapping))
-		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+		hugetlb_vmdelete_list(get_i_mmap_root(mapping), pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
 	remove_inode_hugepages(inode, offset, LLONG_MAX);
@@ -676,7 +676,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	/* Unmap users of full pages in the hole. */
 	if (hole_end > hole_start) {
 		if (mapping_mapped(mapping))
-			hugetlb_vmdelete_list(&mapping->i_mmap,
+			hugetlb_vmdelete_list(get_i_mmap_root(mapping),
 					      hole_start >> PAGE_SHIFT,
 					      hole_end >> PAGE_SHIFT, 0);
 	}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..cd46615b8f53 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -556,6 +556,11 @@ static inline int mapping_mapped(const struct address_space *mapping)
 	return	!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root);
 }
 
+static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
+{
+	return &mapping->i_mmap;
+}
+
 /*
  * Might pages of this file have been modified in userspace?
  * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..0a45c6a8b9f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4041,6 +4041,7 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
 struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
 				unsigned long start, unsigned long last);
 
+/* Please use get_i_mmap_root() to get the @root */
 #define vma_interval_tree_foreach(vma, root, start, last)		\
 	for (vma = vma_interval_tree_iter_first(root, start, last);	\
 	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4084e926e284..d8561a42aec8 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1201,6 +1201,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
 static struct map_info *
 build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 {
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	unsigned long pgoff = offset >> PAGE_SHIFT;
 	struct vm_area_struct *vma;
 	struct map_info *curr = NULL;
@@ -1210,7 +1211,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 
  again:
 	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff) {
 		if (!valid_vma(vma, is_register))
 			continue;
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4b80b167cc9c..8bc49d57a116 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5360,6 +5360,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct vm_area_struct *iter_vma;
 	struct address_space *mapping;
+	struct rb_root_cached *root;
 	pgoff_t pgoff;
 
 	/*
@@ -5370,6 +5371,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) +
 			vma->vm_pgoff;
 	mapping = vma->vm_file->f_mapping;
+	root = get_i_mmap_root(mapping);
 
 	/*
 	 * Take the mapping lock for the duration of the table walk. As
@@ -5377,7 +5379,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
 	i_mmap_lock_write(mapping);
-	vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(iter_vma, root, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
 			continue;
@@ -6850,6 +6852,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	pgoff_t idx = ((addr - vma->vm_start) >> PAGE_SHIFT) +
 			vma->vm_pgoff;
 	struct vm_area_struct *svma;
@@ -6858,7 +6861,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 
 	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
+	vma_interval_tree_foreach(svma, root, idx, idx) {
 		if (svma == vma)
 			continue;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b8452dbdb043..0f577e4a2ccd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1773,10 +1773,11 @@ static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
 
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	struct vm_area_struct *vma;
 
 	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff) {
 		struct mmu_notifier_range range;
 		struct mm_struct *mm;
 		unsigned long addr;
@@ -2194,7 +2195,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		 * not be able to observe any missing pages due to the
 		 * previously inserted retry entries.
 		 */
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, end) {
+		vma_interval_tree_foreach(vma, get_i_mmap_root(mapping),
+					start, end) {
 			if (userfaultfd_missing(vma)) {
 				result = SCAN_EXCEED_NONE_PTE;
 				goto immap_locked;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..85196d9bb26c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -598,7 +598,7 @@ static void collect_procs_file(const struct folio *folio,
 
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
+		vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), pgoff,
 				      pgoff) {
 			/*
 			 * Send early kill signal to tasks where a vma covers
@@ -650,7 +650,8 @@ static void collect_procs_fsdax(const struct page *page,
 			t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), pgoff,
+					pgoff) {
 			if (vma->vm_mm == t->mm)
 				add_to_kill_fsdax(t, page, vma, to_kill, pgoff);
 		}
@@ -2251,7 +2252,8 @@ static void collect_procs_pfn(struct pfn_address_space *pfn_space,
 		t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, ULONG_MAX) {
+		vma_interval_tree_foreach(vma, get_i_mmap_root(mapping),
+					0, ULONG_MAX) {
 			pgoff_t pgoff;
 
 			if (vma->vm_mm == t->mm &&
diff --git a/mm/memory.c b/mm/memory.c
index 5335077765e2..9ea5d6c8ef4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4387,7 +4387,7 @@ void unmap_mapping_folio(struct folio *folio)
 
 	i_mmap_lock_read(mapping);
 	if (unlikely(mapping_mapped(mapping)))
-		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+		unmap_mapping_range_tree(get_i_mmap_root(mapping), first_index,
 					 last_index, &details);
 	i_mmap_unlock_read(mapping);
 }
@@ -4417,7 +4417,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 
 	i_mmap_lock_read(mapping);
 	if (unlikely(mapping_mapped(mapping)))
-		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+		unmap_mapping_range_tree(get_i_mmap_root(mapping), first_index,
 					 last_index, &details);
 	i_mmap_unlock_read(mapping);
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36462..d714fdb357e5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1831,7 +1831,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			flush_dcache_mmap_lock(mapping);
 			/* insert tmp into the share list, just after mpnt */
 			vma_interval_tree_insert_after(tmp, mpnt,
-					&mapping->i_mmap);
+					get_i_mmap_root(mapping));
 			flush_dcache_mmap_unlock(mapping);
 			i_mmap_unlock_write(mapping);
 		}
diff --git a/mm/nommu.c b/mm/nommu.c
index ed3934bc2de4..0f18ffc658e9 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -569,7 +569,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
 
 		i_mmap_lock_write(mapping);
 		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_insert(vma, &mapping->i_mmap);
+		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
 		flush_dcache_mmap_unlock(mapping);
 		i_mmap_unlock_write(mapping);
 	}
@@ -585,7 +585,7 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
 
 		i_mmap_lock_write(mapping);
 		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_remove(vma, &mapping->i_mmap);
+		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
 		flush_dcache_mmap_unlock(mapping);
 		i_mmap_unlock_write(mapping);
 	}
@@ -1804,6 +1804,7 @@ EXPORT_SYMBOL_GPL(copy_remote_vm_str);
 int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 				size_t newsize)
 {
+	struct rb_root_cached *root = get_i_mmap_root(&inode->i_mapping);
 	struct vm_area_struct *vma;
 	struct vm_region *region;
 	pgoff_t low, high;
@@ -1816,7 +1817,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 	i_mmap_lock_read(inode->i_mapping);
 
 	/* search for VMAs that fall within the dead zone */
-	vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) {
+	vma_interval_tree_foreach(vma, root, low, high) {
 		/* found one - only interested if it's shared out of the page
 		 * cache */
 		if (vma->vm_flags & VM_SHARED) {
@@ -1832,7 +1833,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 	 * we don't check for any regions that start beyond the EOF as there
 	 * shouldn't be any
 	 */
-	vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, 0, ULONG_MAX) {
+	vma_interval_tree_foreach(vma, root, 0, ULONG_MAX) {
 		if (!(vma->vm_flags & VM_SHARED))
 			continue;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..8df1b5077951 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -810,7 +810,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 		return -EINVAL;
 
 	lockdep_assert_held(&mapping->i_mmap_rwsem);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
+	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
 				  first_index + nr - 1) {
 		/* Clip to the vma */
 		vba = vma->vm_pgoff;
diff --git a/mm/rmap.c b/mm/rmap.c
index 99e1b3dc390b..6cfcdb96071f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -3051,7 +3051,7 @@ static void __rmap_walk_file(struct folio *folio, struct address_space *mapping,
 		i_mmap_lock_read(mapping);
 	}
 lookup:
-	vma_interval_tree_foreach(vma, &mapping->i_mmap,
+	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping),
 			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(vma, pgoff_start, nr_pages);
 
diff --git a/mm/vma.c b/mm/vma.c
index d90791b00a7b..6159650c1b42 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -234,7 +234,7 @@ static void __vma_link_file(struct vm_area_struct *vma,
 		mapping_allow_writable(mapping);
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_insert(vma, &mapping->i_mmap);
+	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
 	flush_dcache_mmap_unlock(mapping);
 }
 
@@ -248,7 +248,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		mapping_unmap_writable(mapping);
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_remove(vma, &mapping->i_mmap);
+	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
 	flush_dcache_mmap_unlock(mapping);
 }
 
@@ -319,10 +319,11 @@ static void vma_prepare(struct vma_prepare *vp)
 
 	if (vp->file) {
 		flush_dcache_mmap_lock(vp->mapping);
-		vma_interval_tree_remove(vp->vma, &vp->mapping->i_mmap);
+		vma_interval_tree_remove(vp->vma,
+					get_i_mmap_root(vp->mapping));
 		if (vp->adj_next)
 			vma_interval_tree_remove(vp->adj_next,
-						 &vp->mapping->i_mmap);
+					get_i_mmap_root(vp->mapping));
 	}
 
 }
@@ -341,8 +342,9 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 	if (vp->file) {
 		if (vp->adj_next)
 			vma_interval_tree_insert(vp->adj_next,
-						 &vp->mapping->i_mmap);
-		vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap);
+					get_i_mmap_root(vp->mapping));
+		vma_interval_tree_insert(vp->vma,
+					get_i_mmap_root(vp->mapping));
 		flush_dcache_mmap_unlock(vp->mapping);
 	}
 
-- 
2.53.0



^ permalink raw reply related

* [PATCH v2 0/4] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-06-11  6:18 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie

  In NUMA, there are maybe many NUMA nodes and many CPUs.
For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
In the UnixBench tests, there is a test "execl" which tests
the execve system call.

  When we test our server with "./Run -c 384 execl",
the test result is not good enough. The i_mmap locks contended heavily on
"libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have 
over 6000 VMAs, all the VMAs can be in different NUMA mode.
The insert/remove operations do not run quickly enough.

patch 1 & patch 2 are try to hide the direct access of i_mmap.
patch 3 splits the i_mmap into sibling trees, each tree has separate lock,
and we can get better performance with this patch set in our NUMA server:
    we can get over 400% performance improvement.

I did not test the non-NUMA case, since I do not have such server.    
    
v1 --> v2:
	Not only split the immap tree, but also split the lock.
	v1 : https://lkml.org/lkml/2026/4/13/199

Huang Shijie (4):
  mm: use mapping_mapped to simplify the code
  mm: use get_i_mmap_root to access the file's i_mmap
  mm/fs: split the file's i_mmap tree
  docs/mm: update document for split i_mmap tree

 Documentation/mm/process_addrs.rst |  63 +++++++---
 arch/arm/mm/fault-armv.c           |   3 +-
 arch/arm/mm/flush.c                |   3 +-
 arch/nios2/mm/cacheflush.c         |   3 +-
 arch/parisc/kernel/cache.c         |   4 +-
 fs/Kconfig                         |   8 ++
 fs/dax.c                           |   3 +-
 fs/hugetlbfs/inode.c               |  30 +++--
 fs/inode.c                         |  75 +++++++++++-
 include/linux/fs.h                 | 179 ++++++++++++++++++++++++++++-
 include/linux/mm.h                 |  81 +++++++++++++
 include/linux/mm_types.h           |   3 +
 kernel/events/uprobes.c            |   3 +-
 mm/hugetlb.c                       |   7 +-
 mm/internal.h                      |   3 +-
 mm/khugepaged.c                    |   6 +-
 mm/memory-failure.c                |   8 +-
 mm/memory.c                        |   8 +-
 mm/mmap.c                          |  11 +-
 mm/nommu.c                         |  28 +++--
 mm/pagewalk.c                      |   4 +-
 mm/rmap.c                          |   2 +-
 mm/vma.c                           |  74 +++++++++---
 mm/vma_init.c                      |   3 +
 24 files changed, 534 insertions(+), 78 deletions(-)

-- 
2.53.0



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox