From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailgw1.hygon.cn (unknown [101.204.27.37]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 068B437DEBB; Fri, 12 Jun 2026 06:44:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=101.204.27.37 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781246691; cv=none; b=Ulx/M1sahAgmG617RQIpjSf2vu3JlFYC9CJ9eMsKmIE0u5lXj5R5GdaB7Uy/UBmtwAUNGFfSD4oonG99GN8RJ/x3w8fC2Efduo5ibFgNDHfaQyh7iPXZQmmn3v4OcsZTdy1ugQadTTV2M+PWoEbVu6UqoUCc42fXRBL56iLh/FQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781246691; c=relaxed/simple; bh=gUWejuYonhWxtwqO1Qh/RWR4s/z8BtTxCjuNLDz2CIw=; h=Date:From:To:CC:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=hUSCbvNcx9wolSpkJFQV4k7BU4opvruMl1BiHvbjIYdG0uN0+DZqineXiEONGYpj6S5Ep1XgHvCcrIaFXsLD3vWh9E+9agjqeoFmmxlwl2mUjNbpwGqsHBCP5hEVEpdY+jTr0sIRa9Jn5/c8Y2Ya5b9wFLOctkywtbw0WSLkjIA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=hygon.cn; spf=pass smtp.mailfrom=hygon.cn; arc=none smtp.client-ip=101.204.27.37 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=hygon.cn Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=hygon.cn Received: from maildlp2.hygon.cn (unknown [127.0.0.1]) by mailgw1.hygon.cn (Postfix) with ESMTP id 4gc93X0TTFz1dd8t; Fri, 12 Jun 2026 14:44:24 +0800 (CST) Received: from maildlp2.hygon.cn (unknown [172.23.18.61]) by mailgw1.hygon.cn (Postfix) with ESMTP id 4gc93V09B9z1dd89; Fri, 12 Jun 2026 14:44:22 +0800 (CST) Received: from cncheex04.Hygon.cn (unknown [172.23.18.114]) by maildlp2.hygon.cn (Postfix) with ESMTPS id F3B4334C3EC5; Fri, 12 Jun 2026 14:42:54 +0800 (CST) Received: from hsj-2U-Workstation (172.19.20.61) by cncheex04.Hygon.cn (172.23.18.114) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.36; Fri, 12 Jun 2026 14:44:16 +0800 Date: Fri, 12 Jun 2026 14:44:14 +0800 From: Huang Shijie To: Pedro Falcato CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v2 3/4] mm/fs: split the file's i_mmap tree Message-ID: References: <20260611061915.2354307-1-huangsj@hygon.cn> <20260611061915.2354307-4-huangsj@hygon.cn> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: X-ClientProxiedBy: cncheex06.Hygon.cn (172.23.18.116) To cncheex04.Hygon.cn (172.23.18.114) On Thu, Jun 11, 2026 at 12:11:27PM +0100, Pedro Falcato wrote: > Hi, > > On Thu, Jun 11, 2026 at 02:18:59PM +0800, Huang Shijie wrote: > > In the UnixBench tests, there is a test "execl" which tests > > the execve system call. > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs. > > When we test our server with "./Run -c 384 execl", > > the test result is not good enough. The i_mmap locks contended heavily on > > "libc.so" and "ld.so". The i_mmap tree for "libc.so" can be > > over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove > > operations do not run quickly enough. > > I _really_ would have appreciated some coordination here, because I said I was > going to take a look at it. I have something that I think is much simpler Okay, no problem. I waited for more then a month, I thought you are busy at other things. So I spent more then a week to finish the patch set v2. > in practice. These patches are also way too complex to be dropped just before > the merge window. > > Some comments: > > > > > In order to reduce the competition of the i_mmap lock, this patch does > > following: > > 1.) Split the single i_mmap tree into several sibling trees: > > Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to > > turn on/off this feature. > > There is no need for a config option. This needs to Just Work. > > > 2.) Introduce a new field "tree_idx" for vm_area_struct to save the > > sibling tree index for this VMA. > > This is possibly contentious, but there are holes in vm_area_struct. > So I think this is fine. > > > 3.) Introduce a new field "vma_count" for address_space. > > The new mapping_mapped() will use it. > > 4.) Rewrite the vma_interval_tree_foreach() > > 5.) Rewrite the lock functions. > > > > After this patch, the VMA insert/remove operations will work faster, > > and we can get over 400% performance improvement with the above test. > > > > Signed-off-by: Huang Shijie > > --- > > fs/Kconfig | 8 ++ > > fs/hugetlbfs/inode.c | 20 ++++- > > fs/inode.c | 75 ++++++++++++++++- > > include/linux/fs.h | 174 ++++++++++++++++++++++++++++++++++++++- > > include/linux/mm.h | 80 ++++++++++++++++++ > > include/linux/mm_types.h | 3 + > > mm/internal.h | 3 +- > > mm/mmap.c | 11 ++- > > mm/nommu.c | 23 ++++-- > > mm/pagewalk.c | 2 +- > > mm/vma.c | 72 +++++++++++----- > > mm/vma_init.c | 3 + > > 12 files changed, 436 insertions(+), 38 deletions(-) > > > > diff --git a/fs/Kconfig b/fs/Kconfig > > index 43cb06de297f..e24804f70432 100644 > > --- a/fs/Kconfig > > +++ b/fs/Kconfig > > @@ -9,6 +9,14 @@ menu "File systems" > > config DCACHE_WORD_ACCESS > > bool > > > > +config SPLIT_I_MMAP > > + bool "Split the file's i_mmap to several trees" > > + default n > > + help > > + Split the file's i_mmap to several trees, each tree has a separate > > + lock. This will reduce the lock contention of file's i_mmap tree, > > + but it will cost more memory for per inode. > > + > > config VALIDATE_FS_PARSER > > bool "Validate filesystem parameter description" > > help > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index da5b41ea5bdd..68d8308418dd 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -891,6 +891,23 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, > > */ > > static struct lock_class_key hugetlbfs_i_mmap_rwsem_key; > > > > +#ifdef CONFIG_SPLIT_I_MMAP > > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) { > > + lockdep_set_class(&mapping->i_mmap[i].rwsem, > > + &hugetlbfs_i_mmap_rwsem_key); > > + } > > +} > > +#else > > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping) > > +{ > > + lockdep_set_class(&mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key); > > +} > > +#endif > > + > > static struct inode *hugetlbfs_get_inode(struct super_block *sb, > > struct mnt_idmap *idmap, > > struct inode *dir, > > @@ -915,8 +932,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, > > > > inode->i_ino = get_next_ino(); > > inode_init_owner(idmap, inode, dir, mode); > > - lockdep_set_class(&inode->i_mapping->i_mmap_rwsem, > > - &hugetlbfs_i_mmap_rwsem_key); > > + hugetlbfs_lockdep_set_class(inode->i_mapping); > > inode->i_mapping->a_ops = &hugetlbfs_aops; > > simple_inode_init_ts(inode); > > info->resv_map = resv_map; > > diff --git a/fs/inode.c b/fs/inode.c > > index 62c579a0cf7d..cb67ae83f5b3 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -214,6 +214,70 @@ static int no_open(struct inode *inode, struct file *file) > > return -ENXIO; > > } > > > > +#ifdef CONFIG_SPLIT_I_MMAP > > +int split_tree_num; > > +static int split_tree_align __maybe_unused = 32; > > + > > +static void __init init_split_tree_num(void) > > +{ > > +#ifdef CONFIG_NUMA > > + split_tree_num = nr_node_ids; > > +#else > > + split_tree_num = ALIGN(nr_cpu_ids, split_tree_align); > > +#endif > > +} > > Again, too configurable. I think you're too stuck up on the NUMA case - If you do not care about the NUMA. The performance will _NOT_ get improved in our NUMA server. I had ever tested code which do not care about the NUMA, and I got a bad performance. Avoid the remote access is a very important thing for the NUMA server. > which does not matter for many people - and may actively harm NUMA users. If > I have a 128 core 2 NUMA node system, what should I shard by? It is easy to extend the tree number for NUMA. :) For the 128 core 2 NUMA, we can extend to more trees, such as: Two trees for each NUMA node. > > > + > > +static void free_mapping_i_mmap(struct address_space *mapping) > > +{ > > + int i; > > + > > + if (!mapping->i_mmap) > > + return; > > + > > + for (i = 0; i < split_tree_num; i++) > > + kfree(mapping->i_mmap[i]); > > + > > + kfree(mapping->i_mmap); > > + mapping->i_mmap = NULL; > > +} > > + > > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp) > > +{ > > + struct i_mmap_tree *tree; > > + int i; > > + > > + /* The extra one is used as terminator in vma_interval_tree_foreach() */ > > + mapping->i_mmap = kzalloc(sizeof(tree) * (split_tree_num + 1), gfp); > > + if (!mapping->i_mmap) > > + return -ENOMEM; > > + > > + for (i = 0; i < split_tree_num; i++) { > > + tree = kzalloc_node(sizeof(*tree), gfp, i); > > + if (!tree) > > + goto nomem; > > + > > + tree->root = RB_ROOT_CACHED; > > + init_rwsem(&tree->rwsem); > > This (as-is) should blow up with lockdep + the locking loops down there. okay, I will check it later. thanks a lot. > > > + > > + mapping->i_mmap[i] = tree; > > + } > > + return 0; > > +nomem: > > + free_mapping_i_mmap(mapping); > > + return -ENOMEM; > > +} > > Honestly, it's likely that a simple static array in struct address_space The array size is not fixed, so we cannot add a static array in address_space. > suffices. I would not go through the trouble of getting everything very > tight and NUMA correct. > > > +#else > > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp) > > +{ > > + mapping->i_mmap = RB_ROOT_CACHED; > > + init_rwsem(&mapping->i_mmap_rwsem); > > + return 0; > > +} > > + > > +static void free_mapping_i_mmap(struct address_space *mapping) { } > > +static void __init init_split_tree_num(void) {} > > +#endif > > + > > /** > > * inode_init_always_gfp - perform inode structure initialisation > > * @sb: superblock inode belongs to > > @@ -302,9 +366,14 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp > > #endif > > inode->i_flctx = NULL; > > > > - if (unlikely(security_inode_alloc(inode, gfp))) > > + if (init_mapping_i_mmap(mapping, gfp)) > > return -ENOMEM; > > > > + if (unlikely(security_inode_alloc(inode, gfp))) { > > + free_mapping_i_mmap(mapping); > > + return -ENOMEM; > > + } > > + > > this_cpu_inc(nr_inodes); > > > > return 0; > > @@ -380,6 +449,7 @@ void __destroy_inode(struct inode *inode) > > if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl)) > > posix_acl_release(inode->i_default_acl); > > #endif > > + free_mapping_i_mmap(&inode->i_data); > > this_cpu_dec(nr_inodes); > > } > > EXPORT_SYMBOL(__destroy_inode); > > @@ -480,9 +550,7 @@ EXPORT_SYMBOL(inc_nlink); > > static void __address_space_init_once(struct address_space *mapping) > > { > > xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT); > > - init_rwsem(&mapping->i_mmap_rwsem); > > spin_lock_init(&mapping->i_private_lock); > > - mapping->i_mmap = RB_ROOT_CACHED; > > } > > > > void address_space_init_once(struct address_space *mapping) > > @@ -2619,6 +2687,7 @@ void __init inode_init(void) > > &i_hash_mask, > > 0, > > 0); > > + init_split_tree_num(); > > } > > > > void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index cd46615b8f53..f4b3645b61df 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -450,6 +450,25 @@ struct mapping_metadata_bhs { > > struct list_head list; /* The list of bhs (b_assoc_buffers) */ > > }; > > > > +#ifdef CONFIG_SPLIT_I_MMAP > > +/* > > + * struct i_mmap_tree - A single sibling tree of the file's split i_mmap. > > + * @root: The red/black interval tree root. > > + * @rwsem: Protects insert/remove operations on this sibling tree. > > + * @vma_count: Number of VMAs in this sibling tree. > > + * > > + * When CONFIG_SPLIT_I_MMAP is enabled, the file's single i_mmap tree is > > + * split into split_tree_num sibling trees, each with its own lock. This > > + * reduces lock contention by allowing concurrent VMA insert/remove > > + * operations on different sibling trees. > > + */ > > +struct i_mmap_tree { > > + struct rb_root_cached root; > > + struct rw_semaphore rwsem; > > + atomic_t vma_count; > > I don't see what you need this vma_count for? I get the one in address_space, > but this one does not seem useful. For non-NUMA case, we can use it to determine which tree we should put the new VMA. Round-robin is not good enough for a dynamic system. > > > +}; > > +#endif > > + > > /** > > * struct address_space - Contents of a cacheable, mappable object. > > * @host: Owner, either the inode or the block_device. > > @@ -461,8 +480,13 @@ struct mapping_metadata_bhs { > > * @gfp_mask: Memory allocation flags to use for allocating pages. > > * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings. > > * @nr_thps: Number of THPs in the pagecache (non-shmem only). > > - * @i_mmap: Tree of private and shared mappings. > > - * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable. > > + * @i_mmap: Tree of private and shared mappings. When CONFIG_SPLIT_I_MMAP > > + * is enabled, this is an array of split_tree_num struct i_mmap_tree > > + * pointers (plus a NULL terminator). > > NULL terminator wastes more memory, so I would really strongly avoid it as > well. any better idea? > > > + * @vma_count: Total number of VMAs across all sibling trees (only when > > + * CONFIG_SPLIT_I_MMAP is enabled). Used by mapping_mapped(). > > + * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable (only when > > + * CONFIG_SPLIT_I_MMAP is disabled; otherwise per-tree rwsem is used). > > So, there are very good reasons why you still need an i_mmap_rwsem protecting > state, even with split mmap trees. Which I'll go into later. > > > * @nrpages: Number of page entries, protected by the i_pages lock. > > * @writeback_index: Writeback starts here. > > * @a_ops: Methods. > > @@ -480,14 +504,19 @@ struct address_space { > > /* number of thp, only for non-shmem files */ > > atomic_t nr_thps; > > #endif > > +#ifdef CONFIG_SPLIT_I_MMAP > > + struct i_mmap_tree **i_mmap; > > + atomic_t vma_count; > > +#else > > struct rb_root_cached i_mmap; > > + struct rw_semaphore i_mmap_rwsem; > > +#endif > > unsigned long nrpages; > > pgoff_t writeback_index; > > const struct address_space_operations *a_ops; > > unsigned long flags; > > errseq_t wb_err; > > spinlock_t i_private_lock; > > - struct rw_semaphore i_mmap_rwsem; > > See d3b1a9a778e1 ("fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.") Got it. > > > } __attribute__((aligned(sizeof(long)))) __randomize_layout; > > /* > > * On most architectures that alignment is already the case; but > > @@ -508,6 +537,133 @@ static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t > > return xa_marked(&mapping->i_pages, tag); > > } > > > > +#ifdef CONFIG_SPLIT_I_MMAP > > +static inline int mapping_mapped(const struct address_space *mapping) > > +{ > > + return atomic_read(&mapping->vma_count); > > Now that I think of it, I don't think we need atomic_t, only unsigned long + > READ_ONCE() suffices. Increments can race just fine, we don't expect any > consistency there - if you want consistency you probably hold the i_mmap lock. > okay. I will check it. > > +} > > + > > +static inline void inc_mapping_vma(struct address_space *mapping, > > + struct vm_area_struct *vma) > > +{ > > + struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx]; > > + > > + atomic_inc(&tree->vma_count); > > + atomic_inc(&mapping->vma_count); > > +} > > + > > +static inline void dec_mapping_vma(struct address_space *mapping, > > + struct vm_area_struct *vma) > > +{ > > + struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx]; > > + > > + atomic_dec(&tree->vma_count); > > + atomic_dec(&mapping->vma_count); > > +} > > This probably shouldn't be in linux/fs.h. > > > + > > +static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping) > > +{ > > + return (struct rb_root_cached *)mapping->i_mmap; > > +} > > + > > +static inline void i_mmap_tree_lock_write(struct address_space *mapping, > > + struct vm_area_struct *vma) > > +{ > > + struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx]; > > + > > + down_write(&tree->rwsem); > > +} > > + > > +static inline void i_mmap_tree_unlock_write(struct address_space *mapping, > > + struct vm_area_struct *vma) > > +{ > > + struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx]; > > + > > + up_write(&tree->rwsem); > > +} > > + > > +#define i_mmap_lock_write_prepare(mapping) > > +#define i_mmap_unlock_write_complete(mapping) > > It's unclear to me why you added write_prepare() and write_complete(). > > > + > > +extern int split_tree_num; > > +static inline void i_mmap_lock_write(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) > > + down_write(&mapping->i_mmap[i]->rwsem); > > Oof, this is an incredibly large hammer. This is basically why I think keeping > i_mmap_rwsem (in a different form) is required. You do not want to take $nr_cpus > locks (read _or_ write). For my design, I keep i_mmap_rwsem, but I invert its > meaning - taking it in write = I'm reading from the tree; taking it in read = > I'm writing to the tree. This provides some lighter-weight exclusion between > rmap walks and rmap tree manipulation. okay, it seem your method is better. I am waiting for your patch. > > _Technically_, you shouldn't need to always take a lock when manipulating the > tree. A pattern like mnt_hold_writers()/mnt_get_write_access() can probably > work well. But it may be too complex ATM. > > > Also, note that you pretty much do not want i_mmap_lock_write() users after > the conversion is done. > > > +} > > + > > +static inline int i_mmap_trylock_write(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) { > > + if (!down_write_trylock(&mapping->i_mmap[i]->rwsem)) { > > + while (i--) > > + up_write(&mapping->i_mmap[i]->rwsem); > > + return 0; > > + } > > + } > > + return 1; > > +} > > + > > +static inline void i_mmap_unlock_write(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) > > + up_write(&mapping->i_mmap[i]->rwsem); > > +} > > + > > +static inline int i_mmap_trylock_read(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) { > > + if (!down_read_trylock(&mapping->i_mmap[i]->rwsem)) { > > + while (i--) > > + up_read(&mapping->i_mmap[i]->rwsem); > > + return 0; > > + } > > + } > > + return 1; > > +} > > + > > +static inline void i_mmap_lock_read(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) > > + down_read(&mapping->i_mmap[i]->rwsem); > > +} > > + > > +static inline void i_mmap_unlock_read(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) > > + up_read(&mapping->i_mmap[i]->rwsem); > > +} > > + > > +static inline void i_mmap_assert_locked(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) > > + lockdep_assert_held(&mapping->i_mmap[i]->rwsem); > > +} > > + > > +static inline void i_mmap_assert_write_locked(struct address_space *mapping) > > +{ > > + int i; > > + > > + for (i = 0; i < split_tree_num; i++) > > + lockdep_assert_held_write(&mapping->i_mmap[i]->rwsem); > > +} > > + > > +#else > > + > > static inline void i_mmap_lock_write(struct address_space *mapping) > > { > > down_write(&mapping->i_mmap_rwsem); > > @@ -561,6 +717,18 @@ static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mappi > > return &mapping->i_mmap; > > } > > > > +static inline void inc_mapping_vma(struct address_space *mapping, > > + struct vm_area_struct *vma) { } > > +static inline void dec_mapping_vma(struct address_space *mapping, > > + struct vm_area_struct *vma) { } > > + > > +#define i_mmap_lock_write_prepare(mapping) i_mmap_lock_write(mapping) > > +#define i_mmap_unlock_write_complete(mapping) i_mmap_unlock_write(mapping) > > +#define i_mmap_tree_lock_write(mapping, vma) > > +#define i_mmap_tree_unlock_write(mapping, vma) > > + > > +#endif > > + > > /* > > * Might pages of this file have been modified in userspace? > > * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 0a45c6a8b9f2..9aa8119fa9bf 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -4041,11 +4041,91 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root, > > struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node, > > unsigned long start, unsigned long last); > > > > +#ifdef CONFIG_SPLIT_I_MMAP > > +extern int split_tree_num; > > + > > +static inline int smallest_tree_idx(struct file *file) > > +{ > > + struct address_space *mapping = file->f_mapping; > > + int tmp = INT_MAX, count; > > + int i, j = 0; > > + > > + /* > > + * Since a not 100% accurate value is still okay, > > + * we do not need any lock here. > > + */ > > + for (i = 0; i < split_tree_num; i++) { > > + count = atomic_read(&mapping->i_mmap[i]->vma_count); > > + if (count < tmp) { > > + j = i; > > + tmp = count; > > + if (!tmp) > > + break; > > + } > > + } > > Ohh, I see why you want the per-subtree vma_count now. But is this a net-win? It keep the trees as even as possible. > I think doing something like vma-pointer-hashing or just smp_processor_id() > would work a-ok. > > > + return j; > > +} > > + > > +static inline void vma_set_tree_idx(struct vm_area_struct *vma) > > +{ > > +#ifdef CONFIG_NUMA > > + vma->tree_idx = numa_node_id(); > > +#else > > + vma->tree_idx = smallest_tree_idx(vma->vm_file); > > +#endif > > +} > > + > > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma, > > + struct address_space *mapping) > > +{ > > + return &mapping->i_mmap[vma->tree_idx]->root; > > +} > > + > > +/* Find the first valid VMA in the sibling trees */ > > +static inline struct vm_area_struct *first_vma(struct i_mmap_tree ***__r, > > + unsigned long start, unsigned long last) > > +{ > > + struct vm_area_struct *vma = NULL; > > + struct i_mmap_tree **tree = *__r; > > + struct rb_root_cached *root; > > + > > + while (*tree) { > > + root = &(*tree)->root; > > + tree++; > > + vma = vma_interval_tree_iter_first(root, start, last); > > + if (vma) > > + break; > > + } > > + > > + /* Save for the next loop */ > > + *__r = tree; > > + return vma; > > +} > > + > > +/* > > + * Please use get_i_mmap_root() to get the @root. > > + * @_tmp is referenced to avoid unused variable warning. > > + */ > > +#define vma_interval_tree_foreach(vma, root, start, last) \ > > + for (struct i_mmap_tree **_r = (struct i_mmap_tree **)(root), \ > > + **_tmp = (vma = first_vma(&_r, start, last)) ? _r : NULL;\ > > + ((_tmp && vma) || (vma = first_vma(&_r, start, last))); \ > > + vma = vma_interval_tree_iter_next(vma, start, last)) > > +#else > > /* Please use get_i_mmap_root() to get the @root */ > > #define vma_interval_tree_foreach(vma, root, start, last) \ > > for (vma = vma_interval_tree_iter_first(root, start, last); \ > > vma; vma = vma_interval_tree_iter_next(vma, start, last)) > > > > +static inline void vma_set_tree_idx(struct vm_area_struct *vma) { } > > + > > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma, > > + struct address_space *mapping) > > +{ > > + return &mapping->i_mmap; > > +} > > +#endif > > + > > void anon_vma_interval_tree_insert(struct anon_vma_chain *node, > > struct rb_root_cached *root); > > void anon_vma_interval_tree_remove(struct anon_vma_chain *node, > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index a308e2c23b82..8d6aab3346ce 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -1072,6 +1072,9 @@ struct vm_area_struct { > > #ifdef __HAVE_PFNMAP_TRACKING > > struct pfnmap_track_ctx *pfnmap_track_ctx; > > #endif > > +#ifdef CONFIG_SPLIT_I_MMAP > > + int tree_idx; /* The sibling tree index for the VMA */ > > +#endif > > FTR the struct hole isn't here, but right after vm_lock_seq or vm_refcnt in > most configs. okay, thanks. I did not notice the struct hole issue. > > > } __randomize_layout; > > > > /* Clears all bits in the VMA flags bitmap, non-atomically. */ > > diff --git a/mm/internal.h b/mm/internal.h > > index 5a2ddcf68e0b..2d35cacffd19 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -1888,7 +1888,8 @@ static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma, > > > > VM_WARN_ON_ONCE(vma_is_anonymous(vma)); > > file = vma->vm_file; > > - i_mmap_unlock_write(file->f_mapping); > > + i_mmap_tree_unlock_write(file->f_mapping, vma); > > + i_mmap_unlock_write_complete(file->f_mapping); > > action->hide_from_rmap_until_complete = false; > > } > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index d714fdb357e5..70036ec9dcaa 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1825,15 +1825,20 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) > > struct address_space *mapping = file->f_mapping; > > > > get_file(file); > > - i_mmap_lock_write(mapping); > > + i_mmap_lock_write_prepare(mapping); > > + i_mmap_tree_lock_write(mapping, mpnt); > > + > > if (vma_is_shared_maywrite(tmp)) > > mapping_allow_writable(mapping); > > flush_dcache_mmap_lock(mapping); > > /* insert tmp into the share list, just after mpnt */ > > vma_interval_tree_insert_after(tmp, mpnt, > > - get_i_mmap_root(mapping)); > > + get_rb_root(mpnt, mapping)); > > + inc_mapping_vma(mapping, tmp); > > Honestly, would prefer to hide all of these details from mmap. yes, we can. But we need to change the functions in mm/interval_tree.c > > > flush_dcache_mmap_unlock(mapping); > > - i_mmap_unlock_write(mapping); > > + > > + i_mmap_tree_unlock_write(mapping, mpnt); > > + i_mmap_unlock_write_complete(mapping); > > } > > > > if (!(tmp->vm_flags & VM_WIPEONFORK)) > > diff --git a/mm/nommu.c b/mm/nommu.c > > index 0f18ffc658e9..1f2c60a220f6 100644 > > --- a/mm/nommu.c > > +++ b/mm/nommu.c > > @@ -567,11 +567,16 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm) > > if (vma->vm_file) { > > struct address_space *mapping = vma->vm_file->f_mapping; > > > > - i_mmap_lock_write(mapping); > > + i_mmap_lock_write_prepare(mapping); > > + i_mmap_tree_lock_write(mapping, vma); > > + > > flush_dcache_mmap_lock(mapping); > > - vma_interval_tree_insert(vma, get_i_mmap_root(mapping)); > > + vma_interval_tree_insert(vma, get_rb_root(vma, mapping)); > > + inc_mapping_vma(mapping, vma); > > flush_dcache_mmap_unlock(mapping); > > - i_mmap_unlock_write(mapping); > > + > > + i_mmap_tree_unlock_write(mapping, vma); > > + i_mmap_unlock_write_complete(mapping); > > } > > } > > > > @@ -583,11 +588,16 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma) > > struct address_space *mapping; > > mapping = vma->vm_file->f_mapping; > > > > - i_mmap_lock_write(mapping); > > + i_mmap_lock_write_prepare(mapping); > > + i_mmap_tree_lock_write(mapping, vma); > > + > > flush_dcache_mmap_lock(mapping); > > - vma_interval_tree_remove(vma, get_i_mmap_root(mapping)); > > + vma_interval_tree_remove(vma, get_rb_root(vma, mapping)); > > + dec_mapping_vma(mapping, vma); > > flush_dcache_mmap_unlock(mapping); > > - i_mmap_unlock_write(mapping); > > + > > + i_mmap_tree_unlock_write(mapping, vma); > > + i_mmap_unlock_write_complete(mapping); > > } > > } > > > > @@ -1063,6 +1073,7 @@ unsigned long do_mmap(struct file *file, > > if (file) { > > region->vm_file = get_file(file); > > vma->vm_file = get_file(file); > > + vma_set_tree_idx(vma); > > This is unrelated, shouldn't be done here. > > > } > > > > down_write(&nommu_region_sem); > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > index 8df1b5077951..d5745519d95a 100644 > > --- a/mm/pagewalk.c > > +++ b/mm/pagewalk.c > > @@ -809,7 +809,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, > > if (!check_ops_safe(ops)) > > return -EINVAL; > > > > - lockdep_assert_held(&mapping->i_mmap_rwsem); > > + i_mmap_assert_locked(mapping); > > This kind of conversion should be done in a separate step. > > > vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index, > > first_index + nr - 1) { > > /* Clip to the vma */ > > diff --git a/mm/vma.c b/mm/vma.c > > index 6159650c1b42..2055758064a9 100644 > > --- a/mm/vma.c > > +++ b/mm/vma.c > > @@ -234,22 +234,23 @@ static void __vma_link_file(struct vm_area_struct *vma, > > mapping_allow_writable(mapping); > > > > flush_dcache_mmap_lock(mapping); > > - vma_interval_tree_insert(vma, get_i_mmap_root(mapping)); > > + vma_interval_tree_insert(vma, get_rb_root(vma, mapping)); > > + inc_mapping_vma(mapping, vma); > > inc_mapping_vma() should probably be done implicitly by insertion? Yes, we can. It is more grace to hide it in vma_interval_tree_insert. > > > flush_dcache_mmap_unlock(mapping); > > } > > > > -/* > > - * Requires inode->i_mapping->i_mmap_rwsem > > - */ > > static void __remove_shared_vm_struct(struct vm_area_struct *vma, > > struct address_space *mapping) > > { > > + i_mmap_tree_lock_write(mapping, vma); > > if (vma_is_shared_maywrite(vma)) > > mapping_unmap_writable(mapping); > > > > flush_dcache_mmap_lock(mapping); > > - vma_interval_tree_remove(vma, get_i_mmap_root(mapping)); > > + vma_interval_tree_remove(vma, get_rb_root(vma, mapping)); > > + dec_mapping_vma(mapping, vma); > > flush_dcache_mmap_unlock(mapping); > > + i_mmap_tree_unlock_write(mapping, vma); > > } > > > > /* > > @@ -297,8 +298,9 @@ static void vma_prepare(struct vma_prepare *vp) > > uprobe_munmap(vp->adj_next, vp->adj_next->vm_start, > > vp->adj_next->vm_end); > > > > - i_mmap_lock_write(vp->mapping); > > + i_mmap_lock_write_prepare(vp->mapping); > > if (vp->insert && vp->insert->vm_file) { > > + i_mmap_tree_lock_write(vp->mapping, vp->insert); > > /* > > * Put into interval tree now, so instantiated pages > > * are visible to arm/parisc __flush_dcache_page > > @@ -307,6 +309,7 @@ static void vma_prepare(struct vma_prepare *vp) > > */ > > __vma_link_file(vp->insert, > > vp->insert->vm_file->f_mapping); > > + i_mmap_tree_unlock_write(vp->mapping, vp->insert); > > } > > } > > > > @@ -318,12 +321,17 @@ static void vma_prepare(struct vma_prepare *vp) > > } > > > > if (vp->file) { > > + i_mmap_tree_lock_write(vp->mapping, vp->vma); > > flush_dcache_mmap_lock(vp->mapping); > > vma_interval_tree_remove(vp->vma, > > - get_i_mmap_root(vp->mapping)); > > - if (vp->adj_next) > > + get_rb_root(vp->vma, vp->mapping)); > > + dec_mapping_vma(vp->mapping, vp->vma); > > + if (vp->adj_next) { > > + i_mmap_tree_lock_write(vp->mapping, vp->adj_next); > > vma_interval_tree_remove(vp->adj_next, > > - get_i_mmap_root(vp->mapping)); > > + get_rb_root(vp->adj_next, vp->mapping)); > > + dec_mapping_vma(vp->mapping, vp->adj_next); > > + } > > } > > > > } > > @@ -340,12 +348,17 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, > > struct mm_struct *mm) > > { > > if (vp->file) { > > - if (vp->adj_next) > > + if (vp->adj_next) { > > vma_interval_tree_insert(vp->adj_next, > > - get_i_mmap_root(vp->mapping)); > > + get_rb_root(vp->adj_next, vp->mapping)); > > + inc_mapping_vma(vp->mapping, vp->adj_next); > > + i_mmap_tree_unlock_write(vp->mapping, vp->adj_next); > > + } > > vma_interval_tree_insert(vp->vma, > > - get_i_mmap_root(vp->mapping)); > > + get_rb_root(vp->vma, vp->mapping)); > > + inc_mapping_vma(vp->mapping, vp->vma); > > flush_dcache_mmap_unlock(vp->mapping); > > + i_mmap_tree_unlock_write(vp->mapping, vp->vma); > > } > > > > if (vp->remove && vp->file) { > > @@ -370,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, > > } > > > > if (vp->file) { > > - i_mmap_unlock_write(vp->mapping); > > + i_mmap_unlock_write_complete(vp->mapping); > > > > if (!vp->skip_vma_uprobe) { > > uprobe_mmap(vp->vma); > > @@ -1799,12 +1812,12 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb) > > int i; > > > > mapping = vb->vmas[0]->vm_file->f_mapping; > > - i_mmap_lock_write(mapping); > > + i_mmap_lock_write_prepare(mapping); > > for (i = 0; i < vb->count; i++) { > > VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping); > > __remove_shared_vm_struct(vb->vmas[i], mapping); > > } > > - i_mmap_unlock_write(mapping); > > + i_mmap_unlock_write_complete(mapping); > > > > unlink_file_vma_batch_init(vb); > > } > > @@ -1836,10 +1849,13 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock) > > > > if (file) { > > mapping = file->f_mapping; > > - i_mmap_lock_write(mapping); > > + i_mmap_lock_write_prepare(mapping); > > + i_mmap_tree_lock_write(mapping, vma); > > __vma_link_file(vma, mapping); > > - if (!hold_rmap_lock) > > - i_mmap_unlock_write(mapping); > > + if (!hold_rmap_lock) { > > + i_mmap_tree_unlock_write(mapping, vma); > > + i_mmap_unlock_write_complete(mapping); > > + } > > } > > } > > > > @@ -2164,6 +2180,23 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma) > > } > > } > > I can but hope that all of the above is quite simplified before we get to the > "making file rmap more complicated" bit. :( If we can do not care about the ARM device, we can make it simple. Thanks Huang Shijie