linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/17] reimplement per-vma lock as a refcount
@ 2025-01-11  4:25 Suren Baghdasaryan
  2025-01-11  4:25 ` [PATCH v9 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
                   ` (19 more replies)
  0 siblings, 20 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. replaces vm_lock and vma->detached flag with a reference counter;
4. regroups vm_area_struct members to fit them into 3 cachelines;
5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls.

Pagefault microbenchmarks show performance improvement:
Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Changes since v8 [4]:
- Change subject for the cover letter, per Vlastimil Babka
- Added Reviewed-by and Acked-by, per Vlastimil Babka
- Added static check for no-limit case in __refcount_add_not_zero_limited,
per David Laight
- Fixed vma_refcount_put() to call rwsem_release() unconditionally,
per Hillf Danton and Vlastimil Babka
- Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
under us, per Vlastimil Babka
- Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
per Vlastimil Babka
- Changed __vma_enter_locked() parameter to centralize refcount logic,
per Vlastimil Babka
- Amended description in vm_lock replacement patch explaining the effects
of the patch on vm_area_struct size, per Vlastimil Babka
- Added vm_area_struct member regrouping patch [5] into the series
- Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
- Added a comment for vm_area_struct to update vm_area_init_from() when
adding new members, per Vlastimil Babka
- Updated a comment about unstable src->shared.rb when copying a vma in
vm_area_init_from(), per Vlastimil Babka

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
[5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/

Patchset applies over mm-unstable after reverting v8
(current SHA range: 235b5129cb7b - 9e6b24c58985)

Suren Baghdasaryan (17):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm: introduce vma_iter_store_attached() to use with attached vmas
  mm: mark vmas detached upon exit
  types: move struct rcuwait into types.h
  mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  mm: move mmap_init_lock() out of the header file
  mm: uninline the main body of vma_start_write()
  refcount: introduce __refcount_{add|inc}_not_zero_limited
  mm: replace vm_lock and detached flag with a reference count
  mm: move lesser used vma_area_struct members into the last cacheline
  mm/debug: print vm_refcnt state when dumping the vma
  mm: remove extra vma_numab_state_init() call
  mm: prepare lock_vma_under_rcu() for vma reuse possibility
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst |  44 ++++----
 include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
 include/linux/mm_types.h           |  75 +++++++-------
 include/linux/mmap_lock.h          |   6 --
 include/linux/rcuwait.h            |  13 +--
 include/linux/refcount.h           |  24 ++++-
 include/linux/slab.h               |   6 --
 include/linux/types.h              |  12 +++
 kernel/fork.c                      | 129 +++++++++++-------------
 mm/debug.c                         |  12 +++
 mm/init-mm.c                       |   1 +
 mm/memory.c                        |  97 ++++++++++++++++--
 mm/mmap.c                          |   3 +-
 mm/userfaultfd.c                   |  32 +++---
 mm/vma.c                           |  23 ++---
 mm/vma.h                           |  15 ++-
 tools/testing/vma/linux/atomic.h   |   5 +
 tools/testing/vma/vma_internal.h   |  93 ++++++++---------
 18 files changed, 465 insertions(+), 281 deletions(-)

-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v9 01/17] mm: introduce vma_start_read_locked{_nested} helpers
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-11  4:25 ` [PATCH v9 02/17] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb, Liam R. Howlett

Introduce helper functions which can be used to read-lock a VMA when
holding mmap_lock for read.  Replace direct accesses to vma->vm_lock with
these new helpers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 mm/userfaultfd.c   | 22 +++++-----------------
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..1c0250c187f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,6 +735,30 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	return true;
 }
 
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read_nested(&vma->vm_lock->lock, subclass);
+}
+
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read(&vma->vm_lock->lock);
+}
+
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af3dfc3633db..4527c385935b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -84,16 +84,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
-	if (!IS_ERR(vma)) {
-		/*
-		 * We cannot use vma_start_read() as it may fail due to
-		 * false locked (see comment in vma_start_read()). We
-		 * can avoid that by directly locking vm_lock under
-		 * mmap_lock, which guarantees that nobody can lock the
-		 * vma for write (vma_start_write()) under us.
-		 */
-		down_read(&vma->vm_lock->lock);
-	}
+	if (!IS_ERR(vma))
+		vma_start_read_locked(vma);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1491,14 +1483,10 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		/*
-		 * See comment in uffd_lock_vma() as to why not using
-		 * vma_start_read() here.
-		 */
-		down_read(&(*dst_vmap)->vm_lock->lock);
+		vma_start_read_locked(*dst_vmap);
 		if (*dst_vmap != *src_vmap)
-			down_read_nested(&(*src_vmap)->vm_lock->lock,
-					 SINGLE_DEPTH_NESTING);
+			vma_start_read_locked_nested(*src_vmap,
+						SINGLE_DEPTH_NESTING);
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 02/17] mm: move per-vma lock into vm_area_struct
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
  2025-01-11  4:25 ` [PATCH v9 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-11  4:25 ` [PATCH v9 03/17] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb, Liam R. Howlett

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing.  Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].

Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code.  When that split-away part is a lock, it complicates
things even further.  With no performance benefits, there are no reasons
for this split.  Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.  Move
vm_lock back into vm_area_struct, aligning it at the cacheline boundary
and changing the cache to be cacheline-aligned as well.  With kernel
compiled using defconfig, this causes VMA memory consumption to grow from
160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs.  Note that the size of this structure is
dependent on the kernel configuration and typically the original size is
higher than 160 bytes.  Therefore these calculations are close to the
worst case scenario.  A more realistic vm_area_struct usage before this
change is:

     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    176   46    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
which is 3.9MB per 100000 VMAs.  This memory consumption growth can be
addressed later by optimizing the vm_lock.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h               | 28 ++++++++++--------
 include/linux/mm_types.h         |  6 ++--
 kernel/fork.c                    | 49 ++++----------------------------
 tools/testing/vma/vma_internal.h | 33 +++++----------------
 4 files changed, 32 insertions(+), 84 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1c0250c187f6..ed739406b0a7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -697,6 +697,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_lock_init(struct vm_area_struct *vma)
+{
+	init_rwsem(&vma->vm_lock.lock);
+	vma->vm_lock_seq = UINT_MAX;
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -714,7 +720,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
 		return false;
 
 	/*
@@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+		up_read(&vma->vm_lock.lock);
 		return false;
 	}
 	return true;
@@ -744,7 +750,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock->lock, subclass);
+	down_read_nested(&vma->vm_lock.lock, subclass);
 }
 
 /*
@@ -756,13 +762,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
 static inline void vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock->lock);
+	down_read(&vma->vm_lock.lock);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock->lock);
+	up_read(&vma->vm_lock.lock);
 	rcu_read_unlock();
 }
 
@@ -791,7 +797,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock->lock);
+	down_write(&vma->vm_lock.lock);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -799,7 +805,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock->lock);
+	up_write(&vma->vm_lock.lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -811,7 +817,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock->lock))
+	if (!rwsem_is_locked(&vma->vm_lock.lock))
 		vma_assert_write_locked(vma);
 }
 
@@ -844,6 +850,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
+static inline void vma_lock_init(struct vm_area_struct *vma) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -878,10 +885,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
 
-/*
- * WARNING: vma_init does not initialize vma->vm_lock.
- * Use vm_area_alloc()/vm_area_free() if vma needs locking.
- */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	memset(vma, 0, sizeof(*vma));
@@ -890,6 +893,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
 	vma_numab_state_init(vma);
+	vma_lock_init(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5f1b2dc788e2..6573d95f1d1e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -730,8 +730,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock *vm_lock;
 #endif
 
 	/*
@@ -784,6 +782,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index ded49f18cd95..40a8e615499f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-#ifdef CONFIG_PER_VMA_LOCK
-
-/* SLAB cache for vm_area_struct.lock */
-static struct kmem_cache *vma_lock_cachep;
-
-static bool vma_lock_alloc(struct vm_area_struct *vma)
-{
-	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
-}
-
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
-}
-
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
-static inline void vma_lock_free(struct vm_area_struct *vma) {}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
 struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		kmem_cache_free(vm_area_cachep, vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	if (!vma_lock_alloc(new)) {
-		kmem_cache_free(vm_area_cachep, new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
@@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 {
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
-	vma_lock_free(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 						  vm_rcu);
 
 	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -3188,11 +3151,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-
-	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
-#ifdef CONFIG_PER_VMA_LOCK
-	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
-#endif
+	vm_area_cachep = KMEM_CACHE(vm_area_struct,
+			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 2404347fa2c7..96aeb28c81f9 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -274,10 +274,10 @@ struct vm_area_struct {
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_lock.lock (in write mode)
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_lock.lock (in read or write mode)
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -286,7 +286,7 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock *vm_lock;
+	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -463,17 +463,10 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline bool vma_lock_alloc(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma)
 {
-	vma->vm_lock = calloc(1, sizeof(struct vma_lock));
-
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
+	init_rwsem(&vma->vm_lock.lock);
 	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
@@ -496,6 +489,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
+	vma_lock_init(vma);
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -506,10 +500,6 @@ static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		free(vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -522,10 +512,7 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	if (!vma_lock_alloc(new)) {
-		free(new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 
 	return new;
@@ -695,14 +682,8 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	free(vma->vm_lock);
-}
-
 static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-	vma_lock_free(vma);
 	free(vma);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 03/17] mm: mark vma as detached until it's added into vma tree
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
  2025-01-11  4:25 ` [PATCH v9 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
  2025-01-11  4:25 ` [PATCH v9 02/17] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-11  4:25 ` [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas Suren Baghdasaryan
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb, Liam R. Howlett

Current implementation does not set detached flag when a VMA is first
allocated.  This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree.  Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added into
a tree.

Introduce vma_mark_attached() to make the API more readable and to
simplify possible future cleanup when vma->vm_mm might be used to indicate
detached vma and vma_mark_attached() will need an additional mm parameter.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h               | 27 ++++++++++++++++++++-------
 kernel/fork.c                    |  4 ++++
 mm/memory.c                      |  2 +-
 mm/vma.c                         |  6 +++---
 mm/vma.h                         |  2 ++
 tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
 6 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ed739406b0a7..2b322871da87 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,12 +821,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 		vma_assert_write_locked(vma);
 }
 
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
+}
+
+static inline bool is_vma_detached(struct vm_area_struct *vma)
+{
+	return vma->detached;
 }
 
 static inline void release_fault_lock(struct vm_fault *vmf)
@@ -857,8 +866,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 		{ mmap_assert_write_locked(vma->vm_mm); }
-static inline void vma_mark_detached(struct vm_area_struct *vma,
-				     bool detached) {}
+static inline void vma_mark_attached(struct vm_area_struct *vma) {}
+static inline void vma_mark_detached(struct vm_area_struct *vma) {}
 
 static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		unsigned long address)
@@ -891,7 +900,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
+#endif
 	vma_numab_state_init(vma);
 	vma_lock_init(vma);
 }
@@ -1086,6 +1098,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 40a8e615499f..f2f9e7b427ad 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	data_race(memcpy(new, orig, sizeof(*new)));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
+#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
diff --git a/mm/memory.c b/mm/memory.c
index 2a20e3810534..d0dee2282325 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6349,7 +6349,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		goto inval;
 
 	/* Check if the VMA got isolated after we found it */
-	if (vma->detached) {
+	if (is_vma_detached(vma)) {
 		vma_end_read(vma);
 		count_vm_vma_lock_event(VMA_LOCK_MISS);
 		/* The area was replaced with another one */
diff --git a/mm/vma.c b/mm/vma.c
index af1d549b179c..d603494e69d7 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -327,7 +327,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 
 	if (vp->remove) {
 again:
-		vma_mark_detached(vp->remove, true);
+		vma_mark_detached(vp->remove);
 		if (vp->file) {
 			uprobe_munmap(vp->remove, vp->remove->vm_start,
 				      vp->remove->vm_end);
@@ -1221,7 +1221,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
 
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
+		vma_mark_attached(vma);
 
 	__mt_destroy(mas_detach->tree);
 }
@@ -1296,7 +1296,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (error)
 			goto munmap_gather_failed;
 
-		vma_mark_detached(next, true);
+		vma_mark_detached(next);
 		nrpages = vma_pages(next);
 
 		vms->nr_pages += nrpages;
diff --git a/mm/vma.h b/mm/vma.h
index a2e8710b8c47..2a2668de8d2c 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -157,6 +157,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
@@ -389,6 +390,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 
 	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
 	mas_store_prealloc(&vmi->mas, vma);
+	vma_mark_attached(vma);
 }
 
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 96aeb28c81f9..47c8b03ffbbd 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -469,13 +469,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
 static inline void vma_assert_write_locked(struct vm_area_struct *);
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -488,7 +492,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
 	vma_lock_init(vma);
 }
 
@@ -514,6 +519,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	memcpy(new, orig, sizeof(*new));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 03/17] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-13 11:58   ` Lorenzo Stoakes
  2025-01-11  4:25 ` [PATCH v9 05/17] mm: mark vmas detached upon exit Suren Baghdasaryan
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

vma_iter_store() functions can be used both when adding a new vma and
when updating an existing one. However for existing ones we do not need
to mark them attached as they are already marked that way. Introduce
vma_iter_store_attached() to be used with already attached vmas.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 12 ++++++++++++
 mm/vma.c           |  8 ++++----
 mm/vma.h           | 11 +++++++++--
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2b322871da87..2f805f1a0176 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 		vma_assert_write_locked(vma);
 }
 
+static inline void vma_assert_attached(struct vm_area_struct *vma)
+{
+	VM_BUG_ON_VMA(vma->detached, vma);
+}
+
+static inline void vma_assert_detached(struct vm_area_struct *vma)
+{
+	VM_BUG_ON_VMA(!vma->detached, vma);
+}
+
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
 	vma->detached = false;
@@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 		{ mmap_assert_write_locked(vma->vm_mm); }
+static inline void vma_assert_attached(struct vm_area_struct *vma) {}
+static inline void vma_assert_detached(struct vm_area_struct *vma) {}
 static inline void vma_mark_attached(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma) {}
 
diff --git a/mm/vma.c b/mm/vma.c
index d603494e69d7..b9cf552e120c 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
 	vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
 
 	if (expanded)
-		vma_iter_store(vmg->vmi, vmg->vma);
+		vma_iter_store_attached(vmg->vmi, vmg->vma);
 
 	if (adj_start) {
 		adjust->vm_start += adj_start;
 		adjust->vm_pgoff += PHYS_PFN(adj_start);
 		if (adj_start < 0) {
 			WARN_ON(expanded);
-			vma_iter_store(vmg->vmi, adjust);
+			vma_iter_store_attached(vmg->vmi, adjust);
 		}
 	}
 
@@ -2845,7 +2845,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
 				/* Overwrite old entry in mtree. */
-				vma_iter_store(&vmi, vma);
+				vma_iter_store_attached(&vmi, vma);
 				anon_vma_interval_tree_post_update_vma(vma);
 
 				perf_event_mmap(vma);
@@ -2925,7 +2925,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;
 				/* Overwrite old entry in mtree. */
-				vma_iter_store(&vmi, vma);
+				vma_iter_store_attached(&vmi, vma);
 				anon_vma_interval_tree_post_update_vma(vma);
 
 				perf_event_mmap(vma);
diff --git a/mm/vma.h b/mm/vma.h
index 2a2668de8d2c..63dd38d5230c 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -365,9 +365,10 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
 }
 
 /* Store a VMA with preallocated memory */
-static inline void vma_iter_store(struct vma_iterator *vmi,
-				  struct vm_area_struct *vma)
+static inline void vma_iter_store_attached(struct vma_iterator *vmi,
+					   struct vm_area_struct *vma)
 {
+	vma_assert_attached(vma);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start &&
@@ -390,7 +391,13 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 
 	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
 	mas_store_prealloc(&vmi->mas, vma);
+}
+
+static inline void vma_iter_store(struct vma_iterator *vmi,
+				  struct vm_area_struct *vma)
+{
 	vma_mark_attached(vma);
+	vma_iter_store_attached(vmi, vma);
 }
 
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 05/17] mm: mark vmas detached upon exit
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-13 12:05   ` Lorenzo Stoakes
  2025-01-11  4:25 ` [PATCH v9 06/17] types: move struct rcuwait into types.h Suren Baghdasaryan
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

When exit_mmap() removes vmas belonging to an exiting task, it does not
mark them as detached since they can't be reached by other tasks and they
will be freed shortly. Once we introduce vma reuse, all vmas will have to
be in detached state before they are freed to ensure vma when reused is
in a consistent state. Add missing vma_mark_detached() before freeing the
vma.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vma.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index b9cf552e120c..93ff42ac2002 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -413,10 +413,12 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	if (unreachable)
+	if (unreachable) {
+		vma_mark_detached(vma);
 		__vm_area_free(vma);
-	else
+	} else {
 		vm_area_free(vma);
+	}
 }
 
 /*
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 06/17] types: move struct rcuwait into types.h
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 05/17] mm: mark vmas detached upon exit Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-13 14:46   ` Lorenzo Stoakes
  2025-01-11  4:25 ` [PATCH v9 07/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb, Liam R. Howlett

Move rcuwait struct definition into types.h so that rcuwait can be used
without including rcuwait.h which includes other headers. Without this
change mm_types.h can't use rcuwait due to a the following circular
dependency:

mm_types.h -> rcuwait.h -> signal.h -> mm_types.h

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/rcuwait.h | 13 +------------
 include/linux/types.h   | 12 ++++++++++++
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 27343424225c..9ad134a04b41 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -4,18 +4,7 @@
 
 #include <linux/rcupdate.h>
 #include <linux/sched/signal.h>
-
-/*
- * rcuwait provides a way of blocking and waking up a single
- * task in an rcu-safe manner.
- *
- * The only time @task is non-nil is when a user is blocked (or
- * checking if it needs to) on a condition, and reset as soon as we
- * know that the condition has succeeded and are awoken.
- */
-struct rcuwait {
-	struct task_struct __rcu *task;
-};
+#include <linux/types.h>
 
 #define __RCUWAIT_INITIALIZER(name)		\
 	{ .task = NULL, }
diff --git a/include/linux/types.h b/include/linux/types.h
index 2d7b9ae8714c..f1356a9a5730 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -248,5 +248,17 @@ typedef void (*swap_func_t)(void *a, void *b, int size);
 typedef int (*cmp_r_func_t)(const void *a, const void *b, const void *priv);
 typedef int (*cmp_func_t)(const void *a, const void *b);
 
+/*
+ * rcuwait provides a way of blocking and waking up a single
+ * task in an rcu-safe manner.
+ *
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
+ */
+struct rcuwait {
+	struct task_struct __rcu *task;
+};
+
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 07/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 06/17] types: move struct rcuwait into types.h Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-13 15:25   ` Lorenzo Stoakes
  2025-01-11  4:25 ` [PATCH v9 08/17] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
possibility of vma_start_read_locked/vma_start_read_locked_nested failing
due to refcount overflow. Prepare for such possibility by changing these
APIs and adjusting their users.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Lokesh Gidra <lokeshgidra@google.com>
---
 include/linux/mm.h |  6 ++++--
 mm/userfaultfd.c   | 18 +++++++++++++-----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2f805f1a0176..cbb4e3dbbaed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  * not be used in such cases because it might fail due to mm_lock_seq overflow.
  * This functionality is used to obtain vma read lock and drop the mmap read lock.
  */
-static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
 	down_read_nested(&vma->vm_lock.lock, subclass);
+	return true;
 }
 
 /*
@@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
  * not be used in such cases because it might fail due to mm_lock_seq overflow.
  * This functionality is used to obtain vma read lock and drop the mmap read lock.
  */
-static inline void vma_start_read_locked(struct vm_area_struct *vma)
+static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
 	down_read(&vma->vm_lock.lock);
+	return true;
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4527c385935b..411a663932c4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
 	if (!IS_ERR(vma))
-		vma_start_read_locked(vma);
+		if (!vma_start_read_locked(vma))
+			vma = ERR_PTR(-EAGAIN);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1483,10 +1484,17 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		vma_start_read_locked(*dst_vmap);
-		if (*dst_vmap != *src_vmap)
-			vma_start_read_locked_nested(*src_vmap,
-						SINGLE_DEPTH_NESTING);
+		if (vma_start_read_locked(*dst_vmap)) {
+			if (*dst_vmap != *src_vmap) {
+				if (!vma_start_read_locked_nested(*src_vmap,
+							SINGLE_DEPTH_NESTING)) {
+					vma_end_read(*dst_vmap);
+					err = -EAGAIN;
+				}
+			}
+		} else {
+			err = -EAGAIN;
+		}
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 08/17] mm: move mmap_init_lock() out of the header file
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 07/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-13 15:27   ` Lorenzo Stoakes
  2025-01-11  4:25 ` [PATCH v9 09/17] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
not have to reside in the header file. This move lets us avoid including
additional headers in mmap_lock.h later, when mmap_init_lock() needs to
initialize rcuwait object.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmap_lock.h | 6 ------
 kernel/fork.c             | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 45a21faa3ff6..4706c6769902 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -122,12 +122,6 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
-static inline void mmap_init_lock(struct mm_struct *mm)
-{
-	init_rwsem(&mm->mmap_lock);
-	mm_lock_seqcount_init(mm);
-}
-
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_start_locking(mm, true);
diff --git a/kernel/fork.c b/kernel/fork.c
index f2f9e7b427ad..d4c75428ccaf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1219,6 +1219,12 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static inline void mmap_init_lock(struct mm_struct *mm)
+{
+	init_rwsem(&mm->mmap_lock);
+	mm_lock_seqcount_init(mm);
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 09/17] mm: uninline the main body of vma_start_write()
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (7 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 08/17] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-13 15:52   ` Lorenzo Stoakes
  2025-01-11  4:25 ` [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

vma_start_write() is used in many places and will grow in size very soon.
It is not used in performance critical paths and uninlining it should
limit the future code size growth.
No functional changes.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 12 +++---------
 mm/memory.c        | 14 ++++++++++++++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cbb4e3dbbaed..3432756d95e6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -787,6 +787,8 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l
 	return (vma->vm_lock_seq == *mm_lock_seq);
 }
 
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq);
+
 /*
  * Begin writing to a VMA.
  * Exclude concurrent readers under the per-VMA lock until the currently
@@ -799,15 +801,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock.lock);
-	/*
-	 * We should use WRITE_ONCE() here because we can have concurrent reads
-	 * from the early lockless pessimistic check in vma_start_read().
-	 * We don't really care about the correctness of that early check, but
-	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
-	 */
-	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+	__vma_start_write(vma, mm_lock_seq);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index d0dee2282325..236fdecd44d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6328,6 +6328,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
+{
+	down_write(&vma->vm_lock.lock);
+	/*
+	 * We should use WRITE_ONCE() here because we can have concurrent reads
+	 * from the early lockless pessimistic check in vma_start_read().
+	 * We don't really care about the correctness of that early check, but
+	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
+	 */
+	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
+	up_write(&vma->vm_lock.lock);
+}
+EXPORT_SYMBOL_GPL(__vma_start_write);
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (8 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 09/17] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-11  6:31   ` Hillf Danton
  2025-01-11 12:39   ` David Laight
  2025-01-11  4:25 ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
                   ` (9 subsequent siblings)
  19 siblings, 2 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Introduce functions to increase refcount but with a top limit above which
they will fail to increase (the limit is inclusive). Setting the limit to
INT_MAX indicates no limit.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/refcount.h | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..5072ba99f05e 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -137,13 +137,23 @@ static inline unsigned int refcount_read(const refcount_t *r)
 }
 
 static inline __must_check __signed_wrap
-bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
+bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
+				     int limit)
 {
 	int old = refcount_read(r);
 
 	do {
 		if (!old)
 			break;
+
+		if (statically_true(limit == INT_MAX))
+			continue;
+
+		if (i > limit - old) {
+			if (oldp)
+				*oldp = old;
+			return false;
+		}
 	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
 
 	if (oldp)
@@ -155,6 +165,12 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	return old;
 }
 
+static inline __must_check __signed_wrap
+bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
+{
+	return __refcount_add_not_zero_limited(i, r, oldp, INT_MAX);
+}
+
 /**
  * refcount_add_not_zero - add a value to a refcount unless it is 0
  * @i: the value to add to the refcount
@@ -213,6 +229,12 @@ static inline void refcount_add(int i, refcount_t *r)
 	__refcount_add(i, r, NULL);
 }
 
+static inline __must_check bool __refcount_inc_not_zero_limited(refcount_t *r,
+								int *oldp, int limit)
+{
+	return __refcount_add_not_zero_limited(1, r, oldp, limit);
+}
+
 static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
 {
 	return __refcount_add_not_zero(1, r, oldp);
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (9 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-11 11:24   ` Mateusz Guzik
                     ` (4 more replies)
  2025-01-11  4:25 ` [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
                   ` (8 subsequent siblings)
  19 siblings, 5 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore and the vma->detached flag with
a refcount (vm_refcnt).
When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state. Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached()
to be done only after vma has been write-locked. vma_mark_attached()
changes vm_refcnt to 1 to indicate that it has been attached to the vma
tree. When a reader takes read lock, it increments vm_refcnt, unless the
top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
a writer. When writer takes write lock, it sets the top usable bit to
indicate its presence. If there are readers, writer will wait using newly
introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
mode first, there can be only one writer at a time. The last reader to
release the lock will signal the writer to wake up.
refcount might overflow if there are many competing readers, in which case
read-locking will fail. Readers are expected to handle such failures.
In summary:
1. all readers increment the vm_refcnt;
2. writer sets top usable (writer) bit of vm_refcnt;
3. readers cannot increment the vm_refcnt if the writer bit is set;
4. in the presence of readers, writer must wait for the vm_refcnt to drop
to 1 (ignoring the writer bit), indicating an attached vma with no readers;
5. vm_refcnt overflow is handled by the readers.

While this vm_lock replacement does not yet result in a smaller
vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
allows for further size optimization by structure member regrouping
to bring the size of vm_area_struct below 192 bytes.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h               | 102 +++++++++++++++++++++----------
 include/linux/mm_types.h         |  22 +++----
 kernel/fork.c                    |  13 ++--
 mm/init-mm.c                     |   1 +
 mm/memory.c                      |  80 +++++++++++++++++++++---
 tools/testing/vma/linux/atomic.h |   5 ++
 tools/testing/vma/vma_internal.h |  66 +++++++++++---------
 7 files changed, 198 insertions(+), 91 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3432756d95e6..a99b11ee1f66 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -32,6 +32,7 @@
 #include <linux/memremap.h>
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
+#include <linux/rcuwait.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -697,12 +698,43 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_lock_init(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
 {
-	init_rwsem(&vma->vm_lock.lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key lockdep_key;
+
+	lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
+#endif
+	if (reset_refcnt)
+		refcount_set(&vma->vm_refcnt, 0);
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline bool is_vma_writer_only(int refcnt)
+{
+	/*
+	 * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma
+	 * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on
+	 * a detached vma happens only in vma_mark_detached() and is a rare
+	 * case, therefore most of the time there will be no unnecessary wakeup.
+	 */
+	return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1;
+}
+
+static inline void vma_refcount_put(struct vm_area_struct *vma)
+{
+	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
+	struct mm_struct *mm = vma->vm_mm;
+	int oldcnt;
+
+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
+
+		if (is_vma_writer_only(oldcnt - 1))
+			rcuwait_wake_up(&mm->vma_writer_wait);
+	}
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read(struct vm_area_struct *vma)
 {
+	int oldcnt;
+
 	/*
 	 * Check before locking. A race might cause false locked result.
 	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
@@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
+	/*
+	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
+	 * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
+	 */
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT)))
 		return false;
 
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
 	/*
-	 * Overflow might produce false locked result.
+	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
-	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
+	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 *
 	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
@@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock.lock);
+		vma_refcount_put(vma);
 		return false;
 	}
+
 	return true;
 }
 
@@ -749,8 +790,14 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
+	int oldcnt;
+
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock.lock, subclass);
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT)))
+		return false;
+
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
 	return true;
 }
 
@@ -762,16 +809,12 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int
  */
 static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
-	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock.lock);
-	return true;
+	return vma_start_read_locked_nested(vma, 0);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
-	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock.lock);
-	rcu_read_unlock();
+	vma_refcount_put(vma);
 }
 
 /* WARNING! Can only be used if mmap_lock is expected to be write-locked */
@@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock.lock))
+	if (refcount_read(&vma->vm_refcnt) <= 1)
 		vma_assert_write_locked(vma);
 }
 
+/*
+ * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
+ * assertions should be made either under mmap_write_lock or when the object
+ * has been isolated under mmap_write_lock, ensuring no competing writers.
+ */
 static inline void vma_assert_attached(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(vma->detached, vma);
+	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_assert_detached(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(!vma->detached, vma);
+	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
-}
-
-static inline void vma_mark_detached(struct vm_area_struct *vma)
-{
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+	vma_assert_detached(vma);
+	refcount_set(&vma->vm_refcnt, 1);
 }
 
-static inline bool is_vma_detached(struct vm_area_struct *vma)
-{
-	return vma->detached;
-}
+void vma_mark_detached(struct vm_area_struct *vma);
 
 static inline void release_fault_lock(struct vm_fault *vmf)
 {
@@ -865,7 +905,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_lock_init(struct vm_area_struct *vma) {}
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -908,12 +948,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-#endif
 	vma_numab_state_init(vma);
-	vma_lock_init(vma);
+	vma_lock_init(vma, false);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6573d95f1d1e..9228d19662c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
+#include <linux/types.h>
 
 #include <asm/mmu.h>
 
@@ -629,9 +630,8 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
 }
 #endif
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
+#define VMA_LOCK_OFFSET	0x40000000
+#define VMA_REF_LIMIT	(VMA_LOCK_OFFSET - 1)
 
 struct vma_numab_state {
 	/*
@@ -709,19 +709,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/*
-	 * Flag to indicate areas detached from the mm->mm_mt tree.
-	 * Unstable RCU readers are allowed to read this.
-	 */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -784,7 +778,10 @@ struct vm_area_struct {
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_PER_VMA_LOCK
 	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+	refcount_t vm_refcnt ____cacheline_aligned_in_smp;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map vmlock_dep_map;
+#endif
 #endif
 } __randomize_layout;
 
@@ -919,6 +916,7 @@ struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		struct rcuwait vma_writer_wait;
 		/*
 		 * This field has lock-like semantics, meaning it is sometimes
 		 * accessed with ACQUIRE/RELEASE semantics.
diff --git a/kernel/fork.c b/kernel/fork.c
index d4c75428ccaf..9d9275783cf8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -463,12 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	vma_lock_init(new);
+	vma_lock_init(new, true);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
-#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
@@ -477,6 +473,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void __vm_area_free(struct vm_area_struct *vma)
 {
+	/* The vma should be detached while being destroyed. */
+	vma_assert_detached(vma);
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
@@ -488,8 +486,6 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
 						  vm_rcu);
 
-	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -1223,6 +1219,9 @@ static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
 	mm_lock_seqcount_init(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+	rcuwait_init(&mm->vma_writer_wait);
+#endif
 }
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 6af3ad675930..4600e7605cab 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -40,6 +40,7 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
+	.vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
 #endif
 	.user_ns	= &init_user_ns,
diff --git a/mm/memory.c b/mm/memory.c
index 236fdecd44d6..dc16b67beefa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6328,9 +6328,47 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching)
+{
+	unsigned int tgt_refcnt = VMA_LOCK_OFFSET;
+
+	/* Additional refcnt if the vma is attached. */
+	if (!detaching)
+		tgt_refcnt++;
+
+	/*
+	 * If vma is detached then only vma_mark_attached() can raise the
+	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
+	 */
+	if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
+		return false;
+
+	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
+	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
+		   refcount_read(&vma->vm_refcnt) == tgt_refcnt,
+		   TASK_UNINTERRUPTIBLE);
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
+
+	return true;
+}
+
+static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
+{
+	*detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+}
+
 void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 {
-	down_write(&vma->vm_lock.lock);
+	bool locked;
+
+	/*
+	 * __vma_enter_locked() returns false immediately if the vma is not
+	 * attached, otherwise it waits until refcnt is indicating that vma
+	 * is attached with no readers.
+	 */
+	locked = __vma_enter_locked(vma, false);
+
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -6338,10 +6376,40 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+
+	if (locked) {
+		bool detached;
+
+		__vma_exit_locked(vma, &detached);
+		VM_BUG_ON_VMA(detached, vma); /* vma should remain attached */
+	}
 }
 EXPORT_SYMBOL_GPL(__vma_start_write);
 
+void vma_mark_detached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+	vma_assert_attached(vma);
+
+	/*
+	 * We are the only writer, so no need to use vma_refcount_put().
+	 * The condition below is unlikely because the vma has been already
+	 * write-locked and readers can increment vm_refcnt only temporarily
+	 * before they check vm_lock_seq, realize the vma is locked and drop
+	 * back the vm_refcnt. That is a narrow window for observing a raised
+	 * vm_refcnt.
+	 */
+	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+		/* Wait until vma is detached with no readers. */
+		if (__vma_enter_locked(vma, true)) {
+			bool detached;
+
+			__vma_exit_locked(vma, &detached);
+			VM_BUG_ON_VMA(!detached, vma);
+		}
+	}
+}
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 
 	rcu_read_lock();
-retry:
 	vma = mas_walk(&mas);
 	if (!vma)
 		goto inval;
@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma_start_read(vma))
 		goto inval;
 
-	/* Check if the VMA got isolated after we found it */
-	if (is_vma_detached(vma)) {
-		vma_end_read(vma);
-		count_vm_vma_lock_event(VMA_LOCK_MISS);
-		/* The area was replaced with another one */
-		goto retry;
-	}
 	/*
 	 * At this point, we have a stable reference to a VMA: The VMA is
 	 * locked and we know it hasn't already been isolated.
diff --git a/tools/testing/vma/linux/atomic.h b/tools/testing/vma/linux/atomic.h
index 3e1b6adc027b..788c597c4fde 100644
--- a/tools/testing/vma/linux/atomic.h
+++ b/tools/testing/vma/linux/atomic.h
@@ -9,4 +9,9 @@
 #define atomic_set(x, y) uatomic_set(x, y)
 #define U8_MAX UCHAR_MAX
 
+#ifndef atomic_cmpxchg_relaxed
+#define  atomic_cmpxchg_relaxed		uatomic_cmpxchg
+#define  atomic_cmpxchg_release         uatomic_cmpxchg
+#endif /* atomic_cmpxchg_relaxed */
+
 #endif	/* _LINUX_ATOMIC_H */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 47c8b03ffbbd..2ce032943861 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -25,7 +25,7 @@
 #include <linux/maple_tree.h>
 #include <linux/mm.h>
 #include <linux/rbtree.h>
-#include <linux/rwsem.h>
+#include <linux/refcount.h>
 
 extern unsigned long stack_guard_gap;
 #ifdef CONFIG_MMU
@@ -134,10 +134,6 @@ typedef __bitwise unsigned int vm_fault_t;
  */
 #define pr_warn_once pr_err
 
-typedef struct refcount_struct {
-	atomic_t refs;
-} refcount_t;
-
 struct kref {
 	refcount_t refcount;
 };
@@ -232,15 +228,12 @@ struct mm_struct {
 	unsigned long flags; /* Must use atomic bitops to access */
 };
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
-
-
 struct file {
 	struct address_space	*f_mapping;
 };
 
+#define VMA_LOCK_OFFSET	0x40000000
+
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
@@ -268,16 +261,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/* Flag to indicate areas detached from the mm->mm_mt tree */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock.lock (in write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock.lock (in read or write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -286,7 +276,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -339,6 +328,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	refcount_t vm_refcnt;
+#endif
 } __randomize_layout;
 
 struct vm_fault {};
@@ -463,23 +456,41 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline void vma_lock_init(struct vm_area_struct *vma)
+/*
+ * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
+ * assertions should be made either under mmap_write_lock or when the object
+ * has been isolated under mmap_write_lock, ensuring no competing writers.
+ */
+static inline void vma_assert_attached(struct vm_area_struct *vma)
 {
-	init_rwsem(&vma->vm_lock.lock);
-	vma->vm_lock_seq = UINT_MAX;
+	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
 }
 
-static inline void vma_mark_attached(struct vm_area_struct *vma)
+static inline void vma_assert_detached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
+	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+	vma_assert_detached(vma);
+	refcount_set(&vma->vm_refcnt, 1);
+}
+
 static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+	vma_assert_attached(vma);
+
+	/* We are the only writer, so no need to use vma_refcount_put(). */
+	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+		/*
+		 * Reader must have temporarily raised vm_refcnt but it will
+		 * drop it without using the vma since vma is write-locked.
+		 */
+	}
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -492,9 +503,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-	vma_lock_init(vma);
+	vma->vm_lock_seq = UINT_MAX;
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -517,10 +526,9 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	vma_lock_init(new);
+	refcount_set(&new->vm_refcnt, 0);
+	new->vm_lock_seq = UINT_MAX;
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (10 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
@ 2025-01-11  4:25 ` Suren Baghdasaryan
  2025-01-13 16:15   ` Lorenzo Stoakes
  2025-01-15 10:50   ` Peter Zijlstra
  2025-01-11  4:26 ` [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
                   ` (7 subsequent siblings)
  19 siblings, 2 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:25 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Move several vma_area_struct members which are rarely or never used
during page fault handling into the last cacheline to better pack
vm_area_struct. As a result vm_area_struct will fit into 3 as opposed
to 4 cachelines. New typical vm_area_struct layout:

struct vm_area_struct {
    union {
        struct {
            long unsigned int vm_start;              /*     0     8 */
            long unsigned int vm_end;                /*     8     8 */
        };                                           /*     0    16 */
        freeptr_t          vm_freeptr;               /*     0     8 */
    };                                               /*     0    16 */
    struct mm_struct *         vm_mm;                /*    16     8 */
    pgprot_t                   vm_page_prot;         /*    24     8 */
    union {
        const vm_flags_t   vm_flags;                 /*    32     8 */
        vm_flags_t         __vm_flags;               /*    32     8 */
    };                                               /*    32     8 */
    unsigned int               vm_lock_seq;          /*    40     4 */

    /* XXX 4 bytes hole, try to pack */

    struct list_head           anon_vma_chain;       /*    48    16 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct anon_vma *          anon_vma;             /*    64     8 */
    const struct vm_operations_struct  * vm_ops;     /*    72     8 */
    long unsigned int          vm_pgoff;             /*    80     8 */
    struct file *              vm_file;              /*    88     8 */
    void *                     vm_private_data;      /*    96     8 */
    atomic_long_t              swap_readahead_info;  /*   104     8 */
    struct mempolicy *         vm_policy;            /*   112     8 */
    struct vma_numab_state *   numab_state;          /*   120     8 */
    /* --- cacheline 2 boundary (128 bytes) --- */
    refcount_t          vm_refcnt (__aligned__(64)); /*   128     4 */

    /* XXX 4 bytes hole, try to pack */

    struct {
        struct rb_node     rb (__aligned__(8));      /*   136    24 */
        long unsigned int  rb_subtree_last;          /*   160     8 */
    } __attribute__((__aligned__(8))) shared;        /*   136    32 */
    struct anon_vma_name *     anon_name;            /*   168     8 */
    struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */

    /* size: 192, cachelines: 3, members: 18 */
    /* sum members: 176, holes: 2, sum holes: 8 */
    /* padding: 8 */
    /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
} __attribute__((__aligned__(64)));

Memory consumption per 1000 VMAs becomes 48 pages:

    slabinfo after vm_area_struct changes:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    192   42    2 : ...

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm_types.h | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9228d19662c6..d902e6730654 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -725,17 +725,6 @@ struct vm_area_struct {
 	 */
 	unsigned int vm_lock_seq;
 #endif
-
-	/*
-	 * For areas with an address space and backing store,
-	 * linkage into the address_space->i_mmap interval tree.
-	 *
-	 */
-	struct {
-		struct rb_node rb;
-		unsigned long rb_subtree_last;
-	} shared;
-
 	/*
 	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
 	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
@@ -755,14 +744,6 @@ struct vm_area_struct {
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
 
-#ifdef CONFIG_ANON_VMA_NAME
-	/*
-	 * For private and shared anonymous mappings, a pointer to a null
-	 * terminated string containing the name given to the vma, or NULL if
-	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
-	 */
-	struct anon_vma_name *anon_name;
-#endif
 #ifdef CONFIG_SWAP
 	atomic_long_t swap_readahead_info;
 #endif
@@ -775,7 +756,6 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA_BALANCING
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
-	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_PER_VMA_LOCK
 	/* Unstable RCU readers are allowed to read this. */
 	refcount_t vm_refcnt ____cacheline_aligned_in_smp;
@@ -783,6 +763,24 @@ struct vm_area_struct {
 	struct lockdep_map vmlock_dep_map;
 #endif
 #endif
+	/*
+	 * For areas with an address space and backing store,
+	 * linkage into the address_space->i_mmap interval tree.
+	 *
+	 */
+	struct {
+		struct rb_node rb;
+		unsigned long rb_subtree_last;
+	} shared;
+#ifdef CONFIG_ANON_VMA_NAME
+	/*
+	 * For private and shared anonymous mappings, a pointer to a null
+	 * terminated string containing the name given to the vma, or NULL if
+	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
+	 */
+	struct anon_vma_name *anon_name;
+#endif
+	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (11 preceding siblings ...)
  2025-01-11  4:25 ` [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
@ 2025-01-11  4:26 ` Suren Baghdasaryan
  2025-01-13 16:21   ` Lorenzo Stoakes
  2025-01-11  4:26 ` [PATCH v9 14/17] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:26 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

vm_refcnt encodes a number of useful states:
- whether vma is attached or detached
- the number of current vma readers
- presence of a vma writer
Let's include it in the vma dump.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/debug.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/debug.c b/mm/debug.c
index 8d2acf432385..325d7bf22038 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -178,6 +178,17 @@ EXPORT_SYMBOL(dump_page);
 
 void dump_vma(const struct vm_area_struct *vma)
 {
+#ifdef CONFIG_PER_VMA_LOCK
+	pr_emerg("vma %px start %px end %px mm %px\n"
+		"prot %lx anon_vma %px vm_ops %px\n"
+		"pgoff %lx file %px private_data %px\n"
+		"flags: %#lx(%pGv) refcnt %x\n",
+		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
+		(unsigned long)pgprot_val(vma->vm_page_prot),
+		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
+		vma->vm_file, vma->vm_private_data,
+		vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
+#else
 	pr_emerg("vma %px start %px end %px mm %px\n"
 		"prot %lx anon_vma %px vm_ops %px\n"
 		"pgoff %lx file %px private_data %px\n"
@@ -187,6 +198,7 @@ void dump_vma(const struct vm_area_struct *vma)
 		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
 		vma->vm_file, vma->vm_private_data,
 		vma->vm_flags, &vma->vm_flags);
+#endif
 }
 EXPORT_SYMBOL(dump_vma);
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 14/17] mm: remove extra vma_numab_state_init() call
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (12 preceding siblings ...)
  2025-01-11  4:26 ` [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
@ 2025-01-11  4:26 ` Suren Baghdasaryan
  2025-01-13 16:28   ` Lorenzo Stoakes
  2025-01-11  4:26 ` [PATCH v9 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:26 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

vma_init() already memset's the whole vm_area_struct to 0, so there is
no need to an additional vma_numab_state_init().

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a99b11ee1f66..c8da64b114d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -948,7 +948,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_numab_state_init(vma);
 	vma_lock_init(vma, false);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (13 preceding siblings ...)
  2025-01-11  4:26 ` [PATCH v9 14/17] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
@ 2025-01-11  4:26 ` Suren Baghdasaryan
  2025-01-11  4:26 ` [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:26 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma
to be reused and attached to another mm after lock_vma_under_rcu() locks
the vma. lock_vma_under_rcu() should ensure that vma_start_read() is using
the original mm and after locking the vma it should ensure that vma->vm_mm
has not changed from under us.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 10 ++++++----
 mm/memory.c        |  7 ++++---
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8da64b114d1..cb29eb7360c5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -739,8 +739,10 @@ static inline void vma_refcount_put(struct vm_area_struct *vma)
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
  * using mmap_lock. The function should never yield false unlocked result.
+ * False locked result is possible if mm_lock_seq overflows or if vma gets
+ * reused and attached to a different mm before we lock it.
  */
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	int oldcnt;
 
@@ -751,7 +753,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * we don't rely on for anything - the mm_lock_seq read against which we
 	 * need ordering is below.
 	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
 		return false;
 
 	/*
@@ -774,7 +776,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * after it has been unlocked.
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
-	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
+	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
 		vma_refcount_put(vma);
 		return false;
 	}
@@ -906,7 +908,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {}
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index dc16b67beefa..67cfcebb0f94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6426,7 +6426,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma)
 		goto inval;
 
-	if (!vma_start_read(vma))
+	if (!vma_start_read(mm, vma))
 		goto inval;
 
 	/*
@@ -6436,8 +6436,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	 * fields are accessible for RCU readers.
 	 */
 
-	/* Check since vm_start/vm_end might change before we lock the VMA */
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+	/* Check if the vma we locked is the right one. */
+	if (unlikely(vma->vm_mm != mm ||
+		     address < vma->vm_start || address >= vma->vm_end))
 		goto inval_end_read;
 
 	rcu_read_unlock();
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (14 preceding siblings ...)
  2025-01-11  4:26 ` [PATCH v9 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
@ 2025-01-11  4:26 ` Suren Baghdasaryan
  2025-01-15  2:27   ` Wei Yang
  2025-01-11  4:26 ` [PATCH v9 17/17] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:26 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu().
Current checks are sufficient as long as vma is detached before it is
freed. The only place this is not currently happening is in exit_mmap().
Add the missing vma_mark_detached() in exit_mmap().
Another issue which might trick lock_vma_under_rcu() during vma reuse
is vm_area_dup(), which copies the entire content of the vma into a new
one, overriding new vma's vm_refcnt and temporarily making it appear as
attached. This might trick a racing lock_vma_under_rcu() to operate on
a reused vma if it found the vma before it got reused. To prevent this
situation, we should ensure that vm_refcnt stays at detached state (0)
when it is copied and advances to attached state only after it is added
into the vma tree. Introduce vm_area_init_from() which preserves new
vma's vm_refcnt and use it in vm_area_dup(). Since all vmas are in
detached state with no current readers when they are freed,
lock_vma_under_rcu() will not be able to take vm_refcnt after vma got
detached even if vma is reused.
Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
vm_area_struct reuse and will minimize the number of call_rcu() calls.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h               |  2 -
 include/linux/mm_types.h         | 13 ++++--
 include/linux/slab.h             |  6 ---
 kernel/fork.c                    | 73 ++++++++++++++++++++------------
 mm/mmap.c                        |  3 +-
 mm/vma.c                         | 11 ++---
 mm/vma.h                         |  2 +-
 tools/testing/vma/vma_internal.h |  7 +--
 8 files changed, 63 insertions(+), 54 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb29eb7360c5..ac78425e9838 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
-/* Use only if VMA has no other users */
-void __vm_area_free(struct vm_area_struct *vma);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d902e6730654..d366ec6302e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -574,6 +574,12 @@ static inline void *folio_get_private(struct folio *folio)
 
 typedef unsigned long vm_flags_t;
 
+/*
+ * freeptr_t represents a SLUB freelist pointer, which might be encoded
+ * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
+ */
+typedef struct { unsigned long v; } freeptr_t;
+
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
  * conditions.  These are held in a global tree and are pinned by the VMAs that
@@ -677,6 +683,9 @@ struct vma_numab_state {
  *
  * Only explicitly marked struct members may be accessed by RCU readers before
  * getting a stable reference.
+ *
+ * WARNING: when adding new members, please update vm_area_init_from() to copy
+ * them during vm_area_struct content duplication.
  */
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
@@ -687,9 +696,7 @@ struct vm_area_struct {
 			unsigned long vm_start;
 			unsigned long vm_end;
 		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
 	};
 
 	/*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 10a971c2bde3..681b685b6c4e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -234,12 +234,6 @@ enum _slab_flag_bits {
 #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
 #endif
 
-/*
- * freeptr_t represents a SLUB freelist pointer, which might be encoded
- * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
- */
-typedef struct { unsigned long v; } freeptr_t;
-
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
diff --git a/kernel/fork.c b/kernel/fork.c
index 9d9275783cf8..151b40627c14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 	return vma;
 }
 
+static void vm_area_init_from(const struct vm_area_struct *src,
+			      struct vm_area_struct *dest)
+{
+	dest->vm_mm = src->vm_mm;
+	dest->vm_ops = src->vm_ops;
+	dest->vm_start = src->vm_start;
+	dest->vm_end = src->vm_end;
+	dest->anon_vma = src->anon_vma;
+	dest->vm_pgoff = src->vm_pgoff;
+	dest->vm_file = src->vm_file;
+	dest->vm_private_data = src->vm_private_data;
+	vm_flags_init(dest, src->vm_flags);
+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
+	       sizeof(dest->vm_page_prot));
+	/*
+	 * src->shared.rb may be modified concurrently when called from
+	 * dup_mmap(), but the clone will reinitialize it.
+	 */
+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
+	       sizeof(dest->vm_userfaultfd_ctx));
+#ifdef CONFIG_ANON_VMA_NAME
+	dest->anon_name = src->anon_name;
+#endif
+#ifdef CONFIG_SWAP
+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
+	       sizeof(dest->swap_readahead_info));
+#endif
+#ifndef CONFIG_MMU
+	dest->vm_region = src->vm_region;
+#endif
+#ifdef CONFIG_NUMA
+	dest->vm_policy = src->vm_policy;
+#endif
+}
+
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 {
 	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
@@ -458,11 +494,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
 	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
-	/*
-	 * orig->shared.rb may be modified concurrently, but the clone
-	 * will be reinitialized.
-	 */
-	data_race(memcpy(new, orig, sizeof(*new)));
+	vm_area_init_from(orig, new);
 	vma_lock_init(new, true);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
@@ -471,7 +503,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	return new;
 }
 
-void __vm_area_free(struct vm_area_struct *vma)
+void vm_area_free(struct vm_area_struct *vma)
 {
 	/* The vma should be detached while being destroyed. */
 	vma_assert_detached(vma);
@@ -480,25 +512,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
-static void vm_area_free_rcu_cb(struct rcu_head *head)
-{
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
-
-	__vm_area_free(vma);
-}
-#endif
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
-#else
-	__vm_area_free(vma);
-#endif
-}
-
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3144,6 +3157,11 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3160,8 +3178,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = KMEM_CACHE(vm_area_struct,
-			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
 			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
diff --git a/mm/mmap.c b/mm/mmap.c
index cda01071c7b1..7aa36216ecc0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1305,7 +1305,8 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		remove_vma(vma, /* unreachable = */ true);
+		vma_mark_detached(vma);
+		remove_vma(vma);
 		count++;
 		cond_resched();
 		vma = vma_next(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index 93ff42ac2002..0a5158d611e3 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -406,19 +406,14 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
 /*
  * Close a vm structure and free it.
  */
-void remove_vma(struct vm_area_struct *vma, bool unreachable)
+void remove_vma(struct vm_area_struct *vma)
 {
 	might_sleep();
 	vma_close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	if (unreachable) {
-		vma_mark_detached(vma);
-		__vm_area_free(vma);
-	} else {
-		vm_area_free(vma);
-	}
+	vm_area_free(vma);
 }
 
 /*
@@ -1201,7 +1196,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		remove_vma(vma, /* unreachable = */ false);
+		remove_vma(vma);
 
 	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
diff --git a/mm/vma.h b/mm/vma.h
index 63dd38d5230c..f51005b95b39 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -170,7 +170,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
 		  bool unlock);
 
-void remove_vma(struct vm_area_struct *vma, bool unreachable);
+void remove_vma(struct vm_area_struct *vma);
 
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 2ce032943861..49a85ce0d45a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -697,14 +697,9 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void __vm_area_free(struct vm_area_struct *vma)
-{
-	free(vma);
-}
-
 static inline void vm_area_free(struct vm_area_struct *vma)
 {
-	__vm_area_free(vma);
+	free(vma);
 }
 
 static inline void lru_add_drain(void)
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v9 17/17] docs/mm: document latest changes to vm_lock
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (15 preceding siblings ...)
  2025-01-11  4:26 ` [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2025-01-11  4:26 ` Suren Baghdasaryan
  2025-01-13 16:33   ` Lorenzo Stoakes
  2025-01-11  4:52 ` [PATCH v9 00/17] reimplement per-vma lock as a refcount Matthew Wilcox
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  4:26 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, surenb, Liam R. Howlett

Change the documentation to reflect that vm_lock is integrated into vma
and replaced with vm_refcnt.
Document newly introduced vma_start_read_locked{_nested} functions.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 Documentation/mm/process_addrs.rst | 44 ++++++++++++++++++------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 81417fa2ed20..f573de936b5d 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -716,9 +716,14 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
 critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
 before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
 
-VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
-their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
-via :c:func:`!vma_end_read`.
+In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
+and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
+fail due to lock contention but the caller should still check their return values
+in case they fail for other reasons.
+
+VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
+duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
+:c:func:`!vma_end_read`.
 
 VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
 VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
@@ -726,9 +731,9 @@ acquired. An mmap write lock **must** be held for the duration of the VMA write
 lock, releasing or downgrading the mmap write lock also releases the VMA write
 lock so there is no :c:func:`!vma_end_write` function.
 
-Note that a semaphore write lock is not held across a VMA lock. Rather, a
-sequence number is used for serialisation, and the write semaphore is only
-acquired at the point of write lock to update this.
+Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
+modified so that readers can detect the presense of a writer. The reference counter is
+restored once the vma sequence number used for serialisation is updated.
 
 This ensures the semantics we require - VMA write locks provide exclusive write
 access to the VMA.
@@ -738,7 +743,7 @@ Implementation details
 
 The VMA lock mechanism is designed to be a lightweight means of avoiding the use
 of the heavily contended mmap lock. It is implemented using a combination of a
-read/write semaphore and sequence numbers belonging to the containing
+reference counter and sequence numbers belonging to the containing
 :c:struct:`!struct mm_struct` and the VMA.
 
 Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
@@ -779,28 +784,31 @@ release of any VMA locks on its release makes sense, as you would never want to
 keep VMAs locked across entirely separate write operations. It also maintains
 correct lock ordering.
 
-Each time a VMA read lock is acquired, we acquire a read lock on the
-:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
-the sequence count of the VMA does not match that of the mm.
+Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
+reference counter and check that the sequence count of the VMA does not match
+that of the mm.
 
-If it does, the read lock fails. If it does not, we hold the lock, excluding
-writers, but permitting other readers, who will also obtain this lock under RCU.
+If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
+If it does not, we keep the reference counter raised, excluding writers, but
+permitting other readers, who can also obtain this lock under RCU.
 
 Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
 are also RCU safe, so the whole read lock operation is guaranteed to function
 correctly.
 
-On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
-read/write semaphore, before setting the VMA's sequence number under this lock,
-also simultaneously holding the mmap write lock.
+On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
+modified by readers and wait for all readers to drop their reference count.
+Once there are no readers, VMA's sequence number is set to match that of the
+mm. During this entire operation mmap write lock is held.
 
 This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
 until these are finished and mutual exclusion is achieved.
 
-After setting the VMA's sequence number, the lock is released, avoiding
-complexity with a long-term held write lock.
+After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
+indicating a writer is cleared. From this point on, VMA's sequence number will
+indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
 
-This clever combination of a read/write semaphore and sequence count allows for
+This clever combination of a reference counter and sequence count allows for
 fast RCU-based per-VMA lock acquisition (especially on page fault, though
 utilised elsewhere) with minimal complexity around lock ordering.
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (16 preceding siblings ...)
  2025-01-11  4:26 ` [PATCH v9 17/17] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2025-01-11  4:52 ` Matthew Wilcox
  2025-01-11  9:45   ` Suren Baghdasaryan
  2025-01-13 12:14 ` Lorenzo Stoakes
  2025-01-28  5:26 ` Shivank Garg
  19 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2025-01-11  4:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> - Added static check for no-limit case in __refcount_add_not_zero_limited,
> per David Laight

Ugh, no, don't listen to David.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11  4:25 ` [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
@ 2025-01-11  6:31   ` Hillf Danton
  2025-01-11  9:59     ` Suren Baghdasaryan
  2025-01-11 12:39   ` David Laight
  1 sibling, 1 reply; 140+ messages in thread
From: Hillf Danton @ 2025-01-11  6:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, hannes, linux-mm, linux-kernel, kernel-team

On Fri, 10 Jan 2025 20:25:57 -0800 Suren Baghdasaryan <surenb@google.com>
> -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> +				     int limit)
>  {
>  	int old = refcount_read(r);
>  
>  	do {
>  		if (!old)
>  			break;
> +
> +		if (statically_true(limit == INT_MAX))
> +			continue;
> +
> +		if (i > limit - old) {
> +			if (oldp)
> +				*oldp = old;
> +			return false;
> +		}
>  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));

The acquire version should be used, see atomic_long_try_cmpxchg_acquire()
in kernel/locking/rwsem.c.

Why not use the atomic_long_t without bothering to add this limited version?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-11  4:52 ` [PATCH v9 00/17] reimplement per-vma lock as a refcount Matthew Wilcox
@ 2025-01-11  9:45   ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  9:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, peterz, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 8:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > per David Laight
>
> Ugh, no, don't listen to David.

I thought his suggestion to add a check which can be verified at
compile time made sense. Could you please explain why that's a bad
idea? I'm really curious.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11  6:31   ` Hillf Danton
@ 2025-01-11  9:59     ` Suren Baghdasaryan
  2025-01-11 10:00       ` Suren Baghdasaryan
  2025-01-11 12:13       ` Hillf Danton
  0 siblings, 2 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11  9:59 UTC (permalink / raw)
  To: Hillf Danton
  Cc: akpm, peterz, willy, hannes, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 10:32 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Fri, 10 Jan 2025 20:25:57 -0800 Suren Baghdasaryan <surenb@google.com>
> > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > +                                  int limit)
> >  {
> >       int old = refcount_read(r);
> >
> >       do {
> >               if (!old)
> >                       break;
> > +
> > +             if (statically_true(limit == INT_MAX))
> > +                     continue;
> > +
> > +             if (i > limit - old) {
> > +                     if (oldp)
> > +                             *oldp = old;
> > +                     return false;
> > +             }
> >       } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
>
> The acquire version should be used, see atomic_long_try_cmpxchg_acquire()
> in kernel/locking/rwsem.c.

This is how __refcount_add_not_zero() is already implemented and I'm
only adding support for a limit. If you think it's implemented wrong
then IMHO it should be fixed separately.

>
> Why not use the atomic_long_t without bothering to add this limited version?

The check against the limit is not only for overflow protection but
also to avoid refcount increment when the writer bit is set. It makes
the locking code simpler if we have a function that prevents
refcounting when the vma is detached (vm_refcnt==0) or when it's
write-locked (vm_refcnt<VMA_REF_LIMIT).

>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11  9:59     ` Suren Baghdasaryan
@ 2025-01-11 10:00       ` Suren Baghdasaryan
  2025-01-11 12:13       ` Hillf Danton
  1 sibling, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11 10:00 UTC (permalink / raw)
  To: Hillf Danton
  Cc: akpm, peterz, willy, hannes, linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 1:59 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Jan 10, 2025 at 10:32 PM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Fri, 10 Jan 2025 20:25:57 -0800 Suren Baghdasaryan <surenb@google.com>
> > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > > +                                  int limit)
> > >  {
> > >       int old = refcount_read(r);
> > >
> > >       do {
> > >               if (!old)
> > >                       break;
> > > +
> > > +             if (statically_true(limit == INT_MAX))
> > > +                     continue;
> > > +
> > > +             if (i > limit - old) {
> > > +                     if (oldp)
> > > +                             *oldp = old;
> > > +                     return false;
> > > +             }
> > >       } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> >
> > The acquire version should be used, see atomic_long_try_cmpxchg_acquire()
> > in kernel/locking/rwsem.c.
>
> This is how __refcount_add_not_zero() is already implemented and I'm
> only adding support for a limit. If you think it's implemented wrong
> then IMHO it should be fixed separately.
>
> >
> > Why not use the atomic_long_t without bothering to add this limited version?
>
> The check against the limit is not only for overflow protection but
> also to avoid refcount increment when the writer bit is set. It makes
> the locking code simpler if we have a function that prevents
> refcounting when the vma is detached (vm_refcnt==0) or when it's
> write-locked (vm_refcnt<VMA_REF_LIMIT).

s / vm_refcnt<VMA_REF_LIMIT / vm_refcnt>VMA_REF_LIMIT

>
> >
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11  4:25 ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
@ 2025-01-11 11:24   ` Mateusz Guzik
  2025-01-11 20:14     ` Suren Baghdasaryan
  2025-01-12  2:59   ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Wei Yang
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 140+ messages in thread
From: Mateusz Guzik @ 2025-01-11 11:24 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:

So there were quite a few iterations of the patch and I have not been
reading majority of the feedback, so it may be I missed something,
apologies upfront. :)

>  /*
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
> @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>   */
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  {
> +	int oldcnt;
> +
>  	/*
>  	 * Check before locking. A race might cause false locked result.
>  	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>  		return false;
>  
> -	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> +	/*
> +	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> +	 * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> +	 */
> +	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> +						      VMA_REF_LIMIT)))
>  		return false;
>  

Replacing down_read_trylock() with the new routine loses an acquire
fence. That alone is not a problem, but see below.

> +	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
>  	/*
> -	 * Overflow might produce false locked result.
> +	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
>  	 * False unlocked result is impossible because we modify and check
> -	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> +	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
>  	 * modification invalidates all existing locks.
>  	 *
>  	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
>  	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {

The previous modification of this spot to raw_read_seqcount loses the
acquire fence, making the above comment not line up with the code.

I don't know if the stock code (with down_read_trylock()) is correct as
is -- looks fine for cursory reading fwiw. However, if it indeed works,
the acquire fence stemming from the lock routine is a mandatory part of
it afaics.

I think the best way forward is to add a new refcount routine which
ships with an acquire fence.

Otherwise I would suggest:
1. a comment above __refcount_inc_not_zero_limited saying there is an
   acq fence issued later
2. smp_rmb() slapped between that and seq accesses

If the now removed fence is somehow not needed, I think a comment
explaining it is necessary.

> @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  
>  static inline void vma_assert_locked(struct vm_area_struct *vma)
>  {
> -	if (!rwsem_is_locked(&vma->vm_lock.lock))
> +	if (refcount_read(&vma->vm_refcnt) <= 1)
>  		vma_assert_write_locked(vma);
>  }
>  

This now forces the compiler to emit a load from vm_refcnt even if
vma_assert_write_locked expands to nothing. iow this wants to hide
behind the same stuff as vma_assert_write_locked.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11  9:59     ` Suren Baghdasaryan
  2025-01-11 10:00       ` Suren Baghdasaryan
@ 2025-01-11 12:13       ` Hillf Danton
  2025-01-11 17:11         ` Suren Baghdasaryan
  1 sibling, 1 reply; 140+ messages in thread
From: Hillf Danton @ 2025-01-11 12:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, hannes, linux-mm, linux-kernel, kernel-team

On Sat, 11 Jan 2025 01:59:41 -0800 Suren Baghdasaryan <surenb@google.com>
> On Fri, Jan 10, 2025 at 10:32 PM Hillf Danton <hdanton@sina.com> wrote:
> > On Fri, 10 Jan 2025 20:25:57 -0800 Suren Baghdasaryan <surenb@google.com>
> > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > > +                                  int limit)
> > >  {
> > >       int old = refcount_read(r);
> > >
> > >       do {
> > >               if (!old)
> > >                       break;
> > > +
> > > +             if (statically_true(limit == INT_MAX))
> > > +                     continue;
> > > +
> > > +             if (i > limit - old) {
> > > +                     if (oldp)
> > > +                             *oldp = old;
> > > +                     return false;
> > > +             }
> > >       } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> >
> > The acquire version should be used, see atomic_long_try_cmpxchg_acquire()
> > in kernel/locking/rwsem.c.
> 
> This is how __refcount_add_not_zero() is already implemented and I'm
> only adding support for a limit. If you think it's implemented wrong
> then IMHO it should be fixed separately.
> 
Two different things - refcount has nothing to do with locking at the
first place, while what you are adding to the mm directory is something
that replaces rwsem, so from the locking POV you have to mark the
boundaries of the locking section.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11  4:25 ` [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
  2025-01-11  6:31   ` Hillf Danton
@ 2025-01-11 12:39   ` David Laight
  2025-01-11 17:07     ` Matthew Wilcox
  2025-01-11 18:30     ` Paul E. McKenney
  1 sibling, 2 replies; 140+ messages in thread
From: David Laight @ 2025-01-11 12:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, 10 Jan 2025 20:25:57 -0800
Suren Baghdasaryan <surenb@google.com> wrote:

> Introduce functions to increase refcount but with a top limit above which
> they will fail to increase (the limit is inclusive). Setting the limit to
> INT_MAX indicates no limit.

This function has never worked as expected!
I've removed the update and added in the rest of the code.

> diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> index 35f039ecb272..5072ba99f05e 100644
> --- a/include/linux/refcount.h
> +++ b/include/linux/refcount.h
> @@ -137,13 +137,23 @@ static inline unsigned int refcount_read(const refcount_t *r)
>  }
>  
>  static inline __must_check __signed_wrap
> -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
>  {
>  	int old = refcount_read(r);
>  
>  	do {
>  		if (!old)
>  			break;
>
>  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
>  
>  	if (oldp)
>		*oldp = old;
?
>	if (unlikely(old < 0 || old + i < 0))
>		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
>
>  	return old;
>  }

The saturate test just doesn't work as expected.
In C signed integer overflow is undefined (probably so that cpu that saturate/trap
signed overflow can be conformant) and gcc uses that to optimise code.

So if you compile (https://www.godbolt.org/z/WYWo84Weq):
int inc_wraps(int i)
{
    return i < 0 || i + 1 < 0;
}
the second test is optimised away.
I don't think the kernel compiles disable this optimisation.

	David


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11 12:39   ` David Laight
@ 2025-01-11 17:07     ` Matthew Wilcox
  2025-01-11 18:30     ` Paul E. McKenney
  1 sibling, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2025-01-11 17:07 UTC (permalink / raw)
  To: David Laight
  Cc: Suren Baghdasaryan, akpm, peterz, liam.howlett, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 12:39:00PM +0000, David Laight wrote:
> I don't think the kernel compiles disable this optimisation.

You're wrong.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11 12:13       ` Hillf Danton
@ 2025-01-11 17:11         ` Suren Baghdasaryan
  2025-01-11 23:44           ` Hillf Danton
  2025-01-15  9:39           ` Peter Zijlstra
  0 siblings, 2 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11 17:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: akpm, peterz, willy, hannes, linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 4:13 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Sat, 11 Jan 2025 01:59:41 -0800 Suren Baghdasaryan <surenb@google.com>
> > On Fri, Jan 10, 2025 at 10:32 PM Hillf Danton <hdanton@sina.com> wrote:
> > > On Fri, 10 Jan 2025 20:25:57 -0800 Suren Baghdasaryan <surenb@google.com>
> > > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > > > +                                  int limit)
> > > >  {
> > > >       int old = refcount_read(r);
> > > >
> > > >       do {
> > > >               if (!old)
> > > >                       break;
> > > > +
> > > > +             if (statically_true(limit == INT_MAX))
> > > > +                     continue;
> > > > +
> > > > +             if (i > limit - old) {
> > > > +                     if (oldp)
> > > > +                             *oldp = old;
> > > > +                     return false;
> > > > +             }
> > > >       } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > >
> > > The acquire version should be used, see atomic_long_try_cmpxchg_acquire()
> > > in kernel/locking/rwsem.c.
> >
> > This is how __refcount_add_not_zero() is already implemented and I'm
> > only adding support for a limit. If you think it's implemented wrong
> > then IMHO it should be fixed separately.
> >
> Two different things - refcount has nothing to do with locking at the
> first place, while what you are adding to the mm directory is something
> that replaces rwsem, so from the locking POV you have to mark the
> boundaries of the locking section.

I see your point. I think it's a strong argument to use atomic
directly instead of refcount for this locking. I'll try that and see
how it looks. Thanks for the feedback!

>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11 12:39   ` David Laight
  2025-01-11 17:07     ` Matthew Wilcox
@ 2025-01-11 18:30     ` Paul E. McKenney
  2025-01-11 22:19       ` David Laight
  1 sibling, 1 reply; 140+ messages in thread
From: Paul E. McKenney @ 2025-01-11 18:30 UTC (permalink / raw)
  To: David Laight
  Cc: Suren Baghdasaryan, akpm, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 12:39:00PM +0000, David Laight wrote:
> On Fri, 10 Jan 2025 20:25:57 -0800
> Suren Baghdasaryan <surenb@google.com> wrote:
> 
> > Introduce functions to increase refcount but with a top limit above which
> > they will fail to increase (the limit is inclusive). Setting the limit to
> > INT_MAX indicates no limit.
> 
> This function has never worked as expected!
> I've removed the update and added in the rest of the code.
> 
> > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > index 35f039ecb272..5072ba99f05e 100644
> > --- a/include/linux/refcount.h
> > +++ b/include/linux/refcount.h
> > @@ -137,13 +137,23 @@ static inline unsigned int refcount_read(const refcount_t *r)
> >  }
> >  
> >  static inline __must_check __signed_wrap
> > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> >  {
> >  	int old = refcount_read(r);
> >  
> >  	do {
> >  		if (!old)
> >  			break;
> >
> >  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> >  
> >  	if (oldp)
> >		*oldp = old;
> ?
> >	if (unlikely(old < 0 || old + i < 0))
> >		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
> >
> >  	return old;
> >  }
> 
> The saturate test just doesn't work as expected.
> In C signed integer overflow is undefined (probably so that cpu that saturate/trap
> signed overflow can be conformant) and gcc uses that to optimise code.
> 
> So if you compile (https://www.godbolt.org/z/WYWo84Weq):
> int inc_wraps(int i)
> {
>     return i < 0 || i + 1 < 0;
> }
> the second test is optimised away.
> I don't think the kernel compiles disable this optimisation.

Last I checked, my kernel compiles specified -fno-strict-overflow.
What happens if you try that in godbolt?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11 11:24   ` Mateusz Guzik
@ 2025-01-11 20:14     ` Suren Baghdasaryan
  2025-01-11 20:16       ` Suren Baghdasaryan
                         ` (4 more replies)
  0 siblings, 5 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11 20:14 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>
> So there were quite a few iterations of the patch and I have not been
> reading majority of the feedback, so it may be I missed something,
> apologies upfront. :)
>
> >  /*
> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >   * locked result to avoid performance overhead, in which case we fall back to
> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >   */
> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >  {
> > +     int oldcnt;
> > +
> >       /*
> >        * Check before locking. A race might cause false locked result.
> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >               return false;
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > +     /*
> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> > +      */
> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > +                                                   VMA_REF_LIMIT)))
> >               return false;
> >
>
> Replacing down_read_trylock() with the new routine loses an acquire
> fence. That alone is not a problem, but see below.

Hmm. I think this acquire fence is actually necessary. We don't want
the later vm_lock_seq check to be reordered and happen before we take
the refcount. Otherwise this might happen:

reader             writer
if (vm_lock_seq == mm_lock_seq) // check got reordered
        return false;
                       vm_refcnt += VMA_LOCK_OFFSET
                       vm_lock_seq == mm_lock_seq
                       vm_refcnt -= VMA_LOCK_OFFSET
if (!__refcount_inc_not_zero_limited())
        return false;

Both reader's checks will pass and the reader would read-lock a vma
that was write-locked.

>
> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> >       /*
> > -      * Overflow might produce false locked result.
> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        *
> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
>
> The previous modification of this spot to raw_read_seqcount loses the
> acquire fence, making the above comment not line up with the code.

Is it? From reading the seqcount code
(https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):

raw_read_seqcount()
    seqprop_sequence()
        __seqprop(s, sequence)
            __seqprop_sequence()
                smp_load_acquire()

smp_load_acquire() still provides the acquire fence. Am I missing something?

>
> I don't know if the stock code (with down_read_trylock()) is correct as
> is -- looks fine for cursory reading fwiw. However, if it indeed works,
> the acquire fence stemming from the lock routine is a mandatory part of
> it afaics.
>
> I think the best way forward is to add a new refcount routine which
> ships with an acquire fence.

I plan on replacing refcount_t usage here with an atomic since, as
Hillf noted, refcount is not designed to be used for locking. And will
make sure the down_read_trylock() replacement will provide an acquire
fence.

>
> Otherwise I would suggest:
> 1. a comment above __refcount_inc_not_zero_limited saying there is an
>    acq fence issued later
> 2. smp_rmb() slapped between that and seq accesses
>
> If the now removed fence is somehow not needed, I think a comment
> explaining it is necessary.
>
> > @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >
> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> >  {
> > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
> > +     if (refcount_read(&vma->vm_refcnt) <= 1)
> >               vma_assert_write_locked(vma);
> >  }
> >
>
> This now forces the compiler to emit a load from vm_refcnt even if
> vma_assert_write_locked expands to nothing. iow this wants to hide
> behind the same stuff as vma_assert_write_locked.

True. I guess I'll have to avoid using vma_assert_write_locked() like this:

static inline void vma_assert_locked(struct vm_area_struct *vma)
{
        unsigned int mm_lock_seq;

        VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 &&
                                          !__is_vma_write_locked(vma,
&mm_lock_seq), vma);
}

Will make the change.

Thanks for the feedback!

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11 20:14     ` Suren Baghdasaryan
@ 2025-01-11 20:16       ` Suren Baghdasaryan
  2025-01-11 20:31       ` Mateusz Guzik
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11 20:16 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 12:14 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >
> > So there were quite a few iterations of the patch and I have not been
> > reading majority of the feedback, so it may be I missed something,
> > apologies upfront. :)
> >
> > >  /*
> > >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> > >   * locked result to avoid performance overhead, in which case we fall back to
> > > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> > >   */
> > >  static inline bool vma_start_read(struct vm_area_struct *vma)
> > >  {
> > > +     int oldcnt;
> > > +
> > >       /*
> > >        * Check before locking. A race might cause false locked result.
> > >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > >               return false;
> > >
> > > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > > +     /*
> > > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> > > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> > > +      */
> > > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > > +                                                   VMA_REF_LIMIT)))
> > >               return false;
> > >
> >
> > Replacing down_read_trylock() with the new routine loses an acquire
> > fence. That alone is not a problem, but see below.
>
> Hmm. I think this acquire fence is actually necessary. We don't want
> the later vm_lock_seq check to be reordered and happen before we take
> the refcount. Otherwise this might happen:
>
> reader             writer
> if (vm_lock_seq == mm_lock_seq) // check got reordered
>         return false;
>                        vm_refcnt += VMA_LOCK_OFFSET
>                        vm_lock_seq == mm_lock_seq

s/vm_lock_seq == mm_lock_seq/vm_lock_seq = mm_lock_seq

>                        vm_refcnt -= VMA_LOCK_OFFSET
> if (!__refcount_inc_not_zero_limited())
>         return false;
>
> Both reader's checks will pass and the reader would read-lock a vma
> that was write-locked.
>
> >
> > > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> > >       /*
> > > -      * Overflow might produce false locked result.
> > > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> > >        * False unlocked result is impossible because we modify and check
> > > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> > >        * modification invalidates all existing locks.
> > >        *
> > >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > > @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >        * This pairs with RELEASE semantics in vma_end_write_all().
> > >        */
> > >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> >
> > The previous modification of this spot to raw_read_seqcount loses the
> > acquire fence, making the above comment not line up with the code.
>
> Is it? From reading the seqcount code
> (https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
>
> raw_read_seqcount()
>     seqprop_sequence()
>         __seqprop(s, sequence)
>             __seqprop_sequence()
>                 smp_load_acquire()
>
> smp_load_acquire() still provides the acquire fence. Am I missing something?
>
> >
> > I don't know if the stock code (with down_read_trylock()) is correct as
> > is -- looks fine for cursory reading fwiw. However, if it indeed works,
> > the acquire fence stemming from the lock routine is a mandatory part of
> > it afaics.
> >
> > I think the best way forward is to add a new refcount routine which
> > ships with an acquire fence.
>
> I plan on replacing refcount_t usage here with an atomic since, as
> Hillf noted, refcount is not designed to be used for locking. And will
> make sure the down_read_trylock() replacement will provide an acquire
> fence.
>
> >
> > Otherwise I would suggest:
> > 1. a comment above __refcount_inc_not_zero_limited saying there is an
> >    acq fence issued later
> > 2. smp_rmb() slapped between that and seq accesses
> >
> > If the now removed fence is somehow not needed, I think a comment
> > explaining it is necessary.
> >
> > > @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > >
> > >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> > >  {
> > > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
> > > +     if (refcount_read(&vma->vm_refcnt) <= 1)
> > >               vma_assert_write_locked(vma);
> > >  }
> > >
> >
> > This now forces the compiler to emit a load from vm_refcnt even if
> > vma_assert_write_locked expands to nothing. iow this wants to hide
> > behind the same stuff as vma_assert_write_locked.
>
> True. I guess I'll have to avoid using vma_assert_write_locked() like this:
>
> static inline void vma_assert_locked(struct vm_area_struct *vma)
> {
>         unsigned int mm_lock_seq;
>
>         VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 &&
>                                           !__is_vma_write_locked(vma,
> &mm_lock_seq), vma);
> }
>
> Will make the change.
>
> Thanks for the feedback!

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11 20:14     ` Suren Baghdasaryan
  2025-01-11 20:16       ` Suren Baghdasaryan
@ 2025-01-11 20:31       ` Mateusz Guzik
  2025-01-11 20:58         ` Suren Baghdasaryan
  2025-01-11 20:38       ` Vlastimil Babka
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 140+ messages in thread
From: Mateusz Guzik @ 2025-01-11 20:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 9:14 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > The previous modification of this spot to raw_read_seqcount loses the
> > acquire fence, making the above comment not line up with the code.
>
> Is it? From reading the seqcount code
> (https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
>
> raw_read_seqcount()
>     seqprop_sequence()
>         __seqprop(s, sequence)
>             __seqprop_sequence()
>                 smp_load_acquire()
>
> smp_load_acquire() still provides the acquire fence. Am I missing something?
>

That's fine indeed.

In a different project there is an equivalent API which skips
barriers, too quick glance suggested this is what's going on here. My
bad, sorry for false alarm on this front. :)

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11 20:14     ` Suren Baghdasaryan
  2025-01-11 20:16       ` Suren Baghdasaryan
  2025-01-11 20:31       ` Mateusz Guzik
@ 2025-01-11 20:38       ` Vlastimil Babka
  2025-01-13  1:47       ` Wei Yang
  2025-01-15 10:48       ` Peter Zijlstra
  4 siblings, 0 replies; 140+ messages in thread
From: Vlastimil Babka @ 2025-01-11 20:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Mateusz Guzik, Hillf Danton
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, hannes, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On 1/11/25 21:14, Suren Baghdasaryan wrote:
> I plan on replacing refcount_t usage here with an atomic since, as
> Hillf noted, refcount is not designed to be used for locking. And will
> make sure the down_read_trylock() replacement will provide an acquire
> fence.

Could Hillf stop reducing the Cc list on his replies? The whole subthread
went to only few people :(

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11 20:31       ` Mateusz Guzik
@ 2025-01-11 20:58         ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-11 20:58 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 12:31 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Sat, Jan 11, 2025 at 9:14 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > The previous modification of this spot to raw_read_seqcount loses the
> > > acquire fence, making the above comment not line up with the code.
> >
> > Is it? From reading the seqcount code
> > (https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
> >
> > raw_read_seqcount()
> >     seqprop_sequence()
> >         __seqprop(s, sequence)
> >             __seqprop_sequence()
> >                 smp_load_acquire()
> >
> > smp_load_acquire() still provides the acquire fence. Am I missing something?
> >
>
> That's fine indeed.
>
> In a different project there is an equivalent API which skips
> barriers, too quick glance suggested this is what's going on here. My
> bad, sorry for false alarm on this front. :)

No worries. Better to double-check than to merge a bug.
Thanks,
Suren.

>
> --
> Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11 18:30     ` Paul E. McKenney
@ 2025-01-11 22:19       ` David Laight
  2025-01-11 22:50         ` [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited - clang 17.0.1 bug David Laight
  0 siblings, 1 reply; 140+ messages in thread
From: David Laight @ 2025-01-11 22:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Suren Baghdasaryan, akpm, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, 11 Jan 2025 10:30:40 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Sat, Jan 11, 2025 at 12:39:00PM +0000, David Laight wrote:
> > On Fri, 10 Jan 2025 20:25:57 -0800
> > Suren Baghdasaryan <surenb@google.com> wrote:
> >   
> > > Introduce functions to increase refcount but with a top limit above which
> > > they will fail to increase (the limit is inclusive). Setting the limit to
> > > INT_MAX indicates no limit.  
> > 
> > This function has never worked as expected!
> > I've removed the update and added in the rest of the code.
> >   
> > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > index 35f039ecb272..5072ba99f05e 100644
> > > --- a/include/linux/refcount.h
> > > +++ b/include/linux/refcount.h
> > > @@ -137,13 +137,23 @@ static inline unsigned int refcount_read(const refcount_t *r)
> > >  }
> > >  
> > >  static inline __must_check __signed_wrap
> > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > >  {
> > >  	int old = refcount_read(r);
> > >  
> > >  	do {
> > >  		if (!old)
> > >  			break;
> > >
> > >  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > >  
> > >  	if (oldp)
> > >		*oldp = old;  
> > ?  
> > >	if (unlikely(old < 0 || old + i < 0))
> > >		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
> > >
> > >  	return old;
> > >  }  
> > 
> > The saturate test just doesn't work as expected.
> > In C signed integer overflow is undefined (probably so that cpu that saturate/trap
> > signed overflow can be conformant) and gcc uses that to optimise code.
> > 
> > So if you compile (https://www.godbolt.org/z/WYWo84Weq):
> > int inc_wraps(int i)
> > {
> >     return i < 0 || i + 1 < 0;
> > }
> > the second test is optimised away.
> > I don't think the kernel compiles disable this optimisation.  
> 
> Last I checked, my kernel compiles specified -fno-strict-overflow.
> What happens if you try that in godbolt?

That does make gcc generated the wanted object code.
I know that compilation option has come up before, but I couldn't remember the
name or whether it was disabled :-(

You do get much better object code from return (i | i + 1) < 0;
And that is likely to be much better still if you need a conditional jump.

	David



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited - clang 17.0.1 bug
  2025-01-11 22:19       ` David Laight
@ 2025-01-11 22:50         ` David Laight
  2025-01-12 11:37           ` David Laight
  0 siblings, 1 reply; 140+ messages in thread
From: David Laight @ 2025-01-11 22:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Suren Baghdasaryan, akpm, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, nathan

On Sat, 11 Jan 2025 22:19:39 +0000
David Laight <david.laight.linux@gmail.com> wrote:

> On Sat, 11 Jan 2025 10:30:40 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Sat, Jan 11, 2025 at 12:39:00PM +0000, David Laight wrote:  
> > > On Fri, 10 Jan 2025 20:25:57 -0800
> > > Suren Baghdasaryan <surenb@google.com> wrote:
> > >     
> > > > Introduce functions to increase refcount but with a top limit above which
> > > > they will fail to increase (the limit is inclusive). Setting the limit to
> > > > INT_MAX indicates no limit.    
> > > 
> > > This function has never worked as expected!
> > > I've removed the update and added in the rest of the code.
> > >     
> > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > index 35f039ecb272..5072ba99f05e 100644
> > > > --- a/include/linux/refcount.h
> > > > +++ b/include/linux/refcount.h
> > > > @@ -137,13 +137,23 @@ static inline unsigned int refcount_read(const refcount_t *r)
> > > >  }
> > > >  
> > > >  static inline __must_check __signed_wrap
> > > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > >  {
> > > >  	int old = refcount_read(r);
> > > >  
> > > >  	do {
> > > >  		if (!old)
> > > >  			break;
> > > >
> > > >  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > >  
> > > >  	if (oldp)
> > > >		*oldp = old;    
> > > ?    
> > > >	if (unlikely(old < 0 || old + i < 0))
> > > >		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
> > > >
> > > >  	return old;
> > > >  }    
> > > 
> > > The saturate test just doesn't work as expected.
> > > In C signed integer overflow is undefined (probably so that cpu that saturate/trap
> > > signed overflow can be conformant) and gcc uses that to optimise code.
> > > 
> > > So if you compile (https://www.godbolt.org/z/WYWo84Weq):
> > > int inc_wraps(int i)
> > > {
> > >     return i < 0 || i + 1 < 0;
> > > }
> > > the second test is optimised away.
> > > I don't think the kernel compiles disable this optimisation.    
> > 
> > Last I checked, my kernel compiles specified -fno-strict-overflow.
> > What happens if you try that in godbolt?  
> 
> That does make gcc generated the wanted object code.
> I know that compilation option has come up before, but I couldn't remember the
> name or whether it was disabled :-(
> 
> You do get much better object code from return (i | i + 1) < 0;
> And that is likely to be much better still if you need a conditional jump.

I've just checked some more cases (see https://www.godbolt.org/z/YoM9odTbe).
gcc 11 onwards generates the same code code for the two expressions.

Rather more worryingly clang 17.0.1 is getting this one wrong:
   return i < 0 || i + 1 < 0 ? foo(i) : bar(i);
It ignores the 'i + 1' test even with -fno-strict-overflow.
That is more representative of the actual code.

What have I missed now?

	David


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11 17:11         ` Suren Baghdasaryan
@ 2025-01-11 23:44           ` Hillf Danton
  2025-01-12  0:31             ` Suren Baghdasaryan
  2025-01-15  9:39           ` Peter Zijlstra
  1 sibling, 1 reply; 140+ messages in thread
From: Hillf Danton @ 2025-01-11 23:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, hannes, linux-mm, linux-kernel,
	Vlastimil Babka

On Sat, 11 Jan 2025 09:11:52 -0800 Suren Baghdasaryan <surenb@google.com>
> I see your point. I think it's a strong argument to use atomic
> directly instead of refcount for this locking. I'll try that and see
> how it looks. Thanks for the feedback!
>
Better not before having a clear answer to why is it sane to invent
anything like rwsem in 2025. What, the 40 bytes? Nope it is the
fair price paid for finer locking granuality.

BTW Vlastimil, the cc list is cut down because I have to walk around
the spam check on the mail agent side.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11 23:44           ` Hillf Danton
@ 2025-01-12  0:31             ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-12  0:31 UTC (permalink / raw)
  To: Hillf Danton
  Cc: akpm, peterz, willy, hannes, linux-mm, linux-kernel,
	Vlastimil Babka

On Sat, Jan 11, 2025 at 3:45 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Sat, 11 Jan 2025 09:11:52 -0800 Suren Baghdasaryan <surenb@google.com>
> > I see your point. I think it's a strong argument to use atomic
> > directly instead of refcount for this locking. I'll try that and see
> > how it looks. Thanks for the feedback!
> >
> Better not before having a clear answer to why is it sane to invent
> anything like rwsem in 2025. What, the 40 bytes? Nope it is the
> fair price paid for finer locking granuality.

It's not just about the 40 bytes. It allows us to fold the separate
vma->detached flag nicely into the same refcounter, which consolidates
the vma state in one place. Later that makes it much easier to add
SLAB_TYPESAFE_BY_RCU because now we have to preserve only this
refcounter during the vma reuse.

>
> BTW Vlastimil, the cc list is cut down because I have to walk around
> the spam check on the mail agent side.
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11  4:25 ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
  2025-01-11 11:24   ` Mateusz Guzik
@ 2025-01-12  2:59   ` Wei Yang
  2025-01-12 17:35     ` Suren Baghdasaryan
  2025-01-13  2:37   ` Wei Yang
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-12  2:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>rw_semaphore is a sizable structure of 40 bytes and consumes
>considerable space for each vm_area_struct. However vma_lock has
>two important specifics which can be used to replace rw_semaphore
>with a simpler structure:
>1. Readers never wait. They try to take the vma_lock and fall back to
>mmap_lock if that fails.
>2. Only one writer at a time will ever try to write-lock a vma_lock
>because writers first take mmap_lock in write mode.
>Because of these requirements, full rw_semaphore functionality is not
>needed and we can replace rw_semaphore and the vma->detached flag with
>a refcount (vm_refcnt).

This paragraph is merged into the above one in the commit log, which may not
what you expect.

Just a format issue, not sure why they are not separated.

>When vma is in detached state, vm_refcnt is 0 and only a call to
>vma_mark_attached() can take it out of this state. Note that unlike
>before, now we enforce both vma_mark_attached() and vma_mark_detached()
>to be done only after vma has been write-locked. vma_mark_attached()
>changes vm_refcnt to 1 to indicate that it has been attached to the vma
>tree. When a reader takes read lock, it increments vm_refcnt, unless the
>top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
>a writer. When writer takes write lock, it sets the top usable bit to
>indicate its presence. If there are readers, writer will wait using newly
>introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
>mode first, there can be only one writer at a time. The last reader to
>release the lock will signal the writer to wake up.
>refcount might overflow if there are many competing readers, in which case
>read-locking will fail. Readers are expected to handle such failures.
>In summary:
>1. all readers increment the vm_refcnt;
>2. writer sets top usable (writer) bit of vm_refcnt;
>3. readers cannot increment the vm_refcnt if the writer bit is set;
>4. in the presence of readers, writer must wait for the vm_refcnt to drop
>to 1 (ignoring the writer bit), indicating an attached vma with no readers;

It waits until to (VMA_LOCK_OFFSET + 1) as indicates in __vma_start_write(),
if I am right. 

>5. vm_refcnt overflow is handled by the readers.
>
>While this vm_lock replacement does not yet result in a smaller
>vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
>allows for further size optimization by structure member regrouping
>to bring the size of vm_area_struct below 192 bytes.
>
-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited - clang 17.0.1 bug
  2025-01-11 22:50         ` [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited - clang 17.0.1 bug David Laight
@ 2025-01-12 11:37           ` David Laight
  2025-01-12 17:56             ` Paul E. McKenney
  0 siblings, 1 reply; 140+ messages in thread
From: David Laight @ 2025-01-12 11:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Suren Baghdasaryan, akpm, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, nathan

On Sat, 11 Jan 2025 22:50:16 +0000
David Laight <david.laight.linux@gmail.com> wrote:

> On Sat, 11 Jan 2025 22:19:39 +0000
> David Laight <david.laight.linux@gmail.com> wrote:
> 
> > On Sat, 11 Jan 2025 10:30:40 -0800
> > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >   
> > > On Sat, Jan 11, 2025 at 12:39:00PM +0000, David Laight wrote:    
> > > > On Fri, 10 Jan 2025 20:25:57 -0800
> > > > Suren Baghdasaryan <surenb@google.com> wrote:
> > > >       
> > > > > Introduce functions to increase refcount but with a top limit above which
> > > > > they will fail to increase (the limit is inclusive). Setting the limit to
> > > > > INT_MAX indicates no limit.      
> > > > 
> > > > This function has never worked as expected!
> > > > I've removed the update and added in the rest of the code.
> > > >       
> > > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > > index 35f039ecb272..5072ba99f05e 100644
> > > > > --- a/include/linux/refcount.h
> > > > > +++ b/include/linux/refcount.h
> > > > > @@ -137,13 +137,23 @@ static inline unsigned int refcount_read(const refcount_t *r)
> > > > >  }
> > > > >  
> > > > >  static inline __must_check __signed_wrap
> > > > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > >  {
> > > > >  	int old = refcount_read(r);
> > > > >  
> > > > >  	do {
> > > > >  		if (!old)
> > > > >  			break;
> > > > >
> > > > >  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > >  
> > > > >  	if (oldp)
> > > > >		*oldp = old;      
> > > > ?      
> > > > >	if (unlikely(old < 0 || old + i < 0))
> > > > >		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
> > > > >
> > > > >  	return old;
> > > > >  }      
> > > > 
> > > > The saturate test just doesn't work as expected.
> > > > In C signed integer overflow is undefined (probably so that cpu that saturate/trap
> > > > signed overflow can be conformant) and gcc uses that to optimise code.
> > > > 
> > > > So if you compile (https://www.godbolt.org/z/WYWo84Weq):
> > > > int inc_wraps(int i)
> > > > {
> > > >     return i < 0 || i + 1 < 0;
> > > > }
> > > > the second test is optimised away.
> > > > I don't think the kernel compiles disable this optimisation.      
> > > 
> > > Last I checked, my kernel compiles specified -fno-strict-overflow.
> > > What happens if you try that in godbolt?    
> > 
> > That does make gcc generated the wanted object code.
> > I know that compilation option has come up before, but I couldn't remember the
> > name or whether it was disabled :-(
> > 
> > You do get much better object code from return (i | i + 1) < 0;
> > And that is likely to be much better still if you need a conditional jump.  
> 
> I've just checked some more cases (see https://www.godbolt.org/z/YoM9odTbe).
> gcc 11 onwards generates the same code code for the two expressions.
> 
> Rather more worryingly clang 17.0.1 is getting this one wrong:
>    return i < 0 || i + 1 < 0 ? foo(i) : bar(i);
> It ignores the 'i + 1' test even with -fno-strict-overflow.
> That is more representative of the actual code.
> 
> What have I missed now?

A different optimisation :-(

> 
> 	David
> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-12  2:59   ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Wei Yang
@ 2025-01-12 17:35     ` Suren Baghdasaryan
  2025-01-13  0:59       ` Wei Yang
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-12 17:35 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 6:59 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >rw_semaphore is a sizable structure of 40 bytes and consumes
> >considerable space for each vm_area_struct. However vma_lock has
> >two important specifics which can be used to replace rw_semaphore
> >with a simpler structure:
> >1. Readers never wait. They try to take the vma_lock and fall back to
> >mmap_lock if that fails.
> >2. Only one writer at a time will ever try to write-lock a vma_lock
> >because writers first take mmap_lock in write mode.
> >Because of these requirements, full rw_semaphore functionality is not
> >needed and we can replace rw_semaphore and the vma->detached flag with
> >a refcount (vm_refcnt).
>
> This paragraph is merged into the above one in the commit log, which may not
> what you expect.
>
> Just a format issue, not sure why they are not separated.

I'll double-check the formatting. Thanks!

>
> >When vma is in detached state, vm_refcnt is 0 and only a call to
> >vma_mark_attached() can take it out of this state. Note that unlike
> >before, now we enforce both vma_mark_attached() and vma_mark_detached()
> >to be done only after vma has been write-locked. vma_mark_attached()
> >changes vm_refcnt to 1 to indicate that it has been attached to the vma
> >tree. When a reader takes read lock, it increments vm_refcnt, unless the
> >top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> >a writer. When writer takes write lock, it sets the top usable bit to
> >indicate its presence. If there are readers, writer will wait using newly
> >introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> >mode first, there can be only one writer at a time. The last reader to
> >release the lock will signal the writer to wake up.
> >refcount might overflow if there are many competing readers, in which case
> >read-locking will fail. Readers are expected to handle such failures.
> >In summary:
> >1. all readers increment the vm_refcnt;
> >2. writer sets top usable (writer) bit of vm_refcnt;
> >3. readers cannot increment the vm_refcnt if the writer bit is set;
> >4. in the presence of readers, writer must wait for the vm_refcnt to drop
> >to 1 (ignoring the writer bit), indicating an attached vma with no readers;
>
> It waits until to (VMA_LOCK_OFFSET + 1) as indicates in __vma_start_write(),
> if I am right.

Yeah, that's why I mentioned "(ignoring the writer bit)" but maybe
that's too confusing. How about "drop to 1 (plus the VMA_LOCK_OFFSET
writer bit)?

>
> >5. vm_refcnt overflow is handled by the readers.
> >
> >While this vm_lock replacement does not yet result in a smaller
> >vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
> >allows for further size optimization by structure member regrouping
> >to bring the size of vm_area_struct below 192 bytes.
> >
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited - clang 17.0.1 bug
  2025-01-12 11:37           ` David Laight
@ 2025-01-12 17:56             ` Paul E. McKenney
  0 siblings, 0 replies; 140+ messages in thread
From: Paul E. McKenney @ 2025-01-12 17:56 UTC (permalink / raw)
  To: David Laight
  Cc: Suren Baghdasaryan, akpm, peterz, willy, liam.howlett,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team, nathan

On Sun, Jan 12, 2025 at 11:37:56AM +0000, David Laight wrote:
> On Sat, 11 Jan 2025 22:50:16 +0000
> David Laight <david.laight.linux@gmail.com> wrote:
> 
> > On Sat, 11 Jan 2025 22:19:39 +0000
> > David Laight <david.laight.linux@gmail.com> wrote:
> > 
> > > On Sat, 11 Jan 2025 10:30:40 -0800
> > > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > >   
> > > > On Sat, Jan 11, 2025 at 12:39:00PM +0000, David Laight wrote:    
> > > > > On Fri, 10 Jan 2025 20:25:57 -0800
> > > > > Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >       
> > > > > > Introduce functions to increase refcount but with a top limit above which
> > > > > > they will fail to increase (the limit is inclusive). Setting the limit to
> > > > > > INT_MAX indicates no limit.      
> > > > > 
> > > > > This function has never worked as expected!
> > > > > I've removed the update and added in the rest of the code.
> > > > >       
> > > > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > > > index 35f039ecb272..5072ba99f05e 100644
> > > > > > --- a/include/linux/refcount.h
> > > > > > +++ b/include/linux/refcount.h
> > > > > > @@ -137,13 +137,23 @@ static inline unsigned int refcount_read(const refcount_t *r)
> > > > > >  }
> > > > > >  
> > > > > >  static inline __must_check __signed_wrap
> > > > > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > > >  {
> > > > > >  	int old = refcount_read(r);
> > > > > >  
> > > > > >  	do {
> > > > > >  		if (!old)
> > > > > >  			break;
> > > > > >
> > > > > >  	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > > >  
> > > > > >  	if (oldp)
> > > > > >		*oldp = old;      
> > > > > ?      
> > > > > >	if (unlikely(old < 0 || old + i < 0))
> > > > > >		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
> > > > > >
> > > > > >  	return old;
> > > > > >  }      
> > > > > 
> > > > > The saturate test just doesn't work as expected.
> > > > > In C signed integer overflow is undefined (probably so that cpu that saturate/trap
> > > > > signed overflow can be conformant) and gcc uses that to optimise code.
> > > > > 
> > > > > So if you compile (https://www.godbolt.org/z/WYWo84Weq):
> > > > > int inc_wraps(int i)
> > > > > {
> > > > >     return i < 0 || i + 1 < 0;
> > > > > }
> > > > > the second test is optimised away.
> > > > > I don't think the kernel compiles disable this optimisation.      
> > > > 
> > > > Last I checked, my kernel compiles specified -fno-strict-overflow.
> > > > What happens if you try that in godbolt?    
> > > 
> > > That does make gcc generated the wanted object code.
> > > I know that compilation option has come up before, but I couldn't remember the
> > > name or whether it was disabled :-(
> > > 
> > > You do get much better object code from return (i | i + 1) < 0;
> > > And that is likely to be much better still if you need a conditional jump.  
> > 
> > I've just checked some more cases (see https://www.godbolt.org/z/YoM9odTbe).
> > gcc 11 onwards generates the same code code for the two expressions.
> > 
> > Rather more worryingly clang 17.0.1 is getting this one wrong:
> >    return i < 0 || i + 1 < 0 ? foo(i) : bar(i);
> > It ignores the 'i + 1' test even with -fno-strict-overflow.
> > That is more representative of the actual code.
> > 
> > What have I missed now?
> 
> A different optimisation :-(

So the Linux kernel is good with signed integer overflow, right?

(Give or take compiler bugs, of course...)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-12 17:35     ` Suren Baghdasaryan
@ 2025-01-13  0:59       ` Wei Yang
  0 siblings, 0 replies; 140+ messages in thread
From: Wei Yang @ 2025-01-13  0:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sun, Jan 12, 2025 at 09:35:25AM -0800, Suren Baghdasaryan wrote:
>On Sat, Jan 11, 2025 at 6:59 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>> >rw_semaphore is a sizable structure of 40 bytes and consumes
>> >considerable space for each vm_area_struct. However vma_lock has
>> >two important specifics which can be used to replace rw_semaphore
>> >with a simpler structure:
>> >1. Readers never wait. They try to take the vma_lock and fall back to
>> >mmap_lock if that fails.
>> >2. Only one writer at a time will ever try to write-lock a vma_lock
>> >because writers first take mmap_lock in write mode.
>> >Because of these requirements, full rw_semaphore functionality is not
>> >needed and we can replace rw_semaphore and the vma->detached flag with
>> >a refcount (vm_refcnt).
>>
>> This paragraph is merged into the above one in the commit log, which may not
>> what you expect.
>>
>> Just a format issue, not sure why they are not separated.
>
>I'll double-check the formatting. Thanks!
>
>>
>> >When vma is in detached state, vm_refcnt is 0 and only a call to
>> >vma_mark_attached() can take it out of this state. Note that unlike
>> >before, now we enforce both vma_mark_attached() and vma_mark_detached()
>> >to be done only after vma has been write-locked. vma_mark_attached()
>> >changes vm_refcnt to 1 to indicate that it has been attached to the vma
>> >tree. When a reader takes read lock, it increments vm_refcnt, unless the
>> >top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
>> >a writer. When writer takes write lock, it sets the top usable bit to
>> >indicate its presence. If there are readers, writer will wait using newly
>> >introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
>> >mode first, there can be only one writer at a time. The last reader to
>> >release the lock will signal the writer to wake up.
>> >refcount might overflow if there are many competing readers, in which case
>> >read-locking will fail. Readers are expected to handle such failures.
>> >In summary:
>> >1. all readers increment the vm_refcnt;
>> >2. writer sets top usable (writer) bit of vm_refcnt;
>> >3. readers cannot increment the vm_refcnt if the writer bit is set;
>> >4. in the presence of readers, writer must wait for the vm_refcnt to drop
>> >to 1 (ignoring the writer bit), indicating an attached vma with no readers;
>>
>> It waits until to (VMA_LOCK_OFFSET + 1) as indicates in __vma_start_write(),
>> if I am right.
>
>Yeah, that's why I mentioned "(ignoring the writer bit)" but maybe
>that's too confusing. How about "drop to 1 (plus the VMA_LOCK_OFFSET
>writer bit)?
>

Hmm.. hard to say. It is a little confusing, but I don't have a better one :-(

>>
>> >5. vm_refcnt overflow is handled by the readers.
>> >
>> >While this vm_lock replacement does not yet result in a smaller
>> >vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
>> >allows for further size optimization by structure member regrouping
>> >to bring the size of vm_area_struct below 192 bytes.
>> >
>> --
>> Wei Yang
>> Help you, Help me

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11 20:14     ` Suren Baghdasaryan
                         ` (2 preceding siblings ...)
  2025-01-11 20:38       ` Vlastimil Babka
@ 2025-01-13  1:47       ` Wei Yang
  2025-01-13  2:25         ` Wei Yang
  2025-01-13 21:08         ` Suren Baghdasaryan
  2025-01-15 10:48       ` Peter Zijlstra
  4 siblings, 2 replies; 140+ messages in thread
From: Wei Yang @ 2025-01-13  1:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Mateusz Guzik, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
>On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>>
>> So there were quite a few iterations of the patch and I have not been
>> reading majority of the feedback, so it may be I missed something,
>> apologies upfront. :)
>>

Hi, I am new to memory barriers. Hope not bothering.

>> >  /*
>> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
>> >   * locked result to avoid performance overhead, in which case we fall back to
>> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>> >   */
>> >  static inline bool vma_start_read(struct vm_area_struct *vma)
>> >  {
>> > +     int oldcnt;
>> > +
>> >       /*
>> >        * Check before locking. A race might cause false locked result.
>> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
>> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>> >               return false;
>> >
>> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
>> > +     /*
>> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
>> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
>> > +      */
>> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
>> > +                                                   VMA_REF_LIMIT)))
>> >               return false;
>> >
>>
>> Replacing down_read_trylock() with the new routine loses an acquire
>> fence. That alone is not a problem, but see below.
>
>Hmm. I think this acquire fence is actually necessary. We don't want
>the later vm_lock_seq check to be reordered and happen before we take
>the refcount. Otherwise this might happen:
>
>reader             writer
>if (vm_lock_seq == mm_lock_seq) // check got reordered
>        return false;
>                       vm_refcnt += VMA_LOCK_OFFSET
>                       vm_lock_seq == mm_lock_seq
>                       vm_refcnt -= VMA_LOCK_OFFSET
>if (!__refcount_inc_not_zero_limited())
>        return false;
>
>Both reader's checks will pass and the reader would read-lock a vma
>that was write-locked.
>

Here what we plan to do is define __refcount_inc_not_zero_limited() with
acquire fence, e.g. with atomic_try_cmpxchg_acquire(), right?

>>
>> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
>> >       /*
>> > -      * Overflow might produce false locked result.
>> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
>> >        * False unlocked result is impossible because we modify and check
>> > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
>> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
>> >        * modification invalidates all existing locks.
>> >        *
>> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
>> > @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>> >        * This pairs with RELEASE semantics in vma_end_write_all().
>> >        */
>> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {

One question here is would compiler optimize the read of vm_lock_seq here,
since we have read it at the beginning?

Or with the acquire fence added above, compiler won't optimize it.
Or we should use REACE_ONCE(vma->vm_lock_seq) here?

>>
>> The previous modification of this spot to raw_read_seqcount loses the
>> acquire fence, making the above comment not line up with the code.
>
>Is it? From reading the seqcount code
>(https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
>
>raw_read_seqcount()
>    seqprop_sequence()
>        __seqprop(s, sequence)
>            __seqprop_sequence()
>                smp_load_acquire()
>
>smp_load_acquire() still provides the acquire fence. Am I missing something?
>
>>
>> I don't know if the stock code (with down_read_trylock()) is correct as
>> is -- looks fine for cursory reading fwiw. However, if it indeed works,
>> the acquire fence stemming from the lock routine is a mandatory part of
>> it afaics.
>>
>> I think the best way forward is to add a new refcount routine which
>> ships with an acquire fence.
>
>I plan on replacing refcount_t usage here with an atomic since, as
>Hillf noted, refcount is not designed to be used for locking. And will
>make sure the down_read_trylock() replacement will provide an acquire
>fence.
>

Hmm.. refcount_t is defined with atomic_t. I am lost why replacing refcount_t
with atomic_t would help.

>>
>> Otherwise I would suggest:
>> 1. a comment above __refcount_inc_not_zero_limited saying there is an
>>    acq fence issued later
>> 2. smp_rmb() slapped between that and seq accesses
>>
>> If the now removed fence is somehow not needed, I think a comment
>> explaining it is necessary.
>>
>> > @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>> >
>> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
>> >  {
>> > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
>> > +     if (refcount_read(&vma->vm_refcnt) <= 1)
>> >               vma_assert_write_locked(vma);
>> >  }
>> >
>>
>> This now forces the compiler to emit a load from vm_refcnt even if
>> vma_assert_write_locked expands to nothing. iow this wants to hide
>> behind the same stuff as vma_assert_write_locked.
>
>True. I guess I'll have to avoid using vma_assert_write_locked() like this:
>
>static inline void vma_assert_locked(struct vm_area_struct *vma)
>{
>        unsigned int mm_lock_seq;
>
>        VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 &&
>                                          !__is_vma_write_locked(vma,
>&mm_lock_seq), vma);
>}
>
>Will make the change.
>
>Thanks for the feedback!

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-13  1:47       ` Wei Yang
@ 2025-01-13  2:25         ` Wei Yang
  2025-01-13 21:14           ` Suren Baghdasaryan
  2025-01-13 21:08         ` Suren Baghdasaryan
  1 sibling, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-13  2:25 UTC (permalink / raw)
  To: Wei Yang
  Cc: Suren Baghdasaryan, Mateusz Guzik, akpm, peterz, willy,
	liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko, vbabka,
	hannes, oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, lokeshgidra, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, klarasmodin, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 01:47:29AM +0000, Wei Yang wrote:
>On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
>>On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>
>>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>>>
>>> So there were quite a few iterations of the patch and I have not been
>>> reading majority of the feedback, so it may be I missed something,
>>> apologies upfront. :)
>>>
>
>Hi, I am new to memory barriers. Hope not bothering.
>
>>> >  /*
>>> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
>>> >   * locked result to avoid performance overhead, in which case we fall back to
>>> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>>> >   */
>>> >  static inline bool vma_start_read(struct vm_area_struct *vma)
>>> >  {
>>> > +     int oldcnt;
>>> > +
>>> >       /*
>>> >        * Check before locking. A race might cause false locked result.
>>> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
>>> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>>> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>>> >               return false;
>>> >
>>> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
>>> > +     /*
>>> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
>>> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
>>> > +      */
>>> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
>>> > +                                                   VMA_REF_LIMIT)))
>>> >               return false;
>>> >
>>>
>>> Replacing down_read_trylock() with the new routine loses an acquire
>>> fence. That alone is not a problem, but see below.
>>
>>Hmm. I think this acquire fence is actually necessary. We don't want
>>the later vm_lock_seq check to be reordered and happen before we take
>>the refcount. Otherwise this might happen:
>>
>>reader             writer
>>if (vm_lock_seq == mm_lock_seq) // check got reordered
>>        return false;
>>                       vm_refcnt += VMA_LOCK_OFFSET
>>                       vm_lock_seq == mm_lock_seq
>>                       vm_refcnt -= VMA_LOCK_OFFSET
>>if (!__refcount_inc_not_zero_limited())
>>        return false;
>>
>>Both reader's checks will pass and the reader would read-lock a vma
>>that was write-locked.
>>
>
>Here what we plan to do is define __refcount_inc_not_zero_limited() with
>acquire fence, e.g. with atomic_try_cmpxchg_acquire(), right?
>

BTW, usually we pair acquire with release.

The __vma_start_write() provide release fence when locked, so for this part
we are ok, right?  


-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11  4:25 ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
  2025-01-11 11:24   ` Mateusz Guzik
  2025-01-12  2:59   ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Wei Yang
@ 2025-01-13  2:37   ` Wei Yang
  2025-01-13 21:16     ` Suren Baghdasaryan
  2025-01-13  9:36   ` Wei Yang
  2025-01-15  2:58   ` Wei Yang
  4 siblings, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-13  2:37 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> static inline void vma_end_read(struct vm_area_struct *vma) {}
>@@ -908,12 +948,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> 	vma->vm_mm = mm;
> 	vma->vm_ops = &vma_dummy_vm_ops;
> 	INIT_LIST_HEAD(&vma->anon_vma_chain);
>-#ifdef CONFIG_PER_VMA_LOCK
>-	/* vma is not locked, can't use vma_mark_detached() */
>-	vma->detached = true;
>-#endif
> 	vma_numab_state_init(vma);
>-	vma_lock_init(vma);
>+	vma_lock_init(vma, false);

vma_init(vma, mm)
  memset(vma, 0, sizeof(*vma))
  ...
  vma_lock_init(vma, false);

It looks the vm_refcnt must be reset.

BTW, I don't figure out why we want to skip the reset of vm_refcnt. Is this
related to SLAB_TYPESAFE_BY_RCU?

> }
> 

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11  4:25 ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
                     ` (2 preceding siblings ...)
  2025-01-13  2:37   ` Wei Yang
@ 2025-01-13  9:36   ` Wei Yang
  2025-01-13 21:18     ` Suren Baghdasaryan
  2025-01-15  2:58   ` Wei Yang
  4 siblings, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-13  9:36 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
[...]
> 
>+static inline bool is_vma_writer_only(int refcnt)
>+{
>+	/*
>+	 * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma
>+	 * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on
>+	 * a detached vma happens only in vma_mark_detached() and is a rare
>+	 * case, therefore most of the time there will be no unnecessary wakeup.
>+	 */
>+	return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1;

It looks equivalent to 

	return (refcnt == VMA_LOCK_OFFSET) || (refcnt == VMA_LOCK_OFFSET + 1);

And its generated code looks a little simpler.

>+}
>+
>+static inline void vma_refcount_put(struct vm_area_struct *vma)
>+{
>+	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
>+	struct mm_struct *mm = vma->vm_mm;
>+	int oldcnt;
>+
>+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>+	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
>+
>+		if (is_vma_writer_only(oldcnt - 1))
>+			rcuwait_wake_up(&mm->vma_writer_wait);
>+	}
>+}
>+

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-11  4:25 ` [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas Suren Baghdasaryan
@ 2025-01-13 11:58   ` Lorenzo Stoakes
  2025-01-13 16:31     ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 11:58 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:51PM -0800, Suren Baghdasaryan wrote:
> vma_iter_store() functions can be used both when adding a new vma and
> when updating an existing one. However for existing ones we do not need
> to mark them attached as they are already marked that way. Introduce
> vma_iter_store_attached() to be used with already attached vmas.

OK I guess the intent of this is to reinstate the previously existing
asserts, only explicitly checking those places where we attach.

I'm a little concerned that by doing this, somebody might simply invoke
this function without realising the implications.

Can we have something functional like

vma_iter_store_new() and vma_iter_store_overwrite()

?

I don't like us just leaving vma_iter_store() quietly making an assumption
that a caller doesn't necessarily realise.

Also it's more greppable this way.

I had a look through callers and it does seem you've snagged them all
correctly.

>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/mm.h | 12 ++++++++++++
>  mm/vma.c           |  8 ++++----
>  mm/vma.h           | 11 +++++++++--
>  3 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2b322871da87..2f805f1a0176 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
>  		vma_assert_write_locked(vma);
>  }
>
> +static inline void vma_assert_attached(struct vm_area_struct *vma)
> +{
> +	VM_BUG_ON_VMA(vma->detached, vma);
> +}
> +
> +static inline void vma_assert_detached(struct vm_area_struct *vma)
> +{
> +	VM_BUG_ON_VMA(!vma->detached, vma);
> +}
> +
>  static inline void vma_mark_attached(struct vm_area_struct *vma)
>  {
>  	vma->detached = false;
> @@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
>  static inline void vma_start_write(struct vm_area_struct *vma) {}
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  		{ mmap_assert_write_locked(vma->vm_mm); }
> +static inline void vma_assert_attached(struct vm_area_struct *vma) {}
> +static inline void vma_assert_detached(struct vm_area_struct *vma) {}
>  static inline void vma_mark_attached(struct vm_area_struct *vma) {}
>  static inline void vma_mark_detached(struct vm_area_struct *vma) {}
>
> diff --git a/mm/vma.c b/mm/vma.c
> index d603494e69d7..b9cf552e120c 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
>  	vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
>
>  	if (expanded)
> -		vma_iter_store(vmg->vmi, vmg->vma);
> +		vma_iter_store_attached(vmg->vmi, vmg->vma);
>
>  	if (adj_start) {
>  		adjust->vm_start += adj_start;
>  		adjust->vm_pgoff += PHYS_PFN(adj_start);
>  		if (adj_start < 0) {
>  			WARN_ON(expanded);
> -			vma_iter_store(vmg->vmi, adjust);
> +			vma_iter_store_attached(vmg->vmi, adjust);
>  		}
>  	}

I kind of feel this whole function (that yes, I added :>) though derived
from existing logic) needs rework, as it's necessarily rather confusing.

But hey, that's on me :)

But this does look right... OK see this as a note-to-self...

>
> @@ -2845,7 +2845,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  				anon_vma_interval_tree_pre_update_vma(vma);
>  				vma->vm_end = address;
>  				/* Overwrite old entry in mtree. */
> -				vma_iter_store(&vmi, vma);
> +				vma_iter_store_attached(&vmi, vma);
>  				anon_vma_interval_tree_post_update_vma(vma);
>
>  				perf_event_mmap(vma);
> @@ -2925,7 +2925,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>  				vma->vm_start = address;
>  				vma->vm_pgoff -= grow;
>  				/* Overwrite old entry in mtree. */
> -				vma_iter_store(&vmi, vma);
> +				vma_iter_store_attached(&vmi, vma);
>  				anon_vma_interval_tree_post_update_vma(vma);
>
>  				perf_event_mmap(vma);
> diff --git a/mm/vma.h b/mm/vma.h
> index 2a2668de8d2c..63dd38d5230c 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -365,9 +365,10 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
>  }
>
>  /* Store a VMA with preallocated memory */
> -static inline void vma_iter_store(struct vma_iterator *vmi,
> -				  struct vm_area_struct *vma)
> +static inline void vma_iter_store_attached(struct vma_iterator *vmi,
> +					   struct vm_area_struct *vma)
>  {
> +	vma_assert_attached(vma);
>
>  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
>  	if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start &&
> @@ -390,7 +391,13 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
>
>  	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
>  	mas_store_prealloc(&vmi->mas, vma);
> +}
> +
> +static inline void vma_iter_store(struct vma_iterator *vmi,
> +				  struct vm_area_struct *vma)
> +{
>  	vma_mark_attached(vma);
> +	vma_iter_store_attached(vmi, vma);
>  }
>

See comment at top, and we need some comments here to explain why we're
going to pains to do this.

What about mm/nommu.c? I guess these cases are always new VMAs.

We probably definitely need to check this series in a nommu setup, have you
done this? As I can see this breaking things. Then again I suppose you'd
have expected bots to moan by now...

>  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 05/17] mm: mark vmas detached upon exit
  2025-01-11  4:25 ` [PATCH v9 05/17] mm: mark vmas detached upon exit Suren Baghdasaryan
@ 2025-01-13 12:05   ` Lorenzo Stoakes
  2025-01-13 17:02     ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 12:05 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:52PM -0800, Suren Baghdasaryan wrote:
> When exit_mmap() removes vmas belonging to an exiting task, it does not
> mark them as detached since they can't be reached by other tasks and they
> will be freed shortly. Once we introduce vma reuse, all vmas will have to
> be in detached state before they are freed to ensure vma when reused is
> in a consistent state. Add missing vma_mark_detached() before freeing the
> vma.

Hmm this really makes me worry that we'll see bugs from this detached
stuff, do we make this assumption anywhere else I wonder?

>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

But regardless, prima facie, this looks fine, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/vma.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index b9cf552e120c..93ff42ac2002 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -413,10 +413,12 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
>  	mpol_put(vma_policy(vma));
> -	if (unreachable)
> +	if (unreachable) {
> +		vma_mark_detached(vma);
>  		__vm_area_free(vma);
> -	else
> +	} else {
>  		vm_area_free(vma);
> +	}
>  }
>
>  /*
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (17 preceding siblings ...)
  2025-01-11  4:52 ` [PATCH v9 00/17] reimplement per-vma lock as a refcount Matthew Wilcox
@ 2025-01-13 12:14 ` Lorenzo Stoakes
  2025-01-13 16:58   ` Suren Baghdasaryan
  2025-01-14  1:49   ` Andrew Morton
  2025-01-28  5:26 ` Shivank Garg
  19 siblings, 2 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 12:14 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

A nit on subject, I mean this is part of what this series does, and hey -
we have only so much text to put in here - but isn't this both
reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
using the RCU typesafe mechanism?

Do we have to do both in one series? Can we split this out? I mean maybe
that's just churny and unnecessary, but to me this series is 'allocate VMAs
RCU safe and refcount VMA lock' or something like this. Maybe this is
nitty... but still :)

One general comment here - this is a really major change in how this stuff
works, and yet I don't see any tests anywhere in the series.

I know it's tricky to write tests for this, but the new VMA testing
environment should make it possible to test a _lot_ more than we previously
could.

However due to some (*ahem*) interesting distribution of where functions
are, most notably stuff in kernel/fork.c, I guess we can't test
_everything_ there effectively.

But I do feel like we should be able to do better than having absolutely no
testing added for this?

I think there's definitely quite a bit you could test now, at least in
asserting fundamentals in tools/testing/vma/vma.c.

This can cover at least detached state asserts in various scenarios.

But that won't cover off the really gnarly stuff here around RCU slab
allocation, and determining precisely how to test that in a sensible way is
maybe less clear.

But I'd like to see _something_ here please, this is more or less
fundamentally changing how all VMAs are allocated and to just have nothing
feels unfortunate.

I'm already nervous because we've hit issues coming up to v9 and we're not
100% sure if a recent syzkaller is related to these changes or not, I'm not
sure how much we can get assurances with tests but I'd like something.

Thanks!

On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. regroups vm_area_struct members to fit them into 3 cachelines;
> 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
>
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Changes since v8 [4]:
> - Change subject for the cover letter, per Vlastimil Babka
> - Added Reviewed-by and Acked-by, per Vlastimil Babka
> - Added static check for no-limit case in __refcount_add_not_zero_limited,
> per David Laight
> - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> per Hillf Danton and Vlastimil Babka
> - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> under us, per Vlastimil Babka
> - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> per Vlastimil Babka
> - Changed __vma_enter_locked() parameter to centralize refcount logic,
> per Vlastimil Babka
> - Amended description in vm_lock replacement patch explaining the effects
> of the patch on vm_area_struct size, per Vlastimil Babka
> - Added vm_area_struct member regrouping patch [5] into the series
> - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> - Added a comment for vm_area_struct to update vm_area_init_from() when
> adding new members, per Vlastimil Babka
> - Updated a comment about unstable src->shared.rb when copying a vma in
> vm_area_init_from(), per Vlastimil Babka
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
>
> Patchset applies over mm-unstable after reverting v8
> (current SHA range: 235b5129cb7b - 9e6b24c58985)
>
> Suren Baghdasaryan (17):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm: introduce vma_iter_store_attached() to use with attached vmas
>   mm: mark vmas detached upon exit
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm: move lesser used vma_area_struct members into the last cacheline
>   mm/debug: print vm_refcnt state when dumping the vma
>   mm: remove extra vma_numab_state_init() call
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
>
>  Documentation/mm/process_addrs.rst |  44 ++++----
>  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
>  include/linux/mm_types.h           |  75 +++++++-------
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  24 ++++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      | 129 +++++++++++-------------
>  mm/debug.c                         |  12 +++
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  97 ++++++++++++++++--
>  mm/mmap.c                          |   3 +-
>  mm/userfaultfd.c                   |  32 +++---
>  mm/vma.c                           |  23 ++---
>  mm/vma.h                           |  15 ++-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
>  18 files changed, 465 insertions(+), 281 deletions(-)
>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 06/17] types: move struct rcuwait into types.h
  2025-01-11  4:25 ` [PATCH v9 06/17] types: move struct rcuwait into types.h Suren Baghdasaryan
@ 2025-01-13 14:46   ` Lorenzo Stoakes
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 14:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:53PM -0800, Suren Baghdasaryan wrote:
> Move rcuwait struct definition into types.h so that rcuwait can be used
> without including rcuwait.h which includes other headers. Without this
> change mm_types.h can't use rcuwait due to a the following circular
> dependency:
>
> mm_types.h -> rcuwait.h -> signal.h -> mm_types.h

Thanks for including details of motivation for this move :)

>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Davidlohr Bueso <dave@stgolabs.net>
> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/rcuwait.h | 13 +------------
>  include/linux/types.h   | 12 ++++++++++++
>  2 files changed, 13 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
> index 27343424225c..9ad134a04b41 100644
> --- a/include/linux/rcuwait.h
> +++ b/include/linux/rcuwait.h
> @@ -4,18 +4,7 @@
>
>  #include <linux/rcupdate.h>
>  #include <linux/sched/signal.h>
> -
> -/*
> - * rcuwait provides a way of blocking and waking up a single
> - * task in an rcu-safe manner.
> - *
> - * The only time @task is non-nil is when a user is blocked (or
> - * checking if it needs to) on a condition, and reset as soon as we
> - * know that the condition has succeeded and are awoken.
> - */
> -struct rcuwait {
> -	struct task_struct __rcu *task;
> -};
> +#include <linux/types.h>
>
>  #define __RCUWAIT_INITIALIZER(name)		\
>  	{ .task = NULL, }
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 2d7b9ae8714c..f1356a9a5730 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -248,5 +248,17 @@ typedef void (*swap_func_t)(void *a, void *b, int size);
>  typedef int (*cmp_r_func_t)(const void *a, const void *b, const void *priv);
>  typedef int (*cmp_func_t)(const void *a, const void *b);
>
> +/*
> + * rcuwait provides a way of blocking and waking up a single
> + * task in an rcu-safe manner.
> + *
> + * The only time @task is non-nil is when a user is blocked (or
> + * checking if it needs to) on a condition, and reset as soon as we
> + * know that the condition has succeeded and are awoken.
> + */
> +struct rcuwait {
> +	struct task_struct __rcu *task;
> +};
> +
>  #endif /*  __ASSEMBLY__ */
>  #endif /* _LINUX_TYPES_H */
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 07/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2025-01-11  4:25 ` [PATCH v9 07/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
@ 2025-01-13 15:25   ` Lorenzo Stoakes
  2025-01-13 17:53     ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 15:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:54PM -0800, Suren Baghdasaryan wrote:
> With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
> possibility of vma_start_read_locked/vma_start_read_locked_nested failing
> due to refcount overflow. Prepare for such possibility by changing these
> APIs and adjusting their users.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> ---
>  include/linux/mm.h |  6 ++++--
>  mm/userfaultfd.c   | 18 +++++++++++++-----
>  2 files changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2f805f1a0176..cbb4e3dbbaed 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>   * not be used in such cases because it might fail due to mm_lock_seq overflow.
>   * This functionality is used to obtain vma read lock and drop the mmap read lock.
>   */
> -static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> +static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
>  {
>  	mmap_assert_locked(vma->vm_mm);
>  	down_read_nested(&vma->vm_lock.lock, subclass);
> +	return true;
>  }
>
>  /*
> @@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
>   * not be used in such cases because it might fail due to mm_lock_seq overflow.
>   * This functionality is used to obtain vma read lock and drop the mmap read lock.
>   */
> -static inline void vma_start_read_locked(struct vm_area_struct *vma)
> +static inline bool vma_start_read_locked(struct vm_area_struct *vma)
>  {
>  	mmap_assert_locked(vma->vm_mm);
>  	down_read(&vma->vm_lock.lock);
> +	return true;
>  }
>
>  static inline void vma_end_read(struct vm_area_struct *vma)
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 4527c385935b..411a663932c4 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
>  	mmap_read_lock(mm);
>  	vma = find_vma_and_prepare_anon(mm, address);
>  	if (!IS_ERR(vma))
> -		vma_start_read_locked(vma);
> +		if (!vma_start_read_locked(vma))
> +			vma = ERR_PTR(-EAGAIN);

Nit but this kind of reads a bit weirdly now:

	if (!IS_ERR(vma))
		if (!vma_start_read_locked(vma))
			vma = ERR_PTR(-EAGAIN);

Wouldn't this be nicer as:

	if (!IS_ERR(vma) && !vma_start_read_locked(vma))
		vma = ERR_PTR(-EAGAIN);

On the other hand, this embeds an action in an expression, but then it sort of
still looks weird.

	if (!IS_ERR(vma)) {
		bool ok = vma_start_read_locked(vma);

		if (!ok)
			vma = ERR_PTR(-EAGAIN);
	}

This makes me wonder, now yes, we are truly bikeshedding, sorry, but maybe we
could just have vma_start_read_locked return a VMA pointer that could be an
error?

Then this becomes:

	if (!IS_ERR(vma))
		vma = vma_start_read_locked(vma);

>
>  	mmap_read_unlock(mm);
>  	return vma;
> @@ -1483,10 +1484,17 @@ static int uffd_move_lock(struct mm_struct *mm,
>  	mmap_read_lock(mm);
>  	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
>  	if (!err) {
> -		vma_start_read_locked(*dst_vmap);
> -		if (*dst_vmap != *src_vmap)
> -			vma_start_read_locked_nested(*src_vmap,
> -						SINGLE_DEPTH_NESTING);
> +		if (vma_start_read_locked(*dst_vmap)) {
> +			if (*dst_vmap != *src_vmap) {
> +				if (!vma_start_read_locked_nested(*src_vmap,
> +							SINGLE_DEPTH_NESTING)) {
> +					vma_end_read(*dst_vmap);

Hmm, why do we end read if the lock failed here but not above?

> +					err = -EAGAIN;
> +				}
> +			}
> +		} else {
> +			err = -EAGAIN;
> +		}
>  	}

This whole block is really ugly now, this really needs refactoring.

How about (on assumption the vma_end_read() is correct):


	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
	if (err)
		goto out;

	if (!vma_start_read_locked(*dst_vmap)) {
		err = -EAGAIN;
		goto out;
	}

	/* Nothing further to do. */
	if (*dst_vmap == *src_vmap)
		goto out;

	if (!vma_start_read_locked_nested(*src_vmap,
				SINGLE_DEPTH_NESTING)) {
		vma_end_read(*dst_vmap);
		err = -EAGAIN;
	}

out:
	mmap_read_unlock(mm);
	return err;
}

>  	mmap_read_unlock(mm);
>  	return err;
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 08/17] mm: move mmap_init_lock() out of the header file
  2025-01-11  4:25 ` [PATCH v9 08/17] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
@ 2025-01-13 15:27   ` Lorenzo Stoakes
  2025-01-13 17:53     ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 15:27 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:55PM -0800, Suren Baghdasaryan wrote:
> mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
> not have to reside in the header file. This move lets us avoid including
> additional headers in mmap_lock.h later, when mmap_init_lock() needs to
> initialize rcuwait object.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Aside from nit below, LGTM:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mmap_lock.h | 6 ------
>  kernel/fork.c             | 6 ++++++
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 45a21faa3ff6..4706c6769902 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -122,12 +122,6 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int
>
>  #endif /* CONFIG_PER_VMA_LOCK */
>
> -static inline void mmap_init_lock(struct mm_struct *mm)
> -{
> -	init_rwsem(&mm->mmap_lock);
> -	mm_lock_seqcount_init(mm);
> -}
> -
>  static inline void mmap_write_lock(struct mm_struct *mm)
>  {
>  	__mmap_lock_trace_start_locking(mm, true);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f2f9e7b427ad..d4c75428ccaf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1219,6 +1219,12 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
>  #endif
>  }
>
> +static inline void mmap_init_lock(struct mm_struct *mm)

we don't need inline here, please drop it.

> +{
> +	init_rwsem(&mm->mmap_lock);
> +	mm_lock_seqcount_init(mm);
> +}
> +
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	struct user_namespace *user_ns)
>  {
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 09/17] mm: uninline the main body of vma_start_write()
  2025-01-11  4:25 ` [PATCH v9 09/17] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
@ 2025-01-13 15:52   ` Lorenzo Stoakes
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 15:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:56PM -0800, Suren Baghdasaryan wrote:
> vma_start_write() is used in many places and will grow in size very soon.
> It is not used in performance critical paths and uninlining it should
> limit the future code size growth.
> No functional changes.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

LGTM,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mm.h | 12 +++---------
>  mm/memory.c        | 14 ++++++++++++++
>  2 files changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cbb4e3dbbaed..3432756d95e6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -787,6 +787,8 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l
>  	return (vma->vm_lock_seq == *mm_lock_seq);
>  }
>
> +void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq);
> +
>  /*
>   * Begin writing to a VMA.
>   * Exclude concurrent readers under the per-VMA lock until the currently
> @@ -799,15 +801,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
>  	if (__is_vma_write_locked(vma, &mm_lock_seq))
>  		return;
>
> -	down_write(&vma->vm_lock.lock);
> -	/*
> -	 * We should use WRITE_ONCE() here because we can have concurrent reads
> -	 * from the early lockless pessimistic check in vma_start_read().
> -	 * We don't really care about the correctness of that early check, but
> -	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> -	 */
> -	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> -	up_write(&vma->vm_lock.lock);
> +	__vma_start_write(vma, mm_lock_seq);
>  }
>
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> diff --git a/mm/memory.c b/mm/memory.c
> index d0dee2282325..236fdecd44d6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6328,6 +6328,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
>  #endif
>
>  #ifdef CONFIG_PER_VMA_LOCK
> +void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> +{
> +	down_write(&vma->vm_lock.lock);
> +	/*
> +	 * We should use WRITE_ONCE() here because we can have concurrent reads
> +	 * from the early lockless pessimistic check in vma_start_read().
> +	 * We don't really care about the correctness of that early check, but
> +	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> +	 */
> +	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> +	up_write(&vma->vm_lock.lock);
> +}
> +EXPORT_SYMBOL_GPL(__vma_start_write);
> +
>  /*
>   * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
>   * stable and not isolated. If the VMA is not found or is being modified the
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline
  2025-01-11  4:25 ` [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
@ 2025-01-13 16:15   ` Lorenzo Stoakes
  2025-01-15 10:50   ` Peter Zijlstra
  1 sibling, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 16:15 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:59PM -0800, Suren Baghdasaryan wrote:
> Move several vma_area_struct members which are rarely or never used
> during page fault handling into the last cacheline to better pack
> vm_area_struct. As a result vm_area_struct will fit into 3 as opposed
> to 4 cachelines. New typical vm_area_struct layout:
>
> struct vm_area_struct {
>     union {
>         struct {
>             long unsigned int vm_start;              /*     0     8 */
>             long unsigned int vm_end;                /*     8     8 */
>         };                                           /*     0    16 */
>         freeptr_t          vm_freeptr;               /*     0     8 */
>     };                                               /*     0    16 */
>     struct mm_struct *         vm_mm;                /*    16     8 */
>     pgprot_t                   vm_page_prot;         /*    24     8 */
>     union {
>         const vm_flags_t   vm_flags;                 /*    32     8 */
>         vm_flags_t         __vm_flags;               /*    32     8 */
>     };                                               /*    32     8 */
>     unsigned int               vm_lock_seq;          /*    40     4 */
>
>     /* XXX 4 bytes hole, try to pack */
>
>     struct list_head           anon_vma_chain;       /*    48    16 */
>     /* --- cacheline 1 boundary (64 bytes) --- */
>     struct anon_vma *          anon_vma;             /*    64     8 */
>     const struct vm_operations_struct  * vm_ops;     /*    72     8 */
>     long unsigned int          vm_pgoff;             /*    80     8 */
>     struct file *              vm_file;              /*    88     8 */
>     void *                     vm_private_data;      /*    96     8 */
>     atomic_long_t              swap_readahead_info;  /*   104     8 */
>     struct mempolicy *         vm_policy;            /*   112     8 */
>     struct vma_numab_state *   numab_state;          /*   120     8 */
>     /* --- cacheline 2 boundary (128 bytes) --- */
>     refcount_t          vm_refcnt (__aligned__(64)); /*   128     4 */
>
>     /* XXX 4 bytes hole, try to pack */
>
>     struct {
>         struct rb_node     rb (__aligned__(8));      /*   136    24 */
>         long unsigned int  rb_subtree_last;          /*   160     8 */
>     } __attribute__((__aligned__(8))) shared;        /*   136    32 */
>     struct anon_vma_name *     anon_name;            /*   168     8 */
>     struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */
>
>     /* size: 192, cachelines: 3, members: 18 */
>     /* sum members: 176, holes: 2, sum holes: 8 */
>     /* padding: 8 */
>     /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
> } __attribute__((__aligned__(64)));
>
> Memory consumption per 1000 VMAs becomes 48 pages:
>
>     slabinfo after vm_area_struct changes:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    192   42    2 : ...
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Looks sensible to me:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mm_types.h | 38 ++++++++++++++++++--------------------
>  1 file changed, 18 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 9228d19662c6..d902e6730654 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -725,17 +725,6 @@ struct vm_area_struct {
>  	 */
>  	unsigned int vm_lock_seq;
>  #endif
> -
> -	/*
> -	 * For areas with an address space and backing store,
> -	 * linkage into the address_space->i_mmap interval tree.
> -	 *
> -	 */
> -	struct {
> -		struct rb_node rb;
> -		unsigned long rb_subtree_last;
> -	} shared;
> -
>  	/*
>  	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
>  	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
> @@ -755,14 +744,6 @@ struct vm_area_struct {
>  	struct file * vm_file;		/* File we map to (can be NULL). */
>  	void * vm_private_data;		/* was vm_pte (shared mem) */
>
> -#ifdef CONFIG_ANON_VMA_NAME
> -	/*
> -	 * For private and shared anonymous mappings, a pointer to a null
> -	 * terminated string containing the name given to the vma, or NULL if
> -	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
> -	 */
> -	struct anon_vma_name *anon_name;
> -#endif
>  #ifdef CONFIG_SWAP
>  	atomic_long_t swap_readahead_info;
>  #endif
> @@ -775,7 +756,6 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA_BALANCING
>  	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
>  #endif
> -	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  #ifdef CONFIG_PER_VMA_LOCK
>  	/* Unstable RCU readers are allowed to read this. */
>  	refcount_t vm_refcnt ____cacheline_aligned_in_smp;
> @@ -783,6 +763,24 @@ struct vm_area_struct {
>  	struct lockdep_map vmlock_dep_map;
>  #endif
>  #endif
> +	/*
> +	 * For areas with an address space and backing store,
> +	 * linkage into the address_space->i_mmap interval tree.
> +	 *
> +	 */
> +	struct {
> +		struct rb_node rb;
> +		unsigned long rb_subtree_last;
> +	} shared;
> +#ifdef CONFIG_ANON_VMA_NAME
> +	/*
> +	 * For private and shared anonymous mappings, a pointer to a null
> +	 * terminated string containing the name given to the vma, or NULL if
> +	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
> +	 */
> +	struct anon_vma_name *anon_name;
> +#endif
> +	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  } __randomize_layout;
>
>  #ifdef CONFIG_NUMA
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2025-01-11  4:26 ` [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
@ 2025-01-13 16:21   ` Lorenzo Stoakes
  2025-01-13 16:35     ` Liam R. Howlett
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 16:21 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:26:00PM -0800, Suren Baghdasaryan wrote:
> vm_refcnt encodes a number of useful states:
> - whether vma is attached or detached
> - the number of current vma readers
> - presence of a vma writer
> Let's include it in the vma dump.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/debug.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/mm/debug.c b/mm/debug.c
> index 8d2acf432385..325d7bf22038 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -178,6 +178,17 @@ EXPORT_SYMBOL(dump_page);
>
>  void dump_vma(const struct vm_area_struct *vma)
>  {
> +#ifdef CONFIG_PER_VMA_LOCK
> +	pr_emerg("vma %px start %px end %px mm %px\n"
> +		"prot %lx anon_vma %px vm_ops %px\n"
> +		"pgoff %lx file %px private_data %px\n"
> +		"flags: %#lx(%pGv) refcnt %x\n",
> +		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> +		(unsigned long)pgprot_val(vma->vm_page_prot),
> +		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> +		vma->vm_file, vma->vm_private_data,
> +		vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
> +#else
>  	pr_emerg("vma %px start %px end %px mm %px\n"
>  		"prot %lx anon_vma %px vm_ops %px\n"
>  		"pgoff %lx file %px private_data %px\n"
> @@ -187,6 +198,7 @@ void dump_vma(const struct vm_area_struct *vma)
>  		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
>  		vma->vm_file, vma->vm_private_data,
>  		vma->vm_flags, &vma->vm_flags);
> +#endif
>  }

This is pretty horribly duplicative and not in line with how this kind of
thing is done in the rest of the file. You're just adding one entry, so why
not:

void dump_vma(const struct vm_area_struct *vma)
{
	pr_emerg("vma %px start %px end %px mm %px\n"
		"prot %lx anon_vma %px vm_ops %px\n"
		"pgoff %lx file %px private_data %px\n"
#ifdef CONFIG_PER_VMA_LOCK
		"refcnt %x\n"
#endif
		"flags: %#lx(%pGv)\n",
		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
		(unsigned long)pgprot_val(vma->vm_page_prot),
		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
		vma->vm_file, vma->vm_private_data,
		vma->vm_flags,
#ifdef CONFIG_PER_VMA_LOCK
		refcount_read(&vma->vm_refcnt),
#endif
		&vma->vm_flags);
}

?

>  EXPORT_SYMBOL(dump_vma);
>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 14/17] mm: remove extra vma_numab_state_init() call
  2025-01-11  4:26 ` [PATCH v9 14/17] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
@ 2025-01-13 16:28   ` Lorenzo Stoakes
  2025-01-13 17:56     ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 16:28 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:26:01PM -0800, Suren Baghdasaryan wrote:
> vma_init() already memset's the whole vm_area_struct to 0, so there is
> no need to an additional vma_numab_state_init().

Hm strangely random change :) I'm guessing this was a pre-existing thing?

>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

I mean this looks fine, so fair enough just feels a bit incongruous with
series. But regardless:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mm.h | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a99b11ee1f66..c8da64b114d1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -948,7 +948,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &vma_dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> -	vma_numab_state_init(vma);
>  	vma_lock_init(vma, false);
>  }

This leaves one other caller in vm_area_dup() (I _hate_ that this lives in
the fork code... - might very well look at churning some VMA stuff over
from there to an appropriate place).

While we're here, I mean this thing seems a bit of out scope for the series
but if we're doing it, can we just remove vma_numab_state_init() and
instead edit vm_area_init_from() to #ifdef ... this like the other fields
now?

I's not exactly urgent though as this stuff in the fork code is a bit of a
mess anyway...

>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-13 11:58   ` Lorenzo Stoakes
@ 2025-01-13 16:31     ` Suren Baghdasaryan
  2025-01-13 16:44       ` Lorenzo Stoakes
  2025-01-13 16:47       ` Lorenzo Stoakes
  0 siblings, 2 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 16:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 3:58 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:51PM -0800, Suren Baghdasaryan wrote:
> > vma_iter_store() functions can be used both when adding a new vma and
> > when updating an existing one. However for existing ones we do not need
> > to mark them attached as they are already marked that way. Introduce
> > vma_iter_store_attached() to be used with already attached vmas.
>
> OK I guess the intent of this is to reinstate the previously existing
> asserts, only explicitly checking those places where we attach.

No, the motivation is to prevern re-attaching an already attached vma
or re-detaching an already detached vma for state consistency. I guess
I should amend the description to make that clear.

>
> I'm a little concerned that by doing this, somebody might simply invoke
> this function without realising the implications.

Well, in that case somebody should get an assertion. If
vma_iter_store() is called against already attached vma, we get this
assertion:

vma_iter_store()
  vma_mark_attached()
    vma_assert_detached()

If vma_iter_store_attached() is called against a detached vma, we get this one:

vma_iter_store_attached()
  vma_assert_attached()

Does that address your concern?

>
> Can we have something functional like
>
> vma_iter_store_new() and vma_iter_store_overwrite()

Ok. A bit more churn but should not be too bad.

>
> ?
>
> I don't like us just leaving vma_iter_store() quietly making an assumption
> that a caller doesn't necessarily realise.
>
> Also it's more greppable this way.
>
> I had a look through callers and it does seem you've snagged them all
> correctly.
>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  include/linux/mm.h | 12 ++++++++++++
> >  mm/vma.c           |  8 ++++----
> >  mm/vma.h           | 11 +++++++++--
> >  3 files changed, 25 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2b322871da87..2f805f1a0176 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> >               vma_assert_write_locked(vma);
> >  }
> >
> > +static inline void vma_assert_attached(struct vm_area_struct *vma)
> > +{
> > +     VM_BUG_ON_VMA(vma->detached, vma);
> > +}
> > +
> > +static inline void vma_assert_detached(struct vm_area_struct *vma)
> > +{
> > +     VM_BUG_ON_VMA(!vma->detached, vma);
> > +}
> > +
> >  static inline void vma_mark_attached(struct vm_area_struct *vma)
> >  {
> >       vma->detached = false;
> > @@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
> >  static inline void vma_start_write(struct vm_area_struct *vma) {}
> >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >               { mmap_assert_write_locked(vma->vm_mm); }
> > +static inline void vma_assert_attached(struct vm_area_struct *vma) {}
> > +static inline void vma_assert_detached(struct vm_area_struct *vma) {}
> >  static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> >  static inline void vma_mark_detached(struct vm_area_struct *vma) {}
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index d603494e69d7..b9cf552e120c 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
> >       vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
> >
> >       if (expanded)
> > -             vma_iter_store(vmg->vmi, vmg->vma);
> > +             vma_iter_store_attached(vmg->vmi, vmg->vma);
> >
> >       if (adj_start) {
> >               adjust->vm_start += adj_start;
> >               adjust->vm_pgoff += PHYS_PFN(adj_start);
> >               if (adj_start < 0) {
> >                       WARN_ON(expanded);
> > -                     vma_iter_store(vmg->vmi, adjust);
> > +                     vma_iter_store_attached(vmg->vmi, adjust);
> >               }
> >       }
>
> I kind of feel this whole function (that yes, I added :>) though derived
> from existing logic) needs rework, as it's necessarily rather confusing.
>
> But hey, that's on me :)
>
> But this does look right... OK see this as a note-to-self...
>
> >
> > @@ -2845,7 +2845,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >                               anon_vma_interval_tree_pre_update_vma(vma);
> >                               vma->vm_end = address;
> >                               /* Overwrite old entry in mtree. */
> > -                             vma_iter_store(&vmi, vma);
> > +                             vma_iter_store_attached(&vmi, vma);
> >                               anon_vma_interval_tree_post_update_vma(vma);
> >
> >                               perf_event_mmap(vma);
> > @@ -2925,7 +2925,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >                               vma->vm_start = address;
> >                               vma->vm_pgoff -= grow;
> >                               /* Overwrite old entry in mtree. */
> > -                             vma_iter_store(&vmi, vma);
> > +                             vma_iter_store_attached(&vmi, vma);
> >                               anon_vma_interval_tree_post_update_vma(vma);
> >
> >                               perf_event_mmap(vma);
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 2a2668de8d2c..63dd38d5230c 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -365,9 +365,10 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
> >  }
> >
> >  /* Store a VMA with preallocated memory */
> > -static inline void vma_iter_store(struct vma_iterator *vmi,
> > -                               struct vm_area_struct *vma)
> > +static inline void vma_iter_store_attached(struct vma_iterator *vmi,
> > +                                        struct vm_area_struct *vma)
> >  {
> > +     vma_assert_attached(vma);
> >
> >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> >       if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start &&
> > @@ -390,7 +391,13 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
> >
> >       __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> >       mas_store_prealloc(&vmi->mas, vma);
> > +}
> > +
> > +static inline void vma_iter_store(struct vma_iterator *vmi,
> > +                               struct vm_area_struct *vma)
> > +{
> >       vma_mark_attached(vma);
> > +     vma_iter_store_attached(vmi, vma);
> >  }
> >
>
> See comment at top, and we need some comments here to explain why we're
> going to pains to do this.

Ack. I'll amend the patch description to make that clear.

>
> What about mm/nommu.c? I guess these cases are always new VMAs.

CONFIG_PER_VMA_LOCK depends on !CONFIG_NOMMU, so for nommu case all
these attach/detach functions become NOPs.

>
> We probably definitely need to check this series in a nommu setup, have you
> done this? As I can see this breaking things. Then again I suppose you'd
> have expected bots to moan by now...
>
> >  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 17/17] docs/mm: document latest changes to vm_lock
  2025-01-11  4:26 ` [PATCH v9 17/17] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2025-01-13 16:33   ` Lorenzo Stoakes
  2025-01-13 17:56     ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 16:33 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:26:04PM -0800, Suren Baghdasaryan wrote:
> Change the documentation to reflect that vm_lock is integrated into vma
> and replaced with vm_refcnt.
> Document newly introduced vma_start_read_locked{_nested} functions.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

Apart from small nit, LGTM, thanks for doing this!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  Documentation/mm/process_addrs.rst | 44 ++++++++++++++++++------------
>  1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> index 81417fa2ed20..f573de936b5d 100644
> --- a/Documentation/mm/process_addrs.rst
> +++ b/Documentation/mm/process_addrs.rst
> @@ -716,9 +716,14 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
>  critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
>  before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
>
> -VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> -their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
> -via :c:func:`!vma_end_read`.
> +In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
> +and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
> +fail due to lock contention but the caller should still check their return values
> +in case they fail for other reasons.
> +
> +VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
> +duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
> +:c:func:`!vma_end_read`.
>
>  VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
>  VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
> @@ -726,9 +731,9 @@ acquired. An mmap write lock **must** be held for the duration of the VMA write
>  lock, releasing or downgrading the mmap write lock also releases the VMA write
>  lock so there is no :c:func:`!vma_end_write` function.
>
> -Note that a semaphore write lock is not held across a VMA lock. Rather, a
> -sequence number is used for serialisation, and the write semaphore is only
> -acquired at the point of write lock to update this.
> +Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
> +modified so that readers can detect the presense of a writer. The reference counter is
> +restored once the vma sequence number used for serialisation is updated.
>
>  This ensures the semantics we require - VMA write locks provide exclusive write
>  access to the VMA.
> @@ -738,7 +743,7 @@ Implementation details
>
>  The VMA lock mechanism is designed to be a lightweight means of avoiding the use
>  of the heavily contended mmap lock. It is implemented using a combination of a
> -read/write semaphore and sequence numbers belonging to the containing
> +reference counter and sequence numbers belonging to the containing
>  :c:struct:`!struct mm_struct` and the VMA.
>
>  Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
> @@ -779,28 +784,31 @@ release of any VMA locks on its release makes sense, as you would never want to
>  keep VMAs locked across entirely separate write operations. It also maintains
>  correct lock ordering.
>
> -Each time a VMA read lock is acquired, we acquire a read lock on the
> -:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
> -the sequence count of the VMA does not match that of the mm.
> +Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
> +reference counter and check that the sequence count of the VMA does not match
> +that of the mm.
>
> -If it does, the read lock fails. If it does not, we hold the lock, excluding
> -writers, but permitting other readers, who will also obtain this lock under RCU.
> +If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
> +If it does not, we keep the reference counter raised, excluding writers, but
> +permitting other readers, who can also obtain this lock under RCU.
>
>  Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
>  are also RCU safe, so the whole read lock operation is guaranteed to function
>  correctly.
>
> -On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
> -read/write semaphore, before setting the VMA's sequence number under this lock,
> -also simultaneously holding the mmap write lock.
> +On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
> +modified by readers and wait for all readers to drop their reference count.
> +Once there are no readers, VMA's sequence number is set to match that of the

Nit: 'the VMA's sequence number' seems to read better here.

> +mm. During this entire operation mmap write lock is held.
>
>  This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
>  until these are finished and mutual exclusion is achieved.
>
> -After setting the VMA's sequence number, the lock is released, avoiding
> -complexity with a long-term held write lock.
> +After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
> +indicating a writer is cleared. From this point on, VMA's sequence number will
> +indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
>
> -This clever combination of a read/write semaphore and sequence count allows for
> +This clever combination of a reference counter and sequence count allows for
>  fast RCU-based per-VMA lock acquisition (especially on page fault, though
>  utilised elsewhere) with minimal complexity around lock ordering.
>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2025-01-13 16:21   ` Lorenzo Stoakes
@ 2025-01-13 16:35     ` Liam R. Howlett
  2025-01-13 17:57       ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Liam R. Howlett @ 2025-01-13 16:35 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, akpm, peterz, willy, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250113 11:21]:
> On Fri, Jan 10, 2025 at 08:26:00PM -0800, Suren Baghdasaryan wrote:
> > vm_refcnt encodes a number of useful states:
> > - whether vma is attached or detached
> > - the number of current vma readers
> > - presence of a vma writer
> > Let's include it in the vma dump.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/debug.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/mm/debug.c b/mm/debug.c
> > index 8d2acf432385..325d7bf22038 100644
> > --- a/mm/debug.c
> > +++ b/mm/debug.c
> > @@ -178,6 +178,17 @@ EXPORT_SYMBOL(dump_page);
> >
> >  void dump_vma(const struct vm_area_struct *vma)
> >  {
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +	pr_emerg("vma %px start %px end %px mm %px\n"
> > +		"prot %lx anon_vma %px vm_ops %px\n"
> > +		"pgoff %lx file %px private_data %px\n"
> > +		"flags: %#lx(%pGv) refcnt %x\n",
> > +		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> > +		(unsigned long)pgprot_val(vma->vm_page_prot),
> > +		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> > +		vma->vm_file, vma->vm_private_data,
> > +		vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
> > +#else
> >  	pr_emerg("vma %px start %px end %px mm %px\n"
> >  		"prot %lx anon_vma %px vm_ops %px\n"
> >  		"pgoff %lx file %px private_data %px\n"
> > @@ -187,6 +198,7 @@ void dump_vma(const struct vm_area_struct *vma)
> >  		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> >  		vma->vm_file, vma->vm_private_data,
> >  		vma->vm_flags, &vma->vm_flags);
> > +#endif
> >  }
> 
> This is pretty horribly duplicative and not in line with how this kind of
> thing is done in the rest of the file. You're just adding one entry, so why
> not:
> 
> void dump_vma(const struct vm_area_struct *vma)
> {
> 	pr_emerg("vma %px start %px end %px mm %px\n"
> 		"prot %lx anon_vma %px vm_ops %px\n"
> 		"pgoff %lx file %px private_data %px\n"
> #ifdef CONFIG_PER_VMA_LOCK
> 		"refcnt %x\n"
> #endif
> 		"flags: %#lx(%pGv)\n",
> 		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> 		(unsigned long)pgprot_val(vma->vm_page_prot),
> 		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> 		vma->vm_file, vma->vm_private_data,
> 		vma->vm_flags,
> #ifdef CONFIG_PER_VMA_LOCK
> 		refcount_read(&vma->vm_refcnt),
> #endif
> 		&vma->vm_flags);
> }

right, I had an issue with this as well.

Another option would be:

 	pr_emerg("vma %px start %px end %px mm %px\n"
 		"prot %lx anon_vma %px vm_ops %px\n"
 		"pgoff %lx file %px private_data %px\n",
		<big mess here>);
	dump_vma_refcnt();
	pr_emerg("flags:...", vma_vm_flags);


Then dump_vma_refcnt() either dumps the refcnt or does nothing,
depending on the config option.

Either way is good with me.  Lorenzo's suggestion is in line with the
file and it's clear as to why the refcnt might be missing, but I don't
really see this being an issue in practice.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-13 16:31     ` Suren Baghdasaryan
@ 2025-01-13 16:44       ` Lorenzo Stoakes
  2025-01-13 16:47       ` Lorenzo Stoakes
  1 sibling, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 16:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 08:31:45AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 3:58 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:25:51PM -0800, Suren Baghdasaryan wrote:
> > > vma_iter_store() functions can be used both when adding a new vma and
> > > when updating an existing one. However for existing ones we do not need
> > > to mark them attached as they are already marked that way. Introduce
> > > vma_iter_store_attached() to be used with already attached vmas.
> >
> > OK I guess the intent of this is to reinstate the previously existing
> > asserts, only explicitly checking those places where we attach.
>
> No, the motivation is to prevern re-attaching an already attached vma
> or re-detaching an already detached vma for state consistency. I guess
> I should amend the description to make that clear.
>
> >
> > I'm a little concerned that by doing this, somebody might simply invoke
> > this function without realising the implications.
>
> Well, in that case somebody should get an assertion. If
> vma_iter_store() is called against already attached vma, we get this
> assertion:
>
> vma_iter_store()
>   vma_mark_attached()
>     vma_assert_detached()
>
> If vma_iter_store_attached() is called against a detached vma, we get this one:
>
> vma_iter_store_attached()
>   vma_assert_attached()
>
> Does that address your concern?

Well the issue is that you might only get that assertion in some code path
that isn't immediately exercised by code the bots run :)

See my comment re: testing to 00/17 (though I absolutely accept testing
this is a giant pain).

But yes we are protected in cases where a user has CONFIG_DEBUG_VM turned
on.

I honestly wonder if we need to be stronger than that though, it's really
serious if we do this wrong isn't it? Maybe it should be a WARN_ON_ONCE()
or something?

In any case, a rename means somebody isn't going to do this by mistake.

>
> >
> > Can we have something functional like
> >
> > vma_iter_store_new() and vma_iter_store_overwrite()
>
> Ok. A bit more churn but should not be too bad.

Yeah sorry for churn (though hey - I _am_ the churn king right? ;) - but I
think in this case it's really valuable for understanding.

>
> >
> > ?
> >
> > I don't like us just leaving vma_iter_store() quietly making an assumption
> > that a caller doesn't necessarily realise.
> >
> > Also it's more greppable this way.
> >
> > I had a look through callers and it does seem you've snagged them all
> > correctly.
> >
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  include/linux/mm.h | 12 ++++++++++++
> > >  mm/vma.c           |  8 ++++----
> > >  mm/vma.h           | 11 +++++++++--
> > >  3 files changed, 25 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 2b322871da87..2f805f1a0176 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> > >               vma_assert_write_locked(vma);
> > >  }
> > >
> > > +static inline void vma_assert_attached(struct vm_area_struct *vma)
> > > +{
> > > +     VM_BUG_ON_VMA(vma->detached, vma);
> > > +}
> > > +
> > > +static inline void vma_assert_detached(struct vm_area_struct *vma)
> > > +{
> > > +     VM_BUG_ON_VMA(!vma->detached, vma);
> > > +}
> > > +
> > >  static inline void vma_mark_attached(struct vm_area_struct *vma)
> > >  {
> > >       vma->detached = false;
> > > @@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
> > >  static inline void vma_start_write(struct vm_area_struct *vma) {}
> > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > >               { mmap_assert_write_locked(vma->vm_mm); }
> > > +static inline void vma_assert_attached(struct vm_area_struct *vma) {}
> > > +static inline void vma_assert_detached(struct vm_area_struct *vma) {}
> > >  static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> > >  static inline void vma_mark_detached(struct vm_area_struct *vma) {}
> > >
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index d603494e69d7..b9cf552e120c 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
> > >       vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
> > >
> > >       if (expanded)
> > > -             vma_iter_store(vmg->vmi, vmg->vma);
> > > +             vma_iter_store_attached(vmg->vmi, vmg->vma);
> > >
> > >       if (adj_start) {
> > >               adjust->vm_start += adj_start;
> > >               adjust->vm_pgoff += PHYS_PFN(adj_start);
> > >               if (adj_start < 0) {
> > >                       WARN_ON(expanded);
> > > -                     vma_iter_store(vmg->vmi, adjust);
> > > +                     vma_iter_store_attached(vmg->vmi, adjust);
> > >               }
> > >       }
> >
> > I kind of feel this whole function (that yes, I added :>) though derived
> > from existing logic) needs rework, as it's necessarily rather confusing.
> >
> > But hey, that's on me :)
> >
> > But this does look right... OK see this as a note-to-self...
> >
> > >
> > > @@ -2845,7 +2845,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > >                               anon_vma_interval_tree_pre_update_vma(vma);
> > >                               vma->vm_end = address;
> > >                               /* Overwrite old entry in mtree. */
> > > -                             vma_iter_store(&vmi, vma);
> > > +                             vma_iter_store_attached(&vmi, vma);
> > >                               anon_vma_interval_tree_post_update_vma(vma);
> > >
> > >                               perf_event_mmap(vma);
> > > @@ -2925,7 +2925,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > >                               vma->vm_start = address;
> > >                               vma->vm_pgoff -= grow;
> > >                               /* Overwrite old entry in mtree. */
> > > -                             vma_iter_store(&vmi, vma);
> > > +                             vma_iter_store_attached(&vmi, vma);
> > >                               anon_vma_interval_tree_post_update_vma(vma);
> > >
> > >                               perf_event_mmap(vma);
> > > diff --git a/mm/vma.h b/mm/vma.h
> > > index 2a2668de8d2c..63dd38d5230c 100644
> > > --- a/mm/vma.h
> > > +++ b/mm/vma.h
> > > @@ -365,9 +365,10 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
> > >  }
> > >
> > >  /* Store a VMA with preallocated memory */
> > > -static inline void vma_iter_store(struct vma_iterator *vmi,
> > > -                               struct vm_area_struct *vma)
> > > +static inline void vma_iter_store_attached(struct vma_iterator *vmi,
> > > +                                        struct vm_area_struct *vma)
> > >  {
> > > +     vma_assert_attached(vma);
> > >
> > >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > >       if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start &&
> > > @@ -390,7 +391,13 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
> > >
> > >       __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > >       mas_store_prealloc(&vmi->mas, vma);
> > > +}
> > > +
> > > +static inline void vma_iter_store(struct vma_iterator *vmi,
> > > +                               struct vm_area_struct *vma)
> > > +{
> > >       vma_mark_attached(vma);
> > > +     vma_iter_store_attached(vmi, vma);
> > >  }
> > >
> >
> > See comment at top, and we need some comments here to explain why we're
> > going to pains to do this.
>
> Ack. I'll amend the patch description to make that clear.

Thanks!

>
> >
> > What about mm/nommu.c? I guess these cases are always new VMAs.
>
> CONFIG_PER_VMA_LOCK depends on !CONFIG_NOMMU, so for nommu case all
> these attach/detach functions become NOPs.

Ack. OK good, I usually like to pretend nommu doesn't exist, but sometimes
have to ack that it does, now in this case I can go back to not
caring... :>)

>
> >
> > We probably definitely need to check this series in a nommu setup, have you
> > done this? As I can see this breaking things. Then again I suppose you'd
> > have expected bots to moan by now...
> >
> > >  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> > > --
> > > 2.47.1.613.gc27f4b7a9f-goog
> > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-13 16:31     ` Suren Baghdasaryan
  2025-01-13 16:44       ` Lorenzo Stoakes
@ 2025-01-13 16:47       ` Lorenzo Stoakes
  2025-01-13 19:09         ` Suren Baghdasaryan
  1 sibling, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 16:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 08:31:45AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 3:58 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:25:51PM -0800, Suren Baghdasaryan wrote:
> > > vma_iter_store() functions can be used both when adding a new vma and
> > > when updating an existing one. However for existing ones we do not need
> > > to mark them attached as they are already marked that way. Introduce
> > > vma_iter_store_attached() to be used with already attached vmas.
> >
> > OK I guess the intent of this is to reinstate the previously existing
> > asserts, only explicitly checking those places where we attach.
>
> No, the motivation is to prevern re-attaching an already attached vma
> or re-detaching an already detached vma for state consistency. I guess
> I should amend the description to make that clear.

Sorry for noise, missed this reply.

What I mean by this is, in a past iteration of this series I reviewed code
where you did this but did _not_ differentiate between cases of new VMAs
vs. existing, which caused an assert in your series which I reported.

So I"m saying - now you _are_ differentiating between the two cases.

It's certainly worth belabouring the point of exactly what it is you are
trying to catch here, however! :) So yes please do add a little more to
commit msg that'd be great, thanks!

>
> >
> > I'm a little concerned that by doing this, somebody might simply invoke
> > this function without realising the implications.
>
> Well, in that case somebody should get an assertion. If
> vma_iter_store() is called against already attached vma, we get this
> assertion:
>
> vma_iter_store()
>   vma_mark_attached()
>     vma_assert_detached()
>
> If vma_iter_store_attached() is called against a detached vma, we get this one:
>
> vma_iter_store_attached()
>   vma_assert_attached()
>
> Does that address your concern?
>
> >
> > Can we have something functional like
> >
> > vma_iter_store_new() and vma_iter_store_overwrite()
>
> Ok. A bit more churn but should not be too bad.
>
> >
> > ?
> >
> > I don't like us just leaving vma_iter_store() quietly making an assumption
> > that a caller doesn't necessarily realise.
> >
> > Also it's more greppable this way.
> >
> > I had a look through callers and it does seem you've snagged them all
> > correctly.
> >
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  include/linux/mm.h | 12 ++++++++++++
> > >  mm/vma.c           |  8 ++++----
> > >  mm/vma.h           | 11 +++++++++--
> > >  3 files changed, 25 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 2b322871da87..2f805f1a0176 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> > >               vma_assert_write_locked(vma);
> > >  }
> > >
> > > +static inline void vma_assert_attached(struct vm_area_struct *vma)
> > > +{
> > > +     VM_BUG_ON_VMA(vma->detached, vma);
> > > +}
> > > +
> > > +static inline void vma_assert_detached(struct vm_area_struct *vma)
> > > +{
> > > +     VM_BUG_ON_VMA(!vma->detached, vma);
> > > +}
> > > +
> > >  static inline void vma_mark_attached(struct vm_area_struct *vma)
> > >  {
> > >       vma->detached = false;
> > > @@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
> > >  static inline void vma_start_write(struct vm_area_struct *vma) {}
> > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > >               { mmap_assert_write_locked(vma->vm_mm); }
> > > +static inline void vma_assert_attached(struct vm_area_struct *vma) {}
> > > +static inline void vma_assert_detached(struct vm_area_struct *vma) {}
> > >  static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> > >  static inline void vma_mark_detached(struct vm_area_struct *vma) {}
> > >
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index d603494e69d7..b9cf552e120c 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
> > >       vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
> > >
> > >       if (expanded)
> > > -             vma_iter_store(vmg->vmi, vmg->vma);
> > > +             vma_iter_store_attached(vmg->vmi, vmg->vma);
> > >
> > >       if (adj_start) {
> > >               adjust->vm_start += adj_start;
> > >               adjust->vm_pgoff += PHYS_PFN(adj_start);
> > >               if (adj_start < 0) {
> > >                       WARN_ON(expanded);
> > > -                     vma_iter_store(vmg->vmi, adjust);
> > > +                     vma_iter_store_attached(vmg->vmi, adjust);
> > >               }
> > >       }
> >
> > I kind of feel this whole function (that yes, I added :>) though derived
> > from existing logic) needs rework, as it's necessarily rather confusing.
> >
> > But hey, that's on me :)
> >
> > But this does look right... OK see this as a note-to-self...
> >
> > >
> > > @@ -2845,7 +2845,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > >                               anon_vma_interval_tree_pre_update_vma(vma);
> > >                               vma->vm_end = address;
> > >                               /* Overwrite old entry in mtree. */
> > > -                             vma_iter_store(&vmi, vma);
> > > +                             vma_iter_store_attached(&vmi, vma);
> > >                               anon_vma_interval_tree_post_update_vma(vma);
> > >
> > >                               perf_event_mmap(vma);
> > > @@ -2925,7 +2925,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > >                               vma->vm_start = address;
> > >                               vma->vm_pgoff -= grow;
> > >                               /* Overwrite old entry in mtree. */
> > > -                             vma_iter_store(&vmi, vma);
> > > +                             vma_iter_store_attached(&vmi, vma);
> > >                               anon_vma_interval_tree_post_update_vma(vma);
> > >
> > >                               perf_event_mmap(vma);
> > > diff --git a/mm/vma.h b/mm/vma.h
> > > index 2a2668de8d2c..63dd38d5230c 100644
> > > --- a/mm/vma.h
> > > +++ b/mm/vma.h
> > > @@ -365,9 +365,10 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
> > >  }
> > >
> > >  /* Store a VMA with preallocated memory */
> > > -static inline void vma_iter_store(struct vma_iterator *vmi,
> > > -                               struct vm_area_struct *vma)
> > > +static inline void vma_iter_store_attached(struct vma_iterator *vmi,
> > > +                                        struct vm_area_struct *vma)
> > >  {
> > > +     vma_assert_attached(vma);
> > >
> > >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > >       if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start &&
> > > @@ -390,7 +391,13 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
> > >
> > >       __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > >       mas_store_prealloc(&vmi->mas, vma);
> > > +}
> > > +
> > > +static inline void vma_iter_store(struct vma_iterator *vmi,
> > > +                               struct vm_area_struct *vma)
> > > +{
> > >       vma_mark_attached(vma);
> > > +     vma_iter_store_attached(vmi, vma);
> > >  }
> > >
> >
> > See comment at top, and we need some comments here to explain why we're
> > going to pains to do this.
>
> Ack. I'll amend the patch description to make that clear.
>
> >
> > What about mm/nommu.c? I guess these cases are always new VMAs.
>
> CONFIG_PER_VMA_LOCK depends on !CONFIG_NOMMU, so for nommu case all
> these attach/detach functions become NOPs.
>
> >
> > We probably definitely need to check this series in a nommu setup, have you
> > done this? As I can see this breaking things. Then again I suppose you'd
> > have expected bots to moan by now...
> >
> > >  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> > > --
> > > 2.47.1.613.gc27f4b7a9f-goog
> > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-13 12:14 ` Lorenzo Stoakes
@ 2025-01-13 16:58   ` Suren Baghdasaryan
  2025-01-13 17:11     ` Lorenzo Stoakes
  2025-01-14  1:49   ` Andrew Morton
  1 sibling, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 16:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> A nit on subject, I mean this is part of what this series does, and hey -
> we have only so much text to put in here - but isn't this both
> reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> using the RCU typesafe mechanism?
>
> Do we have to do both in one series? Can we split this out? I mean maybe
> that's just churny and unnecessary, but to me this series is 'allocate VMAs
> RCU safe and refcount VMA lock' or something like this. Maybe this is
> nitty... but still :)

There is "motivational dependency" because one of the main reasons I'm
converting the vm_lock into vm_refcnt is to make it easier to add
SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
thighs easier. That would be the 2 patches at the end:

mm: prepare lock_vma_under_rcu() for vma reuse possibility
mm: make vma cache SLAB_TYPESAFE_BY_RCU

I made sure that each patch is bisectable, so there should not be a
problem with tracking issues.

>
> One general comment here - this is a really major change in how this stuff
> works, and yet I don't see any tests anywhere in the series.

Hmm. I was diligently updating the tests to reflect the replacement of
vm_lock with vm_refcnt and adding assertions for detach/attach cases.
This actually reminds me that I missed updading vm_area_struct in
vma_internal.h for the member regrouping patch; will add that. I think
the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
was not sure what kind of testing I can add for that. Any suggestions
would be welcomed.

>
> I know it's tricky to write tests for this, but the new VMA testing
> environment should make it possible to test a _lot_ more than we previously
> could.
>
> However due to some (*ahem*) interesting distribution of where functions
> are, most notably stuff in kernel/fork.c, I guess we can't test
> _everything_ there effectively.
>
> But I do feel like we should be able to do better than having absolutely no
> testing added for this?

Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
for the rest I thought the tests were modified accordingly.

>
> I think there's definitely quite a bit you could test now, at least in
> asserting fundamentals in tools/testing/vma/vma.c.
>
> This can cover at least detached state asserts in various scenarios.

Ok, you mean to check that VMA re-attachment/re-detachment would
trigger assertions? I'll look into adding tests for that.

>
> But that won't cover off the really gnarly stuff here around RCU slab
> allocation, and determining precisely how to test that in a sensible way is
> maybe less clear.
>
> But I'd like to see _something_ here please, this is more or less
> fundamentally changing how all VMAs are allocated and to just have nothing
> feels unfortunate.

Again, I'm open to suggestions on what kind of testing I can add for
SLAB_TYPESAFE_BY_RCU change.

>
> I'm already nervous because we've hit issues coming up to v9 and we're not
> 100% sure if a recent syzkaller is related to these changes or not, I'm not
> sure how much we can get assurances with tests but I'd like something.

If you are referring to the issue at [1], I think David ran the
syzcaller against mm-stable that does not contain this patchset and
the issue still triggered (see [2]). This of course does not guarantee
that this patchset has no other issues :) I'll try adding tests for
re-attaching, re-detaching and welcome ideas on how to test
SLAB_TYPESAFE_BY_RCU transition.
Thanks,
Suren.

[1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
[2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/

>
> Thanks!
>
> On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
> >
> > Pagefault microbenchmarks show performance improvement:
> > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> >
> > Changes since v8 [4]:
> > - Change subject for the cover letter, per Vlastimil Babka
> > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > per David Laight
> > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > per Hillf Danton and Vlastimil Babka
> > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > under us, per Vlastimil Babka
> > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > per Vlastimil Babka
> > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > per Vlastimil Babka
> > - Amended description in vm_lock replacement patch explaining the effects
> > of the patch on vm_area_struct size, per Vlastimil Babka
> > - Added vm_area_struct member regrouping patch [5] into the series
> > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > adding new members, per Vlastimil Babka
> > - Updated a comment about unstable src->shared.rb when copying a vma in
> > vm_area_init_from(), per Vlastimil Babka
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> >
> > Patchset applies over mm-unstable after reverting v8
> > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> >
> > Suren Baghdasaryan (17):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: mark vma as detached until it's added into vma tree
> >   mm: introduce vma_iter_store_attached() to use with attached vmas
> >   mm: mark vmas detached upon exit
> >   types: move struct rcuwait into types.h
> >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> >   mm: move mmap_init_lock() out of the header file
> >   mm: uninline the main body of vma_start_write()
> >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> >   mm: replace vm_lock and detached flag with a reference count
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >   mm/debug: print vm_refcnt state when dumping the vma
> >   mm: remove extra vma_numab_state_init() call
> >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >   docs/mm: document latest changes to vm_lock
> >
> >  Documentation/mm/process_addrs.rst |  44 ++++----
> >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> >  include/linux/mm_types.h           |  75 +++++++-------
> >  include/linux/mmap_lock.h          |   6 --
> >  include/linux/rcuwait.h            |  13 +--
> >  include/linux/refcount.h           |  24 ++++-
> >  include/linux/slab.h               |   6 --
> >  include/linux/types.h              |  12 +++
> >  kernel/fork.c                      | 129 +++++++++++-------------
> >  mm/debug.c                         |  12 +++
> >  mm/init-mm.c                       |   1 +
> >  mm/memory.c                        |  97 ++++++++++++++++--
> >  mm/mmap.c                          |   3 +-
> >  mm/userfaultfd.c                   |  32 +++---
> >  mm/vma.c                           |  23 ++---
> >  mm/vma.h                           |  15 ++-
> >  tools/testing/vma/linux/atomic.h   |   5 +
> >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> >  18 files changed, 465 insertions(+), 281 deletions(-)
> >
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 05/17] mm: mark vmas detached upon exit
  2025-01-13 12:05   ` Lorenzo Stoakes
@ 2025-01-13 17:02     ` Suren Baghdasaryan
  2025-01-13 17:13       ` Lorenzo Stoakes
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 17:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 4:05 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:52PM -0800, Suren Baghdasaryan wrote:
> > When exit_mmap() removes vmas belonging to an exiting task, it does not
> > mark them as detached since they can't be reached by other tasks and they
> > will be freed shortly. Once we introduce vma reuse, all vmas will have to
> > be in detached state before they are freed to ensure vma when reused is
> > in a consistent state. Add missing vma_mark_detached() before freeing the
> > vma.
>
> Hmm this really makes me worry that we'll see bugs from this detached
> stuff, do we make this assumption anywhere else I wonder?

This is the only place which does not currently detach the vma before
freeing it. If someone tries adding a case like that in the future,
they will be met with vma_assert_detached() inside vm_area_free().

>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> But regardless, prima facie, this looks fine, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >  mm/vma.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index b9cf552e120c..93ff42ac2002 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -413,10 +413,12 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> >       if (vma->vm_file)
> >               fput(vma->vm_file);
> >       mpol_put(vma_policy(vma));
> > -     if (unreachable)
> > +     if (unreachable) {
> > +             vma_mark_detached(vma);
> >               __vm_area_free(vma);
> > -     else
> > +     } else {
> >               vm_area_free(vma);
> > +     }
> >  }
> >
> >  /*
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-13 16:58   ` Suren Baghdasaryan
@ 2025-01-13 17:11     ` Lorenzo Stoakes
  2025-01-13 19:00       ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 17:11 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 08:58:37AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > A nit on subject, I mean this is part of what this series does, and hey -
> > we have only so much text to put in here - but isn't this both
> > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > using the RCU typesafe mechanism?
> >
> > Do we have to do both in one series? Can we split this out? I mean maybe
> > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > nitty... but still :)
>
> There is "motivational dependency" because one of the main reasons I'm
> converting the vm_lock into vm_refcnt is to make it easier to add
> SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
> can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
> thighs easier. That would be the 2 patches at the end:

Right yeah... maybe it's better to do it in one hit.

>
> mm: prepare lock_vma_under_rcu() for vma reuse possibility
> mm: make vma cache SLAB_TYPESAFE_BY_RCU
>
> I made sure that each patch is bisectable, so there should not be a
> problem with tracking issues.
>
> >
> > One general comment here - this is a really major change in how this stuff
> > works, and yet I don't see any tests anywhere in the series.
>
> Hmm. I was diligently updating the tests to reflect the replacement of
> vm_lock with vm_refcnt and adding assertions for detach/attach cases.
> This actually reminds me that I missed updading vm_area_struct in
> vma_internal.h for the member regrouping patch; will add that. I think
> the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
> was not sure what kind of testing I can add for that. Any suggestions
> would be welcomed.

And to be clear I'm super grateful you did that :) thanks, be good to
change the member regrouping thing also.

But that doesn't change the fact that this series has exactly zero tests
for it. And for something so broad, it feels like a big issue, we really
want to be careful with something so big here.

You've also noticed that I've cleverly failed to _actually_ suggest
SLAB_TYPESAFE_BY_RCU tests, and mea culpa - it's super hard to think of how
to test that.

Liam has experience doing RCU testing this for the maple tree stuff, but it
wasn't pretty and wasn't easy and would probably require massive rework to
expose this stuff to some viable testing environment, or in other words -
is unworkable.

HOWEVER, I feel like maybe we could try to create scenarios where we might
trigger reuse bugs?

Perhaps some userland code, perhaps even constrained by cgroup, that maps a
ton of stuff and unmaps in a loop in parallel?

Perhaps create scenarios with shared memory where we up refcounts a lot too?

Anyway, this is necessarily nebulous without further investigation, what I
was thinking more concretely is:

Using the VMA userland testing:

1. Assert reference count correctness across locking scenarios and various
   VMA operations.
2. Assert correct detached/not detached state across different scenarios.

This won't quite be complete as not everything is separated out quite
enough to allow things like process tear down/forking etc. to be explicitly
tested but you can unit tests the VMA bits at least.

One note on this, I intend to split the vma.c file into multiple files in
tools/testing/vma/ so if you add tests here it'd be worth probably putting
them into a new file.

I'm happy to help with this if you need any assistance, feel free to ping!

Sorry to put this on you so late in the series, I realise it's annoying,
but I feel like things have changed a lot and obviously aggregated with two
series in one in effect and these are genuine concerns that at this stage I
feel like we need to try to at least make some headway on.

>
> >
> > I know it's tricky to write tests for this, but the new VMA testing
> > environment should make it possible to test a _lot_ more than we previously
> > could.
> >
> > However due to some (*ahem*) interesting distribution of where functions
> > are, most notably stuff in kernel/fork.c, I guess we can't test
> > _everything_ there effectively.
> >
> > But I do feel like we should be able to do better than having absolutely no
> > testing added for this?
>
> Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
> for the rest I thought the tests were modified accordingly.

See above ^

>
> >
> > I think there's definitely quite a bit you could test now, at least in
> > asserting fundamentals in tools/testing/vma/vma.c.
> >
> > This can cover at least detached state asserts in various scenarios.
>
> Ok, you mean to check that VMA re-attachment/re-detachment would
> trigger assertions? I'll look into adding tests for that.

Yeah this is one, see above :)

>
> >
> > But that won't cover off the really gnarly stuff here around RCU slab
> > allocation, and determining precisely how to test that in a sensible way is
> > maybe less clear.
> >
> > But I'd like to see _something_ here please, this is more or less
> > fundamentally changing how all VMAs are allocated and to just have nothing
> > feels unfortunate.
>
> Again, I'm open to suggestions on what kind of testing I can add for
> SLAB_TYPESAFE_BY_RCU change.

See above

>
> >
> > I'm already nervous because we've hit issues coming up to v9 and we're not
> > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > sure how much we can get assurances with tests but I'd like something.
>
> If you are referring to the issue at [1], I think David ran the
> syzcaller against mm-stable that does not contain this patchset and
> the issue still triggered (see [2]). This of course does not guarantee
> that this patchset has no other issues :) I'll try adding tests for
> re-attaching, re-detaching and welcome ideas on how to test
> SLAB_TYPESAFE_BY_RCU transition.
> Thanks,
> Suren.

OK that's reassuring!

>
> [1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
> [2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/
>
> >
> > Thanks!
> >
> > On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> > >
> > > Pagefault microbenchmarks show performance improvement:
> > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> > >
> > > Changes since v8 [4]:
> > > - Change subject for the cover letter, per Vlastimil Babka
> > > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > > per David Laight
> > > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > > per Hillf Danton and Vlastimil Babka
> > > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > > under us, per Vlastimil Babka
> > > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > > per Vlastimil Babka
> > > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > > per Vlastimil Babka
> > > - Amended description in vm_lock replacement patch explaining the effects
> > > of the patch on vm_area_struct size, per Vlastimil Babka
> > > - Added vm_area_struct member regrouping patch [5] into the series
> > > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > > adding new members, per Vlastimil Babka
> > > - Updated a comment about unstable src->shared.rb when copying a vma in
> > > vm_area_init_from(), per Vlastimil Babka
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> > >
> > > Patchset applies over mm-unstable after reverting v8
> > > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> > >
> > > Suren Baghdasaryan (17):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: mark vma as detached until it's added into vma tree
> > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > >   mm: mark vmas detached upon exit
> > >   types: move struct rcuwait into types.h
> > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > >   mm: move mmap_init_lock() out of the header file
> > >   mm: uninline the main body of vma_start_write()
> > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > >   mm: replace vm_lock and detached flag with a reference count
> > >   mm: move lesser used vma_area_struct members into the last cacheline
> > >   mm/debug: print vm_refcnt state when dumping the vma
> > >   mm: remove extra vma_numab_state_init() call
> > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > >   docs/mm: document latest changes to vm_lock
> > >
> > >  Documentation/mm/process_addrs.rst |  44 ++++----
> > >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> > >  include/linux/mm_types.h           |  75 +++++++-------
> > >  include/linux/mmap_lock.h          |   6 --
> > >  include/linux/rcuwait.h            |  13 +--
> > >  include/linux/refcount.h           |  24 ++++-
> > >  include/linux/slab.h               |   6 --
> > >  include/linux/types.h              |  12 +++
> > >  kernel/fork.c                      | 129 +++++++++++-------------
> > >  mm/debug.c                         |  12 +++
> > >  mm/init-mm.c                       |   1 +
> > >  mm/memory.c                        |  97 ++++++++++++++++--
> > >  mm/mmap.c                          |   3 +-
> > >  mm/userfaultfd.c                   |  32 +++---
> > >  mm/vma.c                           |  23 ++---
> > >  mm/vma.h                           |  15 ++-
> > >  tools/testing/vma/linux/atomic.h   |   5 +
> > >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> > >  18 files changed, 465 insertions(+), 281 deletions(-)
> > >
> > > --
> > > 2.47.1.613.gc27f4b7a9f-goog
> > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 05/17] mm: mark vmas detached upon exit
  2025-01-13 17:02     ` Suren Baghdasaryan
@ 2025-01-13 17:13       ` Lorenzo Stoakes
  2025-01-13 19:11         ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-13 17:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 09:02:50AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 4:05 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:25:52PM -0800, Suren Baghdasaryan wrote:
> > > When exit_mmap() removes vmas belonging to an exiting task, it does not
> > > mark them as detached since they can't be reached by other tasks and they
> > > will be freed shortly. Once we introduce vma reuse, all vmas will have to
> > > be in detached state before they are freed to ensure vma when reused is
> > > in a consistent state. Add missing vma_mark_detached() before freeing the
> > > vma.
> >
> > Hmm this really makes me worry that we'll see bugs from this detached
> > stuff, do we make this assumption anywhere else I wonder?
>
> This is the only place which does not currently detach the vma before
> freeing it. If someone tries adding a case like that in the future,
> they will be met with vma_assert_detached() inside vm_area_free().

OK good to know!

Again, I wonder if we should make these assertions stronger as commented
elsewhere, because if we see them in production isn't that worth an actual
non-debug WARN_ON_ONCE()?

>
> >
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > But regardless, prima facie, this looks fine, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > > ---
> > >  mm/vma.c | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index b9cf552e120c..93ff42ac2002 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -413,10 +413,12 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> > >       if (vma->vm_file)
> > >               fput(vma->vm_file);
> > >       mpol_put(vma_policy(vma));
> > > -     if (unreachable)
> > > +     if (unreachable) {
> > > +             vma_mark_detached(vma);
> > >               __vm_area_free(vma);
> > > -     else
> > > +     } else {
> > >               vm_area_free(vma);
> > > +     }
> > >  }
> > >
> > >  /*
> > > --
> > > 2.47.1.613.gc27f4b7a9f-goog
> > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 07/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2025-01-13 15:25   ` Lorenzo Stoakes
@ 2025-01-13 17:53     ` Suren Baghdasaryan
  2025-01-14 11:48       ` Lorenzo Stoakes
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 17:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 7:25 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:54PM -0800, Suren Baghdasaryan wrote:
> > With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
> > possibility of vma_start_read_locked/vma_start_read_locked_nested failing
> > due to refcount overflow. Prepare for such possibility by changing these
> > APIs and adjusting their users.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> >  include/linux/mm.h |  6 ++++--
> >  mm/userfaultfd.c   | 18 +++++++++++++-----
> >  2 files changed, 17 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2f805f1a0176..cbb4e3dbbaed 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >   * not be used in such cases because it might fail due to mm_lock_seq overflow.
> >   * This functionality is used to obtain vma read lock and drop the mmap read lock.
> >   */
> > -static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> > +static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> >  {
> >       mmap_assert_locked(vma->vm_mm);
> >       down_read_nested(&vma->vm_lock.lock, subclass);
> > +     return true;
> >  }
> >
> >  /*
> > @@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
> >   * not be used in such cases because it might fail due to mm_lock_seq overflow.
> >   * This functionality is used to obtain vma read lock and drop the mmap read lock.
> >   */
> > -static inline void vma_start_read_locked(struct vm_area_struct *vma)
> > +static inline bool vma_start_read_locked(struct vm_area_struct *vma)
> >  {
> >       mmap_assert_locked(vma->vm_mm);
> >       down_read(&vma->vm_lock.lock);
> > +     return true;
> >  }
> >
> >  static inline void vma_end_read(struct vm_area_struct *vma)
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 4527c385935b..411a663932c4 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
> >       mmap_read_lock(mm);
> >       vma = find_vma_and_prepare_anon(mm, address);
> >       if (!IS_ERR(vma))
> > -             vma_start_read_locked(vma);
> > +             if (!vma_start_read_locked(vma))
> > +                     vma = ERR_PTR(-EAGAIN);
>
> Nit but this kind of reads a bit weirdly now:
>
>         if (!IS_ERR(vma))
>                 if (!vma_start_read_locked(vma))
>                         vma = ERR_PTR(-EAGAIN);
>
> Wouldn't this be nicer as:
>
>         if (!IS_ERR(vma) && !vma_start_read_locked(vma))
>                 vma = ERR_PTR(-EAGAIN);
>
> On the other hand, this embeds an action in an expression, but then it sort of
> still looks weird.
>
>         if (!IS_ERR(vma)) {
>                 bool ok = vma_start_read_locked(vma);
>
>                 if (!ok)
>                         vma = ERR_PTR(-EAGAIN);
>         }
>
> This makes me wonder, now yes, we are truly bikeshedding, sorry, but maybe we
> could just have vma_start_read_locked return a VMA pointer that could be an
> error?
>
> Then this becomes:
>
>         if (!IS_ERR(vma))
>                 vma = vma_start_read_locked(vma);

No, I think it would be wrong for vma_start_read_locked() to always
return EAGAIN when it can't lock the vma. The error code here is
context-dependent, so while EAGAIN is the right thing here, it might
not work for other future users.

>
> >
> >       mmap_read_unlock(mm);
> >       return vma;
> > @@ -1483,10 +1484,17 @@ static int uffd_move_lock(struct mm_struct *mm,
> >       mmap_read_lock(mm);
> >       err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
> >       if (!err) {
> > -             vma_start_read_locked(*dst_vmap);
> > -             if (*dst_vmap != *src_vmap)
> > -                     vma_start_read_locked_nested(*src_vmap,
> > -                                             SINGLE_DEPTH_NESTING);
> > +             if (vma_start_read_locked(*dst_vmap)) {
> > +                     if (*dst_vmap != *src_vmap) {
> > +                             if (!vma_start_read_locked_nested(*src_vmap,
> > +                                                     SINGLE_DEPTH_NESTING)) {
> > +                                     vma_end_read(*dst_vmap);
>
> Hmm, why do we end read if the lock failed here but not above?

We have successfully done vma_start_read_locked(dst_vmap) (we locked
dest vma) but we failed to do vma_start_read_locked_nested(src_vmap)
(we could not lock src vma). So we should undo the dest vma locking.
Does that clarify the logic?

>
> > +                                     err = -EAGAIN;
> > +                             }
> > +                     }
> > +             } else {
> > +                     err = -EAGAIN;
> > +             }
> >       }
>
> This whole block is really ugly now, this really needs refactoring.
>
> How about (on assumption the vma_end_read() is correct):
>
>
>         err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
>         if (err)
>                 goto out;
>
>         if (!vma_start_read_locked(*dst_vmap)) {
>                 err = -EAGAIN;
>                 goto out;
>         }
>
>         /* Nothing further to do. */
>         if (*dst_vmap == *src_vmap)
>                 goto out;
>
>         if (!vma_start_read_locked_nested(*src_vmap,
>                                 SINGLE_DEPTH_NESTING)) {
>                 vma_end_read(*dst_vmap);
>                 err = -EAGAIN;
>         }
>
> out:
>         mmap_read_unlock(mm);
>         return err;
> }

Ok, that looks good to me. Will change this way.
Thanks!

>
> >       mmap_read_unlock(mm);
> >       return err;
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 08/17] mm: move mmap_init_lock() out of the header file
  2025-01-13 15:27   ` Lorenzo Stoakes
@ 2025-01-13 17:53     ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 17:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 7:27 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:55PM -0800, Suren Baghdasaryan wrote:
> > mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
> > not have to reside in the header file. This move lets us avoid including
> > additional headers in mmap_lock.h later, when mmap_init_lock() needs to
> > initialize rcuwait object.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> Aside from nit below, LGTM:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >  include/linux/mmap_lock.h | 6 ------
> >  kernel/fork.c             | 6 ++++++
> >  2 files changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index 45a21faa3ff6..4706c6769902 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -122,12 +122,6 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int
> >
> >  #endif /* CONFIG_PER_VMA_LOCK */
> >
> > -static inline void mmap_init_lock(struct mm_struct *mm)
> > -{
> > -     init_rwsem(&mm->mmap_lock);
> > -     mm_lock_seqcount_init(mm);
> > -}
> > -
> >  static inline void mmap_write_lock(struct mm_struct *mm)
> >  {
> >       __mmap_lock_trace_start_locking(mm, true);
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index f2f9e7b427ad..d4c75428ccaf 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1219,6 +1219,12 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
> >  #endif
> >  }
> >
> > +static inline void mmap_init_lock(struct mm_struct *mm)
>
> we don't need inline here, please drop it.

Ack.

>
> > +{
> > +     init_rwsem(&mm->mmap_lock);
> > +     mm_lock_seqcount_init(mm);
> > +}
> > +
> >  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >       struct user_namespace *user_ns)
> >  {
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 14/17] mm: remove extra vma_numab_state_init() call
  2025-01-13 16:28   ` Lorenzo Stoakes
@ 2025-01-13 17:56     ` Suren Baghdasaryan
  2025-01-14 11:45       ` Lorenzo Stoakes
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 17:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 8:28 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:26:01PM -0800, Suren Baghdasaryan wrote:
> > vma_init() already memset's the whole vm_area_struct to 0, so there is
> > no need to an additional vma_numab_state_init().
>
> Hm strangely random change :) I'm guessing this was a pre-existing thing?

Yeah, I stumbled on it while working on an earlier version of this
patchset which involved ctor usage.

>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> I mean this looks fine, so fair enough just feels a bit incongruous with
> series. But regardless:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >  include/linux/mm.h | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a99b11ee1f66..c8da64b114d1 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -948,7 +948,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > -     vma_numab_state_init(vma);
> >       vma_lock_init(vma, false);
> >  }
>
> This leaves one other caller in vm_area_dup() (I _hate_ that this lives in
> the fork code... - might very well look at churning some VMA stuff over
> from there to an appropriate place).
>
> While we're here, I mean this thing seems a bit of out scope for the series
> but if we're doing it, can we just remove vma_numab_state_init() and
> instead edit vm_area_init_from() to #ifdef ... this like the other fields
> now?
>
> I's not exactly urgent though as this stuff in the fork code is a bit of a
> mess anyway...

Yeah, let's keep the cleanup out for now. The series is already quite
big. I included this one-line cleanup since it was uncontroversial and
simple.

>
> >
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 17/17] docs/mm: document latest changes to vm_lock
  2025-01-13 16:33   ` Lorenzo Stoakes
@ 2025-01-13 17:56     ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 17:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 8:33 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:26:04PM -0800, Suren Baghdasaryan wrote:
> > Change the documentation to reflect that vm_lock is integrated into vma
> > and replaced with vm_refcnt.
> > Document newly introduced vma_start_read_locked{_nested} functions.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
>
> Apart from small nit, LGTM, thanks for doing this!
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >  Documentation/mm/process_addrs.rst | 44 ++++++++++++++++++------------
> >  1 file changed, 26 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> > index 81417fa2ed20..f573de936b5d 100644
> > --- a/Documentation/mm/process_addrs.rst
> > +++ b/Documentation/mm/process_addrs.rst
> > @@ -716,9 +716,14 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
> >  critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
> >  before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
> >
> > -VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> > -their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
> > -via :c:func:`!vma_end_read`.
> > +In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
> > +and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
> > +fail due to lock contention but the caller should still check their return values
> > +in case they fail for other reasons.
> > +
> > +VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
> > +duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
> > +:c:func:`!vma_end_read`.
> >
> >  VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
> >  VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
> > @@ -726,9 +731,9 @@ acquired. An mmap write lock **must** be held for the duration of the VMA write
> >  lock, releasing or downgrading the mmap write lock also releases the VMA write
> >  lock so there is no :c:func:`!vma_end_write` function.
> >
> > -Note that a semaphore write lock is not held across a VMA lock. Rather, a
> > -sequence number is used for serialisation, and the write semaphore is only
> > -acquired at the point of write lock to update this.
> > +Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
> > +modified so that readers can detect the presense of a writer. The reference counter is
> > +restored once the vma sequence number used for serialisation is updated.
> >
> >  This ensures the semantics we require - VMA write locks provide exclusive write
> >  access to the VMA.
> > @@ -738,7 +743,7 @@ Implementation details
> >
> >  The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> >  of the heavily contended mmap lock. It is implemented using a combination of a
> > -read/write semaphore and sequence numbers belonging to the containing
> > +reference counter and sequence numbers belonging to the containing
> >  :c:struct:`!struct mm_struct` and the VMA.
> >
> >  Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
> > @@ -779,28 +784,31 @@ release of any VMA locks on its release makes sense, as you would never want to
> >  keep VMAs locked across entirely separate write operations. It also maintains
> >  correct lock ordering.
> >
> > -Each time a VMA read lock is acquired, we acquire a read lock on the
> > -:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
> > -the sequence count of the VMA does not match that of the mm.
> > +Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
> > +reference counter and check that the sequence count of the VMA does not match
> > +that of the mm.
> >
> > -If it does, the read lock fails. If it does not, we hold the lock, excluding
> > -writers, but permitting other readers, who will also obtain this lock under RCU.
> > +If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
> > +If it does not, we keep the reference counter raised, excluding writers, but
> > +permitting other readers, who can also obtain this lock under RCU.
> >
> >  Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
> >  are also RCU safe, so the whole read lock operation is guaranteed to function
> >  correctly.
> >
> > -On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
> > -read/write semaphore, before setting the VMA's sequence number under this lock,
> > -also simultaneously holding the mmap write lock.
> > +On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
> > +modified by readers and wait for all readers to drop their reference count.
> > +Once there are no readers, VMA's sequence number is set to match that of the
>
> Nit: 'the VMA's sequence number' seems to read better here.

Ack.

>
> > +mm. During this entire operation mmap write lock is held.
> >
> >  This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
> >  until these are finished and mutual exclusion is achieved.
> >
> > -After setting the VMA's sequence number, the lock is released, avoiding
> > -complexity with a long-term held write lock.
> > +After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
> > +indicating a writer is cleared. From this point on, VMA's sequence number will
> > +indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
> >
> > -This clever combination of a read/write semaphore and sequence count allows for
> > +This clever combination of a reference counter and sequence count allows for
> >  fast RCU-based per-VMA lock acquisition (especially on page fault, though
> >  utilised elsewhere) with minimal complexity around lock ordering.
> >
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2025-01-13 16:35     ` Liam R. Howlett
@ 2025-01-13 17:57       ` Suren Baghdasaryan
  2025-01-14 11:41         ` Lorenzo Stoakes
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 17:57 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, Suren Baghdasaryan, akpm,
	peterz, willy, david.laight.linux, mhocko, vbabka, hannes,
	mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, lokeshgidra, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Mon, Jan 13, 2025 at 8:35 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250113 11:21]:
> > On Fri, Jan 10, 2025 at 08:26:00PM -0800, Suren Baghdasaryan wrote:
> > > vm_refcnt encodes a number of useful states:
> > > - whether vma is attached or detached
> > > - the number of current vma readers
> > > - presence of a vma writer
> > > Let's include it in the vma dump.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  mm/debug.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > >
> > > diff --git a/mm/debug.c b/mm/debug.c
> > > index 8d2acf432385..325d7bf22038 100644
> > > --- a/mm/debug.c
> > > +++ b/mm/debug.c
> > > @@ -178,6 +178,17 @@ EXPORT_SYMBOL(dump_page);
> > >
> > >  void dump_vma(const struct vm_area_struct *vma)
> > >  {
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +   pr_emerg("vma %px start %px end %px mm %px\n"
> > > +           "prot %lx anon_vma %px vm_ops %px\n"
> > > +           "pgoff %lx file %px private_data %px\n"
> > > +           "flags: %#lx(%pGv) refcnt %x\n",
> > > +           vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> > > +           (unsigned long)pgprot_val(vma->vm_page_prot),
> > > +           vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> > > +           vma->vm_file, vma->vm_private_data,
> > > +           vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
> > > +#else
> > >     pr_emerg("vma %px start %px end %px mm %px\n"
> > >             "prot %lx anon_vma %px vm_ops %px\n"
> > >             "pgoff %lx file %px private_data %px\n"
> > > @@ -187,6 +198,7 @@ void dump_vma(const struct vm_area_struct *vma)
> > >             vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> > >             vma->vm_file, vma->vm_private_data,
> > >             vma->vm_flags, &vma->vm_flags);
> > > +#endif
> > >  }
> >
> > This is pretty horribly duplicative and not in line with how this kind of
> > thing is done in the rest of the file. You're just adding one entry, so why
> > not:
> >
> > void dump_vma(const struct vm_area_struct *vma)
> > {
> >       pr_emerg("vma %px start %px end %px mm %px\n"
> >               "prot %lx anon_vma %px vm_ops %px\n"
> >               "pgoff %lx file %px private_data %px\n"
> > #ifdef CONFIG_PER_VMA_LOCK
> >               "refcnt %x\n"
> > #endif
> >               "flags: %#lx(%pGv)\n",
> >               vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> >               (unsigned long)pgprot_val(vma->vm_page_prot),
> >               vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> >               vma->vm_file, vma->vm_private_data,
> >               vma->vm_flags,
> > #ifdef CONFIG_PER_VMA_LOCK
> >               refcount_read(&vma->vm_refcnt),
> > #endif
> >               &vma->vm_flags);
> > }
>
> right, I had an issue with this as well.
>
> Another option would be:
>
>         pr_emerg("vma %px start %px end %px mm %px\n"
>                 "prot %lx anon_vma %px vm_ops %px\n"
>                 "pgoff %lx file %px private_data %px\n",
>                 <big mess here>);
>         dump_vma_refcnt();
>         pr_emerg("flags:...", vma_vm_flags);
>
>
> Then dump_vma_refcnt() either dumps the refcnt or does nothing,
> depending on the config option.
>
> Either way is good with me.  Lorenzo's suggestion is in line with the
> file and it's clear as to why the refcnt might be missing, but I don't
> really see this being an issue in practice.

Thanks for clarifying! Lorenzo's suggestion LGTM too. I'll adopt it. Thanks!

>
> Thanks,
> Liam
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-13 17:11     ` Lorenzo Stoakes
@ 2025-01-13 19:00       ` Suren Baghdasaryan
  2025-01-14 11:35         ` Lorenzo Stoakes
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 19:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 9:11 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Jan 13, 2025 at 08:58:37AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > A nit on subject, I mean this is part of what this series does, and hey -
> > > we have only so much text to put in here - but isn't this both
> > > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > > using the RCU typesafe mechanism?
> > >
> > > Do we have to do both in one series? Can we split this out? I mean maybe
> > > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > > nitty... but still :)
> >
> > There is "motivational dependency" because one of the main reasons I'm
> > converting the vm_lock into vm_refcnt is to make it easier to add
> > SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
> > can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
> > thighs easier. That would be the 2 patches at the end:
>
> Right yeah... maybe it's better to do it in one hit.
>
> >
> > mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >
> > I made sure that each patch is bisectable, so there should not be a
> > problem with tracking issues.
> >
> > >
> > > One general comment here - this is a really major change in how this stuff
> > > works, and yet I don't see any tests anywhere in the series.
> >
> > Hmm. I was diligently updating the tests to reflect the replacement of
> > vm_lock with vm_refcnt and adding assertions for detach/attach cases.
> > This actually reminds me that I missed updading vm_area_struct in
> > vma_internal.h for the member regrouping patch; will add that. I think
> > the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
> > was not sure what kind of testing I can add for that. Any suggestions
> > would be welcomed.
>
> And to be clear I'm super grateful you did that :) thanks, be good to
> change the member regrouping thing also.
>
> But that doesn't change the fact that this series has exactly zero tests
> for it. And for something so broad, it feels like a big issue, we really
> want to be careful with something so big here.
>
> You've also noticed that I've cleverly failed to _actually_ suggest
> SLAB_TYPESAFE_BY_RCU tests, and mea culpa - it's super hard to think of how
> to test that.
>
> Liam has experience doing RCU testing this for the maple tree stuff, but it
> wasn't pretty and wasn't easy and would probably require massive rework to
> expose this stuff to some viable testing environment, or in other words -
> is unworkable.
>
> HOWEVER, I feel like maybe we could try to create scenarios where we might
> trigger reuse bugs?
>
> Perhaps some userland code, perhaps even constrained by cgroup, that maps a
> ton of stuff and unmaps in a loop in parallel?
>
> Perhaps create scenarios with shared memory where we up refcounts a lot too?

I have this old spf_test
(https://github.com/surenbaghdasaryan/spf_test/blob/main/spf_test.c)
which I often use to weed out vma locking issues because it starts
multiple threads doing mmap + page faults. Perhaps we can repackage it
into a test/benchmark for testing contention on mmap/vma locks?

>
> Anyway, this is necessarily nebulous without further investigation, what I
> was thinking more concretely is:
>
> Using the VMA userland testing:
>
> 1. Assert reference count correctness across locking scenarios and various
>    VMA operations.
> 2. Assert correct detached/not detached state across different scenarios.
>
> This won't quite be complete as not everything is separated out quite
> enough to allow things like process tear down/forking etc. to be explicitly
> tested but you can unit tests the VMA bits at least.
>
> One note on this, I intend to split the vma.c file into multiple files in
> tools/testing/vma/ so if you add tests here it'd be worth probably putting
> them into a new file.
>
> I'm happy to help with this if you need any assistance, feel free to ping!

As a starting point I was thinking of changing
vma_assert_attached()/vma_assert_detached() and
vma_mark_attached()/vma_mark_detached() to return a bool and use
WARN_ON_ONCE() (to address your concern about asserts being dependent
on CONFIG_DEBUG_VM) like this:

static inline bool vma_assert_detached()
{
    return !WARN_ON_ONCE(atomic_read(&vma->vm_refcnt));
}

static inline bool vma_mark_attached(struct vm_area_struct *vma)
{
    vma_assert_write_locked(vma);
    if (!vma_assert_detached(vma))
        return false;

    atomic_set(&vma->vm_refcnt, 1);
    return true;
}

With that we can add correctness checks in the tools/testing/vma/vma.c
for different states, for example in the alloc_and_link_vma() we can
check that after vma_link() the vma is indeed attached:

ASSERT_TRUE(vma_assert_attached(vma));

This might not cover all states but is probably a good starting point. WDYT?

>
> Sorry to put this on you so late in the series, I realise it's annoying,
> but I feel like things have changed a lot and obviously aggregated with two
> series in one in effect and these are genuine concerns that at this stage I
> feel like we need to try to at least make some headway on.
>
> >
> > >
> > > I know it's tricky to write tests for this, but the new VMA testing
> > > environment should make it possible to test a _lot_ more than we previously
> > > could.
> > >
> > > However due to some (*ahem*) interesting distribution of where functions
> > > are, most notably stuff in kernel/fork.c, I guess we can't test
> > > _everything_ there effectively.
> > >
> > > But I do feel like we should be able to do better than having absolutely no
> > > testing added for this?
> >
> > Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
> > for the rest I thought the tests were modified accordingly.
>
> See above ^
>
> >
> > >
> > > I think there's definitely quite a bit you could test now, at least in
> > > asserting fundamentals in tools/testing/vma/vma.c.
> > >
> > > This can cover at least detached state asserts in various scenarios.
> >
> > Ok, you mean to check that VMA re-attachment/re-detachment would
> > trigger assertions? I'll look into adding tests for that.
>
> Yeah this is one, see above :)
>
> >
> > >
> > > But that won't cover off the really gnarly stuff here around RCU slab
> > > allocation, and determining precisely how to test that in a sensible way is
> > > maybe less clear.
> > >
> > > But I'd like to see _something_ here please, this is more or less
> > > fundamentally changing how all VMAs are allocated and to just have nothing
> > > feels unfortunate.
> >
> > Again, I'm open to suggestions on what kind of testing I can add for
> > SLAB_TYPESAFE_BY_RCU change.
>
> See above
>
> >
> > >
> > > I'm already nervous because we've hit issues coming up to v9 and we're not
> > > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > > sure how much we can get assurances with tests but I'd like something.
> >
> > If you are referring to the issue at [1], I think David ran the
> > syzcaller against mm-stable that does not contain this patchset and
> > the issue still triggered (see [2]). This of course does not guarantee
> > that this patchset has no other issues :) I'll try adding tests for
> > re-attaching, re-detaching and welcome ideas on how to test
> > SLAB_TYPESAFE_BY_RCU transition.
> > Thanks,
> > Suren.
>
> OK that's reassuring!
>
> >
> > [1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
> > [2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/
> >
> > >
> > > Thanks!
> > >
> > > On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > > vm_area_struct in [1] because of the performance regression caused by
> > > > false cacheline sharing. Recent investigation [2] revealed that the
> > > > regressions is limited to a rather old Broadwell microarchitecture and
> > > > even there it can be mitigated by disabling adjacent cacheline
> > > > prefetching, see [3].
> > > > Splitting single logical structure into multiple ones leads to more
> > > > complicated management, extra pointer dereferences and overall less
> > > > maintainable code. When that split-away part is a lock, it complicates
> > > > things even further. With no performance benefits, there are no reasons
> > > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > > This patchset:
> > > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > > boundary and changing the cache to be cacheline-aligned to minimize
> > > > cacheline sharing;
> > > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > > it is inserted into vma tree;
> > > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > > > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > > reuse and to minimize call_rcu() calls.
> > > >
> > > > Pagefault microbenchmarks show performance improvement:
> > > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> > > >
> > > > Changes since v8 [4]:
> > > > - Change subject for the cover letter, per Vlastimil Babka
> > > > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > > > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > > > per David Laight
> > > > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > > > per Hillf Danton and Vlastimil Babka
> > > > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > > > under us, per Vlastimil Babka
> > > > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > > > per Vlastimil Babka
> > > > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > > > per Vlastimil Babka
> > > > - Amended description in vm_lock replacement patch explaining the effects
> > > > of the patch on vm_area_struct size, per Vlastimil Babka
> > > > - Added vm_area_struct member regrouping patch [5] into the series
> > > > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > > > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > > > adding new members, per Vlastimil Babka
> > > > - Updated a comment about unstable src->shared.rb when copying a vma in
> > > > vm_area_init_from(), per Vlastimil Babka
> > > >
> > > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > > > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> > > >
> > > > Patchset applies over mm-unstable after reverting v8
> > > > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> > > >
> > > > Suren Baghdasaryan (17):
> > > >   mm: introduce vma_start_read_locked{_nested} helpers
> > > >   mm: move per-vma lock into vm_area_struct
> > > >   mm: mark vma as detached until it's added into vma tree
> > > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > > >   mm: mark vmas detached upon exit
> > > >   types: move struct rcuwait into types.h
> > > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > > >   mm: move mmap_init_lock() out of the header file
> > > >   mm: uninline the main body of vma_start_write()
> > > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > > >   mm: replace vm_lock and detached flag with a reference count
> > > >   mm: move lesser used vma_area_struct members into the last cacheline
> > > >   mm/debug: print vm_refcnt state when dumping the vma
> > > >   mm: remove extra vma_numab_state_init() call
> > > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > > >   docs/mm: document latest changes to vm_lock
> > > >
> > > >  Documentation/mm/process_addrs.rst |  44 ++++----
> > > >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> > > >  include/linux/mm_types.h           |  75 +++++++-------
> > > >  include/linux/mmap_lock.h          |   6 --
> > > >  include/linux/rcuwait.h            |  13 +--
> > > >  include/linux/refcount.h           |  24 ++++-
> > > >  include/linux/slab.h               |   6 --
> > > >  include/linux/types.h              |  12 +++
> > > >  kernel/fork.c                      | 129 +++++++++++-------------
> > > >  mm/debug.c                         |  12 +++
> > > >  mm/init-mm.c                       |   1 +
> > > >  mm/memory.c                        |  97 ++++++++++++++++--
> > > >  mm/mmap.c                          |   3 +-
> > > >  mm/userfaultfd.c                   |  32 +++---
> > > >  mm/vma.c                           |  23 ++---
> > > >  mm/vma.h                           |  15 ++-
> > > >  tools/testing/vma/linux/atomic.h   |   5 +
> > > >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> > > >  18 files changed, 465 insertions(+), 281 deletions(-)
> > > >
> > > > --
> > > > 2.47.1.613.gc27f4b7a9f-goog
> > > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-13 16:47       ` Lorenzo Stoakes
@ 2025-01-13 19:09         ` Suren Baghdasaryan
  2025-01-14 11:38           ` Lorenzo Stoakes
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 19:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 8:48 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Jan 13, 2025 at 08:31:45AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 13, 2025 at 3:58 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Fri, Jan 10, 2025 at 08:25:51PM -0800, Suren Baghdasaryan wrote:
> > > > vma_iter_store() functions can be used both when adding a new vma and
> > > > when updating an existing one. However for existing ones we do not need
> > > > to mark them attached as they are already marked that way. Introduce
> > > > vma_iter_store_attached() to be used with already attached vmas.
> > >
> > > OK I guess the intent of this is to reinstate the previously existing
> > > asserts, only explicitly checking those places where we attach.
> >
> > No, the motivation is to prevern re-attaching an already attached vma
> > or re-detaching an already detached vma for state consistency. I guess
> > I should amend the description to make that clear.
>
> Sorry for noise, missed this reply.
>
> What I mean by this is, in a past iteration of this series I reviewed code
> where you did this but did _not_ differentiate between cases of new VMAs
> vs. existing, which caused an assert in your series which I reported.
>
> So I"m saying - now you _are_ differentiating between the two cases.
>
> It's certainly worth belabouring the point of exactly what it is you are
> trying to catch here, however! :) So yes please do add a little more to
> commit msg that'd be great, thanks!

Sure. How about:

With vma->detached being a separate flag, double-marking a vmas as
attached or detached is not an issue because the flag will simply be
overwritten with the same value. However once we fold this flag into
the refcount later in this series, re-attaching or re-detaching a vma
becomes an issue since these operations will be
incrementing/decrementing a refcount. Fix the places where we
currently re-attaching a vma during vma update and add assertions in
vma_mark_attached()/vma_mark_detached() to catch invalid usage.

>
> >
> > >
> > > I'm a little concerned that by doing this, somebody might simply invoke
> > > this function without realising the implications.
> >
> > Well, in that case somebody should get an assertion. If
> > vma_iter_store() is called against already attached vma, we get this
> > assertion:
> >
> > vma_iter_store()
> >   vma_mark_attached()
> >     vma_assert_detached()
> >
> > If vma_iter_store_attached() is called against a detached vma, we get this one:
> >
> > vma_iter_store_attached()
> >   vma_assert_attached()
> >
> > Does that address your concern?
> >
> > >
> > > Can we have something functional like
> > >
> > > vma_iter_store_new() and vma_iter_store_overwrite()
> >
> > Ok. A bit more churn but should not be too bad.
> >
> > >
> > > ?
> > >
> > > I don't like us just leaving vma_iter_store() quietly making an assumption
> > > that a caller doesn't necessarily realise.
> > >
> > > Also it's more greppable this way.
> > >
> > > I had a look through callers and it does seem you've snagged them all
> > > correctly.
> > >
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > > ---
> > > >  include/linux/mm.h | 12 ++++++++++++
> > > >  mm/vma.c           |  8 ++++----
> > > >  mm/vma.h           | 11 +++++++++--
> > > >  3 files changed, 25 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 2b322871da87..2f805f1a0176 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> > > >               vma_assert_write_locked(vma);
> > > >  }
> > > >
> > > > +static inline void vma_assert_attached(struct vm_area_struct *vma)
> > > > +{
> > > > +     VM_BUG_ON_VMA(vma->detached, vma);
> > > > +}
> > > > +
> > > > +static inline void vma_assert_detached(struct vm_area_struct *vma)
> > > > +{
> > > > +     VM_BUG_ON_VMA(!vma->detached, vma);
> > > > +}
> > > > +
> > > >  static inline void vma_mark_attached(struct vm_area_struct *vma)
> > > >  {
> > > >       vma->detached = false;
> > > > @@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
> > > >  static inline void vma_start_write(struct vm_area_struct *vma) {}
> > > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > >               { mmap_assert_write_locked(vma->vm_mm); }
> > > > +static inline void vma_assert_attached(struct vm_area_struct *vma) {}
> > > > +static inline void vma_assert_detached(struct vm_area_struct *vma) {}
> > > >  static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> > > >  static inline void vma_mark_detached(struct vm_area_struct *vma) {}
> > > >
> > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > index d603494e69d7..b9cf552e120c 100644
> > > > --- a/mm/vma.c
> > > > +++ b/mm/vma.c
> > > > @@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
> > > >       vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
> > > >
> > > >       if (expanded)
> > > > -             vma_iter_store(vmg->vmi, vmg->vma);
> > > > +             vma_iter_store_attached(vmg->vmi, vmg->vma);
> > > >
> > > >       if (adj_start) {
> > > >               adjust->vm_start += adj_start;
> > > >               adjust->vm_pgoff += PHYS_PFN(adj_start);
> > > >               if (adj_start < 0) {
> > > >                       WARN_ON(expanded);
> > > > -                     vma_iter_store(vmg->vmi, adjust);
> > > > +                     vma_iter_store_attached(vmg->vmi, adjust);
> > > >               }
> > > >       }
> > >
> > > I kind of feel this whole function (that yes, I added :>) though derived
> > > from existing logic) needs rework, as it's necessarily rather confusing.
> > >
> > > But hey, that's on me :)
> > >
> > > But this does look right... OK see this as a note-to-self...
> > >
> > > >
> > > > @@ -2845,7 +2845,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > > >                               anon_vma_interval_tree_pre_update_vma(vma);
> > > >                               vma->vm_end = address;
> > > >                               /* Overwrite old entry in mtree. */
> > > > -                             vma_iter_store(&vmi, vma);
> > > > +                             vma_iter_store_attached(&vmi, vma);
> > > >                               anon_vma_interval_tree_post_update_vma(vma);
> > > >
> > > >                               perf_event_mmap(vma);
> > > > @@ -2925,7 +2925,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > > >                               vma->vm_start = address;
> > > >                               vma->vm_pgoff -= grow;
> > > >                               /* Overwrite old entry in mtree. */
> > > > -                             vma_iter_store(&vmi, vma);
> > > > +                             vma_iter_store_attached(&vmi, vma);
> > > >                               anon_vma_interval_tree_post_update_vma(vma);
> > > >
> > > >                               perf_event_mmap(vma);
> > > > diff --git a/mm/vma.h b/mm/vma.h
> > > > index 2a2668de8d2c..63dd38d5230c 100644
> > > > --- a/mm/vma.h
> > > > +++ b/mm/vma.h
> > > > @@ -365,9 +365,10 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
> > > >  }
> > > >
> > > >  /* Store a VMA with preallocated memory */
> > > > -static inline void vma_iter_store(struct vma_iterator *vmi,
> > > > -                               struct vm_area_struct *vma)
> > > > +static inline void vma_iter_store_attached(struct vma_iterator *vmi,
> > > > +                                        struct vm_area_struct *vma)
> > > >  {
> > > > +     vma_assert_attached(vma);
> > > >
> > > >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > > >       if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start &&
> > > > @@ -390,7 +391,13 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
> > > >
> > > >       __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > > >       mas_store_prealloc(&vmi->mas, vma);
> > > > +}
> > > > +
> > > > +static inline void vma_iter_store(struct vma_iterator *vmi,
> > > > +                               struct vm_area_struct *vma)
> > > > +{
> > > >       vma_mark_attached(vma);
> > > > +     vma_iter_store_attached(vmi, vma);
> > > >  }
> > > >
> > >
> > > See comment at top, and we need some comments here to explain why we're
> > > going to pains to do this.
> >
> > Ack. I'll amend the patch description to make that clear.
> >
> > >
> > > What about mm/nommu.c? I guess these cases are always new VMAs.
> >
> > CONFIG_PER_VMA_LOCK depends on !CONFIG_NOMMU, so for nommu case all
> > these attach/detach functions become NOPs.
> >
> > >
> > > We probably definitely need to check this series in a nommu setup, have you
> > > done this? As I can see this breaking things. Then again I suppose you'd
> > > have expected bots to moan by now...
> > >
> > > >  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> > > > --
> > > > 2.47.1.613.gc27f4b7a9f-goog
> > > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 05/17] mm: mark vmas detached upon exit
  2025-01-13 17:13       ` Lorenzo Stoakes
@ 2025-01-13 19:11         ` Suren Baghdasaryan
  2025-01-13 20:32           ` Vlastimil Babka
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 19:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 9:13 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Jan 13, 2025 at 09:02:50AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 13, 2025 at 4:05 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Fri, Jan 10, 2025 at 08:25:52PM -0800, Suren Baghdasaryan wrote:
> > > > When exit_mmap() removes vmas belonging to an exiting task, it does not
> > > > mark them as detached since they can't be reached by other tasks and they
> > > > will be freed shortly. Once we introduce vma reuse, all vmas will have to
> > > > be in detached state before they are freed to ensure vma when reused is
> > > > in a consistent state. Add missing vma_mark_detached() before freeing the
> > > > vma.
> > >
> > > Hmm this really makes me worry that we'll see bugs from this detached
> > > stuff, do we make this assumption anywhere else I wonder?
> >
> > This is the only place which does not currently detach the vma before
> > freeing it. If someone tries adding a case like that in the future,
> > they will be met with vma_assert_detached() inside vm_area_free().
>
> OK good to know!
>
> Again, I wonder if we should make these assertions stronger as commented
> elsewhere, because if we see them in production isn't that worth an actual
> non-debug WARN_ON_ONCE()?

Sure. I'll change vma_assert_attached()/vma_assert_detached() to use
WARN_ON_ONCE() and to return a bool (see also my reply in the patch
[0/17]).

>
> >
> > >
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > >
> > > But regardless, prima facie, this looks fine, so:
> > >
> > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >
> > > > ---
> > > >  mm/vma.c | 6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > index b9cf552e120c..93ff42ac2002 100644
> > > > --- a/mm/vma.c
> > > > +++ b/mm/vma.c
> > > > @@ -413,10 +413,12 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> > > >       if (vma->vm_file)
> > > >               fput(vma->vm_file);
> > > >       mpol_put(vma_policy(vma));
> > > > -     if (unreachable)
> > > > +     if (unreachable) {
> > > > +             vma_mark_detached(vma);
> > > >               __vm_area_free(vma);
> > > > -     else
> > > > +     } else {
> > > >               vm_area_free(vma);
> > > > +     }
> > > >  }
> > > >
> > > >  /*
> > > > --
> > > > 2.47.1.613.gc27f4b7a9f-goog
> > > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 05/17] mm: mark vmas detached upon exit
  2025-01-13 19:11         ` Suren Baghdasaryan
@ 2025-01-13 20:32           ` Vlastimil Babka
  2025-01-13 20:42             ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Vlastimil Babka @ 2025-01-13 20:32 UTC (permalink / raw)
  To: Suren Baghdasaryan, Lorenzo Stoakes
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On 1/13/25 20:11, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 9:13 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> On Mon, Jan 13, 2025 at 09:02:50AM -0800, Suren Baghdasaryan wrote:
>> > On Mon, Jan 13, 2025 at 4:05 AM Lorenzo Stoakes
>> > <lorenzo.stoakes@oracle.com> wrote:
>> > >
>> > > On Fri, Jan 10, 2025 at 08:25:52PM -0800, Suren Baghdasaryan wrote:
>> > > > When exit_mmap() removes vmas belonging to an exiting task, it does not
>> > > > mark them as detached since they can't be reached by other tasks and they
>> > > > will be freed shortly. Once we introduce vma reuse, all vmas will have to
>> > > > be in detached state before they are freed to ensure vma when reused is
>> > > > in a consistent state. Add missing vma_mark_detached() before freeing the
>> > > > vma.
>> > >
>> > > Hmm this really makes me worry that we'll see bugs from this detached
>> > > stuff, do we make this assumption anywhere else I wonder?
>> >
>> > This is the only place which does not currently detach the vma before
>> > freeing it. If someone tries adding a case like that in the future,
>> > they will be met with vma_assert_detached() inside vm_area_free().
>>
>> OK good to know!
>>
>> Again, I wonder if we should make these assertions stronger as commented
>> elsewhere, because if we see them in production isn't that worth an actual
>> non-debug WARN_ON_ONCE()?
> 
> Sure. I'll change vma_assert_attached()/vma_assert_detached() to use
> WARN_ON_ONCE() and to return a bool (see also my reply in the patch
> [0/17]).

So is this a case of "someone might introduce code later that will violate
them" as alluded to above? Unconditional WARN_ON_ONCE seems too much then.

In general it's not easy to determine how paranoid we should be in non-debug
code, but I'm not sure what's the need here specifically.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 05/17] mm: mark vmas detached upon exit
  2025-01-13 20:32           ` Vlastimil Babka
@ 2025-01-13 20:42             ` Suren Baghdasaryan
  2025-01-14 11:36               ` Lorenzo Stoakes
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 20:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, akpm, peterz, willy, liam.howlett,
	david.laight.linux, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 12:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/13/25 20:11, Suren Baghdasaryan wrote:
> > On Mon, Jan 13, 2025 at 9:13 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> >>
> >> On Mon, Jan 13, 2025 at 09:02:50AM -0800, Suren Baghdasaryan wrote:
> >> > On Mon, Jan 13, 2025 at 4:05 AM Lorenzo Stoakes
> >> > <lorenzo.stoakes@oracle.com> wrote:
> >> > >
> >> > > On Fri, Jan 10, 2025 at 08:25:52PM -0800, Suren Baghdasaryan wrote:
> >> > > > When exit_mmap() removes vmas belonging to an exiting task, it does not
> >> > > > mark them as detached since they can't be reached by other tasks and they
> >> > > > will be freed shortly. Once we introduce vma reuse, all vmas will have to
> >> > > > be in detached state before they are freed to ensure vma when reused is
> >> > > > in a consistent state. Add missing vma_mark_detached() before freeing the
> >> > > > vma.
> >> > >
> >> > > Hmm this really makes me worry that we'll see bugs from this detached
> >> > > stuff, do we make this assumption anywhere else I wonder?
> >> >
> >> > This is the only place which does not currently detach the vma before
> >> > freeing it. If someone tries adding a case like that in the future,
> >> > they will be met with vma_assert_detached() inside vm_area_free().
> >>
> >> OK good to know!
> >>
> >> Again, I wonder if we should make these assertions stronger as commented
> >> elsewhere, because if we see them in production isn't that worth an actual
> >> non-debug WARN_ON_ONCE()?
> >
> > Sure. I'll change vma_assert_attached()/vma_assert_detached() to use
> > WARN_ON_ONCE() and to return a bool (see also my reply in the patch
> > [0/17]).
>
> So is this a case of "someone might introduce code later that will violate
> them" as alluded to above? Unconditional WARN_ON_ONCE seems too much then.

Yes, I wanted to make sure refcounting will not be broken by someone
doing re-attach/re-detach.

>
> In general it's not easy to determine how paranoid we should be in non-debug
> code, but I'm not sure what's the need here specifically.

I'm not sure how strict we should be but we definitely should try to
catch refcounting mistakes and that's my goal here.

>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-13  1:47       ` Wei Yang
  2025-01-13  2:25         ` Wei Yang
@ 2025-01-13 21:08         ` Suren Baghdasaryan
  1 sibling, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 21:08 UTC (permalink / raw)
  To: Wei Yang
  Cc: Mateusz Guzik, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Sun, Jan 12, 2025 at 5:47 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> >On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>
> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >>
> >> So there were quite a few iterations of the patch and I have not been
> >> reading majority of the feedback, so it may be I missed something,
> >> apologies upfront. :)
> >>
>
> Hi, I am new to memory barriers. Hope not bothering.
>
> >> >  /*
> >> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >> >   * locked result to avoid performance overhead, in which case we fall back to
> >> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >> >   */
> >> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >> >  {
> >> > +     int oldcnt;
> >> > +
> >> >       /*
> >> >        * Check before locking. A race might cause false locked result.
> >> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> >> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >> >               return false;
> >> >
> >> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> >> > +     /*
> >> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> >> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> >> > +      */
> >> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> >> > +                                                   VMA_REF_LIMIT)))
> >> >               return false;
> >> >
> >>
> >> Replacing down_read_trylock() with the new routine loses an acquire
> >> fence. That alone is not a problem, but see below.
> >
> >Hmm. I think this acquire fence is actually necessary. We don't want
> >the later vm_lock_seq check to be reordered and happen before we take
> >the refcount. Otherwise this might happen:
> >
> >reader             writer
> >if (vm_lock_seq == mm_lock_seq) // check got reordered
> >        return false;
> >                       vm_refcnt += VMA_LOCK_OFFSET
> >                       vm_lock_seq == mm_lock_seq
> >                       vm_refcnt -= VMA_LOCK_OFFSET
> >if (!__refcount_inc_not_zero_limited())
> >        return false;
> >
> >Both reader's checks will pass and the reader would read-lock a vma
> >that was write-locked.
> >
>
> Here what we plan to do is define __refcount_inc_not_zero_limited() with
> acquire fence, e.g. with atomic_try_cmpxchg_acquire(), right?

Correct. __refcount_inc_not_zero_limited() does not do that in this
version but I'll fix that.

>
> >>
> >> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> >> >       /*
> >> > -      * Overflow might produce false locked result.
> >> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> >> >        * False unlocked result is impossible because we modify and check
> >> > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> >> >        * modification invalidates all existing locks.
> >> >        *
> >> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> >> > @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >> >        */
> >> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
>
> One question here is would compiler optimize the read of vm_lock_seq here,
> since we have read it at the beginning?
>
> Or with the acquire fence added above, compiler won't optimize it.

Correct. See "ACQUIRE operations" section in
https://www.kernel.org/doc/Documentation/memory-barriers.txt,
specifically this: "It guarantees that all memory operations after the
ACQUIRE operation will appear to happen after the ACQUIRE operation
with respect to the other components of the system.".

> Or we should use REACE_ONCE(vma->vm_lock_seq) here?
>
> >>
> >> The previous modification of this spot to raw_read_seqcount loses the
> >> acquire fence, making the above comment not line up with the code.
> >
> >Is it? From reading the seqcount code
> >(https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
> >
> >raw_read_seqcount()
> >    seqprop_sequence()
> >        __seqprop(s, sequence)
> >            __seqprop_sequence()
> >                smp_load_acquire()
> >
> >smp_load_acquire() still provides the acquire fence. Am I missing something?
> >
> >>
> >> I don't know if the stock code (with down_read_trylock()) is correct as
> >> is -- looks fine for cursory reading fwiw. However, if it indeed works,
> >> the acquire fence stemming from the lock routine is a mandatory part of
> >> it afaics.
> >>
> >> I think the best way forward is to add a new refcount routine which
> >> ships with an acquire fence.
> >
> >I plan on replacing refcount_t usage here with an atomic since, as
> >Hillf noted, refcount is not designed to be used for locking. And will
> >make sure the down_read_trylock() replacement will provide an acquire
> >fence.
> >
>
> Hmm.. refcount_t is defined with atomic_t. I am lost why replacing refcount_t
> with atomic_t would help.

My point is that refcount_t is not designed for locking, so changing
refcount-related functions and adding fences there would be wrong. So,
I'll move to using more generic atomic_t and will implement the
functionality I need without affecting refcounting functions.

>
> >>
> >> Otherwise I would suggest:
> >> 1. a comment above __refcount_inc_not_zero_limited saying there is an
> >>    acq fence issued later
> >> 2. smp_rmb() slapped between that and seq accesses
> >>
> >> If the now removed fence is somehow not needed, I think a comment
> >> explaining it is necessary.
> >>
> >> > @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >> >
> >> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> >> >  {
> >> > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
> >> > +     if (refcount_read(&vma->vm_refcnt) <= 1)
> >> >               vma_assert_write_locked(vma);
> >> >  }
> >> >
> >>
> >> This now forces the compiler to emit a load from vm_refcnt even if
> >> vma_assert_write_locked expands to nothing. iow this wants to hide
> >> behind the same stuff as vma_assert_write_locked.
> >
> >True. I guess I'll have to avoid using vma_assert_write_locked() like this:
> >
> >static inline void vma_assert_locked(struct vm_area_struct *vma)
> >{
> >        unsigned int mm_lock_seq;
> >
> >        VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 &&
> >                                          !__is_vma_write_locked(vma,
> >&mm_lock_seq), vma);
> >}
> >
> >Will make the change.
> >
> >Thanks for the feedback!
>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-13  2:25         ` Wei Yang
@ 2025-01-13 21:14           ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 21:14 UTC (permalink / raw)
  To: Wei Yang
  Cc: Mateusz Guzik, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Sun, Jan 12, 2025 at 6:25 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, Jan 13, 2025 at 01:47:29AM +0000, Wei Yang wrote:
> >On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> >>On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>>
> >>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >>>
> >>> So there were quite a few iterations of the patch and I have not been
> >>> reading majority of the feedback, so it may be I missed something,
> >>> apologies upfront. :)
> >>>
> >
> >Hi, I am new to memory barriers. Hope not bothering.
> >
> >>> >  /*
> >>> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >>> >   * locked result to avoid performance overhead, in which case we fall back to
> >>> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >>> >   */
> >>> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >>> >  {
> >>> > +     int oldcnt;
> >>> > +
> >>> >       /*
> >>> >        * Check before locking. A race might cause false locked result.
> >>> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> >>> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >>> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >>> >               return false;
> >>> >
> >>> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> >>> > +     /*
> >>> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> >>> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> >>> > +      */
> >>> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> >>> > +                                                   VMA_REF_LIMIT)))
> >>> >               return false;
> >>> >
> >>>
> >>> Replacing down_read_trylock() with the new routine loses an acquire
> >>> fence. That alone is not a problem, but see below.
> >>
> >>Hmm. I think this acquire fence is actually necessary. We don't want
> >>the later vm_lock_seq check to be reordered and happen before we take
> >>the refcount. Otherwise this might happen:
> >>
> >>reader             writer
> >>if (vm_lock_seq == mm_lock_seq) // check got reordered
> >>        return false;
> >>                       vm_refcnt += VMA_LOCK_OFFSET
> >>                       vm_lock_seq == mm_lock_seq
> >>                       vm_refcnt -= VMA_LOCK_OFFSET
> >>if (!__refcount_inc_not_zero_limited())
> >>        return false;
> >>
> >>Both reader's checks will pass and the reader would read-lock a vma
> >>that was write-locked.
> >>
> >
> >Here what we plan to do is define __refcount_inc_not_zero_limited() with
> >acquire fence, e.g. with atomic_try_cmpxchg_acquire(), right?
> >
>
> BTW, usually we pair acquire with release.
>
> The __vma_start_write() provide release fence when locked, so for this part
> we are ok, right?

Yes, __vma_start_write() -> __vma_exit_locked() ->
refcount_sub_and_test() and this function provides release memory
ordering, see https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/refcount.h#L289

>
>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-13  2:37   ` Wei Yang
@ 2025-01-13 21:16     ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 21:16 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sun, Jan 12, 2025 at 6:38 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> > static inline void vma_end_read(struct vm_area_struct *vma) {}
> >@@ -908,12 +948,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> >-#ifdef CONFIG_PER_VMA_LOCK
> >-      /* vma is not locked, can't use vma_mark_detached() */
> >-      vma->detached = true;
> >-#endif
> >       vma_numab_state_init(vma);
> >-      vma_lock_init(vma);
> >+      vma_lock_init(vma, false);
>
> vma_init(vma, mm)
>   memset(vma, 0, sizeof(*vma))
>   ...
>   vma_lock_init(vma, false);
>
> It looks the vm_refcnt must be reset.
>
> BTW, I don't figure out why we want to skip the reset of vm_refcnt. Is this
> related to SLAB_TYPESAFE_BY_RCU?

Earlier memset(vma, 0, sizeof(*vma)) already zeroes the entire
structure, so vm_refcnt is already 0 and does not need to be reset
again.

>
> > }
> >
>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-13  9:36   ` Wei Yang
@ 2025-01-13 21:18     ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-13 21:18 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 1:36 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> [...]
> >
> >+static inline bool is_vma_writer_only(int refcnt)
> >+{
> >+      /*
> >+       * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma
> >+       * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on
> >+       * a detached vma happens only in vma_mark_detached() and is a rare
> >+       * case, therefore most of the time there will be no unnecessary wakeup.
> >+       */
> >+      return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1;
>
> It looks equivalent to
>
>         return (refcnt == VMA_LOCK_OFFSET) || (refcnt == VMA_LOCK_OFFSET + 1);
>
> And its generated code looks a little simpler.

Yeah, but I think the original version is a bit more descriptive,
checking that (1) there is a writer and (2) there are no readers.

>
> >+}
> >+
> >+static inline void vma_refcount_put(struct vm_area_struct *vma)
> >+{
> >+      /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> >+      struct mm_struct *mm = vma->vm_mm;
> >+      int oldcnt;
> >+
> >+      rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >+      if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> >+
> >+              if (is_vma_writer_only(oldcnt - 1))
> >+                      rcuwait_wake_up(&mm->vma_writer_wait);
> >+      }
> >+}
> >+
>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-13 12:14 ` Lorenzo Stoakes
  2025-01-13 16:58   ` Suren Baghdasaryan
@ 2025-01-14  1:49   ` Andrew Morton
  2025-01-14  2:53     ` Suren Baghdasaryan
  1 sibling, 1 reply; 140+ messages in thread
From: Andrew Morton @ 2025-01-14  1:49 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, peterz, willy, liam.howlett,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, 13 Jan 2025 12:14:19 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> A nit on subject, I mean this is part of what this series does, and hey -
> we have only so much text to put in here - but isn't this both
> reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> using the RCU typesafe mechanism?
> 
> Do we have to do both in one series? Can we split this out? I mean maybe
> that's just churny and unnecessary, but to me this series is 'allocate VMAs
> RCU safe and refcount VMA lock' or something like this. Maybe this is
> nitty... but still :)
> 
> One general comment here - this is a really major change in how this stuff
> works, and yet I don't see any tests anywhere in the series.
> 
> I know it's tricky to write tests for this, but the new VMA testing
> environment should make it possible to test a _lot_ more than we previously
> could.
> 
> However due to some (*ahem*) interesting distribution of where functions
> are, most notably stuff in kernel/fork.c, I guess we can't test
> _everything_ there effectively.
> 
> But I do feel like we should be able to do better than having absolutely no
> testing added for this?
> 
> I think there's definitely quite a bit you could test now, at least in
> asserting fundamentals in tools/testing/vma/vma.c.
> 
> This can cover at least detached state asserts in various scenarios.
> 
> But that won't cover off the really gnarly stuff here around RCU slab
> allocation, and determining precisely how to test that in a sensible way is
> maybe less clear.
> 
> But I'd like to see _something_ here please, this is more or less
> fundamentally changing how all VMAs are allocated and to just have nothing
> feels unfortunate.
> 
> I'm already nervous because we've hit issues coming up to v9 and we're not
> 100% sure if a recent syzkaller is related to these changes or not, I'm not
> sure how much we can get assurances with tests but I'd like something.

Thanks.

Yes, we're at -rc7 and this series is rather in panic mode and it seems
unnecessarily risky so I'm inclined to set it aside for this cycle.

If the series is considered super desirable and if people are confident
that we can address any remaining glitches during two months of -rc
then sure, we could push the envelope a bit.  But I don't believe this
is the case so I'm thinking let's give ourselves another cycle to get
this all sorted out?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-14  1:49   ` Andrew Morton
@ 2025-01-14  2:53     ` Suren Baghdasaryan
  2025-01-14  4:09       ` Andrew Morton
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-14  2:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lorenzo Stoakes, peterz, willy, liam.howlett, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 13 Jan 2025 12:14:19 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > A nit on subject, I mean this is part of what this series does, and hey -
> > we have only so much text to put in here - but isn't this both
> > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > using the RCU typesafe mechanism?
> >
> > Do we have to do both in one series? Can we split this out? I mean maybe
> > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > nitty... but still :)
> >
> > One general comment here - this is a really major change in how this stuff
> > works, and yet I don't see any tests anywhere in the series.
> >
> > I know it's tricky to write tests for this, but the new VMA testing
> > environment should make it possible to test a _lot_ more than we previously
> > could.
> >
> > However due to some (*ahem*) interesting distribution of where functions
> > are, most notably stuff in kernel/fork.c, I guess we can't test
> > _everything_ there effectively.
> >
> > But I do feel like we should be able to do better than having absolutely no
> > testing added for this?
> >
> > I think there's definitely quite a bit you could test now, at least in
> > asserting fundamentals in tools/testing/vma/vma.c.
> >
> > This can cover at least detached state asserts in various scenarios.
> >
> > But that won't cover off the really gnarly stuff here around RCU slab
> > allocation, and determining precisely how to test that in a sensible way is
> > maybe less clear.
> >
> > But I'd like to see _something_ here please, this is more or less
> > fundamentally changing how all VMAs are allocated and to just have nothing
> > feels unfortunate.
> >
> > I'm already nervous because we've hit issues coming up to v9 and we're not
> > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > sure how much we can get assurances with tests but I'd like something.
>
> Thanks.
>
> Yes, we're at -rc7 and this series is rather in panic mode and it seems
> unnecessarily risky so I'm inclined to set it aside for this cycle.
>
> If the series is considered super desirable and if people are confident
> that we can address any remaining glitches during two months of -rc
> then sure, we could push the envelope a bit.  But I don't believe this
> is the case so I'm thinking let's give ourselves another cycle to get
> this all sorted out?

I didn't think this series was in panic mode with one real issue that
is not hard to address (memory ordering in
__refcount_inc_not_zero_limited()) but I'm obviously biased and might
be missing the big picture. In any case, if it makes people nervous I
have no objections to your plan.
Thanks,
Suren.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-14  2:53     ` Suren Baghdasaryan
@ 2025-01-14  4:09       ` Andrew Morton
  2025-01-14  9:09         ` Vlastimil Babka
                           ` (2 more replies)
  0 siblings, 3 replies; 140+ messages in thread
From: Andrew Morton @ 2025-01-14  4:09 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lorenzo Stoakes, peterz, willy, liam.howlett, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:

> On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> >
> > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > unnecessarily risky so I'm inclined to set it aside for this cycle.
> >
> > If the series is considered super desirable and if people are confident
> > that we can address any remaining glitches during two months of -rc
> > then sure, we could push the envelope a bit.  But I don't believe this
> > is the case so I'm thinking let's give ourselves another cycle to get
> > this all sorted out?
> 
> I didn't think this series was in panic mode with one real issue that
> is not hard to address (memory ordering in
> __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> be missing the big picture. In any case, if it makes people nervous I
> have no objections to your plan.

Well, I'm soliciting opinions here.  What do others think?

And do you see much urgency with these changes?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-14  4:09       ` Andrew Morton
@ 2025-01-14  9:09         ` Vlastimil Babka
  2025-01-14 10:27           ` Hillf Danton
  2025-01-14  9:47         ` Lorenzo Stoakes
  2025-01-14 14:59         ` Liam R. Howlett
  2 siblings, 1 reply; 140+ messages in thread
From: Vlastimil Babka @ 2025-01-14  9:09 UTC (permalink / raw)
  To: Andrew Morton, Suren Baghdasaryan
  Cc: Lorenzo Stoakes, peterz, willy, liam.howlett, david.laight.linux,
	mhocko, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On 1/14/25 05:09, Andrew Morton wrote:
> On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> 
>> On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >
>> >
>> > Yes, we're at -rc7 and this series is rather in panic mode and it seems
>> > unnecessarily risky so I'm inclined to set it aside for this cycle.
>> >
>> > If the series is considered super desirable and if people are confident
>> > that we can address any remaining glitches during two months of -rc
>> > then sure, we could push the envelope a bit.  But I don't believe this
>> > is the case so I'm thinking let's give ourselves another cycle to get
>> > this all sorted out?
>> 
>> I didn't think this series was in panic mode with one real issue that
>> is not hard to address (memory ordering in
>> __refcount_inc_not_zero_limited()) but I'm obviously biased and might
>> be missing the big picture. In any case, if it makes people nervous I
>> have no objections to your plan.
> 
> Well, I'm soliciting opinions here.  What do others think?
> 
> And do you see much urgency with these changes?

I don't see the urgency and at this point giving it more time seems wise.
Seems like v10 won't be exactly trivial as we'll change from refcount_t to
atomic_t? And I'd like to see PeterZ review the lockdep parts too.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-14  4:09       ` Andrew Morton
  2025-01-14  9:09         ` Vlastimil Babka
@ 2025-01-14  9:47         ` Lorenzo Stoakes
  2025-01-14 14:59         ` Liam R. Howlett
  2 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-14  9:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, peterz, willy, liam.howlett,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 08:09:08PM -0800, Andrew Morton wrote:
> On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > >
> > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > >
> > > If the series is considered super desirable and if people are confident
> > > that we can address any remaining glitches during two months of -rc
> > > then sure, we could push the envelope a bit.  But I don't believe this
> > > is the case so I'm thinking let's give ourselves another cycle to get
> > > this all sorted out?
> >
> > I didn't think this series was in panic mode with one real issue that
> > is not hard to address (memory ordering in
> > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > be missing the big picture. In any case, if it makes people nervous I
> > have no objections to your plan.
>
> Well, I'm soliciting opinions here.  What do others think?
>
> And do you see much urgency with these changes?

With apologies to Suren (genuinely!) who is doing great work and is
super-responsive here, this really needs another cycle in my opinion.

As Vlastimil points out there's some non-trivial bits to go, but I am also
firmly of the opinion we need to have as much testing as is practical here.

I don't think this is urgent on any timeline so I'd like to join Vlastimil
to firmly but politely push for this to land in 6.15 rather than 6.14.

Just to reiterate - this is absolutely no reflection on Suren who has been
really great here - it is purely a product of the complexity and scope of
this change.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-14  9:09         ` Vlastimil Babka
@ 2025-01-14 10:27           ` Hillf Danton
  0 siblings, 0 replies; 140+ messages in thread
From: Hillf Danton @ 2025-01-14 10:27 UTC (permalink / raw)
  To: Andrew Morton, Suren Baghdasaryan
  Cc: Lorenzo Stoakes, peterz, hannes, linux-mm, linux-kernel,
	Vlastimil Babka

On Tue, 14 Jan 2025 10:09:42 +0100 Vlastimil Babka <vbabka@suse.cz>
> On 1/14/25 05:09, Andrew Morton wrote:
> > 
> > Well, I'm soliciting opinions here.  What do others think?
> > 
> > And do you see much urgency with these changes?
> 
> I don't see the urgency and at this point giving it more time seems wise.
> Seems like v10 won't be exactly trivial as we'll change from refcount_t to
> atomic_t? And I'd like to see PeterZ review the lockdep parts too.

+1

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-13 19:00       ` Suren Baghdasaryan
@ 2025-01-14 11:35         ` Lorenzo Stoakes
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-14 11:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 11:00:07AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 9:11 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Jan 13, 2025 at 08:58:37AM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > A nit on subject, I mean this is part of what this series does, and hey -
> > > > we have only so much text to put in here - but isn't this both
> > > > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > > > using the RCU typesafe mechanism?
> > > >
> > > > Do we have to do both in one series? Can we split this out? I mean maybe
> > > > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > > > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > > > nitty... but still :)
> > >
> > > There is "motivational dependency" because one of the main reasons I'm
> > > converting the vm_lock into vm_refcnt is to make it easier to add
> > > SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
> > > can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
> > > thighs easier. That would be the 2 patches at the end:
> >
> > Right yeah... maybe it's better to do it in one hit.
> >
> > >
> > > mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > > mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > >
> > > I made sure that each patch is bisectable, so there should not be a
> > > problem with tracking issues.
> > >
> > > >
> > > > One general comment here - this is a really major change in how this stuff
> > > > works, and yet I don't see any tests anywhere in the series.
> > >
> > > Hmm. I was diligently updating the tests to reflect the replacement of
> > > vm_lock with vm_refcnt and adding assertions for detach/attach cases.
> > > This actually reminds me that I missed updading vm_area_struct in
> > > vma_internal.h for the member regrouping patch; will add that. I think
> > > the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
> > > was not sure what kind of testing I can add for that. Any suggestions
> > > would be welcomed.
> >
> > And to be clear I'm super grateful you did that :) thanks, be good to
> > change the member regrouping thing also.
> >
> > But that doesn't change the fact that this series has exactly zero tests
> > for it. And for something so broad, it feels like a big issue, we really
> > want to be careful with something so big here.
> >
> > You've also noticed that I've cleverly failed to _actually_ suggest
> > SLAB_TYPESAFE_BY_RCU tests, and mea culpa - it's super hard to think of how
> > to test that.
> >
> > Liam has experience doing RCU testing this for the maple tree stuff, but it
> > wasn't pretty and wasn't easy and would probably require massive rework to
> > expose this stuff to some viable testing environment, or in other words -
> > is unworkable.
> >
> > HOWEVER, I feel like maybe we could try to create scenarios where we might
> > trigger reuse bugs?
> >
> > Perhaps some userland code, perhaps even constrained by cgroup, that maps a
> > ton of stuff and unmaps in a loop in parallel?
> >
> > Perhaps create scenarios with shared memory where we up refcounts a lot too?
>
> I have this old spf_test
> (https://github.com/surenbaghdasaryan/spf_test/blob/main/spf_test.c)
> which I often use to weed out vma locking issues because it starts
> multiple threads doing mmap + page faults. Perhaps we can repackage it
> into a test/benchmark for testing contention on mmap/vma locks?

Ah nice yeah that sounds good!

>
> >
> > Anyway, this is necessarily nebulous without further investigation, what I
> > was thinking more concretely is:
> >
> > Using the VMA userland testing:
> >
> > 1. Assert reference count correctness across locking scenarios and various
> >    VMA operations.
> > 2. Assert correct detached/not detached state across different scenarios.
> >
> > This won't quite be complete as not everything is separated out quite
> > enough to allow things like process tear down/forking etc. to be explicitly
> > tested but you can unit tests the VMA bits at least.
> >
> > One note on this, I intend to split the vma.c file into multiple files in
> > tools/testing/vma/ so if you add tests here it'd be worth probably putting
> > them into a new file.
> >
> > I'm happy to help with this if you need any assistance, feel free to ping!
>
> As a starting point I was thinking of changing
> vma_assert_attached()/vma_assert_detached() and
> vma_mark_attached()/vma_mark_detached() to return a bool and use
> WARN_ON_ONCE() (to address your concern about asserts being dependent
> on CONFIG_DEBUG_VM) like this:
>
> static inline bool vma_assert_detached()
> {
>     return !WARN_ON_ONCE(atomic_read(&vma->vm_refcnt));
> }
>
> static inline bool vma_mark_attached(struct vm_area_struct *vma)
> {
>     vma_assert_write_locked(vma);
>     if (!vma_assert_detached(vma))
>         return false;
>
>     atomic_set(&vma->vm_refcnt, 1);
>     return true;
> }
>

Sounds good!

> With that we can add correctness checks in the tools/testing/vma/vma.c
> for different states, for example in the alloc_and_link_vma() we can
> check that after vma_link() the vma is indeed attached:
>
> ASSERT_TRUE(vma_assert_attached(vma));
>
> This might not cover all states but is probably a good starting point. WDYT?

Yeah, this is a good starting point.

I think also we should add explicit asserts in the merge tests to ensure
attachment.

I mean part of this is adding more tests in general for standard
operations, but I don't want to be silly and suggest you need to do that.

I think this forms a decent basis.

>
> >
> > Sorry to put this on you so late in the series, I realise it's annoying,
> > but I feel like things have changed a lot and obviously aggregated with two
> > series in one in effect and these are genuine concerns that at this stage I
> > feel like we need to try to at least make some headway on.
> >
> > >
> > > >
> > > > I know it's tricky to write tests for this, but the new VMA testing
> > > > environment should make it possible to test a _lot_ more than we previously
> > > > could.
> > > >
> > > > However due to some (*ahem*) interesting distribution of where functions
> > > > are, most notably stuff in kernel/fork.c, I guess we can't test
> > > > _everything_ there effectively.
> > > >
> > > > But I do feel like we should be able to do better than having absolutely no
> > > > testing added for this?
> > >
> > > Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
> > > for the rest I thought the tests were modified accordingly.
> >
> > See above ^
> >
> > >
> > > >
> > > > I think there's definitely quite a bit you could test now, at least in
> > > > asserting fundamentals in tools/testing/vma/vma.c.
> > > >
> > > > This can cover at least detached state asserts in various scenarios.
> > >
> > > Ok, you mean to check that VMA re-attachment/re-detachment would
> > > trigger assertions? I'll look into adding tests for that.
> >
> > Yeah this is one, see above :)
> >
> > >
> > > >
> > > > But that won't cover off the really gnarly stuff here around RCU slab
> > > > allocation, and determining precisely how to test that in a sensible way is
> > > > maybe less clear.
> > > >
> > > > But I'd like to see _something_ here please, this is more or less
> > > > fundamentally changing how all VMAs are allocated and to just have nothing
> > > > feels unfortunate.
> > >
> > > Again, I'm open to suggestions on what kind of testing I can add for
> > > SLAB_TYPESAFE_BY_RCU change.
> >
> > See above
> >
> > >
> > > >
> > > > I'm already nervous because we've hit issues coming up to v9 and we're not
> > > > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > > > sure how much we can get assurances with tests but I'd like something.
> > >
> > > If you are referring to the issue at [1], I think David ran the
> > > syzcaller against mm-stable that does not contain this patchset and
> > > the issue still triggered (see [2]). This of course does not guarantee
> > > that this patchset has no other issues :) I'll try adding tests for
> > > re-attaching, re-detaching and welcome ideas on how to test
> > > SLAB_TYPESAFE_BY_RCU transition.
> > > Thanks,
> > > Suren.
> >
> > OK that's reassuring!
> >
> > >
> > > [1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
> > > [2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/
> > >
> > > >
> > > > Thanks!
> > > >
> > > > On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > > > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > > > vm_area_struct in [1] because of the performance regression caused by
> > > > > false cacheline sharing. Recent investigation [2] revealed that the
> > > > > regressions is limited to a rather old Broadwell microarchitecture and
> > > > > even there it can be mitigated by disabling adjacent cacheline
> > > > > prefetching, see [3].
> > > > > Splitting single logical structure into multiple ones leads to more
> > > > > complicated management, extra pointer dereferences and overall less
> > > > > maintainable code. When that split-away part is a lock, it complicates
> > > > > things even further. With no performance benefits, there are no reasons
> > > > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > > > This patchset:
> > > > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > > > boundary and changing the cache to be cacheline-aligned to minimize
> > > > > cacheline sharing;
> > > > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > > > it is inserted into vma tree;
> > > > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > > > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > > > > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > > > reuse and to minimize call_rcu() calls.
> > > > >
> > > > > Pagefault microbenchmarks show performance improvement:
> > > > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> > > > >
> > > > > Changes since v8 [4]:
> > > > > - Change subject for the cover letter, per Vlastimil Babka
> > > > > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > > > > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > > > > per David Laight
> > > > > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > > > > per Hillf Danton and Vlastimil Babka
> > > > > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > > > > under us, per Vlastimil Babka
> > > > > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > > > > per Vlastimil Babka
> > > > > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > > > > per Vlastimil Babka
> > > > > - Amended description in vm_lock replacement patch explaining the effects
> > > > > of the patch on vm_area_struct size, per Vlastimil Babka
> > > > > - Added vm_area_struct member regrouping patch [5] into the series
> > > > > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > > > > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > > > > adding new members, per Vlastimil Babka
> > > > > - Updated a comment about unstable src->shared.rb when copying a vma in
> > > > > vm_area_init_from(), per Vlastimil Babka
> > > > >
> > > > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > > > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > > > > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> > > > >
> > > > > Patchset applies over mm-unstable after reverting v8
> > > > > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> > > > >
> > > > > Suren Baghdasaryan (17):
> > > > >   mm: introduce vma_start_read_locked{_nested} helpers
> > > > >   mm: move per-vma lock into vm_area_struct
> > > > >   mm: mark vma as detached until it's added into vma tree
> > > > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > > > >   mm: mark vmas detached upon exit
> > > > >   types: move struct rcuwait into types.h
> > > > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > > > >   mm: move mmap_init_lock() out of the header file
> > > > >   mm: uninline the main body of vma_start_write()
> > > > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > > > >   mm: replace vm_lock and detached flag with a reference count
> > > > >   mm: move lesser used vma_area_struct members into the last cacheline
> > > > >   mm/debug: print vm_refcnt state when dumping the vma
> > > > >   mm: remove extra vma_numab_state_init() call
> > > > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > > > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > > > >   docs/mm: document latest changes to vm_lock
> > > > >
> > > > >  Documentation/mm/process_addrs.rst |  44 ++++----
> > > > >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> > > > >  include/linux/mm_types.h           |  75 +++++++-------
> > > > >  include/linux/mmap_lock.h          |   6 --
> > > > >  include/linux/rcuwait.h            |  13 +--
> > > > >  include/linux/refcount.h           |  24 ++++-
> > > > >  include/linux/slab.h               |   6 --
> > > > >  include/linux/types.h              |  12 +++
> > > > >  kernel/fork.c                      | 129 +++++++++++-------------
> > > > >  mm/debug.c                         |  12 +++
> > > > >  mm/init-mm.c                       |   1 +
> > > > >  mm/memory.c                        |  97 ++++++++++++++++--
> > > > >  mm/mmap.c                          |   3 +-
> > > > >  mm/userfaultfd.c                   |  32 +++---
> > > > >  mm/vma.c                           |  23 ++---
> > > > >  mm/vma.h                           |  15 ++-
> > > > >  tools/testing/vma/linux/atomic.h   |   5 +
> > > > >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> > > > >  18 files changed, 465 insertions(+), 281 deletions(-)
> > > > >
> > > > > --
> > > > > 2.47.1.613.gc27f4b7a9f-goog
> > > > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 05/17] mm: mark vmas detached upon exit
  2025-01-13 20:42             ` Suren Baghdasaryan
@ 2025-01-14 11:36               ` Lorenzo Stoakes
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-14 11:36 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, akpm, peterz, willy, liam.howlett,
	david.laight.linux, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 12:42:53PM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 12:32 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 1/13/25 20:11, Suren Baghdasaryan wrote:
> > > On Mon, Jan 13, 2025 at 9:13 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > >>
> > >> On Mon, Jan 13, 2025 at 09:02:50AM -0800, Suren Baghdasaryan wrote:
> > >> > On Mon, Jan 13, 2025 at 4:05 AM Lorenzo Stoakes
> > >> > <lorenzo.stoakes@oracle.com> wrote:
> > >> > >
> > >> > > On Fri, Jan 10, 2025 at 08:25:52PM -0800, Suren Baghdasaryan wrote:
> > >> > > > When exit_mmap() removes vmas belonging to an exiting task, it does not
> > >> > > > mark them as detached since they can't be reached by other tasks and they
> > >> > > > will be freed shortly. Once we introduce vma reuse, all vmas will have to
> > >> > > > be in detached state before they are freed to ensure vma when reused is
> > >> > > > in a consistent state. Add missing vma_mark_detached() before freeing the
> > >> > > > vma.
> > >> > >
> > >> > > Hmm this really makes me worry that we'll see bugs from this detached
> > >> > > stuff, do we make this assumption anywhere else I wonder?
> > >> >
> > >> > This is the only place which does not currently detach the vma before
> > >> > freeing it. If someone tries adding a case like that in the future,
> > >> > they will be met with vma_assert_detached() inside vm_area_free().
> > >>
> > >> OK good to know!
> > >>
> > >> Again, I wonder if we should make these assertions stronger as commented
> > >> elsewhere, because if we see them in production isn't that worth an actual
> > >> non-debug WARN_ON_ONCE()?
> > >
> > > Sure. I'll change vma_assert_attached()/vma_assert_detached() to use
> > > WARN_ON_ONCE() and to return a bool (see also my reply in the patch
> > > [0/17]).
> >
> > So is this a case of "someone might introduce code later that will violate
> > them" as alluded to above? Unconditional WARN_ON_ONCE seems too much then.

My concern is that there is a broken case that remains hidden because
nothing is actually checked in production, which would then become really
difficult to debug should somebody report it.

We intend the WARN_ONxxx() functions to be asserting things that -should
not be- for precisely this kind of purpose so I think it makes sense here.

>
> Yes, I wanted to make sure refcounting will not be broken by someone
> doing re-attach/re-detach.

Yes, and debugging this without it could be really horrible.

>
> >
> > In general it's not easy to determine how paranoid we should be in non-debug
> > code, but I'm not sure what's the need here specifically.
>
> I'm not sure how strict we should be but we definitely should try to
> catch refcounting mistakes and that's my goal here.

Yes I think it is worth it here (obviously :)

>
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas
  2025-01-13 19:09         ` Suren Baghdasaryan
@ 2025-01-14 11:38           ` Lorenzo Stoakes
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-14 11:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 11:09:25AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 8:48 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Jan 13, 2025 at 08:31:45AM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Jan 13, 2025 at 3:58 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Fri, Jan 10, 2025 at 08:25:51PM -0800, Suren Baghdasaryan wrote:
> > > > > vma_iter_store() functions can be used both when adding a new vma and
> > > > > when updating an existing one. However for existing ones we do not need
> > > > > to mark them attached as they are already marked that way. Introduce
> > > > > vma_iter_store_attached() to be used with already attached vmas.
> > > >
> > > > OK I guess the intent of this is to reinstate the previously existing
> > > > asserts, only explicitly checking those places where we attach.
> > >
> > > No, the motivation is to prevern re-attaching an already attached vma
> > > or re-detaching an already detached vma for state consistency. I guess
> > > I should amend the description to make that clear.
> >
> > Sorry for noise, missed this reply.
> >
> > What I mean by this is, in a past iteration of this series I reviewed code
> > where you did this but did _not_ differentiate between cases of new VMAs
> > vs. existing, which caused an assert in your series which I reported.
> >
> > So I"m saying - now you _are_ differentiating between the two cases.
> >
> > It's certainly worth belabouring the point of exactly what it is you are
> > trying to catch here, however! :) So yes please do add a little more to
> > commit msg that'd be great, thanks!
>
> Sure. How about:
>
> With vma->detached being a separate flag, double-marking a vmas as
> attached or detached is not an issue because the flag will simply be
> overwritten with the same value. However once we fold this flag into
> the refcount later in this series, re-attaching or re-detaching a vma
> becomes an issue since these operations will be
> incrementing/decrementing a refcount. Fix the places where we
> currently re-attaching a vma during vma update and add assertions in
> vma_mark_attached()/vma_mark_detached() to catch invalid usage.

That's awesome, thanks!

>
> >
> > >
> > > >
> > > > I'm a little concerned that by doing this, somebody might simply invoke
> > > > this function without realising the implications.
> > >
> > > Well, in that case somebody should get an assertion. If
> > > vma_iter_store() is called against already attached vma, we get this
> > > assertion:
> > >
> > > vma_iter_store()
> > >   vma_mark_attached()
> > >     vma_assert_detached()
> > >
> > > If vma_iter_store_attached() is called against a detached vma, we get this one:
> > >
> > > vma_iter_store_attached()
> > >   vma_assert_attached()
> > >
> > > Does that address your concern?
> > >
> > > >
> > > > Can we have something functional like
> > > >
> > > > vma_iter_store_new() and vma_iter_store_overwrite()
> > >
> > > Ok. A bit more churn but should not be too bad.
> > >
> > > >
> > > > ?
> > > >
> > > > I don't like us just leaving vma_iter_store() quietly making an assumption
> > > > that a caller doesn't necessarily realise.
> > > >
> > > > Also it's more greppable this way.
> > > >
> > > > I had a look through callers and it does seem you've snagged them all
> > > > correctly.
> > > >
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > > > ---
> > > > >  include/linux/mm.h | 12 ++++++++++++
> > > > >  mm/vma.c           |  8 ++++----
> > > > >  mm/vma.h           | 11 +++++++++--
> > > > >  3 files changed, 25 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 2b322871da87..2f805f1a0176 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -821,6 +821,16 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> > > > >               vma_assert_write_locked(vma);
> > > > >  }
> > > > >
> > > > > +static inline void vma_assert_attached(struct vm_area_struct *vma)
> > > > > +{
> > > > > +     VM_BUG_ON_VMA(vma->detached, vma);
> > > > > +}
> > > > > +
> > > > > +static inline void vma_assert_detached(struct vm_area_struct *vma)
> > > > > +{
> > > > > +     VM_BUG_ON_VMA(!vma->detached, vma);
> > > > > +}
> > > > > +
> > > > >  static inline void vma_mark_attached(struct vm_area_struct *vma)
> > > > >  {
> > > > >       vma->detached = false;
> > > > > @@ -866,6 +876,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
> > > > >  static inline void vma_start_write(struct vm_area_struct *vma) {}
> > > > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > > >               { mmap_assert_write_locked(vma->vm_mm); }
> > > > > +static inline void vma_assert_attached(struct vm_area_struct *vma) {}
> > > > > +static inline void vma_assert_detached(struct vm_area_struct *vma) {}
> > > > >  static inline void vma_mark_attached(struct vm_area_struct *vma) {}
> > > > >  static inline void vma_mark_detached(struct vm_area_struct *vma) {}
> > > > >
> > > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > > index d603494e69d7..b9cf552e120c 100644
> > > > > --- a/mm/vma.c
> > > > > +++ b/mm/vma.c
> > > > > @@ -660,14 +660,14 @@ static int commit_merge(struct vma_merge_struct *vmg,
> > > > >       vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff);
> > > > >
> > > > >       if (expanded)
> > > > > -             vma_iter_store(vmg->vmi, vmg->vma);
> > > > > +             vma_iter_store_attached(vmg->vmi, vmg->vma);
> > > > >
> > > > >       if (adj_start) {
> > > > >               adjust->vm_start += adj_start;
> > > > >               adjust->vm_pgoff += PHYS_PFN(adj_start);
> > > > >               if (adj_start < 0) {
> > > > >                       WARN_ON(expanded);
> > > > > -                     vma_iter_store(vmg->vmi, adjust);
> > > > > +                     vma_iter_store_attached(vmg->vmi, adjust);
> > > > >               }
> > > > >       }
> > > >
> > > > I kind of feel this whole function (that yes, I added :>) though derived
> > > > from existing logic) needs rework, as it's necessarily rather confusing.
> > > >
> > > > But hey, that's on me :)
> > > >
> > > > But this does look right... OK see this as a note-to-self...
> > > >
> > > > >
> > > > > @@ -2845,7 +2845,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > > > >                               anon_vma_interval_tree_pre_update_vma(vma);
> > > > >                               vma->vm_end = address;
> > > > >                               /* Overwrite old entry in mtree. */
> > > > > -                             vma_iter_store(&vmi, vma);
> > > > > +                             vma_iter_store_attached(&vmi, vma);
> > > > >                               anon_vma_interval_tree_post_update_vma(vma);
> > > > >
> > > > >                               perf_event_mmap(vma);
> > > > > @@ -2925,7 +2925,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > > > >                               vma->vm_start = address;
> > > > >                               vma->vm_pgoff -= grow;
> > > > >                               /* Overwrite old entry in mtree. */
> > > > > -                             vma_iter_store(&vmi, vma);
> > > > > +                             vma_iter_store_attached(&vmi, vma);
> > > > >                               anon_vma_interval_tree_post_update_vma(vma);
> > > > >
> > > > >                               perf_event_mmap(vma);
> > > > > diff --git a/mm/vma.h b/mm/vma.h
> > > > > index 2a2668de8d2c..63dd38d5230c 100644
> > > > > --- a/mm/vma.h
> > > > > +++ b/mm/vma.h
> > > > > @@ -365,9 +365,10 @@ static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
> > > > >  }
> > > > >
> > > > >  /* Store a VMA with preallocated memory */
> > > > > -static inline void vma_iter_store(struct vma_iterator *vmi,
> > > > > -                               struct vm_area_struct *vma)
> > > > > +static inline void vma_iter_store_attached(struct vma_iterator *vmi,
> > > > > +                                        struct vm_area_struct *vma)
> > > > >  {
> > > > > +     vma_assert_attached(vma);
> > > > >
> > > > >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > > > >       if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start &&
> > > > > @@ -390,7 +391,13 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
> > > > >
> > > > >       __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > > > >       mas_store_prealloc(&vmi->mas, vma);
> > > > > +}
> > > > > +
> > > > > +static inline void vma_iter_store(struct vma_iterator *vmi,
> > > > > +                               struct vm_area_struct *vma)
> > > > > +{
> > > > >       vma_mark_attached(vma);
> > > > > +     vma_iter_store_attached(vmi, vma);
> > > > >  }
> > > > >
> > > >
> > > > See comment at top, and we need some comments here to explain why we're
> > > > going to pains to do this.
> > >
> > > Ack. I'll amend the patch description to make that clear.
> > >
> > > >
> > > > What about mm/nommu.c? I guess these cases are always new VMAs.
> > >
> > > CONFIG_PER_VMA_LOCK depends on !CONFIG_NOMMU, so for nommu case all
> > > these attach/detach functions become NOPs.
> > >
> > > >
> > > > We probably definitely need to check this series in a nommu setup, have you
> > > > done this? As I can see this breaking things. Then again I suppose you'd
> > > > have expected bots to moan by now...
> > > >
> > > > >  static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
> > > > > --
> > > > > 2.47.1.613.gc27f4b7a9f-goog
> > > > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma
  2025-01-13 17:57       ` Suren Baghdasaryan
@ 2025-01-14 11:41         ` Lorenzo Stoakes
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-14 11:41 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, peterz, willy, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 09:57:54AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 8:35 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250113 11:21]:
> > > On Fri, Jan 10, 2025 at 08:26:00PM -0800, Suren Baghdasaryan wrote:
> > > > vm_refcnt encodes a number of useful states:
> > > > - whether vma is attached or detached
> > > > - the number of current vma readers
> > > > - presence of a vma writer
> > > > Let's include it in the vma dump.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > > ---
> > > >  mm/debug.c | 12 ++++++++++++
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/mm/debug.c b/mm/debug.c
> > > > index 8d2acf432385..325d7bf22038 100644
> > > > --- a/mm/debug.c
> > > > +++ b/mm/debug.c
> > > > @@ -178,6 +178,17 @@ EXPORT_SYMBOL(dump_page);
> > > >
> > > >  void dump_vma(const struct vm_area_struct *vma)
> > > >  {
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +   pr_emerg("vma %px start %px end %px mm %px\n"
> > > > +           "prot %lx anon_vma %px vm_ops %px\n"
> > > > +           "pgoff %lx file %px private_data %px\n"
> > > > +           "flags: %#lx(%pGv) refcnt %x\n",
> > > > +           vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> > > > +           (unsigned long)pgprot_val(vma->vm_page_prot),
> > > > +           vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> > > > +           vma->vm_file, vma->vm_private_data,
> > > > +           vma->vm_flags, &vma->vm_flags, refcount_read(&vma->vm_refcnt));
> > > > +#else
> > > >     pr_emerg("vma %px start %px end %px mm %px\n"
> > > >             "prot %lx anon_vma %px vm_ops %px\n"
> > > >             "pgoff %lx file %px private_data %px\n"
> > > > @@ -187,6 +198,7 @@ void dump_vma(const struct vm_area_struct *vma)
> > > >             vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> > > >             vma->vm_file, vma->vm_private_data,
> > > >             vma->vm_flags, &vma->vm_flags);
> > > > +#endif
> > > >  }
> > >
> > > This is pretty horribly duplicative and not in line with how this kind of
> > > thing is done in the rest of the file. You're just adding one entry, so why
> > > not:
> > >
> > > void dump_vma(const struct vm_area_struct *vma)
> > > {
> > >       pr_emerg("vma %px start %px end %px mm %px\n"
> > >               "prot %lx anon_vma %px vm_ops %px\n"
> > >               "pgoff %lx file %px private_data %px\n"
> > > #ifdef CONFIG_PER_VMA_LOCK
> > >               "refcnt %x\n"
> > > #endif
> > >               "flags: %#lx(%pGv)\n",
> > >               vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
> > >               (unsigned long)pgprot_val(vma->vm_page_prot),
> > >               vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
> > >               vma->vm_file, vma->vm_private_data,
> > >               vma->vm_flags,
> > > #ifdef CONFIG_PER_VMA_LOCK
> > >               refcount_read(&vma->vm_refcnt),
> > > #endif
> > >               &vma->vm_flags);
> > > }
> >
> > right, I had an issue with this as well.
> >
> > Another option would be:
> >
> >         pr_emerg("vma %px start %px end %px mm %px\n"
> >                 "prot %lx anon_vma %px vm_ops %px\n"
> >                 "pgoff %lx file %px private_data %px\n",
> >                 <big mess here>);
> >         dump_vma_refcnt();
> >         pr_emerg("flags:...", vma_vm_flags);
> >
> >
> > Then dump_vma_refcnt() either dumps the refcnt or does nothing,
> > depending on the config option.
> >
> > Either way is good with me.  Lorenzo's suggestion is in line with the
> > file and it's clear as to why the refcnt might be missing, but I don't
> > really see this being an issue in practice.
>
> Thanks for clarifying! Lorenzo's suggestion LGTM too. I'll adopt it. Thanks!
>

Cheers guys!

> >
> > Thanks,
> > Liam
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 14/17] mm: remove extra vma_numab_state_init() call
  2025-01-13 17:56     ` Suren Baghdasaryan
@ 2025-01-14 11:45       ` Lorenzo Stoakes
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-14 11:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 09:56:17AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 8:28 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:26:01PM -0800, Suren Baghdasaryan wrote:
> > > vma_init() already memset's the whole vm_area_struct to 0, so there is
> > > no need to an additional vma_numab_state_init().
> >
> > Hm strangely random change :) I'm guessing this was a pre-existing thing?
>
> Yeah, I stumbled on it while working on an earlier version of this
> patchset which involved ctor usage.
>
> >
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > I mean this looks fine, so fair enough just feels a bit incongruous with
> > series. But regardless:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > > ---
> > >  include/linux/mm.h | 1 -
> > >  1 file changed, 1 deletion(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index a99b11ee1f66..c8da64b114d1 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -948,7 +948,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > >       vma->vm_mm = mm;
> > >       vma->vm_ops = &vma_dummy_vm_ops;
> > >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > > -     vma_numab_state_init(vma);
> > >       vma_lock_init(vma, false);
> > >  }
> >
> > This leaves one other caller in vm_area_dup() (I _hate_ that this lives in
> > the fork code... - might very well look at churning some VMA stuff over
> > from there to an appropriate place).
> >
> > While we're here, I mean this thing seems a bit of out scope for the series
> > but if we're doing it, can we just remove vma_numab_state_init() and
> > instead edit vm_area_init_from() to #ifdef ... this like the other fields
> > now?
> >
> > I's not exactly urgent though as this stuff in the fork code is a bit of a
> > mess anyway...
>
> Yeah, let's keep the cleanup out for now. The series is already quite
> big. I included this one-line cleanup since it was uncontroversial and
> simple.

Yeah it's fine not a big deal. We can address this trivia later.

>
> >
> > >
> > > --
> > > 2.47.1.613.gc27f4b7a9f-goog
> > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 07/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2025-01-13 17:53     ` Suren Baghdasaryan
@ 2025-01-14 11:48       ` Lorenzo Stoakes
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-14 11:48 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Mon, Jan 13, 2025 at 09:53:01AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 7:25 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:25:54PM -0800, Suren Baghdasaryan wrote:
> > > With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
> > > possibility of vma_start_read_locked/vma_start_read_locked_nested failing
> > > due to refcount overflow. Prepare for such possibility by changing these
> > > APIs and adjusting their users.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > > ---
> > >  include/linux/mm.h |  6 ++++--
> > >  mm/userfaultfd.c   | 18 +++++++++++++-----
> > >  2 files changed, 17 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 2f805f1a0176..cbb4e3dbbaed 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >   * not be used in such cases because it might fail due to mm_lock_seq overflow.
> > >   * This functionality is used to obtain vma read lock and drop the mmap read lock.
> > >   */
> > > -static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> > > +static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> > >  {
> > >       mmap_assert_locked(vma->vm_mm);
> > >       down_read_nested(&vma->vm_lock.lock, subclass);
> > > +     return true;
> > >  }
> > >
> > >  /*
> > > @@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
> > >   * not be used in such cases because it might fail due to mm_lock_seq overflow.
> > >   * This functionality is used to obtain vma read lock and drop the mmap read lock.
> > >   */
> > > -static inline void vma_start_read_locked(struct vm_area_struct *vma)
> > > +static inline bool vma_start_read_locked(struct vm_area_struct *vma)
> > >  {
> > >       mmap_assert_locked(vma->vm_mm);
> > >       down_read(&vma->vm_lock.lock);
> > > +     return true;
> > >  }
> > >
> > >  static inline void vma_end_read(struct vm_area_struct *vma)
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index 4527c385935b..411a663932c4 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
> > >       mmap_read_lock(mm);
> > >       vma = find_vma_and_prepare_anon(mm, address);
> > >       if (!IS_ERR(vma))
> > > -             vma_start_read_locked(vma);
> > > +             if (!vma_start_read_locked(vma))
> > > +                     vma = ERR_PTR(-EAGAIN);
> >
> > Nit but this kind of reads a bit weirdly now:
> >
> >         if (!IS_ERR(vma))
> >                 if (!vma_start_read_locked(vma))
> >                         vma = ERR_PTR(-EAGAIN);
> >
> > Wouldn't this be nicer as:
> >
> >         if (!IS_ERR(vma) && !vma_start_read_locked(vma))
> >                 vma = ERR_PTR(-EAGAIN);
> >
> > On the other hand, this embeds an action in an expression, but then it sort of
> > still looks weird.
> >
> >         if (!IS_ERR(vma)) {
> >                 bool ok = vma_start_read_locked(vma);
> >
> >                 if (!ok)
> >                         vma = ERR_PTR(-EAGAIN);
> >         }
> >
> > This makes me wonder, now yes, we are truly bikeshedding, sorry, but maybe we
> > could just have vma_start_read_locked return a VMA pointer that could be an
> > error?
> >
> > Then this becomes:
> >
> >         if (!IS_ERR(vma))
> >                 vma = vma_start_read_locked(vma);
>
> No, I think it would be wrong for vma_start_read_locked() to always
> return EAGAIN when it can't lock the vma. The error code here is
> context-dependent, so while EAGAIN is the right thing here, it might
> not work for other future users.

Ack, makes sense.

But it'd be nice to clean this up so it isn't this arrow-shaped-code
thing. I mean obviously this is subjective and sorry to bikeshed this late
in a series... but :)

Are you with:

	if (!IS_ERR(vma)) {
		bool ok = vma_start_read_locked(vma);

		if (!ok)
			vma = ERR_PTR(-EAGAIN);
	}

?

I think this reads better.

Sorry to be a pain! :)

>
> >
> > >
> > >       mmap_read_unlock(mm);
> > >       return vma;
> > > @@ -1483,10 +1484,17 @@ static int uffd_move_lock(struct mm_struct *mm,
> > >       mmap_read_lock(mm);
> > >       err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
> > >       if (!err) {
> > > -             vma_start_read_locked(*dst_vmap);
> > > -             if (*dst_vmap != *src_vmap)
> > > -                     vma_start_read_locked_nested(*src_vmap,
> > > -                                             SINGLE_DEPTH_NESTING);
> > > +             if (vma_start_read_locked(*dst_vmap)) {
> > > +                     if (*dst_vmap != *src_vmap) {
> > > +                             if (!vma_start_read_locked_nested(*src_vmap,
> > > +                                                     SINGLE_DEPTH_NESTING)) {
> > > +                                     vma_end_read(*dst_vmap);
> >
> > Hmm, why do we end read if the lock failed here but not above?
>
> We have successfully done vma_start_read_locked(dst_vmap) (we locked
> dest vma) but we failed to do vma_start_read_locked_nested(src_vmap)
> (we could not lock src vma). So we should undo the dest vma locking.
> Does that clarify the logic?

Ahh right makes sense. Maybe a quick cheeky comment to that effect here too?

>
> >
> > > +                                     err = -EAGAIN;
> > > +                             }
> > > +                     }
> > > +             } else {
> > > +                     err = -EAGAIN;
> > > +             }
> > >       }
> >
> > This whole block is really ugly now, this really needs refactoring.
> >
> > How about (on assumption the vma_end_read() is correct):
> >
> >
> >         err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
> >         if (err)
> >                 goto out;
> >
> >         if (!vma_start_read_locked(*dst_vmap)) {
> >                 err = -EAGAIN;
> >                 goto out;
> >         }
> >
> >         /* Nothing further to do. */
> >         if (*dst_vmap == *src_vmap)
> >                 goto out;
> >
> >         if (!vma_start_read_locked_nested(*src_vmap,
> >                                 SINGLE_DEPTH_NESTING)) {
> >                 vma_end_read(*dst_vmap);
> >                 err = -EAGAIN;
> >         }
> >
> > out:
> >         mmap_read_unlock(mm);
> >         return err;
> > }
>
> Ok, that looks good to me. Will change this way.
> Thanks!
>

Thanks!

> >
> > >       mmap_read_unlock(mm);
> > >       return err;
> > > --
> > > 2.47.1.613.gc27f4b7a9f-goog
> > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-14  4:09       ` Andrew Morton
  2025-01-14  9:09         ` Vlastimil Babka
  2025-01-14  9:47         ` Lorenzo Stoakes
@ 2025-01-14 14:59         ` Liam R. Howlett
  2025-01-14 15:54           ` Suren Baghdasaryan
  2 siblings, 1 reply; 140+ messages in thread
From: Liam R. Howlett @ 2025-01-14 14:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Lorenzo Stoakes, peterz, willy,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

* Andrew Morton <akpm@linux-foundation.org> [250113 23:09]:
> On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> 
> > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > >
> > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > >
> > > If the series is considered super desirable and if people are confident
> > > that we can address any remaining glitches during two months of -rc
> > > then sure, we could push the envelope a bit.  But I don't believe this
> > > is the case so I'm thinking let's give ourselves another cycle to get
> > > this all sorted out?
> > 
> > I didn't think this series was in panic mode with one real issue that
> > is not hard to address (memory ordering in
> > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > be missing the big picture. In any case, if it makes people nervous I
> > have no objections to your plan.
> 
> Well, I'm soliciting opinions here.  What do others think?
> 
> And do you see much urgency with these changes?
> 

I think it's in good shape, but more time for this change is probably
the right approach.

I don't think it's had enough testing time with the changes since v7.
The series has had significant changes, with the side effect of
invalidating some of the test time.

I really like what it does, but if Suren doesn't need it upstream for
some reason then I'd say we leave it to soak longer.

If he does need it upstream, we can deal with any fallout and fixes - it
will have minimum long term effects as it's not an LTS.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-14 14:59         ` Liam R. Howlett
@ 2025-01-14 15:54           ` Suren Baghdasaryan
  2025-01-15 11:34             ` Lorenzo Stoakes
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-14 15:54 UTC (permalink / raw)
  To: Liam R. Howlett, Andrew Morton, Suren Baghdasaryan,
	Lorenzo Stoakes, peterz, willy, david.laight.linux, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 6:59 AM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> [250113 23:09]:
> > On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > >
> > > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > > >
> > > > If the series is considered super desirable and if people are confident
> > > > that we can address any remaining glitches during two months of -rc
> > > > then sure, we could push the envelope a bit.  But I don't believe this
> > > > is the case so I'm thinking let's give ourselves another cycle to get
> > > > this all sorted out?
> > >
> > > I didn't think this series was in panic mode with one real issue that
> > > is not hard to address (memory ordering in
> > > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > > be missing the big picture. In any case, if it makes people nervous I
> > > have no objections to your plan.
> >
> > Well, I'm soliciting opinions here.  What do others think?
> >
> > And do you see much urgency with these changes?
> >
>
> I think it's in good shape, but more time for this change is probably
> the right approach.
>
> I don't think it's had enough testing time with the changes since v7.
> The series has had significant changes, with the side effect of
> invalidating some of the test time.
>
> I really like what it does, but if Suren doesn't need it upstream for
> some reason then I'd say we leave it to soak longer.
>
> If he does need it upstream, we can deal with any fallout and fixes - it
> will have minimum long term effects as it's not an LTS.

Thanks for voicing your opinion, folks! There is no real urgency and
no objections from me to wait until the next cycle.
I'll be posting v10 shortly purely for reviews while this is fresh on
people's mind, and with the understanding that it won't be picked up
by Andrew.
Thanks,
Suren.

>
> Thanks,
> Liam
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-11  4:26 ` [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2025-01-15  2:27   ` Wei Yang
  2025-01-15  3:15     ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-15  2:27 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:

>diff --git a/kernel/fork.c b/kernel/fork.c
>index 9d9275783cf8..151b40627c14 100644
>--- a/kernel/fork.c
>+++ b/kernel/fork.c
>@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> 	return vma;
> }
> 
>+static void vm_area_init_from(const struct vm_area_struct *src,
>+			      struct vm_area_struct *dest)
>+{
>+	dest->vm_mm = src->vm_mm;
>+	dest->vm_ops = src->vm_ops;
>+	dest->vm_start = src->vm_start;
>+	dest->vm_end = src->vm_end;
>+	dest->anon_vma = src->anon_vma;
>+	dest->vm_pgoff = src->vm_pgoff;
>+	dest->vm_file = src->vm_file;
>+	dest->vm_private_data = src->vm_private_data;
>+	vm_flags_init(dest, src->vm_flags);
>+	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
>+	       sizeof(dest->vm_page_prot));
>+	/*
>+	 * src->shared.rb may be modified concurrently when called from
>+	 * dup_mmap(), but the clone will reinitialize it.
>+	 */
>+	data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
>+	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
>+	       sizeof(dest->vm_userfaultfd_ctx));
>+#ifdef CONFIG_ANON_VMA_NAME
>+	dest->anon_name = src->anon_name;
>+#endif
>+#ifdef CONFIG_SWAP
>+	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
>+	       sizeof(dest->swap_readahead_info));
>+#endif
>+#ifndef CONFIG_MMU
>+	dest->vm_region = src->vm_region;
>+#endif
>+#ifdef CONFIG_NUMA
>+	dest->vm_policy = src->vm_policy;
>+#endif
>+}

Would this be difficult to maintain? We should make sure not miss or overwrite
anything.

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11  4:25 ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
                     ` (3 preceding siblings ...)
  2025-01-13  9:36   ` Wei Yang
@ 2025-01-15  2:58   ` Wei Yang
  2025-01-15  3:12     ` Suren Baghdasaryan
  4 siblings, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-15  2:58 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> 	struct vm_area_struct *vma;
> 
> 	rcu_read_lock();
>-retry:
> 	vma = mas_walk(&mas);
> 	if (!vma)
> 		goto inval;
>@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> 	if (!vma_start_read(vma))
> 		goto inval;
> 
>-	/* Check if the VMA got isolated after we found it */
>-	if (is_vma_detached(vma)) {
>-		vma_end_read(vma);
>-		count_vm_vma_lock_event(VMA_LOCK_MISS);
>-		/* The area was replaced with another one */
>-		goto retry;
>-	}

We have a little behavior change here.

Originally, if we found an detached vma, we may retry. But now, we would go to
the slow path directly. 

Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
to see the percentage of this case. If it shows this is a too rare
case to impact performance, we can ignore it.

Also the event VMA_LOCK_MISS recording is removed, but the definition is
there. We may record it in the vma_start_read() when oldcnt is 0.

BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
don't have an overall success/failure statistic in vmstat.

> 	/*
> 	 * At this point, we have a stable reference to a VMA: The VMA is
> 	 * locked and we know it hasn't already been isolated.

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15  2:58   ` Wei Yang
@ 2025-01-15  3:12     ` Suren Baghdasaryan
  2025-01-15 12:05       ` Wei Yang
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15  3:12 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       struct vm_area_struct *vma;
> >
> >       rcu_read_lock();
> >-retry:
> >       vma = mas_walk(&mas);
> >       if (!vma)
> >               goto inval;
> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       if (!vma_start_read(vma))
> >               goto inval;
> >
> >-      /* Check if the VMA got isolated after we found it */
> >-      if (is_vma_detached(vma)) {
> >-              vma_end_read(vma);
> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
> >-              /* The area was replaced with another one */
> >-              goto retry;
> >-      }
>
> We have a little behavior change here.
>
> Originally, if we found an detached vma, we may retry. But now, we would go to
> the slow path directly.

Hmm. Good point. I think the easiest way to keep the same
functionality is to make vma_start_read() return vm_area_struct* on
success, NULL on locking failure and EAGAIN if vma was detached
(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
the case of EAGAIN.

>
> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
> to see the percentage of this case. If it shows this is a too rare
> case to impact performance, we can ignore it.
>
> Also the event VMA_LOCK_MISS recording is removed, but the definition is
> there. We may record it in the vma_start_read() when oldcnt is 0.
>
> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
> don't have an overall success/failure statistic in vmstat.

Are you referring to the fact that we do not increment
VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
page fault (in which we increment VMA_LOCK_RETRY instead)?

>
> >       /*
> >        * At this point, we have a stable reference to a VMA: The VMA is
> >        * locked and we know it hasn't already been isolated.
>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  2:27   ` Wei Yang
@ 2025-01-15  3:15     ` Suren Baghdasaryan
  2025-01-15  3:58       ` Liam R. Howlett
                         ` (3 more replies)
  0 siblings, 4 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15  3:15 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
>
> >diff --git a/kernel/fork.c b/kernel/fork.c
> >index 9d9275783cf8..151b40627c14 100644
> >--- a/kernel/fork.c
> >+++ b/kernel/fork.c
> >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >       return vma;
> > }
> >
> >+static void vm_area_init_from(const struct vm_area_struct *src,
> >+                            struct vm_area_struct *dest)
> >+{
> >+      dest->vm_mm = src->vm_mm;
> >+      dest->vm_ops = src->vm_ops;
> >+      dest->vm_start = src->vm_start;
> >+      dest->vm_end = src->vm_end;
> >+      dest->anon_vma = src->anon_vma;
> >+      dest->vm_pgoff = src->vm_pgoff;
> >+      dest->vm_file = src->vm_file;
> >+      dest->vm_private_data = src->vm_private_data;
> >+      vm_flags_init(dest, src->vm_flags);
> >+      memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> >+             sizeof(dest->vm_page_prot));
> >+      /*
> >+       * src->shared.rb may be modified concurrently when called from
> >+       * dup_mmap(), but the clone will reinitialize it.
> >+       */
> >+      data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> >+      memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> >+             sizeof(dest->vm_userfaultfd_ctx));
> >+#ifdef CONFIG_ANON_VMA_NAME
> >+      dest->anon_name = src->anon_name;
> >+#endif
> >+#ifdef CONFIG_SWAP
> >+      memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> >+             sizeof(dest->swap_readahead_info));
> >+#endif
> >+#ifndef CONFIG_MMU
> >+      dest->vm_region = src->vm_region;
> >+#endif
> >+#ifdef CONFIG_NUMA
> >+      dest->vm_policy = src->vm_policy;
> >+#endif
> >+}
>
> Would this be difficult to maintain? We should make sure not miss or overwrite
> anything.

Yeah, it is less maintainable than a simple memcpy() but I did not
find a better alternative. I added a warning above the struct
vm_area_struct definition to update this function every time we change
that structure. Not sure if there is anything else I can do to help
with this.

>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  3:15     ` Suren Baghdasaryan
@ 2025-01-15  3:58       ` Liam R. Howlett
  2025-01-15  5:41         ` Suren Baghdasaryan
  2025-01-15  3:59       ` Mateusz Guzik
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 140+ messages in thread
From: Liam R. Howlett @ 2025-01-15  3:58 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Wei Yang, akpm, peterz, willy, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [250114 22:15]:
> On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> >
> > >diff --git a/kernel/fork.c b/kernel/fork.c
> > >index 9d9275783cf8..151b40627c14 100644
> > >--- a/kernel/fork.c
> > >+++ b/kernel/fork.c
> > >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > >       return vma;
> > > }
> > >
> > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > >+                            struct vm_area_struct *dest)
> > >+{
> > >+      dest->vm_mm = src->vm_mm;
> > >+      dest->vm_ops = src->vm_ops;
> > >+      dest->vm_start = src->vm_start;
> > >+      dest->vm_end = src->vm_end;
> > >+      dest->anon_vma = src->anon_vma;
> > >+      dest->vm_pgoff = src->vm_pgoff;
> > >+      dest->vm_file = src->vm_file;
> > >+      dest->vm_private_data = src->vm_private_data;
> > >+      vm_flags_init(dest, src->vm_flags);
> > >+      memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > >+             sizeof(dest->vm_page_prot));
> > >+      /*
> > >+       * src->shared.rb may be modified concurrently when called from
> > >+       * dup_mmap(), but the clone will reinitialize it.
> > >+       */
> > >+      data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > >+      memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > >+             sizeof(dest->vm_userfaultfd_ctx));
> > >+#ifdef CONFIG_ANON_VMA_NAME
> > >+      dest->anon_name = src->anon_name;
> > >+#endif
> > >+#ifdef CONFIG_SWAP
> > >+      memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > >+             sizeof(dest->swap_readahead_info));
> > >+#endif
> > >+#ifndef CONFIG_MMU
> > >+      dest->vm_region = src->vm_region;
> > >+#endif
> > >+#ifdef CONFIG_NUMA
> > >+      dest->vm_policy = src->vm_policy;
> > >+#endif
> > >+}
> >
> > Would this be difficult to maintain? We should make sure not miss or overwrite
> > anything.
> 
> Yeah, it is less maintainable than a simple memcpy() but I did not
> find a better alternative. I added a warning above the struct
> vm_area_struct definition to update this function every time we change
> that structure. Not sure if there is anything else I can do to help
> with this.

Here's a horrible idea.. if we put the ref count at the end or start of
the struct, we could set the ref count to zero and copy the other area
in one mmecpy().

Even worse idea, we could use a pointer_of like macro to get the position
of the ref count in the vma struct, set the ref count to zero and
carefully copy the other two parts in two memcpy() operations.

Feel free to disregard these ideas as it is late here and I'm having
fun thinking up bad ways to make this "more" maintainable.

Either of these would make updating the struct easier, but very painful
to debug when it goes wrong (or reading the function).

Thanks,
Liam

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  3:15     ` Suren Baghdasaryan
  2025-01-15  3:58       ` Liam R. Howlett
@ 2025-01-15  3:59       ` Mateusz Guzik
  2025-01-15  5:47         ` Suren Baghdasaryan
  2025-01-15  7:58       ` Vlastimil Babka
  2025-01-15 12:17       ` Wei Yang
  3 siblings, 1 reply; 140+ messages in thread
From: Mateusz Guzik @ 2025-01-15  3:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 4:15 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> >
> > >diff --git a/kernel/fork.c b/kernel/fork.c
> > >index 9d9275783cf8..151b40627c14 100644
> > >--- a/kernel/fork.c
> > >+++ b/kernel/fork.c
> > >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > >       return vma;
> > > }
> > >
> > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > >+                            struct vm_area_struct *dest)
> > >+{
[snip]
> > Would this be difficult to maintain? We should make sure not miss or overwrite
> > anything.
>
> Yeah, it is less maintainable than a simple memcpy() but I did not
> find a better alternative. I added a warning above the struct
> vm_area_struct definition to update this function every time we change
> that structure. Not sure if there is anything else I can do to help
> with this.
>

Bare minimum this could have a BUILD_BUG_ON in below the func for the
known-covered size. But it would have to be conditional on arch and
some macros, somewhat nasty.

KASAN or KMSAN (I don't remember which) can be used to find missing
copies. To that end the target struct could be marked as fully
uninitialized before copy and have a full read performed from it
afterwards -- guaranteed to trip over any field which any field not
explicitly covered (including padding though). I don't know what magic
macros can be used to do in Linux, I am saying the support to get this
result is there. I understand most people don't use this, but this
still should be enough to trip over buggy patches in -next.

Finally, the struct could have macros delimiting copy/non-copy
sections (with macros expanding to field names), for illustrative
purposes:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 332cee285662..25063a3970c8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -677,6 +677,7 @@ struct vma_numab_state {
  * getting a stable reference.
  */
 struct vm_area_struct {
+#define vma_start_copy0 vm_rcu
        /* The first cache line has the info for VMA tree walking. */

        union {
@@ -731,6 +732,7 @@ struct vm_area_struct {
        /* Unstable RCU readers are allowed to read this. */
        struct vma_lock *vm_lock;
 #endif
+#define vma_end_copy1 vm_lock

        /*
         * For areas with an address space and backing store,

then you would do everything with a series of calls

however, the __randomize_layout annotation whacks that idea (maybe it
can be retired?)

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  3:58       ` Liam R. Howlett
@ 2025-01-15  5:41         ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15  5:41 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Wei Yang, akpm, peterz,
	willy, lorenzo.stoakes, david.laight.linux, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 7:58 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250114 22:15]:
> > On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> > >
> > > On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> > >
> > > >diff --git a/kernel/fork.c b/kernel/fork.c
> > > >index 9d9275783cf8..151b40627c14 100644
> > > >--- a/kernel/fork.c
> > > >+++ b/kernel/fork.c
> > > >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > >       return vma;
> > > > }
> > > >
> > > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > > >+                            struct vm_area_struct *dest)
> > > >+{
> > > >+      dest->vm_mm = src->vm_mm;
> > > >+      dest->vm_ops = src->vm_ops;
> > > >+      dest->vm_start = src->vm_start;
> > > >+      dest->vm_end = src->vm_end;
> > > >+      dest->anon_vma = src->anon_vma;
> > > >+      dest->vm_pgoff = src->vm_pgoff;
> > > >+      dest->vm_file = src->vm_file;
> > > >+      dest->vm_private_data = src->vm_private_data;
> > > >+      vm_flags_init(dest, src->vm_flags);
> > > >+      memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > >+             sizeof(dest->vm_page_prot));
> > > >+      /*
> > > >+       * src->shared.rb may be modified concurrently when called from
> > > >+       * dup_mmap(), but the clone will reinitialize it.
> > > >+       */
> > > >+      data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > > >+      memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > >+             sizeof(dest->vm_userfaultfd_ctx));
> > > >+#ifdef CONFIG_ANON_VMA_NAME
> > > >+      dest->anon_name = src->anon_name;
> > > >+#endif
> > > >+#ifdef CONFIG_SWAP
> > > >+      memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > >+             sizeof(dest->swap_readahead_info));
> > > >+#endif
> > > >+#ifndef CONFIG_MMU
> > > >+      dest->vm_region = src->vm_region;
> > > >+#endif
> > > >+#ifdef CONFIG_NUMA
> > > >+      dest->vm_policy = src->vm_policy;
> > > >+#endif
> > > >+}
> > >
> > > Would this be difficult to maintain? We should make sure not miss or overwrite
> > > anything.
> >
> > Yeah, it is less maintainable than a simple memcpy() but I did not
> > find a better alternative. I added a warning above the struct
> > vm_area_struct definition to update this function every time we change
> > that structure. Not sure if there is anything else I can do to help
> > with this.
>
> Here's a horrible idea.. if we put the ref count at the end or start of
> the struct, we could set the ref count to zero and copy the other area
> in one mmecpy().
>
> Even worse idea, we could use a pointer_of like macro to get the position
> of the ref count in the vma struct, set the ref count to zero and
> carefully copy the other two parts in two memcpy() operations.

I implemented this approach in v3 of this patchset here:
https://lore.kernel.org/all/20241117080931.600731-5-surenb@google.com/
like this:

#define VMA_BEFORE_LOCK offsetof(struct vm_area_struct, vm_lock)
#define VMA_LOCK_END(vma) \
        (((void *)(vma)) + offsetofend(struct vm_area_struct, vm_lock))
#define VMA_AFTER_LOCK \
        (sizeof(struct vm_area_struct) - offsetofend(struct
vm_area_struct, vm_lock))

static inline void vma_copy(struct vm_area_struct *new, struct
vm_area_struct *orig)
{
        /* Preserve vma->vm_lock */
        data_race(memcpy(new, orig, VMA_BEFORE_LOCK));
        data_race(memcpy(VMA_LOCK_END(new), VMA_LOCK_END(orig),
VMA_AFTER_LOCK));
}

If this looks more maintainable I can revive it.
Maybe introduce a more generic function to copy any structure
excluding a specific field and use it like this:

copy_struct_except(new, orig, struct vm_area_struct, vm_refcnt);

Would that be better?

>
> Feel free to disregard these ideas as it is late here and I'm having
> fun thinking up bad ways to make this "more" maintainable.
>
> Either of these would make updating the struct easier, but very painful
> to debug when it goes wrong (or reading the function).
>
> Thanks,
> Liam

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  3:59       ` Mateusz Guzik
@ 2025-01-15  5:47         ` Suren Baghdasaryan
  2025-01-15  5:51           ` Mateusz Guzik
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15  5:47 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 8:00 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 4:15 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> > >
> > > On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> > >
> > > >diff --git a/kernel/fork.c b/kernel/fork.c
> > > >index 9d9275783cf8..151b40627c14 100644
> > > >--- a/kernel/fork.c
> > > >+++ b/kernel/fork.c
> > > >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > >       return vma;
> > > > }
> > > >
> > > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > > >+                            struct vm_area_struct *dest)
> > > >+{
> [snip]
> > > Would this be difficult to maintain? We should make sure not miss or overwrite
> > > anything.
> >
> > Yeah, it is less maintainable than a simple memcpy() but I did not
> > find a better alternative. I added a warning above the struct
> > vm_area_struct definition to update this function every time we change
> > that structure. Not sure if there is anything else I can do to help
> > with this.
> >
>
> Bare minimum this could have a BUILD_BUG_ON in below the func for the
> known-covered size. But it would have to be conditional on arch and
> some macros, somewhat nasty.
>
> KASAN or KMSAN (I don't remember which) can be used to find missing
> copies. To that end the target struct could be marked as fully
> uninitialized before copy and have a full read performed from it
> afterwards -- guaranteed to trip over any field which any field not
> explicitly covered (including padding though). I don't know what magic
> macros can be used to do in Linux, I am saying the support to get this
> result is there. I understand most people don't use this, but this
> still should be enough to trip over buggy patches in -next.

If my previous suggestion does not fly I'll start digging into KASAN
to see how we can use it. Thanks for the tip.

>
> Finally, the struct could have macros delimiting copy/non-copy
> sections (with macros expanding to field names), for illustrative
> purposes:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 332cee285662..25063a3970c8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -677,6 +677,7 @@ struct vma_numab_state {
>   * getting a stable reference.
>   */
>  struct vm_area_struct {
> +#define vma_start_copy0 vm_rcu
>         /* The first cache line has the info for VMA tree walking. */
>
>         union {
> @@ -731,6 +732,7 @@ struct vm_area_struct {
>         /* Unstable RCU readers are allowed to read this. */
>         struct vma_lock *vm_lock;
>  #endif
> +#define vma_end_copy1 vm_lock
>
>         /*
>          * For areas with an address space and backing store,
>
> then you would do everything with a series of calls

I'm not sure... My proposed approach with offsetof() I think is a bit
cleaner than adding macros to denote copy sections. WDYT?

>
> however, the __randomize_layout annotation whacks that idea (maybe it
> can be retired?)
>
> --
> Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  5:47         ` Suren Baghdasaryan
@ 2025-01-15  5:51           ` Mateusz Guzik
  2025-01-15  6:41             ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Mateusz Guzik @ 2025-01-15  5:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 6:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 14, 2025 at 8:00 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Wed, Jan 15, 2025 at 4:15 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> > > >
> > > > On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> > > >
> > > > >diff --git a/kernel/fork.c b/kernel/fork.c
> > > > >index 9d9275783cf8..151b40627c14 100644
> > > > >--- a/kernel/fork.c
> > > > >+++ b/kernel/fork.c
> > > > >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > > >       return vma;
> > > > > }
> > > > >
> > > > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > > > >+                            struct vm_area_struct *dest)
> > > > >+{
> > [snip]
> > > > Would this be difficult to maintain? We should make sure not miss or overwrite
> > > > anything.
> > >
> > > Yeah, it is less maintainable than a simple memcpy() but I did not
> > > find a better alternative. I added a warning above the struct
> > > vm_area_struct definition to update this function every time we change
> > > that structure. Not sure if there is anything else I can do to help
> > > with this.
> > >
> >
> > Bare minimum this could have a BUILD_BUG_ON in below the func for the
> > known-covered size. But it would have to be conditional on arch and
> > some macros, somewhat nasty.
> >
> > KASAN or KMSAN (I don't remember which) can be used to find missing
> > copies. To that end the target struct could be marked as fully
> > uninitialized before copy and have a full read performed from it
> > afterwards -- guaranteed to trip over any field which any field not
> > explicitly covered (including padding though). I don't know what magic
> > macros can be used to do in Linux, I am saying the support to get this
> > result is there. I understand most people don't use this, but this
> > still should be enough to trip over buggy patches in -next.
>
> If my previous suggestion does not fly I'll start digging into KASAN
> to see how we can use it. Thanks for the tip.
>
> >
> > Finally, the struct could have macros delimiting copy/non-copy
> > sections (with macros expanding to field names), for illustrative
> > purposes:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 332cee285662..25063a3970c8 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -677,6 +677,7 @@ struct vma_numab_state {
> >   * getting a stable reference.
> >   */
> >  struct vm_area_struct {
> > +#define vma_start_copy0 vm_rcu
> >         /* The first cache line has the info for VMA tree walking. */
> >
> >         union {
> > @@ -731,6 +732,7 @@ struct vm_area_struct {
> >         /* Unstable RCU readers are allowed to read this. */
> >         struct vma_lock *vm_lock;
> >  #endif
> > +#define vma_end_copy1 vm_lock
> >
> >         /*
> >          * For areas with an address space and backing store,
> >
> > then you would do everything with a series of calls
>
> I'm not sure... My proposed approach with offsetof() I think is a bit
> cleaner than adding macros to denote copy sections. WDYT?
>

another non-copy field may show up down the road and then the person
adding it is going to be a sad panda. wont happen if the "infra" is
there.

but I concede this is not a big deal and i'm not going to bikeshed about it.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  5:51           ` Mateusz Guzik
@ 2025-01-15  6:41             ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15  6:41 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 9:52 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 6:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jan 14, 2025 at 8:00 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Wed, Jan 15, 2025 at 4:15 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> > > > >
> > > > > >diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > >index 9d9275783cf8..151b40627c14 100644
> > > > > >--- a/kernel/fork.c
> > > > > >+++ b/kernel/fork.c
> > > > > >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > > > > >       return vma;
> > > > > > }
> > > > > >
> > > > > >+static void vm_area_init_from(const struct vm_area_struct *src,
> > > > > >+                            struct vm_area_struct *dest)
> > > > > >+{
> > > [snip]
> > > > > Would this be difficult to maintain? We should make sure not miss or overwrite
> > > > > anything.
> > > >
> > > > Yeah, it is less maintainable than a simple memcpy() but I did not
> > > > find a better alternative. I added a warning above the struct
> > > > vm_area_struct definition to update this function every time we change
> > > > that structure. Not sure if there is anything else I can do to help
> > > > with this.
> > > >
> > >
> > > Bare minimum this could have a BUILD_BUG_ON in below the func for the
> > > known-covered size. But it would have to be conditional on arch and
> > > some macros, somewhat nasty.
> > >
> > > KASAN or KMSAN (I don't remember which) can be used to find missing
> > > copies. To that end the target struct could be marked as fully
> > > uninitialized before copy and have a full read performed from it
> > > afterwards -- guaranteed to trip over any field which any field not
> > > explicitly covered (including padding though). I don't know what magic
> > > macros can be used to do in Linux, I am saying the support to get this
> > > result is there. I understand most people don't use this, but this
> > > still should be enough to trip over buggy patches in -next.
> >
> > If my previous suggestion does not fly I'll start digging into KASAN
> > to see how we can use it. Thanks for the tip.
> >
> > >
> > > Finally, the struct could have macros delimiting copy/non-copy
> > > sections (with macros expanding to field names), for illustrative
> > > purposes:
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 332cee285662..25063a3970c8 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -677,6 +677,7 @@ struct vma_numab_state {
> > >   * getting a stable reference.
> > >   */
> > >  struct vm_area_struct {
> > > +#define vma_start_copy0 vm_rcu
> > >         /* The first cache line has the info for VMA tree walking. */
> > >
> > >         union {
> > > @@ -731,6 +732,7 @@ struct vm_area_struct {
> > >         /* Unstable RCU readers are allowed to read this. */
> > >         struct vma_lock *vm_lock;
> > >  #endif
> > > +#define vma_end_copy1 vm_lock
> > >
> > >         /*
> > >          * For areas with an address space and backing store,
> > >
> > > then you would do everything with a series of calls
> >
> > I'm not sure... My proposed approach with offsetof() I think is a bit
> > cleaner than adding macros to denote copy sections. WDYT?
> >
>
> another non-copy field may show up down the road and then the person
> adding it is going to be a sad panda. wont happen if the "infra" is
> there.
>
> but I concede this is not a big deal and i'm not going to bikeshed about it.

Yeah, I can't think of a perfect solution. I think we should pick a
sane one and if requirements change we can change the implementation
of vm_area_init_from() accordingly.

>
> --
> Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  3:15     ` Suren Baghdasaryan
  2025-01-15  3:58       ` Liam R. Howlett
  2025-01-15  3:59       ` Mateusz Guzik
@ 2025-01-15  7:58       ` Vlastimil Babka
  2025-01-15 15:10         ` Suren Baghdasaryan
  2025-01-15 12:17       ` Wei Yang
  3 siblings, 1 reply; 140+ messages in thread
From: Vlastimil Babka @ 2025-01-15  7:58 UTC (permalink / raw)
  To: Suren Baghdasaryan, Wei Yang, willy
  Cc: akpm, peterz, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On 1/15/25 04:15, Suren Baghdasaryan wrote:
> On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
>>
>> >diff --git a/kernel/fork.c b/kernel/fork.c
>> >index 9d9275783cf8..151b40627c14 100644
>> >--- a/kernel/fork.c
>> >+++ b/kernel/fork.c
>> >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>> >       return vma;
>> > }
>> >
>> >+static void vm_area_init_from(const struct vm_area_struct *src,
>> >+                            struct vm_area_struct *dest)
>> >+{
>> >+      dest->vm_mm = src->vm_mm;
>> >+      dest->vm_ops = src->vm_ops;
>> >+      dest->vm_start = src->vm_start;
>> >+      dest->vm_end = src->vm_end;
>> >+      dest->anon_vma = src->anon_vma;
>> >+      dest->vm_pgoff = src->vm_pgoff;
>> >+      dest->vm_file = src->vm_file;
>> >+      dest->vm_private_data = src->vm_private_data;
>> >+      vm_flags_init(dest, src->vm_flags);
>> >+      memcpy(&dest->vm_page_prot, &src->vm_page_prot,
>> >+             sizeof(dest->vm_page_prot));
>> >+      /*
>> >+       * src->shared.rb may be modified concurrently when called from
>> >+       * dup_mmap(), but the clone will reinitialize it.
>> >+       */
>> >+      data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
>> >+      memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
>> >+             sizeof(dest->vm_userfaultfd_ctx));
>> >+#ifdef CONFIG_ANON_VMA_NAME
>> >+      dest->anon_name = src->anon_name;
>> >+#endif
>> >+#ifdef CONFIG_SWAP
>> >+      memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
>> >+             sizeof(dest->swap_readahead_info));
>> >+#endif
>> >+#ifndef CONFIG_MMU
>> >+      dest->vm_region = src->vm_region;
>> >+#endif
>> >+#ifdef CONFIG_NUMA
>> >+      dest->vm_policy = src->vm_policy;
>> >+#endif
>> >+}
>>
>> Would this be difficult to maintain? We should make sure not miss or overwrite
>> anything.
> 
> Yeah, it is less maintainable than a simple memcpy() but I did not
> find a better alternative.

Willy knows one but refuses to share it :(

> I added a warning above the struct
> vm_area_struct definition to update this function every time we change
> that structure. Not sure if there is anything else I can do to help
> with this.
> 
>>
>> --
>> Wei Yang
>> Help you, Help me


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-11 17:11         ` Suren Baghdasaryan
  2025-01-11 23:44           ` Hillf Danton
@ 2025-01-15  9:39           ` Peter Zijlstra
  2025-01-16 10:52             ` Hillf Danton
  1 sibling, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2025-01-15  9:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Hillf Danton, akpm, willy, hannes, linux-mm, linux-kernel,
	kernel-team

On Sat, Jan 11, 2025 at 09:11:52AM -0800, Suren Baghdasaryan wrote:
> On Sat, Jan 11, 2025 at 4:13 AM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Sat, 11 Jan 2025 01:59:41 -0800 Suren Baghdasaryan <surenb@google.com>
> > > On Fri, Jan 10, 2025 at 10:32 PM Hillf Danton <hdanton@sina.com> wrote:
> > > > On Fri, 10 Jan 2025 20:25:57 -0800 Suren Baghdasaryan <surenb@google.com>
> > > > > -bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > > +bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
> > > > > +                                  int limit)
> > > > >  {
> > > > >       int old = refcount_read(r);
> > > > >
> > > > >       do {
> > > > >               if (!old)
> > > > >                       break;
> > > > > +
> > > > > +             if (statically_true(limit == INT_MAX))
> > > > > +                     continue;
> > > > > +
> > > > > +             if (i > limit - old) {
> > > > > +                     if (oldp)
> > > > > +                             *oldp = old;
> > > > > +                     return false;
> > > > > +             }
> > > > >       } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > >
> > > > The acquire version should be used, see atomic_long_try_cmpxchg_acquire()
> > > > in kernel/locking/rwsem.c.
> > >
> > > This is how __refcount_add_not_zero() is already implemented and I'm
> > > only adding support for a limit. If you think it's implemented wrong
> > > then IMHO it should be fixed separately.
> > >
> > Two different things - refcount has nothing to do with locking at the
> > first place, while what you are adding to the mm directory is something
> > that replaces rwsem, so from the locking POV you have to mark the
> > boundaries of the locking section.
> 
> I see your point. I think it's a strong argument to use atomic
> directly instead of refcount for this locking. I'll try that and see
> how it looks. Thanks for the feedback!

Sigh; don't let hillf confuse you. *IF* you need an acquire it will be
in the part where you wait for readers to go away. But even there, think
about what you're serializing against. Readers don't typically modify
things.

And modifications are fully serialized by mmap_sem^H^H^Hlock.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-11 20:14     ` Suren Baghdasaryan
                         ` (3 preceding siblings ...)
  2025-01-13  1:47       ` Wei Yang
@ 2025-01-15 10:48       ` Peter Zijlstra
  2025-01-15 11:13         ` Peter Zijlstra
  4 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2025-01-15 10:48 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Mateusz Guzik, akpm, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:

> > Replacing down_read_trylock() with the new routine loses an acquire
> > fence. That alone is not a problem, but see below.
> 
> Hmm. I think this acquire fence is actually necessary. We don't want
> the later vm_lock_seq check to be reordered and happen before we take
> the refcount. Otherwise this might happen:
> 
> reader             writer
> if (vm_lock_seq == mm_lock_seq) // check got reordered
>         return false;
>                        vm_refcnt += VMA_LOCK_OFFSET
>                        vm_lock_seq == mm_lock_seq
>                        vm_refcnt -= VMA_LOCK_OFFSET
> if (!__refcount_inc_not_zero_limited())
>         return false;
> 
> Both reader's checks will pass and the reader would read-lock a vma
> that was write-locked.

Hmm, you're right. That acquire does matter here.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline
  2025-01-11  4:25 ` [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
  2025-01-13 16:15   ` Lorenzo Stoakes
@ 2025-01-15 10:50   ` Peter Zijlstra
  2025-01-15 16:39     ` Suren Baghdasaryan
  1 sibling, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2025-01-15 10:50 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Fri, Jan 10, 2025 at 08:25:59PM -0800, Suren Baghdasaryan wrote:
> Move several vma_area_struct members which are rarely or never used
> during page fault handling into the last cacheline to better pack
> vm_area_struct. As a result vm_area_struct will fit into 3 as opposed
> to 4 cachelines. New typical vm_area_struct layout:
> 
> struct vm_area_struct {
>     union {
>         struct {
>             long unsigned int vm_start;              /*     0     8 */
>             long unsigned int vm_end;                /*     8     8 */
>         };                                           /*     0    16 */
>         freeptr_t          vm_freeptr;               /*     0     8 */
>     };                                               /*     0    16 */
>     struct mm_struct *         vm_mm;                /*    16     8 */
>     pgprot_t                   vm_page_prot;         /*    24     8 */
>     union {
>         const vm_flags_t   vm_flags;                 /*    32     8 */
>         vm_flags_t         __vm_flags;               /*    32     8 */
>     };                                               /*    32     8 */
>     unsigned int               vm_lock_seq;          /*    40     4 */

Does it not make sense to move this seq field near the refcnt?

>     /* XXX 4 bytes hole, try to pack */
> 
>     struct list_head           anon_vma_chain;       /*    48    16 */
>     /* --- cacheline 1 boundary (64 bytes) --- */
>     struct anon_vma *          anon_vma;             /*    64     8 */
>     const struct vm_operations_struct  * vm_ops;     /*    72     8 */
>     long unsigned int          vm_pgoff;             /*    80     8 */
>     struct file *              vm_file;              /*    88     8 */
>     void *                     vm_private_data;      /*    96     8 */
>     atomic_long_t              swap_readahead_info;  /*   104     8 */
>     struct mempolicy *         vm_policy;            /*   112     8 */
>     struct vma_numab_state *   numab_state;          /*   120     8 */
>     /* --- cacheline 2 boundary (128 bytes) --- */
>     refcount_t          vm_refcnt (__aligned__(64)); /*   128     4 */
> 
>     /* XXX 4 bytes hole, try to pack */
> 
>     struct {
>         struct rb_node     rb (__aligned__(8));      /*   136    24 */
>         long unsigned int  rb_subtree_last;          /*   160     8 */
>     } __attribute__((__aligned__(8))) shared;        /*   136    32 */
>     struct anon_vma_name *     anon_name;            /*   168     8 */
>     struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */
> 
>     /* size: 192, cachelines: 3, members: 18 */
>     /* sum members: 176, holes: 2, sum holes: 8 */
>     /* padding: 8 */
>     /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
> } __attribute__((__aligned__(64)));



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15 10:48       ` Peter Zijlstra
@ 2025-01-15 11:13         ` Peter Zijlstra
  2025-01-15 15:00           ` Suren Baghdasaryan
  2025-01-15 16:00           ` [PATCH] refcount: Strengthen inc_not_zero() Peter Zijlstra
  0 siblings, 2 replies; 140+ messages in thread
From: Peter Zijlstra @ 2025-01-15 11:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Mateusz Guzik, akpm, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 11:48:41AM +0100, Peter Zijlstra wrote:
> On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> 
> > > Replacing down_read_trylock() with the new routine loses an acquire
> > > fence. That alone is not a problem, but see below.
> > 
> > Hmm. I think this acquire fence is actually necessary. We don't want
> > the later vm_lock_seq check to be reordered and happen before we take
> > the refcount. Otherwise this might happen:
> > 
> > reader             writer
> > if (vm_lock_seq == mm_lock_seq) // check got reordered
> >         return false;
> >                        vm_refcnt += VMA_LOCK_OFFSET
> >                        vm_lock_seq == mm_lock_seq
> >                        vm_refcnt -= VMA_LOCK_OFFSET
> > if (!__refcount_inc_not_zero_limited())
> >         return false;
> > 
> > Both reader's checks will pass and the reader would read-lock a vma
> > that was write-locked.
> 
> Hmm, you're right. That acquire does matter here.

Notably, it means refcount_t is entirely unsuitable for anything
SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
conditions after the refcount succeeds.

And this is probably fine, but let me ponder this all a little more.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-14 15:54           ` Suren Baghdasaryan
@ 2025-01-15 11:34             ` Lorenzo Stoakes
  2025-01-15 15:14               ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Stoakes @ 2025-01-15 11:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Andrew Morton, peterz, willy, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 07:54:48AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 14, 2025 at 6:59 AM 'Liam R. Howlett' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > * Andrew Morton <akpm@linux-foundation.org> [250113 23:09]:
> > > On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > >
> > > > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > > > >
> > > > > If the series is considered super desirable and if people are confident
> > > > > that we can address any remaining glitches during two months of -rc
> > > > > then sure, we could push the envelope a bit.  But I don't believe this
> > > > > is the case so I'm thinking let's give ourselves another cycle to get
> > > > > this all sorted out?
> > > >
> > > > I didn't think this series was in panic mode with one real issue that
> > > > is not hard to address (memory ordering in
> > > > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > > > be missing the big picture. In any case, if it makes people nervous I
> > > > have no objections to your plan.
> > >
> > > Well, I'm soliciting opinions here.  What do others think?
> > >
> > > And do you see much urgency with these changes?
> > >
> >
> > I think it's in good shape, but more time for this change is probably
> > the right approach.
> >
> > I don't think it's had enough testing time with the changes since v7.
> > The series has had significant changes, with the side effect of
> > invalidating some of the test time.
> >
> > I really like what it does, but if Suren doesn't need it upstream for
> > some reason then I'd say we leave it to soak longer.
> >
> > If he does need it upstream, we can deal with any fallout and fixes - it
> > will have minimum long term effects as it's not an LTS.
>
> Thanks for voicing your opinion, folks! There is no real urgency and
> no objections from me to wait until the next cycle.
> I'll be posting v10 shortly purely for reviews while this is fresh on
> people's mind, and with the understanding that it won't be picked up
> by Andrew.
> Thanks,
> Suren.

(From my side :) Thanks, and definitely no reflection on quality and your
responsiveness has been amazing, just a reflection of the complexity of
this change.

>
> >
> > Thanks,
> > Liam
> >
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15  3:12     ` Suren Baghdasaryan
@ 2025-01-15 12:05       ` Wei Yang
  2025-01-15 15:01         ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-15 12:05 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 07:12:20PM -0800, Suren Baghdasaryan wrote:
>On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>> >       struct vm_area_struct *vma;
>> >
>> >       rcu_read_lock();
>> >-retry:
>> >       vma = mas_walk(&mas);
>> >       if (!vma)
>> >               goto inval;
>> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>> >       if (!vma_start_read(vma))
>> >               goto inval;
>> >
>> >-      /* Check if the VMA got isolated after we found it */
>> >-      if (is_vma_detached(vma)) {
>> >-              vma_end_read(vma);
>> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
>> >-              /* The area was replaced with another one */
>> >-              goto retry;
>> >-      }
>>
>> We have a little behavior change here.
>>
>> Originally, if we found an detached vma, we may retry. But now, we would go to
>> the slow path directly.
>
>Hmm. Good point. I think the easiest way to keep the same
>functionality is to make vma_start_read() return vm_area_struct* on
>success, NULL on locking failure and EAGAIN if vma was detached
>(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
>the case of EAGAIN.
>

Looks good to me.

>>
>> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
>> to see the percentage of this case. If it shows this is a too rare
>> case to impact performance, we can ignore it.
>>
>> Also the event VMA_LOCK_MISS recording is removed, but the definition is
>> there. We may record it in the vma_start_read() when oldcnt is 0.
>>
>> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
>> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
>> don't have an overall success/failure statistic in vmstat.
>
>Are you referring to the fact that we do not increment
>VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the

Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.

>page fault (in which we increment VMA_LOCK_RETRY instead)?
>

I don't follow this.

>>
>> >       /*
>> >        * At this point, we have a stable reference to a VMA: The VMA is
>> >        * locked and we know it hasn't already been isolated.
>>
>> --
>> Wei Yang
>> Help you, Help me

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  3:15     ` Suren Baghdasaryan
                         ` (2 preceding siblings ...)
  2025-01-15  7:58       ` Vlastimil Babka
@ 2025-01-15 12:17       ` Wei Yang
  2025-01-15 21:46         ` Suren Baghdasaryan
  3 siblings, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-15 12:17 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 07:15:05PM -0800, Suren Baghdasaryan wrote:
>On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
>>
>> >diff --git a/kernel/fork.c b/kernel/fork.c
>> >index 9d9275783cf8..151b40627c14 100644
>> >--- a/kernel/fork.c
>> >+++ b/kernel/fork.c
>> >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>> >       return vma;
>> > }
>> >
>> >+static void vm_area_init_from(const struct vm_area_struct *src,
>> >+                            struct vm_area_struct *dest)
>> >+{
>> >+      dest->vm_mm = src->vm_mm;
>> >+      dest->vm_ops = src->vm_ops;
>> >+      dest->vm_start = src->vm_start;
>> >+      dest->vm_end = src->vm_end;
>> >+      dest->anon_vma = src->anon_vma;
>> >+      dest->vm_pgoff = src->vm_pgoff;
>> >+      dest->vm_file = src->vm_file;
>> >+      dest->vm_private_data = src->vm_private_data;
>> >+      vm_flags_init(dest, src->vm_flags);
>> >+      memcpy(&dest->vm_page_prot, &src->vm_page_prot,
>> >+             sizeof(dest->vm_page_prot));
>> >+      /*
>> >+       * src->shared.rb may be modified concurrently when called from
>> >+       * dup_mmap(), but the clone will reinitialize it.
>> >+       */
>> >+      data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
>> >+      memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
>> >+             sizeof(dest->vm_userfaultfd_ctx));
>> >+#ifdef CONFIG_ANON_VMA_NAME
>> >+      dest->anon_name = src->anon_name;
>> >+#endif
>> >+#ifdef CONFIG_SWAP
>> >+      memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
>> >+             sizeof(dest->swap_readahead_info));
>> >+#endif
>> >+#ifndef CONFIG_MMU
>> >+      dest->vm_region = src->vm_region;
>> >+#endif
>> >+#ifdef CONFIG_NUMA
>> >+      dest->vm_policy = src->vm_policy;
>> >+#endif
>> >+}
>>
>> Would this be difficult to maintain? We should make sure not miss or overwrite
>> anything.
>
>Yeah, it is less maintainable than a simple memcpy() but I did not
>find a better alternative. I added a warning above the struct
>vm_area_struct definition to update this function every time we change
>that structure. Not sure if there is anything else I can do to help
>with this.
>

For !PER_VMA_LOCK, maybe we can use memcpy() as usual.

For PER_VMA_LOCK, I just come up the same idea with you:-)

>>
>> --
>> Wei Yang
>> Help you, Help me

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15 11:13         ` Peter Zijlstra
@ 2025-01-15 15:00           ` Suren Baghdasaryan
  2025-01-15 15:35             ` Peter Zijlstra
  2025-01-15 16:00           ` [PATCH] refcount: Strengthen inc_not_zero() Peter Zijlstra
  1 sibling, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mateusz Guzik, akpm, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 3:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 15, 2025 at 11:48:41AM +0100, Peter Zijlstra wrote:
> > On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> >
> > > > Replacing down_read_trylock() with the new routine loses an acquire
> > > > fence. That alone is not a problem, but see below.
> > >
> > > Hmm. I think this acquire fence is actually necessary. We don't want
> > > the later vm_lock_seq check to be reordered and happen before we take
> > > the refcount. Otherwise this might happen:
> > >
> > > reader             writer
> > > if (vm_lock_seq == mm_lock_seq) // check got reordered
> > >         return false;
> > >                        vm_refcnt += VMA_LOCK_OFFSET
> > >                        vm_lock_seq == mm_lock_seq
> > >                        vm_refcnt -= VMA_LOCK_OFFSET
> > > if (!__refcount_inc_not_zero_limited())
> > >         return false;
> > >
> > > Both reader's checks will pass and the reader would read-lock a vma
> > > that was write-locked.
> >
> > Hmm, you're right. That acquire does matter here.
>
> Notably, it means refcount_t is entirely unsuitable for anything
> SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> conditions after the refcount succeeds.

Thanks for reviewing, Peter!
Yes, I'm changing the code to use atomic_t instead of refcount_t and
it comes out quite nicely I think. I had to add two small helper
functions:
vm_refcount_inc() - similar to refcount_add_not_zero() but with an
acquired fence.
vm_refcnt_sub() - similar to refcount_sub_and_test(). I could use
atomic_sub_and_test() but that would add unnecessary acquire fence in
the pagefault path, so I'm using refcount_sub_and_test() logic
instead.

For SLAB_TYPESAFE_BY_RCU I think we are ok with the
__vma_enter_locked()/__vma_exit_locked() transition in the
vma_mark_detached() before freeing the vma and would not need
secondary validation. In __vma_enter_locked(), vm_refcount gets
VMA_LOCK_OFFSET set, which prevents readers from taking the refcount.
In __vma_exit_locked() vm_refcnt transitions to 0, so again that
prevents readers from taking the refcount. IOW, the readers won't get
to the secondary validation and will fail early on
__refcount_inc_not_zero_limited(). I think this transition correctly
serves the purpose of waiting for current temporary readers to exit
and preventing new readers from read-locking and using the vma.

>
> And this is probably fine, but let me ponder this all a little more.

Thanks for taking the time!

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15 12:05       ` Wei Yang
@ 2025-01-15 15:01         ` Suren Baghdasaryan
  2025-01-16  1:37           ` Wei Yang
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15 15:01 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 4:05 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 07:12:20PM -0800, Suren Baghdasaryan wrote:
> >On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >> >       struct vm_area_struct *vma;
> >> >
> >> >       rcu_read_lock();
> >> >-retry:
> >> >       vma = mas_walk(&mas);
> >> >       if (!vma)
> >> >               goto inval;
> >> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >> >       if (!vma_start_read(vma))
> >> >               goto inval;
> >> >
> >> >-      /* Check if the VMA got isolated after we found it */
> >> >-      if (is_vma_detached(vma)) {
> >> >-              vma_end_read(vma);
> >> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
> >> >-              /* The area was replaced with another one */
> >> >-              goto retry;
> >> >-      }
> >>
> >> We have a little behavior change here.
> >>
> >> Originally, if we found an detached vma, we may retry. But now, we would go to
> >> the slow path directly.
> >
> >Hmm. Good point. I think the easiest way to keep the same
> >functionality is to make vma_start_read() return vm_area_struct* on
> >success, NULL on locking failure and EAGAIN if vma was detached
> >(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
> >the case of EAGAIN.
> >
>
> Looks good to me.
>
> >>
> >> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
> >> to see the percentage of this case. If it shows this is a too rare
> >> case to impact performance, we can ignore it.
> >>
> >> Also the event VMA_LOCK_MISS recording is removed, but the definition is
> >> there. We may record it in the vma_start_read() when oldcnt is 0.
> >>
> >> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
> >> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
> >> don't have an overall success/failure statistic in vmstat.
> >
> >Are you referring to the fact that we do not increment
> >VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
>
> Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.
>
> >page fault (in which we increment VMA_LOCK_RETRY instead)?
> >
>
> I don't follow this.

Sorry, I meant to say "in which case we increment VMA_LOCK_RETRY
instead". IOW, when we successfully lock the vma but have to retry the
pagefault, we increment VMA_LOCK_RETRY without incrementing
VMA_LOCK_SUCCESS.

>
> >>
> >> >       /*
> >> >        * At this point, we have a stable reference to a VMA: The VMA is
> >> >        * locked and we know it hasn't already been isolated.
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15  7:58       ` Vlastimil Babka
@ 2025-01-15 15:10         ` Suren Baghdasaryan
  2025-02-13 22:56           ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15 15:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Wei Yang, willy, akpm, peterz, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Tue, Jan 14, 2025 at 11:58 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/25 04:15, Suren Baghdasaryan wrote:
> > On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> >>
> >> >diff --git a/kernel/fork.c b/kernel/fork.c
> >> >index 9d9275783cf8..151b40627c14 100644
> >> >--- a/kernel/fork.c
> >> >+++ b/kernel/fork.c
> >> >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >> >       return vma;
> >> > }
> >> >
> >> >+static void vm_area_init_from(const struct vm_area_struct *src,
> >> >+                            struct vm_area_struct *dest)
> >> >+{
> >> >+      dest->vm_mm = src->vm_mm;
> >> >+      dest->vm_ops = src->vm_ops;
> >> >+      dest->vm_start = src->vm_start;
> >> >+      dest->vm_end = src->vm_end;
> >> >+      dest->anon_vma = src->anon_vma;
> >> >+      dest->vm_pgoff = src->vm_pgoff;
> >> >+      dest->vm_file = src->vm_file;
> >> >+      dest->vm_private_data = src->vm_private_data;
> >> >+      vm_flags_init(dest, src->vm_flags);
> >> >+      memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> >> >+             sizeof(dest->vm_page_prot));
> >> >+      /*
> >> >+       * src->shared.rb may be modified concurrently when called from
> >> >+       * dup_mmap(), but the clone will reinitialize it.
> >> >+       */
> >> >+      data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> >> >+      memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> >> >+             sizeof(dest->vm_userfaultfd_ctx));
> >> >+#ifdef CONFIG_ANON_VMA_NAME
> >> >+      dest->anon_name = src->anon_name;
> >> >+#endif
> >> >+#ifdef CONFIG_SWAP
> >> >+      memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> >> >+             sizeof(dest->swap_readahead_info));
> >> >+#endif
> >> >+#ifndef CONFIG_MMU
> >> >+      dest->vm_region = src->vm_region;
> >> >+#endif
> >> >+#ifdef CONFIG_NUMA
> >> >+      dest->vm_policy = src->vm_policy;
> >> >+#endif
> >> >+}
> >>
> >> Would this be difficult to maintain? We should make sure not miss or overwrite
> >> anything.
> >
> > Yeah, it is less maintainable than a simple memcpy() but I did not
> > find a better alternative.
>
> Willy knows one but refuses to share it :(

Ah, that reminds me why I dropped this approach :) But to be honest,
back then we also had vma_clear() and that added to the ugliness. Now
I could simply to this without all those macros:

static inline void vma_copy(struct vm_area_struct *new,
                                            struct vm_area_struct *orig)
{
        /* Copy the vma while preserving vma->vm_lock */
        data_race(memcpy(new, orig, offsetof(struct vm_area_struct, vm_lock)));
        data_race(memcpy(new + offsetofend(struct vm_area_struct, vm_lock),
                orig + offsetofend(struct vm_area_struct, vm_lock),
                sizeof(struct vm_area_struct) -
                offsetofend(struct vm_area_struct, vm_lock));
}

Would that be better than the current approach?

>
> > I added a warning above the struct
> > vm_area_struct definition to update this function every time we change
> > that structure. Not sure if there is anything else I can do to help
> > with this.
> >
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-15 11:34             ` Lorenzo Stoakes
@ 2025-01-15 15:14               ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15 15:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, Andrew Morton, peterz, willy, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 3:34 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Jan 14, 2025 at 07:54:48AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 14, 2025 at 6:59 AM 'Liam R. Howlett' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > * Andrew Morton <akpm@linux-foundation.org> [250113 23:09]:
> > > > On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > > >
> > > > > >
> > > > > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > > > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > > > > >
> > > > > > If the series is considered super desirable and if people are confident
> > > > > > that we can address any remaining glitches during two months of -rc
> > > > > > then sure, we could push the envelope a bit.  But I don't believe this
> > > > > > is the case so I'm thinking let's give ourselves another cycle to get
> > > > > > this all sorted out?
> > > > >
> > > > > I didn't think this series was in panic mode with one real issue that
> > > > > is not hard to address (memory ordering in
> > > > > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > > > > be missing the big picture. In any case, if it makes people nervous I
> > > > > have no objections to your plan.
> > > >
> > > > Well, I'm soliciting opinions here.  What do others think?
> > > >
> > > > And do you see much urgency with these changes?
> > > >
> > >
> > > I think it's in good shape, but more time for this change is probably
> > > the right approach.
> > >
> > > I don't think it's had enough testing time with the changes since v7.
> > > The series has had significant changes, with the side effect of
> > > invalidating some of the test time.
> > >
> > > I really like what it does, but if Suren doesn't need it upstream for
> > > some reason then I'd say we leave it to soak longer.
> > >
> > > If he does need it upstream, we can deal with any fallout and fixes - it
> > > will have minimum long term effects as it's not an LTS.
> >
> > Thanks for voicing your opinion, folks! There is no real urgency and
> > no objections from me to wait until the next cycle.
> > I'll be posting v10 shortly purely for reviews while this is fresh on
> > people's mind, and with the understanding that it won't be picked up
> > by Andrew.
> > Thanks,
> > Suren.
>
> (From my side :) Thanks, and definitely no reflection on quality and your
> responsiveness has been amazing, just a reflection of the complexity of
> this change.

No worries, I understand and accept the reasoning.
And thanks for sugar coating the pill, it made it easier to swallow :)

>
> >
> > >
> > > Thanks,
> > > Liam
> > >
> > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15 15:00           ` Suren Baghdasaryan
@ 2025-01-15 15:35             ` Peter Zijlstra
  2025-01-15 15:38               ` Peter Zijlstra
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2025-01-15 15:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Mateusz Guzik, akpm, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 07:00:37AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 15, 2025 at 3:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Jan 15, 2025 at 11:48:41AM +0100, Peter Zijlstra wrote:
> > > On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> > >
> > > > > Replacing down_read_trylock() with the new routine loses an acquire
> > > > > fence. That alone is not a problem, but see below.
> > > >
> > > > Hmm. I think this acquire fence is actually necessary. We don't want
> > > > the later vm_lock_seq check to be reordered and happen before we take
> > > > the refcount. Otherwise this might happen:
> > > >
> > > > reader             writer
> > > > if (vm_lock_seq == mm_lock_seq) // check got reordered
> > > >         return false;
> > > >                        vm_refcnt += VMA_LOCK_OFFSET
> > > >                        vm_lock_seq == mm_lock_seq
> > > >                        vm_refcnt -= VMA_LOCK_OFFSET
> > > > if (!__refcount_inc_not_zero_limited())
> > > >         return false;
> > > >
> > > > Both reader's checks will pass and the reader would read-lock a vma
> > > > that was write-locked.
> > >
> > > Hmm, you're right. That acquire does matter here.
> >
> > Notably, it means refcount_t is entirely unsuitable for anything
> > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > conditions after the refcount succeeds.
> 
> Thanks for reviewing, Peter!
> Yes, I'm changing the code to use atomic_t instead of refcount_t and
> it comes out quite nicely I think. I had to add two small helper
> functions:
> vm_refcount_inc() - similar to refcount_add_not_zero() but with an
> acquired fence.
> vm_refcnt_sub() - similar to refcount_sub_and_test(). I could use
> atomic_sub_and_test() but that would add unnecessary acquire fence in
> the pagefault path, so I'm using refcount_sub_and_test() logic
> instead.

Right.

> For SLAB_TYPESAFE_BY_RCU I think we are ok with the
> __vma_enter_locked()/__vma_exit_locked() transition in the
> vma_mark_detached() before freeing the vma and would not need
> secondary validation. In __vma_enter_locked(), vm_refcount gets
> VMA_LOCK_OFFSET set, which prevents readers from taking the refcount.
> In __vma_exit_locked() vm_refcnt transitions to 0, so again that
> prevents readers from taking the refcount. IOW, the readers won't get
> to the secondary validation and will fail early on
> __refcount_inc_not_zero_limited(). I think this transition correctly
> serves the purpose of waiting for current temporary readers to exit
> and preventing new readers from read-locking and using the vma.

Consider:

    CPU0				CPU1

    rcu_read_lock();
    vma = vma_lookup(mm, vaddr);

    ... cpu goes sleep for a *long time* ...

    					__vma_exit_locked();
					vma_area_free()
					..
					vma = vma_area_alloc();
					vma_mark_attached();

    ... comes back once vma is re-used ...

    vma_start_read()
      vm_refcount_inc(); // success!!

At which point we need to validate vma is for mm and covers vaddr, which
is what patch 15 does, no?



Also, I seem to have forgotten some braces back in 2008 :-)

---
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 10a971c2bde3..c1356b52f8ea 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -115,9 +115,10 @@ enum _slab_flag_bits {
  *   rcu_read_lock();
  *   obj = lockless_lookup(key);
  *   if (obj) {
- *     if (!try_get_ref(obj)) // might fail for free objects
+ *     if (!try_get_ref(obj)) { // might fail for free objects
  *       rcu_read_unlock();
  *       goto begin;
+ *     }
  *
  *     if (obj->key != key) { // not the object we expected
  *       put_ref(obj);

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15 15:35             ` Peter Zijlstra
@ 2025-01-15 15:38               ` Peter Zijlstra
  2025-01-15 16:22                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2025-01-15 15:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Mateusz Guzik, akpm, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 04:35:07PM +0100, Peter Zijlstra wrote:

> Consider:
> 
>     CPU0				CPU1
> 
>     rcu_read_lock();
>     vma = vma_lookup(mm, vaddr);
> 
>     ... cpu goes sleep for a *long time* ...
> 
>     					__vma_exit_locked();
> 					vma_area_free()
> 					..
> 					vma = vma_area_alloc();
> 					vma_mark_attached();
> 
>     ... comes back once vma is re-used ...
> 
>     vma_start_read()
>       vm_refcount_inc(); // success!!
> 
> At which point we need to validate vma is for mm and covers vaddr, which
> is what patch 15 does, no?

Also, critically, we want these reads to happen *after* the refcount
increment.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH] refcount: Strengthen inc_not_zero()
  2025-01-15 11:13         ` Peter Zijlstra
  2025-01-15 15:00           ` Suren Baghdasaryan
@ 2025-01-15 16:00           ` Peter Zijlstra
  2025-01-16 15:12             ` Suren Baghdasaryan
                               ` (2 more replies)
  1 sibling, 3 replies; 140+ messages in thread
From: Peter Zijlstra @ 2025-01-15 16:00 UTC (permalink / raw)
  To: Suren Baghdasaryan, will, boqun.feng, mark.rutland
  Cc: Mateusz Guzik, akpm, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:

> Notably, it means refcount_t is entirely unsuitable for anything
> SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> conditions after the refcount succeeds.
> 
> And this is probably fine, but let me ponder this all a little more.

Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
fix this, these things are hard enough as they are.

Will, others, wdyt?

---
Subject: refcount: Strengthen inc_not_zero()

For speculative lookups where a successful inc_not_zero() pins the
object, but where we still need to double check if the object acquired
is indeed the one we set out to aquire, needs this validation to happen
*after* the increment.

Notably SLAB_TYPESAFE_BY_RCU is one such an example.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/refcount.h | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..340e7ffa445e 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -69,9 +69,10 @@
  * its the lock acquire, for RCU/lockless data structures its the dependent
  * load.
  *
- * Do note that inc_not_zero() provides a control dependency which will order
- * future stores against the inc, this ensures we'll never modify the object
- * if we did not in fact acquire a reference.
+ * Do note that inc_not_zero() does provide acquire order, which will order
+ * future load and stores against the inc, this ensures all subsequent accesses
+ * are from this object and not anything previously occupying this memory --
+ * consider SLAB_TYPESAFE_BY_RCU.
  *
  * The decrements will provide release order, such that all the prior loads and
  * stores will be issued before, it also provides a control dependency, which
@@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	do {
 		if (!old)
 			break;
-	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
+	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
 
 	if (oldp)
 		*oldp = old;
@@ -225,9 +226,9 @@ static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp
  * Similar to atomic_inc_not_zero(), but will saturate at REFCOUNT_SATURATED
  * and WARN.
  *
- * Provides no memory ordering, it is assumed the caller has guaranteed the
- * object memory to be stable (RCU, etc.). It does provide a control dependency
- * and thereby orders future stores. See the comment on top.
+ * Provides acquire ordering, such that subsequent accesses are after the
+ * increment. This is important for the cases where secondary validation is
+ * required, eg. SLAB_TYPESAFE_BY_RCU.
  *
  * Return: true if the increment was successful, false otherwise
  */

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15 15:38               ` Peter Zijlstra
@ 2025-01-15 16:22                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mateusz Guzik, akpm, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 7:38 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 15, 2025 at 04:35:07PM +0100, Peter Zijlstra wrote:
>
> > Consider:
> >
> >     CPU0                              CPU1
> >
> >     rcu_read_lock();
> >     vma = vma_lookup(mm, vaddr);
> >
> >     ... cpu goes sleep for a *long time* ...
> >
> >                                       __vma_exit_locked();
> >                                       vma_area_free()
> >                                       ..
> >                                       vma = vma_area_alloc();
> >                                       vma_mark_attached();
> >
> >     ... comes back once vma is re-used ...
> >
> >     vma_start_read()
> >       vm_refcount_inc(); // success!!
> >
> > At which point we need to validate vma is for mm and covers vaddr, which
> > is what patch 15 does, no?

Correct. Sorry, I thought by "secondary validation" you only meant
vm_lock_seq check in vma_start_read(). Now I understand your point.
Yes, if the vma we found gets reused before we read-lock it then the
checks for mm and address range should catch a possibly incorrect vma.
If these checks fail, we retry. If they succeed we have the correct
vma even if it was recycled since we found it.

>
> Also, critically, we want these reads to happen *after* the refcount
> increment.

Yes, and I think the acquire fence in the
refcount_add_not_zero_limited() replacement should guarantee that
ordering.

>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline
  2025-01-15 10:50   ` Peter Zijlstra
@ 2025-01-15 16:39     ` Suren Baghdasaryan
  2025-02-13 22:59       ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 2:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:59PM -0800, Suren Baghdasaryan wrote:
> > Move several vma_area_struct members which are rarely or never used
> > during page fault handling into the last cacheline to better pack
> > vm_area_struct. As a result vm_area_struct will fit into 3 as opposed
> > to 4 cachelines. New typical vm_area_struct layout:
> >
> > struct vm_area_struct {
> >     union {
> >         struct {
> >             long unsigned int vm_start;              /*     0     8 */
> >             long unsigned int vm_end;                /*     8     8 */
> >         };                                           /*     0    16 */
> >         freeptr_t          vm_freeptr;               /*     0     8 */
> >     };                                               /*     0    16 */
> >     struct mm_struct *         vm_mm;                /*    16     8 */
> >     pgprot_t                   vm_page_prot;         /*    24     8 */
> >     union {
> >         const vm_flags_t   vm_flags;                 /*    32     8 */
> >         vm_flags_t         __vm_flags;               /*    32     8 */
> >     };                                               /*    32     8 */
> >     unsigned int               vm_lock_seq;          /*    40     4 */
>
> Does it not make sense to move this seq field near the refcnt?

In an earlier version, when vm_lock was not a refcount yet, I tried
that and moving vm_lock_seq introduced regression in the pft test. We
have that early vm_lock_seq check in the beginning of vma_start_read()
and if it fails we bail out early without locking. I think that might
be the reason why keeping vm_lock_seq in the first cacheling is
beneficial. But I'll try moving it again now that we have vm_refcnt
instead of the lock and see if pft still shows any regression.

>
> >     /* XXX 4 bytes hole, try to pack */
> >
> >     struct list_head           anon_vma_chain;       /*    48    16 */
> >     /* --- cacheline 1 boundary (64 bytes) --- */
> >     struct anon_vma *          anon_vma;             /*    64     8 */
> >     const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> >     long unsigned int          vm_pgoff;             /*    80     8 */
> >     struct file *              vm_file;              /*    88     8 */
> >     void *                     vm_private_data;      /*    96     8 */
> >     atomic_long_t              swap_readahead_info;  /*   104     8 */
> >     struct mempolicy *         vm_policy;            /*   112     8 */
> >     struct vma_numab_state *   numab_state;          /*   120     8 */
> >     /* --- cacheline 2 boundary (128 bytes) --- */
> >     refcount_t          vm_refcnt (__aligned__(64)); /*   128     4 */
> >
> >     /* XXX 4 bytes hole, try to pack */
> >
> >     struct {
> >         struct rb_node     rb (__aligned__(8));      /*   136    24 */
> >         long unsigned int  rb_subtree_last;          /*   160     8 */
> >     } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> >     struct anon_vma_name *     anon_name;            /*   168     8 */
> >     struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */
> >
> >     /* size: 192, cachelines: 3, members: 18 */
> >     /* sum members: 176, holes: 2, sum holes: 8 */
> >     /* padding: 8 */
> >     /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
> > } __attribute__((__aligned__(64)));
>
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15 12:17       ` Wei Yang
@ 2025-01-15 21:46         ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-15 21:46 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 4:17 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 07:15:05PM -0800, Suren Baghdasaryan wrote:
> >On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> >>
> >> >diff --git a/kernel/fork.c b/kernel/fork.c
> >> >index 9d9275783cf8..151b40627c14 100644
> >> >--- a/kernel/fork.c
> >> >+++ b/kernel/fork.c
> >> >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >> >       return vma;
> >> > }
> >> >
> >> >+static void vm_area_init_from(const struct vm_area_struct *src,
> >> >+                            struct vm_area_struct *dest)
> >> >+{
> >> >+      dest->vm_mm = src->vm_mm;
> >> >+      dest->vm_ops = src->vm_ops;
> >> >+      dest->vm_start = src->vm_start;
> >> >+      dest->vm_end = src->vm_end;
> >> >+      dest->anon_vma = src->anon_vma;
> >> >+      dest->vm_pgoff = src->vm_pgoff;
> >> >+      dest->vm_file = src->vm_file;
> >> >+      dest->vm_private_data = src->vm_private_data;
> >> >+      vm_flags_init(dest, src->vm_flags);
> >> >+      memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> >> >+             sizeof(dest->vm_page_prot));
> >> >+      /*
> >> >+       * src->shared.rb may be modified concurrently when called from
> >> >+       * dup_mmap(), but the clone will reinitialize it.
> >> >+       */
> >> >+      data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> >> >+      memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> >> >+             sizeof(dest->vm_userfaultfd_ctx));
> >> >+#ifdef CONFIG_ANON_VMA_NAME
> >> >+      dest->anon_name = src->anon_name;
> >> >+#endif
> >> >+#ifdef CONFIG_SWAP
> >> >+      memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> >> >+             sizeof(dest->swap_readahead_info));
> >> >+#endif
> >> >+#ifndef CONFIG_MMU
> >> >+      dest->vm_region = src->vm_region;
> >> >+#endif
> >> >+#ifdef CONFIG_NUMA
> >> >+      dest->vm_policy = src->vm_policy;
> >> >+#endif
> >> >+}
> >>
> >> Would this be difficult to maintain? We should make sure not miss or overwrite
> >> anything.
> >
> >Yeah, it is less maintainable than a simple memcpy() but I did not
> >find a better alternative. I added a warning above the struct
> >vm_area_struct definition to update this function every time we change
> >that structure. Not sure if there is anything else I can do to help
> >with this.
> >
>
> For !PER_VMA_LOCK, maybe we can use memcpy() as usual.
>
> For PER_VMA_LOCK, I just come up the same idea with you:-)

Missed this comment. Yeah, in one of the previous versions I had
different !PER_VMA_LOCK and PER_VMA_LOCK versions of vma_copy() but
David raised a question whether it is worth having two versions. From
performance POV there is no reason for that and it unnecessarily
complicates the code. So, I dropped that in favor of having one
version.

>
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-15 15:01         ` Suren Baghdasaryan
@ 2025-01-16  1:37           ` Wei Yang
  2025-01-16  1:41             ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Wei Yang @ 2025-01-16  1:37 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 07:01:56AM -0800, Suren Baghdasaryan wrote:
>On Wed, Jan 15, 2025 at 4:05 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Tue, Jan 14, 2025 at 07:12:20PM -0800, Suren Baghdasaryan wrote:
>> >On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>> >>
>> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>> >> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>> >> >       struct vm_area_struct *vma;
>> >> >
>> >> >       rcu_read_lock();
>> >> >-retry:
>> >> >       vma = mas_walk(&mas);
>> >> >       if (!vma)
>> >> >               goto inval;
>> >> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>> >> >       if (!vma_start_read(vma))
>> >> >               goto inval;
>> >> >
>> >> >-      /* Check if the VMA got isolated after we found it */
>> >> >-      if (is_vma_detached(vma)) {
>> >> >-              vma_end_read(vma);
>> >> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
>> >> >-              /* The area was replaced with another one */
>> >> >-              goto retry;
>> >> >-      }
>> >>
>> >> We have a little behavior change here.
>> >>
>> >> Originally, if we found an detached vma, we may retry. But now, we would go to
>> >> the slow path directly.
>> >
>> >Hmm. Good point. I think the easiest way to keep the same
>> >functionality is to make vma_start_read() return vm_area_struct* on
>> >success, NULL on locking failure and EAGAIN if vma was detached
>> >(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
>> >the case of EAGAIN.
>> >
>>
>> Looks good to me.
>>
>> >>
>> >> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
>> >> to see the percentage of this case. If it shows this is a too rare
>> >> case to impact performance, we can ignore it.
>> >>
>> >> Also the event VMA_LOCK_MISS recording is removed, but the definition is
>> >> there. We may record it in the vma_start_read() when oldcnt is 0.
>> >>
>> >> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
>> >> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
>> >> don't have an overall success/failure statistic in vmstat.
>> >
>> >Are you referring to the fact that we do not increment
>> >VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
>>
>> Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.
>>
>> >page fault (in which we increment VMA_LOCK_RETRY instead)?
>> >
>>
>> I don't follow this.
>
>Sorry, I meant to say "in which case we increment VMA_LOCK_RETRY
>instead". IOW, when we successfully lock the vma but have to retry the
>pagefault, we increment VMA_LOCK_RETRY without incrementing
>VMA_LOCK_SUCCESS.
>

Yes, this makes me confused about what VMA_LOCK_SUCCESS represents.

>>
>> >>
>> >> >       /*
>> >> >        * At this point, we have a stable reference to a VMA: The VMA is
>> >> >        * locked and we know it hasn't already been isolated.
>> >>
>> >> --
>> >> Wei Yang
>> >> Help you, Help me
>>
>> --
>> Wei Yang
>> Help you, Help me

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-16  1:37           ` Wei Yang
@ 2025-01-16  1:41             ` Suren Baghdasaryan
  2025-01-16  9:10               ` Wei Yang
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-16  1:41 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 5:37 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 07:01:56AM -0800, Suren Baghdasaryan wrote:
> >On Wed, Jan 15, 2025 at 4:05 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Tue, Jan 14, 2025 at 07:12:20PM -0800, Suren Baghdasaryan wrote:
> >> >On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >> >>
> >> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >> >> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >> >> >       struct vm_area_struct *vma;
> >> >> >
> >> >> >       rcu_read_lock();
> >> >> >-retry:
> >> >> >       vma = mas_walk(&mas);
> >> >> >       if (!vma)
> >> >> >               goto inval;
> >> >> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >> >> >       if (!vma_start_read(vma))
> >> >> >               goto inval;
> >> >> >
> >> >> >-      /* Check if the VMA got isolated after we found it */
> >> >> >-      if (is_vma_detached(vma)) {
> >> >> >-              vma_end_read(vma);
> >> >> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
> >> >> >-              /* The area was replaced with another one */
> >> >> >-              goto retry;
> >> >> >-      }
> >> >>
> >> >> We have a little behavior change here.
> >> >>
> >> >> Originally, if we found an detached vma, we may retry. But now, we would go to
> >> >> the slow path directly.
> >> >
> >> >Hmm. Good point. I think the easiest way to keep the same
> >> >functionality is to make vma_start_read() return vm_area_struct* on
> >> >success, NULL on locking failure and EAGAIN if vma was detached
> >> >(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
> >> >the case of EAGAIN.
> >> >
> >>
> >> Looks good to me.
> >>
> >> >>
> >> >> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
> >> >> to see the percentage of this case. If it shows this is a too rare
> >> >> case to impact performance, we can ignore it.
> >> >>
> >> >> Also the event VMA_LOCK_MISS recording is removed, but the definition is
> >> >> there. We may record it in the vma_start_read() when oldcnt is 0.
> >> >>
> >> >> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
> >> >> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
> >> >> don't have an overall success/failure statistic in vmstat.
> >> >
> >> >Are you referring to the fact that we do not increment
> >> >VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
> >>
> >> Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.
> >>
> >> >page fault (in which we increment VMA_LOCK_RETRY instead)?
> >> >
> >>
> >> I don't follow this.
> >
> >Sorry, I meant to say "in which case we increment VMA_LOCK_RETRY
> >instead". IOW, when we successfully lock the vma but have to retry the
> >pagefault, we increment VMA_LOCK_RETRY without incrementing
> >VMA_LOCK_SUCCESS.
> >
>
> Yes, this makes me confused about what VMA_LOCK_SUCCESS represents.

I'll need to look into the history of why we account it this way but
this is out of scope for this patchset.

>
> >>
> >> >>
> >> >> >       /*
> >> >> >        * At this point, we have a stable reference to a VMA: The VMA is
> >> >> >        * locked and we know it hasn't already been isolated.
> >> >>
> >> >> --
> >> >> Wei Yang
> >> >> Help you, Help me
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
>
> --
> Wei Yang
> Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count
  2025-01-16  1:41             ` Suren Baghdasaryan
@ 2025-01-16  9:10               ` Wei Yang
  0 siblings, 0 replies; 140+ messages in thread
From: Wei Yang @ 2025-01-16  9:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Wei Yang, akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 05:41:27PM -0800, Suren Baghdasaryan wrote:
[...]
>> >> >the case of EAGAIN.
>> >> >
>> >>
>> >> Looks good to me.
>> >>
>> >> >>
>> >> >> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
>> >> >> to see the percentage of this case. If it shows this is a too rare
>> >> >> case to impact performance, we can ignore it.
>> >> >>
>> >> >> Also the event VMA_LOCK_MISS recording is removed, but the definition is
>> >> >> there. We may record it in the vma_start_read() when oldcnt is 0.
>> >> >>
>> >> >> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
>> >> >> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
>> >> >> don't have an overall success/failure statistic in vmstat.
>> >> >
>> >> >Are you referring to the fact that we do not increment
>> >> >VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
>> >>
>> >> Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.
>> >>
>> >> >page fault (in which we increment VMA_LOCK_RETRY instead)?
>> >> >
>> >>
>> >> I don't follow this.
>> >
>> >Sorry, I meant to say "in which case we increment VMA_LOCK_RETRY
>> >instead". IOW, when we successfully lock the vma but have to retry the
>> >pagefault, we increment VMA_LOCK_RETRY without incrementing
>> >VMA_LOCK_SUCCESS.
>> >
>>
>> Yes, this makes me confused about what VMA_LOCK_SUCCESS represents.
>
>I'll need to look into the history of why we account it this way but
>this is out of scope for this patchset.
>

Agree.


-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2025-01-15  9:39           ` Peter Zijlstra
@ 2025-01-16 10:52             ` Hillf Danton
  0 siblings, 0 replies; 140+ messages in thread
From: Hillf Danton @ 2025-01-16 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, akpm, willy, hannes, linux-mm, linux-kernel,
	kernel-team

On Wed, 15 Jan 2025 10:39:55 +0100 Peter Zijlstra <peterz@infradead.org>
> 
> Sigh; don't let hillf confuse you. *IF* you need an acquire it will be
> in the part where you wait for readers to go away. But even there, think
> about what you're serializing against. Readers don't typically modify
> things.
> 
> And modifications are fully serialized by mmap_sem^H^H^Hlock.
> 
^H^H^H

- * Do note that inc_not_zero() provides a control dependency which will order
- * future stores against the inc, this ensures we'll never modify the object
- * if we did not in fact acquire a reference.
+ * Do note that inc_not_zero() does provide acquire order, which will order
+ * future load and stores against the inc, this ensures all subsequent accesses
+ * are from this object and not anything previously occupying this memory --
+ * consider SLAB_TYPESAFE_BY_RCU.
  *
  * The decrements will provide release order, such that all the prior loads and
  * stores will be issued before, it also provides a control dependency, which
@@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	do {
 		if (!old)
 			break;
-	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
+	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH] refcount: Strengthen inc_not_zero()
  2025-01-15 16:00           ` [PATCH] refcount: Strengthen inc_not_zero() Peter Zijlstra
@ 2025-01-16 15:12             ` Suren Baghdasaryan
  2025-01-17 15:41             ` Will Deacon
  2025-01-17 16:13             ` Matthew Wilcox
  2 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-16 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: will, boqun.feng, mark.rutland, Mateusz Guzik, akpm, willy,
	liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko, vbabka,
	hannes, oliver.sang, mgorman, david, peterx, oleg, dave, paulmck,
	brauner, dhowells, hdanton, hughd, lokeshgidra, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Wed, Jan 15, 2025 at 8:00 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
>
> > Notably, it means refcount_t is entirely unsuitable for anything
> > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > conditions after the refcount succeeds.
> >
> > And this is probably fine, but let me ponder this all a little more.
>
> Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> fix this, these things are hard enough as they are.
>
> Will, others, wdyt?

I'll wait for the verdict on this patch before proceeding with my
series. It obviously simplifies my job. Thanks Peter!

>
> ---
> Subject: refcount: Strengthen inc_not_zero()
>
> For speculative lookups where a successful inc_not_zero() pins the
> object, but where we still need to double check if the object acquired
> is indeed the one we set out to aquire, needs this validation to happen
> *after* the increment.
>
> Notably SLAB_TYPESAFE_BY_RCU is one such an example.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/refcount.h | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> index 35f039ecb272..340e7ffa445e 100644
> --- a/include/linux/refcount.h
> +++ b/include/linux/refcount.h
> @@ -69,9 +69,10 @@
>   * its the lock acquire, for RCU/lockless data structures its the dependent
>   * load.
>   *
> - * Do note that inc_not_zero() provides a control dependency which will order
> - * future stores against the inc, this ensures we'll never modify the object
> - * if we did not in fact acquire a reference.
> + * Do note that inc_not_zero() does provide acquire order, which will order
> + * future load and stores against the inc, this ensures all subsequent accesses
> + * are from this object and not anything previously occupying this memory --
> + * consider SLAB_TYPESAFE_BY_RCU.
>   *
>   * The decrements will provide release order, such that all the prior loads and
>   * stores will be issued before, it also provides a control dependency, which
> @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
>         do {
>                 if (!old)
>                         break;
> -       } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> +       } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
>
>         if (oldp)
>                 *oldp = old;
> @@ -225,9 +226,9 @@ static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp
>   * Similar to atomic_inc_not_zero(), but will saturate at REFCOUNT_SATURATED
>   * and WARN.
>   *
> - * Provides no memory ordering, it is assumed the caller has guaranteed the
> - * object memory to be stable (RCU, etc.). It does provide a control dependency
> - * and thereby orders future stores. See the comment on top.
> + * Provides acquire ordering, such that subsequent accesses are after the
> + * increment. This is important for the cases where secondary validation is
> + * required, eg. SLAB_TYPESAFE_BY_RCU.
>   *
>   * Return: true if the increment was successful, false otherwise
>   */

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH] refcount: Strengthen inc_not_zero()
  2025-01-15 16:00           ` [PATCH] refcount: Strengthen inc_not_zero() Peter Zijlstra
  2025-01-16 15:12             ` Suren Baghdasaryan
@ 2025-01-17 15:41             ` Will Deacon
  2025-01-27 14:09               ` Will Deacon
  2025-01-17 16:13             ` Matthew Wilcox
  2 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2025-01-17 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, boqun.feng, mark.rutland, Mateusz Guzik, akpm,
	willy, liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko,
	vbabka, hannes, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> 
> > Notably, it means refcount_t is entirely unsuitable for anything
> > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > conditions after the refcount succeeds.
> > 
> > And this is probably fine, but let me ponder this all a little more.
> 
> Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> fix this, these things are hard enough as they are.
> 
> Will, others, wdyt?

We should also update the Documentation (atomic_t.txt and
refcount-vs-atomic.rst) if we strengthen this.

> ---
> Subject: refcount: Strengthen inc_not_zero()
> 
> For speculative lookups where a successful inc_not_zero() pins the
> object, but where we still need to double check if the object acquired
> is indeed the one we set out to aquire, needs this validation to happen
> *after* the increment.
> 
> Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/refcount.h | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> index 35f039ecb272..340e7ffa445e 100644
> --- a/include/linux/refcount.h
> +++ b/include/linux/refcount.h
> @@ -69,9 +69,10 @@
>   * its the lock acquire, for RCU/lockless data structures its the dependent
>   * load.
>   *
> - * Do note that inc_not_zero() provides a control dependency which will order
> - * future stores against the inc, this ensures we'll never modify the object
> - * if we did not in fact acquire a reference.
> + * Do note that inc_not_zero() does provide acquire order, which will order
> + * future load and stores against the inc, this ensures all subsequent accesses
> + * are from this object and not anything previously occupying this memory --
> + * consider SLAB_TYPESAFE_BY_RCU.
>   *
>   * The decrements will provide release order, such that all the prior loads and
>   * stores will be issued before, it also provides a control dependency, which
> @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
>  	do {
>  		if (!old)
>  			break;
> -	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> +	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));

Hmm, do the later memory accesses need to be ordered against the store
part of the increment or just the read? If it's the former, then I don't
think that _acquire is sufficient -- accesses can still get in-between
the read and write parts of the RmW.

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH] refcount: Strengthen inc_not_zero()
  2025-01-15 16:00           ` [PATCH] refcount: Strengthen inc_not_zero() Peter Zijlstra
  2025-01-16 15:12             ` Suren Baghdasaryan
  2025-01-17 15:41             ` Will Deacon
@ 2025-01-17 16:13             ` Matthew Wilcox
  2 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2025-01-17 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, will, boqun.feng, mark.rutland, Mateusz Guzik,
	akpm, liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko,
	vbabka, hannes, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> Subject: refcount: Strengthen inc_not_zero()
> 
> For speculative lookups where a successful inc_not_zero() pins the
> object, but where we still need to double check if the object acquired
> is indeed the one we set out to aquire, needs this validation to happen
> *after* the increment.
> 
> Notably SLAB_TYPESAFE_BY_RCU is one such an example.

While you're looking at inc_not_zero(), have you already thought about
doing something like this?  ie failing rather than saturating since
all users of this already have to check for failure.  It looks like
two comparisons per call rather than three.

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..3ef7d316e870 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -142,16 +142,13 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	int old = refcount_read(r);
 
 	do {
-		if (!old)
+		if (old <= 0 || old + i < 0)
 			break;
 	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
 
 	if (oldp)
 		*oldp = old;
 
-	if (unlikely(old < 0 || old + i < 0))
-		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
-
 	return old;
 }
 

$ ./scripts/bloat-o-meter before.o after.o 
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-91 (-91)
Function                                     old     new   delta
io_wq_for_each_worker.isra                   162     158      -4
io_worker_handle_work                       1403    1387     -16
io_wq_activate_free_worker                   187     158     -29
io_queue_worker_create                       367     325     -42
Total: Before=10068, After=9977, chg -0.90%

(that's io_uring/io-wq.o as an example user of refcount_inc_not_zero())


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH] refcount: Strengthen inc_not_zero()
  2025-01-17 15:41             ` Will Deacon
@ 2025-01-27 14:09               ` Will Deacon
  2025-01-27 19:21                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2025-01-27 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, boqun.feng, mark.rutland, Mateusz Guzik, akpm,
	willy, liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko,
	vbabka, hannes, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > 
> > > Notably, it means refcount_t is entirely unsuitable for anything
> > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > conditions after the refcount succeeds.
> > > 
> > > And this is probably fine, but let me ponder this all a little more.
> > 
> > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > fix this, these things are hard enough as they are.
> > 
> > Will, others, wdyt?
> 
> We should also update the Documentation (atomic_t.txt and
> refcount-vs-atomic.rst) if we strengthen this.
> 
> > ---
> > Subject: refcount: Strengthen inc_not_zero()
> > 
> > For speculative lookups where a successful inc_not_zero() pins the
> > object, but where we still need to double check if the object acquired
> > is indeed the one we set out to aquire, needs this validation to happen
> > *after* the increment.
> > 
> > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  include/linux/refcount.h | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > index 35f039ecb272..340e7ffa445e 100644
> > --- a/include/linux/refcount.h
> > +++ b/include/linux/refcount.h
> > @@ -69,9 +69,10 @@
> >   * its the lock acquire, for RCU/lockless data structures its the dependent
> >   * load.
> >   *
> > - * Do note that inc_not_zero() provides a control dependency which will order
> > - * future stores against the inc, this ensures we'll never modify the object
> > - * if we did not in fact acquire a reference.
> > + * Do note that inc_not_zero() does provide acquire order, which will order
> > + * future load and stores against the inc, this ensures all subsequent accesses
> > + * are from this object and not anything previously occupying this memory --
> > + * consider SLAB_TYPESAFE_BY_RCU.
> >   *
> >   * The decrements will provide release order, such that all the prior loads and
> >   * stores will be issued before, it also provides a control dependency, which
> > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> >  	do {
> >  		if (!old)
> >  			break;
> > -	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > +	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> 
> Hmm, do the later memory accesses need to be ordered against the store
> part of the increment or just the read? If it's the former, then I don't
> think that _acquire is sufficient -- accesses can still get in-between
> the read and write parts of the RmW.

I dug some more into this at the end of last week. For the
SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
dec_and_test(), then I think using _acquire() above is correct as the
later references can only move up into the critical section in the case
that we successfully obtained a reference.

However, if we're going to make the barriers implicit in the refcount
operations here then I think we also need to do something on the producer
side for when the object is re-initialised after being destroyed and
allocated again. I think that would necessitate release ordering for
refcount_set() so that whatever allows the consumer to validate the
object (e.g. sequence number) is published *before* the refcount.

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH] refcount: Strengthen inc_not_zero()
  2025-01-27 14:09               ` Will Deacon
@ 2025-01-27 19:21                 ` Suren Baghdasaryan
  2025-01-28 23:51                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-27 19:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, boqun.feng, mark.rutland, Mateusz Guzik, akpm,
	willy, liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko,
	vbabka, hannes, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
>
> On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > >
> > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > conditions after the refcount succeeds.
> > > >
> > > > And this is probably fine, but let me ponder this all a little more.
> > >
> > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > fix this, these things are hard enough as they are.
> > >
> > > Will, others, wdyt?
> >
> > We should also update the Documentation (atomic_t.txt and
> > refcount-vs-atomic.rst) if we strengthen this.
> >
> > > ---
> > > Subject: refcount: Strengthen inc_not_zero()
> > >
> > > For speculative lookups where a successful inc_not_zero() pins the
> > > object, but where we still need to double check if the object acquired
> > > is indeed the one we set out to aquire, needs this validation to happen
> > > *after* the increment.
> > >
> > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  include/linux/refcount.h | 15 ++++++++-------
> > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > index 35f039ecb272..340e7ffa445e 100644
> > > --- a/include/linux/refcount.h
> > > +++ b/include/linux/refcount.h
> > > @@ -69,9 +69,10 @@
> > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > >   * load.
> > >   *
> > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > - * future stores against the inc, this ensures we'll never modify the object
> > > - * if we did not in fact acquire a reference.
> > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > + * are from this object and not anything previously occupying this memory --
> > > + * consider SLAB_TYPESAFE_BY_RCU.
> > >   *
> > >   * The decrements will provide release order, such that all the prior loads and
> > >   * stores will be issued before, it also provides a control dependency, which
> > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > >     do {
> > >             if (!old)
> > >                     break;
> > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> >
> > Hmm, do the later memory accesses need to be ordered against the store
> > part of the increment or just the read? If it's the former, then I don't
> > think that _acquire is sufficient -- accesses can still get in-between
> > the read and write parts of the RmW.
>
> I dug some more into this at the end of last week. For the
> SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> dec_and_test(), then I think using _acquire() above is correct as the
> later references can only move up into the critical section in the case
> that we successfully obtained a reference.
>
> However, if we're going to make the barriers implicit in the refcount
> operations here then I think we also need to do something on the producer
> side for when the object is re-initialised after being destroyed and
> allocated again. I think that would necessitate release ordering for
> refcount_set() so that whatever allows the consumer to validate the
> object (e.g. sequence number) is published *before* the refcount.

Thanks Will!
I would like to expand on your answer to provide an example of the
race that would happen without release ordering in the producer. To
save reader's time I provide a simplified flow and reasoning first.
More detailed code of what I'm considering a typical
SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
reply (Addendum).
Simplified flow looks like this:

consumer:
    obj = lookup(collection, key);
    if (!refcount_inc_not_zero(&obj->ref))
        return;
    smp_rmb(); /* Peter's new acquire fence */
    if (READ_ONCE(obj->key) != key) {
        put_ref(obj);
        return;
    }
    use(obj->value);

producer:
    old_key = obj->key;
    remove(collection, old_key);
    if (!refcount_dec_and_test(&obj->ref))
        return;
    obj->key = KEY_INVALID;
    free(objj);
    ...
    obj = malloc(); /* obj is reused */
    obj->key = new_key;
    obj->value = new_value;
    smp_wmb(); /* Will's proposed release fence */
    refcount_set(obj->ref, 1);
    insert(collection, key, obj);

Let's consider a case when new_key == old_key. Will call both of them
"key". Without WIll's proposed fence the following reordering is
possible:

consumer:
    obj = lookup(collection, key);

                 producer:
                     key = obj->key
                     remove(collection, key);
                     if (!refcount_dec_and_test(&obj->ref))
                         return;
                     obj->key = KEY_INVALID;
                     free(objj);
                     obj = malloc(); /* obj is reused */
                     refcount_set(obj->ref, 1);
                     obj->key = key; /* same key */

    if (!refcount_inc_not_zero(&obj->ref))
        return;
    smp_rmb();
    if (READ_ONCE(obj->key) != key) {
        put_ref(obj);
        return;
    }
    use(obj->value);

                     obj->value = new_value; /* reordered store */
                     add(collection, key, obj);

So, the consumer finds the old object, successfully takes a refcount
and validates the key. It succeeds because the object is allocated and
has the same key, which is fine. However it proceeds to use stale
obj->value. Will's proposed release ordering would prevent that.

The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
I think it would be better to introduce two new functions:
refcount_add_not_zero_acquire() and refcount_set_release(), clearly
document that they should be used when a freed object can be recycled
and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
should also clarify that once it's called, the object can be accessed
by consumers even if it was not added yet into the collection used for
object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
then can explicitly use these new functions in the example code,
further clarifying their purpose and proper use.
WDYT?

ADDENDUM.
Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:

struct object {
    refcount_t ref;
    u64 key;
    u64 value;
};

void init(struct object *obj, u64 key, u64 value)
{
    obj->key = key;
    obj->value = value;
    smp_wmb(); /* Will's proposed release fence */
    refcount_set(obj->ref, 1);
}

bool get_ref(struct object *obj, u64 key)
{
    if (!refcount_inc_not_zero(&obj->ref))
        return false;
    smp_rmb(); /* Peter's new acquire fence */
    if (READ_ONCE(obj->key) != key) {
        put_ref(obj);
        return false;
    }
    return true;
}

void put_ref(struct object *obj)
{
    if (!refcount_dec_and_test(&obj->ref))
        return;
    obj->key = KEY_INVALID;
    free(obj);
}

consumer:
    obj = lookup(collection, key);
    if (!get_ref(obj, key)
        return;
    use(obj->value);

producer:
    remove(collection, old_obj->key);
    put_ref(old_obj);
    new_obj = malloc();
    init(new_obj, new_key, new_value);
    insert(collection, new_key, new_obj);

With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
equal to new_obj.


>
> Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
                   ` (18 preceding siblings ...)
  2025-01-13 12:14 ` Lorenzo Stoakes
@ 2025-01-28  5:26 ` Shivank Garg
  2025-01-28  5:50   ` Suren Baghdasaryan
  19 siblings, 1 reply; 140+ messages in thread
From: Shivank Garg @ 2025-01-28  5:26 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

Hi Suren,

I tested the patch-series on AMD EPYC Zen 5 system
(2-socket, 64-core per socket with SMT Enabled, 4 NUMA nodes)
using mmtests's PFT (config-workload-pft-threads) benchmark on
mm-unstable.

On 1/11/2025 9:55 AM, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. regroups vm_area_struct members to fit them into 3 cachelines;
> 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
> 
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Results:
                             mm-unstable-vanilla   mm-unstable-v9-pervma-lock
Hmean     faults/cpu-1    2018530.3023 (   0.00%)  2051456.8039 (   1.63%)
Hmean     faults/cpu-4     718869.0234 (   0.00%)   720985.6986 (   0.29%)
Hmean     faults/cpu-7     377965.1187 (   0.00%)   368305.2802 (  -2.56%)
Hmean     faults/cpu-12    215502.5334 (   0.00%)   212764.0061 (  -1.27%)
Hmean     faults/cpu-21    149946.1873 (   0.00%)   150688.2173 (   0.49%)
Hmean     faults/cpu-30    142379.7863 (   0.00%)   143677.5012 (   0.91%)
Hmean     faults/cpu-48    139625.5003 (   0.00%)   156217.1383 *  11.88%*
Hmean     faults/cpu-79    119093.6132 (   0.00%)   140501.1787 *  17.98%*
Hmean     faults/cpu-110   102446.6879 (   0.00%)   114128.7155 (  11.40%)
Hmean     faults/cpu-128    96640.4022 (   0.00%)   109474.8875 (  13.28%)
Hmean     faults/sec-1    2018197.4666 (   0.00%)  2051119.1375 (   1.63%)
Hmean     faults/sec-4    2853494.9619 (   0.00%)  2865639.8350 (   0.43%)
Hmean     faults/sec-7    2631049.4283 (   0.00%)  2564037.1696 (  -2.55%)
Hmean     faults/sec-12   2570378.4204 (   0.00%)  2540353.2525 (  -1.17%)
Hmean     faults/sec-21   3018640.3933 (   0.00%)  3014396.9773 (  -0.14%)
Hmean     faults/sec-30   4150723.9209 (   0.00%)  4167550.4070 (   0.41%)
Hmean     faults/sec-48   6459327.6559 (   0.00%)  7217660.4385 *  11.74%*
Hmean     faults/sec-79   8977397.1421 (   0.00%) 10695351.6214 *  19.14%*
Hmean     faults/sec-110 10590055.2262 (   0.00%) 12309035.9250 (  16.23%)
Hmean     faults/sec-128 11448246.6485 (   0.00%) 13554648.3823 (  18.40%)

Please add my:
Tested-by: Shivank Garg <shivankg@amd.com>

I would be happy to test future versions if needed.

Thanks,
Shivank



> 
> Changes since v8 [4]:
> - Change subject for the cover letter, per Vlastimil Babka
> - Added Reviewed-by and Acked-by, per Vlastimil Babka
> - Added static check for no-limit case in __refcount_add_not_zero_limited,
> per David Laight
> - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> per Hillf Danton and Vlastimil Babka
> - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> under us, per Vlastimil Babka
> - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> per Vlastimil Babka
> - Changed __vma_enter_locked() parameter to centralize refcount logic,
> per Vlastimil Babka
> - Amended description in vm_lock replacement patch explaining the effects
> of the patch on vm_area_struct size, per Vlastimil Babka
> - Added vm_area_struct member regrouping patch [5] into the series
> - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> - Added a comment for vm_area_struct to update vm_area_init_from() when
> adding new members, per Vlastimil Babka
> - Updated a comment about unstable src->shared.rb when copying a vma in
> vm_area_init_from(), per Vlastimil Babka
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> 
> Patchset applies over mm-unstable after reverting v8
> (current SHA range: 235b5129cb7b - 9e6b24c58985)
> 
> Suren Baghdasaryan (17):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm: introduce vma_iter_store_attached() to use with attached vmas
>   mm: mark vmas detached upon exit
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm: move lesser used vma_area_struct members into the last cacheline
>   mm/debug: print vm_refcnt state when dumping the vma
>   mm: remove extra vma_numab_state_init() call
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
> 
>  Documentation/mm/process_addrs.rst |  44 ++++----
>  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
>  include/linux/mm_types.h           |  75 +++++++-------
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  24 ++++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      | 129 +++++++++++-------------
>  mm/debug.c                         |  12 +++
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  97 ++++++++++++++++--
>  mm/mmap.c                          |   3 +-
>  mm/userfaultfd.c                   |  32 +++---
>  mm/vma.c                           |  23 ++---
>  mm/vma.h                           |  15 ++-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
>  18 files changed, 465 insertions(+), 281 deletions(-)
> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
  2025-01-28  5:26 ` Shivank Garg
@ 2025-01-28  5:50   ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-28  5:50 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, richard.weiyang, corbet,
	linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Jan 27, 2025 at 9:27 PM Shivank Garg <shivankg@amd.com> wrote:
>
> Hi Suren,
>
> I tested the patch-series on AMD EPYC Zen 5 system
> (2-socket, 64-core per socket with SMT Enabled, 4 NUMA nodes)
> using mmtests's PFT (config-workload-pft-threads) benchmark on
> mm-unstable.
>
> On 1/11/2025 9:55 AM, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
> >
> > Pagefault microbenchmarks show performance improvement:
> > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Results:
>                              mm-unstable-vanilla   mm-unstable-v9-pervma-lock
> Hmean     faults/cpu-1    2018530.3023 (   0.00%)  2051456.8039 (   1.63%)
> Hmean     faults/cpu-4     718869.0234 (   0.00%)   720985.6986 (   0.29%)
> Hmean     faults/cpu-7     377965.1187 (   0.00%)   368305.2802 (  -2.56%)
> Hmean     faults/cpu-12    215502.5334 (   0.00%)   212764.0061 (  -1.27%)
> Hmean     faults/cpu-21    149946.1873 (   0.00%)   150688.2173 (   0.49%)
> Hmean     faults/cpu-30    142379.7863 (   0.00%)   143677.5012 (   0.91%)
> Hmean     faults/cpu-48    139625.5003 (   0.00%)   156217.1383 *  11.88%*
> Hmean     faults/cpu-79    119093.6132 (   0.00%)   140501.1787 *  17.98%*
> Hmean     faults/cpu-110   102446.6879 (   0.00%)   114128.7155 (  11.40%)
> Hmean     faults/cpu-128    96640.4022 (   0.00%)   109474.8875 (  13.28%)
> Hmean     faults/sec-1    2018197.4666 (   0.00%)  2051119.1375 (   1.63%)
> Hmean     faults/sec-4    2853494.9619 (   0.00%)  2865639.8350 (   0.43%)
> Hmean     faults/sec-7    2631049.4283 (   0.00%)  2564037.1696 (  -2.55%)
> Hmean     faults/sec-12   2570378.4204 (   0.00%)  2540353.2525 (  -1.17%)
> Hmean     faults/sec-21   3018640.3933 (   0.00%)  3014396.9773 (  -0.14%)
> Hmean     faults/sec-30   4150723.9209 (   0.00%)  4167550.4070 (   0.41%)
> Hmean     faults/sec-48   6459327.6559 (   0.00%)  7217660.4385 *  11.74%*
> Hmean     faults/sec-79   8977397.1421 (   0.00%) 10695351.6214 *  19.14%*
> Hmean     faults/sec-110 10590055.2262 (   0.00%) 12309035.9250 (  16.23%)
> Hmean     faults/sec-128 11448246.6485 (   0.00%) 13554648.3823 (  18.40%)

Hi Shivank,
Thank you for providing more test results! This looks quite good!

>
> Please add my:
> Tested-by: Shivank Garg <shivankg@amd.com>
>
> I would be happy to test future versions if needed.

That would be very much appreciated! I'll CC you on the next version.
Thanks,
Suren.

>
> Thanks,
> Shivank
>
>
>
> >
> > Changes since v8 [4]:
> > - Change subject for the cover letter, per Vlastimil Babka
> > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > per David Laight
> > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > per Hillf Danton and Vlastimil Babka
> > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > under us, per Vlastimil Babka
> > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > per Vlastimil Babka
> > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > per Vlastimil Babka
> > - Amended description in vm_lock replacement patch explaining the effects
> > of the patch on vm_area_struct size, per Vlastimil Babka
> > - Added vm_area_struct member regrouping patch [5] into the series
> > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > adding new members, per Vlastimil Babka
> > - Updated a comment about unstable src->shared.rb when copying a vma in
> > vm_area_init_from(), per Vlastimil Babka
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> >
> > Patchset applies over mm-unstable after reverting v8
> > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> >
> > Suren Baghdasaryan (17):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: mark vma as detached until it's added into vma tree
> >   mm: introduce vma_iter_store_attached() to use with attached vmas
> >   mm: mark vmas detached upon exit
> >   types: move struct rcuwait into types.h
> >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> >   mm: move mmap_init_lock() out of the header file
> >   mm: uninline the main body of vma_start_write()
> >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> >   mm: replace vm_lock and detached flag with a reference count
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >   mm/debug: print vm_refcnt state when dumping the vma
> >   mm: remove extra vma_numab_state_init() call
> >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >   docs/mm: document latest changes to vm_lock
> >
> >  Documentation/mm/process_addrs.rst |  44 ++++----
> >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> >  include/linux/mm_types.h           |  75 +++++++-------
> >  include/linux/mmap_lock.h          |   6 --
> >  include/linux/rcuwait.h            |  13 +--
> >  include/linux/refcount.h           |  24 ++++-
> >  include/linux/slab.h               |   6 --
> >  include/linux/types.h              |  12 +++
> >  kernel/fork.c                      | 129 +++++++++++-------------
> >  mm/debug.c                         |  12 +++
> >  mm/init-mm.c                       |   1 +
> >  mm/memory.c                        |  97 ++++++++++++++++--
> >  mm/mmap.c                          |   3 +-
> >  mm/userfaultfd.c                   |  32 +++---
> >  mm/vma.c                           |  23 ++---
> >  mm/vma.h                           |  15 ++-
> >  tools/testing/vma/linux/atomic.h   |   5 +
> >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> >  18 files changed, 465 insertions(+), 281 deletions(-)
> >
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH] refcount: Strengthen inc_not_zero()
  2025-01-27 19:21                 ` Suren Baghdasaryan
@ 2025-01-28 23:51                   ` Suren Baghdasaryan
  2025-02-06  2:52                     ` [PATCH 1/1] refcount: provide ops for cases when object's memory can be reused Suren Baghdasaryan
  2025-02-06  3:03                     ` [PATCH] refcount: Strengthen inc_not_zero() Suren Baghdasaryan
  0 siblings, 2 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-01-28 23:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, boqun.feng, mark.rutland, Mateusz Guzik, akpm,
	willy, liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko,
	vbabka, hannes, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Mon, Jan 27, 2025 at 11:21 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
> >
> > On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > > >
> > > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > > conditions after the refcount succeeds.
> > > > >
> > > > > And this is probably fine, but let me ponder this all a little more.
> > > >
> > > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > > fix this, these things are hard enough as they are.
> > > >
> > > > Will, others, wdyt?
> > >
> > > We should also update the Documentation (atomic_t.txt and
> > > refcount-vs-atomic.rst) if we strengthen this.
> > >
> > > > ---
> > > > Subject: refcount: Strengthen inc_not_zero()
> > > >
> > > > For speculative lookups where a successful inc_not_zero() pins the
> > > > object, but where we still need to double check if the object acquired
> > > > is indeed the one we set out to aquire, needs this validation to happen
> > > > *after* the increment.
> > > >
> > > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > > >
> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > ---
> > > >  include/linux/refcount.h | 15 ++++++++-------
> > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > index 35f039ecb272..340e7ffa445e 100644
> > > > --- a/include/linux/refcount.h
> > > > +++ b/include/linux/refcount.h
> > > > @@ -69,9 +69,10 @@
> > > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > > >   * load.
> > > >   *
> > > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > > - * future stores against the inc, this ensures we'll never modify the object
> > > > - * if we did not in fact acquire a reference.
> > > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > > + * are from this object and not anything previously occupying this memory --
> > > > + * consider SLAB_TYPESAFE_BY_RCU.
> > > >   *
> > > >   * The decrements will provide release order, such that all the prior loads and
> > > >   * stores will be issued before, it also provides a control dependency, which
> > > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > >     do {
> > > >             if (!old)
> > > >                     break;
> > > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> > >
> > > Hmm, do the later memory accesses need to be ordered against the store
> > > part of the increment or just the read? If it's the former, then I don't
> > > think that _acquire is sufficient -- accesses can still get in-between
> > > the read and write parts of the RmW.
> >
> > I dug some more into this at the end of last week. For the
> > SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> > dec_and_test(), then I think using _acquire() above is correct as the
> > later references can only move up into the critical section in the case
> > that we successfully obtained a reference.
> >
> > However, if we're going to make the barriers implicit in the refcount
> > operations here then I think we also need to do something on the producer
> > side for when the object is re-initialised after being destroyed and
> > allocated again. I think that would necessitate release ordering for
> > refcount_set() so that whatever allows the consumer to validate the
> > object (e.g. sequence number) is published *before* the refcount.
>
> Thanks Will!
> I would like to expand on your answer to provide an example of the
> race that would happen without release ordering in the producer. To
> save reader's time I provide a simplified flow and reasoning first.
> More detailed code of what I'm considering a typical
> SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
> reply (Addendum).
> Simplified flow looks like this:
>
> consumer:
>     obj = lookup(collection, key);
>     if (!refcount_inc_not_zero(&obj->ref))
>         return;
>     smp_rmb(); /* Peter's new acquire fence */
>     if (READ_ONCE(obj->key) != key) {
>         put_ref(obj);
>         return;
>     }
>     use(obj->value);
>
> producer:
>     old_key = obj->key;
>     remove(collection, old_key);
>     if (!refcount_dec_and_test(&obj->ref))
>         return;
>     obj->key = KEY_INVALID;
>     free(objj);
>     ...
>     obj = malloc(); /* obj is reused */
>     obj->key = new_key;
>     obj->value = new_value;
>     smp_wmb(); /* Will's proposed release fence */
>     refcount_set(obj->ref, 1);
>     insert(collection, key, obj);
>
> Let's consider a case when new_key == old_key. Will call both of them
> "key". Without WIll's proposed fence the following reordering is
> possible:
>
> consumer:
>     obj = lookup(collection, key);
>
>                  producer:
>                      key = obj->key
>                      remove(collection, key);
>                      if (!refcount_dec_and_test(&obj->ref))
>                          return;
>                      obj->key = KEY_INVALID;
>                      free(objj);
>                      obj = malloc(); /* obj is reused */
>                      refcount_set(obj->ref, 1);
>                      obj->key = key; /* same key */
>
>     if (!refcount_inc_not_zero(&obj->ref))
>         return;
>     smp_rmb();
>     if (READ_ONCE(obj->key) != key) {
>         put_ref(obj);
>         return;
>     }
>     use(obj->value);
>
>                      obj->value = new_value; /* reordered store */
>                      add(collection, key, obj);
>
> So, the consumer finds the old object, successfully takes a refcount
> and validates the key. It succeeds because the object is allocated and
> has the same key, which is fine. However it proceeds to use stale
> obj->value. Will's proposed release ordering would prevent that.
>
> The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
> I think it would be better to introduce two new functions:
> refcount_add_not_zero_acquire() and refcount_set_release(), clearly
> document that they should be used when a freed object can be recycled
> and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
> should also clarify that once it's called, the object can be accessed
> by consumers even if it was not added yet into the collection used for
> object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
> comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> then can explicitly use these new functions in the example code,
> further clarifying their purpose and proper use.
> WDYT?

Hi Peter,
Should I take a stab at preparing a patch to add the two new
refcounting functions suggested above with updates to the
documentation and comments?
If you disagree with that or need more time to decide then I'll wait.
Please let me know.
Thanks,
Suren.


>
> ADDENDUM.
> Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:
>
> struct object {
>     refcount_t ref;
>     u64 key;
>     u64 value;
> };
>
> void init(struct object *obj, u64 key, u64 value)
> {
>     obj->key = key;
>     obj->value = value;
>     smp_wmb(); /* Will's proposed release fence */
>     refcount_set(obj->ref, 1);
> }
>
> bool get_ref(struct object *obj, u64 key)
> {
>     if (!refcount_inc_not_zero(&obj->ref))
>         return false;
>     smp_rmb(); /* Peter's new acquire fence */
>     if (READ_ONCE(obj->key) != key) {
>         put_ref(obj);
>         return false;
>     }
>     return true;
> }
>
> void put_ref(struct object *obj)
> {
>     if (!refcount_dec_and_test(&obj->ref))
>         return;
>     obj->key = KEY_INVALID;
>     free(obj);
> }
>
> consumer:
>     obj = lookup(collection, key);
>     if (!get_ref(obj, key)
>         return;
>     use(obj->value);
>
> producer:
>     remove(collection, old_obj->key);
>     put_ref(old_obj);
>     new_obj = malloc();
>     init(new_obj, new_key, new_value);
>     insert(collection, new_key, new_obj);
>
> With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
> equal to new_obj.
>
>
> >
> > Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 1/1] refcount: provide ops for cases when object's memory can be reused
  2025-01-28 23:51                   ` Suren Baghdasaryan
@ 2025-02-06  2:52                     ` Suren Baghdasaryan
  2025-02-06 10:41                       ` Vlastimil Babka
  2025-02-06  3:03                     ` [PATCH] refcount: Strengthen inc_not_zero() Suren Baghdasaryan
  1 sibling, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-02-06  2:52 UTC (permalink / raw)
  To: peterz
  Cc: will, paulmck, boqun.feng, mark.rutland, mjguzik, akpm, willy,
	liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko, vbabka,
	hannes, oliver.sang, mgorman, david, peterx, oleg, dave, brauner,
	dhowells, hdanton, hughd, lokeshgidra, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team, Suren Baghdasaryan

For speculative lookups where a successful inc_not_zero() pins the
object, but where we still need to double check if the object acquired
is indeed the one we set out to acquire (identity check), needs this
validation to happen *after* the increment.
Similarly, when a new object is initialized and its memory might have
been previously occupied by another object, all stores to initialize the
object should happen *before* refcount initialization.

Notably SLAB_TYPESAFE_BY_RCU is one such an example when this ordering
is required for reference counting.

Add refcount_{add|inc}_not_zero_acquire() to guarantee the proper ordering
between acquiring a reference count on an object and performing the
identity check for that object.
Add refcount_set_release() to guarantee proper ordering between stores
initializing object attributes and the store initializing the refcount.
refcount_set_release() should be done after all other object attributes
are initialized. Once refcount_set_release() is called, the object should
be considered visible to other tasks even if it was not yet added into an
object collection normally used to discover it. This is because other
tasks might have discovered the object previously occupying the same
memory and after memory reuse they can succeed in taking refcount for the
new object and start using it.

Object reuse example to consider:

consumer:
    obj = lookup(collection, key);
    if (!refcount_inc_not_zero_acquire(&obj->ref))
        return;
    if (READ_ONCE(obj->key) != key) { /* identity check */
        put_ref(obj);
        return;
    }
    use(obj->value);

                 producer:
                     remove(collection, obj->key);
                     if (!refcount_dec_and_test(&obj->ref))
                         return;
                     obj->key = KEY_INVALID;
                     free(obj);
                     obj = malloc(); /* obj is reused */
                     obj->key = new_key;
                     obj->value = new_value;
                     refcount_set_release(obj->ref, 1);
                     add(collection, new_key, obj);

refcount_{add|inc}_not_zero_acquire() is required to prevent the following
reordering when refcount_inc_not_zero() is used instead:

consumer:
    obj = lookup(collection, key);
    if (READ_ONCE(obj->key) != key) { /* reordered identity check */
        put_ref(obj);
        return;
    }
                 producer:
                     remove(collection, obj->key);
                     if (!refcount_dec_and_test(&obj->ref))
                         return;
                     obj->key = KEY_INVALID;
                     free(obj);
                     obj = malloc(); /* obj is reused */
                     obj->key = new_key;
                     obj->value = new_value;
                     refcount_set_release(obj->ref, 1);
                     add(collection, new_key, obj);

    if (!refcount_inc_not_zero(&obj->ref))
        return;
    use(obj->value); /* USING WRONG OBJECT */

refcount_set_release() is required to prevent the following reordering
when refcount_set() is used instead:

consumer:
    obj = lookup(collection, key);

                 producer:
                     remove(collection, obj->key);
                     if (!refcount_dec_and_test(&obj->ref))
                         return;
                     obj->key = KEY_INVALID;
                     free(obj);
                     obj = malloc(); /* obj is reused */
                     obj->key = new_key; /* new_key == old_key */
                     refcount_set(obj->ref, 1);

    if (!refcount_inc_not_zero_acquire(&obj->ref))
        return;
    if (READ_ONCE(obj->key) != key) { /* pass since new_key == old_key */
        put_ref(obj);
        return;
    }
    use(obj->value); /* USING STALE obj->value */

                     obj->value = new_value; /* reordered store */
                     add(collection, key, obj);

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/RCU/whatisRCU.rst               |  10 ++
 Documentation/core-api/refcount-vs-atomic.rst |  37 +++++-
 include/linux/refcount.h                      | 106 ++++++++++++++++++
 include/linux/slab.h                          |   9 ++
 4 files changed, 156 insertions(+), 6 deletions(-)

diff --git a/Documentation/RCU/whatisRCU.rst b/Documentation/RCU/whatisRCU.rst
index 1ef5784c1b84..53faeed7c190 100644
--- a/Documentation/RCU/whatisRCU.rst
+++ b/Documentation/RCU/whatisRCU.rst
@@ -971,6 +971,16 @@ unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
 initialized after each and every call to kmem_cache_alloc(), which renders
 reference-free spinlock acquisition completely unsafe.  Therefore, when
 using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
+If using refcount_t, the specialized refcount_{add|inc}_not_zero_acquire()
+and refcount_set_release() APIs should be used to ensure correct operation
+ordering when verifying object identity and when initializing newly
+allocated objects. Acquire fence in refcount_{add|inc}_not_zero_acquire()
+ensures that identity checks happen *after* reference count is taken.
+refcount_set_release() should be called after a newly allocated object is
+fully initialized and release fence ensures that new values are visible
+*before* refcount can be successfully taken by other users. Once
+refcount_set_release() is called, the object should be considered visible
+by other tasks.
 (Those willing to initialize their locks in a kmem_cache constructor
 may also use locking, including cache-friendly sequence locking.)
 
diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst
index 79a009ce11df..9551a7bbfd38 100644
--- a/Documentation/core-api/refcount-vs-atomic.rst
+++ b/Documentation/core-api/refcount-vs-atomic.rst
@@ -86,7 +86,19 @@ Memory ordering guarantee changes:
  * none (both fully unordered)
 
 
-case 2) - increment-based ops that return no value
+case 2) - non-"Read/Modify/Write" (RMW) ops with release ordering
+-------------------------------------------
+
+Function changes:
+
+ * atomic_set_release() --> refcount_set_release()
+
+Memory ordering guarantee changes:
+
+ * none (both provide RELEASE ordering)
+
+
+case 3) - increment-based ops that return no value
 --------------------------------------------------
 
 Function changes:
@@ -98,7 +110,7 @@ Memory ordering guarantee changes:
 
  * none (both fully unordered)
 
-case 3) - decrement-based RMW ops that return no value
+case 4) - decrement-based RMW ops that return no value
 ------------------------------------------------------
 
 Function changes:
@@ -110,7 +122,7 @@ Memory ordering guarantee changes:
  * fully unordered --> RELEASE ordering
 
 
-case 4) - increment-based RMW ops that return a value
+case 5) - increment-based RMW ops that return a value
 -----------------------------------------------------
 
 Function changes:
@@ -126,7 +138,20 @@ Memory ordering guarantees changes:
    result of obtaining pointer to the object!
 
 
-case 5) - generic dec/sub decrement-based RMW ops that return a value
+case 6) - increment-based RMW ops with acquire ordering that return a value
+-----------------------------------------------------
+
+Function changes:
+
+ * atomic_inc_not_zero() --> refcount_inc_not_zero_acquire()
+ * no atomic counterpart --> refcount_add_not_zero_acquire()
+
+Memory ordering guarantees changes:
+
+ * fully ordered --> ACQUIRE ordering on success
+
+
+case 7) - generic dec/sub decrement-based RMW ops that return a value
 ---------------------------------------------------------------------
 
 Function changes:
@@ -139,7 +164,7 @@ Memory ordering guarantees changes:
  * fully ordered --> RELEASE ordering + ACQUIRE ordering on success
 
 
-case 6) other decrement-based RMW ops that return a value
+case 8) other decrement-based RMW ops that return a value
 ---------------------------------------------------------
 
 Function changes:
@@ -154,7 +179,7 @@ Memory ordering guarantees changes:
 .. note:: atomic_add_unless() only provides full order on success.
 
 
-case 7) - lock-based RMW
+case 9) - lock-based RMW
 ------------------------
 
 Function changes:
diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..4589d2e7bfea 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -87,6 +87,15 @@
  * The decrements dec_and_test() and sub_and_test() also provide acquire
  * ordering on success.
  *
+ * refcount_{add|inc}_not_zero_acquire() and refcount_set_release() provide
+ * acquire and release ordering for cases when the memory occupied by the
+ * object might be reused to store another object. This is important for the
+ * cases where secondary validation is required to detect such reuse, e.g.
+ * SLAB_TYPESAFE_BY_RCU. The secondary validation checks have to happen after
+ * the refcount is taken, hence acquire order is necessary. Similarly, when the
+ * object is initialized, all stores to its attributes should be visible before
+ * the refcount is set, otherwise a stale attribute value might be used by
+ * another task which succeeds in taking a refcount to the new object.
  */
 
 #ifndef _LINUX_REFCOUNT_H
@@ -125,6 +134,31 @@ static inline void refcount_set(refcount_t *r, int n)
 	atomic_set(&r->refs, n);
 }
 
+/**
+ * refcount_set_release - set a refcount's value with release ordering
+ * @r: the refcount
+ * @n: value to which the refcount will be set
+ *
+ * This function should be used when memory occupied by the object might be
+ * reused to store another object -- consider SLAB_TYPESAFE_BY_RCU.
+ *
+ * Provides release memory ordering which will order previous memory operations
+ * against this store. This ensures all updates to this object are visible
+ * once the refcount is set and stale values from the object previously
+ * occupying this memory are overwritten with new ones.
+ *
+ * This function should be called only after new object is fully initialized.
+ * After this call the object should be considered visible to other tasks even
+ * if it was not yet added into an object collection normally used to discover
+ * it. This is because other tasks might have discovered the object previously
+ * occupying the same memory and after memory reuse they can succeed in taking
+ * refcount to the new object and start using it.
+ */
+static inline void refcount_set_release(refcount_t *r, int n)
+{
+	atomic_set_release(&r->refs, n);
+}
+
 /**
  * refcount_read - get a refcount's value
  * @r: the refcount
@@ -178,6 +212,52 @@ static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
 	return __refcount_add_not_zero(i, r, NULL);
 }
 
+static inline __must_check __signed_wrap
+bool __refcount_add_not_zero_acquire(int i, refcount_t *r, int *oldp)
+{
+	int old = refcount_read(r);
+
+	do {
+		if (!old)
+			break;
+	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
+
+	if (oldp)
+		*oldp = old;
+
+	if (unlikely(old < 0 || old + i < 0))
+		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
+
+	return old;
+}
+
+/**
+ * refcount_add_not_zero_acquire - add a value to a refcount with acquire ordering unless it is 0
+ *
+ * @i: the value to add to the refcount
+ * @r: the refcount
+ *
+ * Will saturate at REFCOUNT_SATURATED and WARN.
+ *
+ * This function should be used when memory occupied by the object might be
+ * reused to store another object -- consider SLAB_TYPESAFE_BY_RCU.
+ *
+ * Provides acquire memory ordering on success, it is assumed the caller has
+ * guaranteed the object memory to be stable (RCU, etc.). It does provide a
+ * control dependency and thereby orders future stores. See the comment on top.
+ *
+ * Use of this function is not recommended for the normal reference counting
+ * use case in which references are taken and released one at a time.  In these
+ * cases, refcount_inc_not_zero_acquire() should instead be used to increment a
+ * reference count.
+ *
+ * Return: false if the passed refcount is 0, true otherwise
+ */
+static inline __must_check bool refcount_add_not_zero_acquire(int i, refcount_t *r)
+{
+	return __refcount_add_not_zero_acquire(i, r, NULL);
+}
+
 static inline __signed_wrap
 void __refcount_add(int i, refcount_t *r, int *oldp)
 {
@@ -236,6 +316,32 @@ static inline __must_check bool refcount_inc_not_zero(refcount_t *r)
 	return __refcount_inc_not_zero(r, NULL);
 }
 
+static inline __must_check bool __refcount_inc_not_zero_acquire(refcount_t *r, int *oldp)
+{
+	return __refcount_add_not_zero_acquire(1, r, oldp);
+}
+
+/**
+ * refcount_inc_not_zero_acquire - increment a refcount with acquire ordering unless it is 0
+ * @r: the refcount to increment
+ *
+ * Similar to refcount_inc_not_zero(), but provides acquire memory ordering on
+ * success.
+ *
+ * This function should be used when memory occupied by the object might be
+ * reused to store another object -- consider SLAB_TYPESAFE_BY_RCU.
+ *
+ * Provides acquire memory ordering on success, it is assumed the caller has
+ * guaranteed the object memory to be stable (RCU, etc.). It does provide a
+ * control dependency and thereby orders future stores. See the comment on top.
+ *
+ * Return: true if the increment was successful, false otherwise
+ */
+static inline __must_check bool refcount_inc_not_zero_acquire(refcount_t *r)
+{
+	return __refcount_inc_not_zero_acquire(r, NULL);
+}
+
 static inline void __refcount_inc(refcount_t *r, int *oldp)
 {
 	__refcount_add(1, r, oldp);
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 09eedaecf120..ad902a2d692b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -136,6 +136,15 @@ enum _slab_flag_bits {
  * rcu_read_lock before reading the address, then rcu_read_unlock after
  * taking the spinlock within the structure expected at that address.
  *
+ * Note that object identity check has to be done *after* acquiring a
+ * reference, therefore user has to ensure proper ordering for loads.
+ * Similarly, when initializing objects allocated with SLAB_TYPESAFE_BY_RCU,
+ * the newly allocated object has to be fully initialized *before* its
+ * refcount gets initialized and proper ordering for stores is required.
+ * refcount_{add|inc}_not_zero_acquire() and refcount_set_release() are
+ * designed with the proper fences required for reference counting objects
+ * allocated with SLAB_TYPESAFE_BY_RCU.
+ *
  * Note that it is not possible to acquire a lock within a structure
  * allocated with SLAB_TYPESAFE_BY_RCU without first acquiring a reference
  * as described above.  The reason is that SLAB_TYPESAFE_BY_RCU pages

base-commit: 92514ef226f511f2ca1fb1b8752966097518edc0
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH] refcount: Strengthen inc_not_zero()
  2025-01-28 23:51                   ` Suren Baghdasaryan
  2025-02-06  2:52                     ` [PATCH 1/1] refcount: provide ops for cases when object's memory can be reused Suren Baghdasaryan
@ 2025-02-06  3:03                     ` Suren Baghdasaryan
  2025-02-13 23:04                       ` Suren Baghdasaryan
  1 sibling, 1 reply; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-02-06  3:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, boqun.feng, mark.rutland, Mateusz Guzik, akpm,
	willy, liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko,
	vbabka, hannes, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Tue, Jan 28, 2025 at 3:51 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jan 27, 2025 at 11:21 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
> > >
> > > On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > > > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > > > >
> > > > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > > > conditions after the refcount succeeds.
> > > > > >
> > > > > > And this is probably fine, but let me ponder this all a little more.
> > > > >
> > > > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > > > fix this, these things are hard enough as they are.
> > > > >
> > > > > Will, others, wdyt?
> > > >
> > > > We should also update the Documentation (atomic_t.txt and
> > > > refcount-vs-atomic.rst) if we strengthen this.
> > > >
> > > > > ---
> > > > > Subject: refcount: Strengthen inc_not_zero()
> > > > >
> > > > > For speculative lookups where a successful inc_not_zero() pins the
> > > > > object, but where we still need to double check if the object acquired
> > > > > is indeed the one we set out to aquire, needs this validation to happen
> > > > > *after* the increment.
> > > > >
> > > > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > > > >
> > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > ---
> > > > >  include/linux/refcount.h | 15 ++++++++-------
> > > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > > index 35f039ecb272..340e7ffa445e 100644
> > > > > --- a/include/linux/refcount.h
> > > > > +++ b/include/linux/refcount.h
> > > > > @@ -69,9 +69,10 @@
> > > > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > > > >   * load.
> > > > >   *
> > > > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > > > - * future stores against the inc, this ensures we'll never modify the object
> > > > > - * if we did not in fact acquire a reference.
> > > > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > > > + * are from this object and not anything previously occupying this memory --
> > > > > + * consider SLAB_TYPESAFE_BY_RCU.
> > > > >   *
> > > > >   * The decrements will provide release order, such that all the prior loads and
> > > > >   * stores will be issued before, it also provides a control dependency, which
> > > > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > >     do {
> > > > >             if (!old)
> > > > >                     break;
> > > > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> > > >
> > > > Hmm, do the later memory accesses need to be ordered against the store
> > > > part of the increment or just the read? If it's the former, then I don't
> > > > think that _acquire is sufficient -- accesses can still get in-between
> > > > the read and write parts of the RmW.
> > >
> > > I dug some more into this at the end of last week. For the
> > > SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> > > dec_and_test(), then I think using _acquire() above is correct as the
> > > later references can only move up into the critical section in the case
> > > that we successfully obtained a reference.
> > >
> > > However, if we're going to make the barriers implicit in the refcount
> > > operations here then I think we also need to do something on the producer
> > > side for when the object is re-initialised after being destroyed and
> > > allocated again. I think that would necessitate release ordering for
> > > refcount_set() so that whatever allows the consumer to validate the
> > > object (e.g. sequence number) is published *before* the refcount.
> >
> > Thanks Will!
> > I would like to expand on your answer to provide an example of the
> > race that would happen without release ordering in the producer. To
> > save reader's time I provide a simplified flow and reasoning first.
> > More detailed code of what I'm considering a typical
> > SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
> > reply (Addendum).
> > Simplified flow looks like this:
> >
> > consumer:
> >     obj = lookup(collection, key);
> >     if (!refcount_inc_not_zero(&obj->ref))
> >         return;
> >     smp_rmb(); /* Peter's new acquire fence */
> >     if (READ_ONCE(obj->key) != key) {
> >         put_ref(obj);
> >         return;
> >     }
> >     use(obj->value);
> >
> > producer:
> >     old_key = obj->key;
> >     remove(collection, old_key);
> >     if (!refcount_dec_and_test(&obj->ref))
> >         return;
> >     obj->key = KEY_INVALID;
> >     free(objj);
> >     ...
> >     obj = malloc(); /* obj is reused */
> >     obj->key = new_key;
> >     obj->value = new_value;
> >     smp_wmb(); /* Will's proposed release fence */
> >     refcount_set(obj->ref, 1);
> >     insert(collection, key, obj);
> >
> > Let's consider a case when new_key == old_key. Will call both of them
> > "key". Without WIll's proposed fence the following reordering is
> > possible:
> >
> > consumer:
> >     obj = lookup(collection, key);
> >
> >                  producer:
> >                      key = obj->key
> >                      remove(collection, key);
> >                      if (!refcount_dec_and_test(&obj->ref))
> >                          return;
> >                      obj->key = KEY_INVALID;
> >                      free(objj);
> >                      obj = malloc(); /* obj is reused */
> >                      refcount_set(obj->ref, 1);
> >                      obj->key = key; /* same key */
> >
> >     if (!refcount_inc_not_zero(&obj->ref))
> >         return;
> >     smp_rmb();
> >     if (READ_ONCE(obj->key) != key) {
> >         put_ref(obj);
> >         return;
> >     }
> >     use(obj->value);
> >
> >                      obj->value = new_value; /* reordered store */
> >                      add(collection, key, obj);
> >
> > So, the consumer finds the old object, successfully takes a refcount
> > and validates the key. It succeeds because the object is allocated and
> > has the same key, which is fine. However it proceeds to use stale
> > obj->value. Will's proposed release ordering would prevent that.
> >
> > The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
> > I think it would be better to introduce two new functions:
> > refcount_add_not_zero_acquire() and refcount_set_release(), clearly
> > document that they should be used when a freed object can be recycled
> > and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
> > should also clarify that once it's called, the object can be accessed
> > by consumers even if it was not added yet into the collection used for
> > object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
> > comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > then can explicitly use these new functions in the example code,
> > further clarifying their purpose and proper use.
> > WDYT?
>
> Hi Peter,
> Should I take a stab at preparing a patch to add the two new
> refcounting functions suggested above with updates to the
> documentation and comments?
> If you disagree with that or need more time to decide then I'll wait.
> Please let me know.

Not sure if "--in-reply-to" worked but I just posted a patch adding
new refcounting APIs for SLAB_TYPESAFE_BY_RCU here:
https://lore.kernel.org/all/20250206025201.979573-1-surenb@google.com/
Since Peter seems to be busy I discussed these ordering requirements
for SLAB_TYPESAFE_BY_RCU with Paul McKenney and he was leaning towards
having separate functions with the additional fences for this case.
That's what I provided in my patch.
Another possible option is to add acquire ordering in the
__refcount_add_not_zero() as Peter suggested and add
refcount_set_release() function.
Thanks,
Suren.


> Thanks,
> Suren.
>
>
> >
> > ADDENDUM.
> > Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:
> >
> > struct object {
> >     refcount_t ref;
> >     u64 key;
> >     u64 value;
> > };
> >
> > void init(struct object *obj, u64 key, u64 value)
> > {
> >     obj->key = key;
> >     obj->value = value;
> >     smp_wmb(); /* Will's proposed release fence */
> >     refcount_set(obj->ref, 1);
> > }
> >
> > bool get_ref(struct object *obj, u64 key)
> > {
> >     if (!refcount_inc_not_zero(&obj->ref))
> >         return false;
> >     smp_rmb(); /* Peter's new acquire fence */
> >     if (READ_ONCE(obj->key) != key) {
> >         put_ref(obj);
> >         return false;
> >     }
> >     return true;
> > }
> >
> > void put_ref(struct object *obj)
> > {
> >     if (!refcount_dec_and_test(&obj->ref))
> >         return;
> >     obj->key = KEY_INVALID;
> >     free(obj);
> > }
> >
> > consumer:
> >     obj = lookup(collection, key);
> >     if (!get_ref(obj, key)
> >         return;
> >     use(obj->value);
> >
> > producer:
> >     remove(collection, old_obj->key);
> >     put_ref(old_obj);
> >     new_obj = malloc();
> >     init(new_obj, new_key, new_value);
> >     insert(collection, new_key, new_obj);
> >
> > With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
> > equal to new_obj.
> >
> >
> > >
> > > Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 1/1] refcount: provide ops for cases when object's memory can be reused
  2025-02-06  2:52                     ` [PATCH 1/1] refcount: provide ops for cases when object's memory can be reused Suren Baghdasaryan
@ 2025-02-06 10:41                       ` Vlastimil Babka
  0 siblings, 0 replies; 140+ messages in thread
From: Vlastimil Babka @ 2025-02-06 10:41 UTC (permalink / raw)
  To: Suren Baghdasaryan, peterz
  Cc: will, paulmck, boqun.feng, mark.rutland, mjguzik, akpm, willy,
	liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko, hannes,
	oliver.sang, mgorman, david, peterx, oleg, dave, brauner,
	dhowells, hdanton, hughd, lokeshgidra, minchan, jannh,
	shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On 2/6/25 03:52, Suren Baghdasaryan wrote:
> For speculative lookups where a successful inc_not_zero() pins the
> object, but where we still need to double check if the object acquired
> is indeed the one we set out to acquire (identity check), needs this
> validation to happen *after* the increment.
> Similarly, when a new object is initialized and its memory might have
> been previously occupied by another object, all stores to initialize the
> object should happen *before* refcount initialization.
> 
> Notably SLAB_TYPESAFE_BY_RCU is one such an example when this ordering
> is required for reference counting.
> 
> Add refcount_{add|inc}_not_zero_acquire() to guarantee the proper ordering
> between acquiring a reference count on an object and performing the
> identity check for that object.
> Add refcount_set_release() to guarantee proper ordering between stores
> initializing object attributes and the store initializing the refcount.
> refcount_set_release() should be done after all other object attributes
> are initialized. Once refcount_set_release() is called, the object should
> be considered visible to other tasks even if it was not yet added into an
> object collection normally used to discover it. This is because other
> tasks might have discovered the object previously occupying the same
> memory and after memory reuse they can succeed in taking refcount for the
> new object and start using it.
> 
> Object reuse example to consider:
> 
> consumer:
>     obj = lookup(collection, key);
>     if (!refcount_inc_not_zero_acquire(&obj->ref))
>         return;
>     if (READ_ONCE(obj->key) != key) { /* identity check */
>         put_ref(obj);
>         return;
>     }
>     use(obj->value);
> 
>                  producer:
>                      remove(collection, obj->key);
>                      if (!refcount_dec_and_test(&obj->ref))
>                          return;
>                      obj->key = KEY_INVALID;
>                      free(obj);
>                      obj = malloc(); /* obj is reused */
>                      obj->key = new_key;
>                      obj->value = new_value;
>                      refcount_set_release(obj->ref, 1);
>                      add(collection, new_key, obj);
> 
> refcount_{add|inc}_not_zero_acquire() is required to prevent the following
> reordering when refcount_inc_not_zero() is used instead:
> 
> consumer:
>     obj = lookup(collection, key);
>     if (READ_ONCE(obj->key) != key) { /* reordered identity check */
>         put_ref(obj);
>         return;
>     }
>                  producer:
>                      remove(collection, obj->key);
>                      if (!refcount_dec_and_test(&obj->ref))
>                          return;
>                      obj->key = KEY_INVALID;
>                      free(obj);
>                      obj = malloc(); /* obj is reused */
>                      obj->key = new_key;
>                      obj->value = new_value;
>                      refcount_set_release(obj->ref, 1);
>                      add(collection, new_key, obj);
> 
>     if (!refcount_inc_not_zero(&obj->ref))
>         return;
>     use(obj->value); /* USING WRONG OBJECT */
> 
> refcount_set_release() is required to prevent the following reordering
> when refcount_set() is used instead:
> 
> consumer:
>     obj = lookup(collection, key);
> 
>                  producer:
>                      remove(collection, obj->key);
>                      if (!refcount_dec_and_test(&obj->ref))
>                          return;
>                      obj->key = KEY_INVALID;
>                      free(obj);
>                      obj = malloc(); /* obj is reused */
>                      obj->key = new_key; /* new_key == old_key */
>                      refcount_set(obj->ref, 1);
> 
>     if (!refcount_inc_not_zero_acquire(&obj->ref))
>         return;
>     if (READ_ONCE(obj->key) != key) { /* pass since new_key == old_key */
>         put_ref(obj);
>         return;
>     }
>     use(obj->value); /* USING STALE obj->value */
> 
>                      obj->value = new_value; /* reordered store */
>                      add(collection, key, obj);
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz> #slab


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2025-01-15 15:10         ` Suren Baghdasaryan
@ 2025-02-13 22:56           ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-02-13 22:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Wei Yang, willy, akpm, peterz, liam.howlett, lorenzo.stoakes,
	david.laight.linux, mhocko, hannes, mjguzik, oliver.sang, mgorman,
	david, peterx, oleg, dave, paulmck, brauner, dhowells, hdanton,
	hughd, lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 7:10 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 14, 2025 at 11:58 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 1/15/25 04:15, Suren Baghdasaryan wrote:
> > > On Tue, Jan 14, 2025 at 6:27 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> > >>
> > >> On Fri, Jan 10, 2025 at 08:26:03PM -0800, Suren Baghdasaryan wrote:
> > >>
> > >> >diff --git a/kernel/fork.c b/kernel/fork.c
> > >> >index 9d9275783cf8..151b40627c14 100644
> > >> >--- a/kernel/fork.c
> > >> >+++ b/kernel/fork.c
> > >> >@@ -449,6 +449,42 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > >> >       return vma;
> > >> > }
> > >> >
> > >> >+static void vm_area_init_from(const struct vm_area_struct *src,
> > >> >+                            struct vm_area_struct *dest)
> > >> >+{
> > >> >+      dest->vm_mm = src->vm_mm;
> > >> >+      dest->vm_ops = src->vm_ops;
> > >> >+      dest->vm_start = src->vm_start;
> > >> >+      dest->vm_end = src->vm_end;
> > >> >+      dest->anon_vma = src->anon_vma;
> > >> >+      dest->vm_pgoff = src->vm_pgoff;
> > >> >+      dest->vm_file = src->vm_file;
> > >> >+      dest->vm_private_data = src->vm_private_data;
> > >> >+      vm_flags_init(dest, src->vm_flags);
> > >> >+      memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > >> >+             sizeof(dest->vm_page_prot));
> > >> >+      /*
> > >> >+       * src->shared.rb may be modified concurrently when called from
> > >> >+       * dup_mmap(), but the clone will reinitialize it.
> > >> >+       */
> > >> >+      data_race(memcpy(&dest->shared, &src->shared, sizeof(dest->shared)));
> > >> >+      memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > >> >+             sizeof(dest->vm_userfaultfd_ctx));
> > >> >+#ifdef CONFIG_ANON_VMA_NAME
> > >> >+      dest->anon_name = src->anon_name;
> > >> >+#endif
> > >> >+#ifdef CONFIG_SWAP
> > >> >+      memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > >> >+             sizeof(dest->swap_readahead_info));
> > >> >+#endif
> > >> >+#ifndef CONFIG_MMU
> > >> >+      dest->vm_region = src->vm_region;
> > >> >+#endif
> > >> >+#ifdef CONFIG_NUMA
> > >> >+      dest->vm_policy = src->vm_policy;
> > >> >+#endif
> > >> >+}
> > >>
> > >> Would this be difficult to maintain? We should make sure not miss or overwrite
> > >> anything.
> > >
> > > Yeah, it is less maintainable than a simple memcpy() but I did not
> > > find a better alternative.
> >
> > Willy knows one but refuses to share it :(
>
> Ah, that reminds me why I dropped this approach :) But to be honest,
> back then we also had vma_clear() and that added to the ugliness. Now
> I could simply to this without all those macros:
>
> static inline void vma_copy(struct vm_area_struct *new,
>                                             struct vm_area_struct *orig)
> {
>         /* Copy the vma while preserving vma->vm_lock */
>         data_race(memcpy(new, orig, offsetof(struct vm_area_struct, vm_lock)));
>         data_race(memcpy(new + offsetofend(struct vm_area_struct, vm_lock),
>                 orig + offsetofend(struct vm_area_struct, vm_lock),
>                 sizeof(struct vm_area_struct) -
>                 offsetofend(struct vm_area_struct, vm_lock));
> }
>
> Would that be better than the current approach?

I discussed proposed alternatives with Willy and he prefers the
current field-by-field copy approach. I also tried using
kmsan_check_memory() to check for uninitialized memory in the
vm_area_struct but unfortunately KMSAN stumbles on the holes in this
structure and there are 4 of them (I attached pahole output at the end
of this email). I tried unpoisoning holes but that gets very ugly very
fast. So, I posted v10
(https://lore.kernel.org/all/20250213224655.1680278-18-surenb@google.com/)
without changing this part.

struct vm_area_struct {
        union {
                struct {
                      unsigned long vm_start;          /*     0     8 */
                        unsigned long vm_end;            /*     8     8 */
                };                                       /*     0    16 */
                freeptr_t          vm_freeptr;           /*     0     8 */
        };                                               /*     0    16 */
        union {
                struct {
                        unsigned long      vm_start;             /*
 0     8 */
                        unsigned long      vm_end;               /*
 8     8 */
                };                                               /*
 0    16 */
                freeptr_t                  vm_freeptr;           /*
 0     8 */
        };

        struct mm_struct *         vm_mm;                /*    16     8 */
        pgprot_t                   vm_page_prot;         /*    24     8 */
        union {
                const vm_flags_t   vm_flags;             /*    32     8 */
                vm_flags_t         __vm_flags;           /*    32     8 */
        };                                               /*    32     8 */
        union {
                const vm_flags_t           vm_flags;             /*
 0     8 */
                vm_flags_t                 __vm_flags;           /*
 0     8 */
        };

        unsigned int               vm_lock_seq;          /*    40     4 */

        /* XXX 4 bytes hole, try to pack */

        struct list_head           anon_vma_chain;       /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct anon_vma *          anon_vma;             /*    64     8 */
        const struct vm_operations_struct  * vm_ops;     /*    72     8 */
        unsigned long              vm_pgoff;             /*    80     8 */
        struct file *              vm_file;              /*    88     8 */
        void *                     vm_private_data;      /*    96     8 */
        atomic_long_t              swap_readahead_info;  /*   104     8 */
        struct mempolicy *         vm_policy;            /*   112     8 */

        /* XXX 8 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        refcount_t                 vm_refcnt
__attribute__((__aligned__(64))); /*   128     4 */

        /* XXX 4 bytes hole, try to pack */

        struct {
                struct rb_node     rb __attribute__((__aligned__(8)));
/*   136    24 */
                unsigned long      rb_subtree_last;      /*   160     8 */
        } shared;                                        /*   136    32 */
        struct {
                struct rb_node             rb
__attribute__((__aligned__(8))); /*     0    24 */
                unsigned long              rb_subtree_last;      /*
24     8 */

        /* size: 32, cachelines: 1, members: 2 */
        /* forced alignments: 1 */
        /* last cacheline: 32 bytes */
        };

        struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */

        /* size: 192, cachelines: 3, members: 16 */
        /* sum members: 152, holes: 3, sum holes: 16 */
        /* padding: 24 */
        /* forced alignments: 1, forced holes: 1, sum forced holes: 8 */
};

>
> >
> > > I added a warning above the struct
> > > vm_area_struct definition to update this function every time we change
> > > that structure. Not sure if there is anything else I can do to help
> > > with this.
> > >
> > >>
> > >> --
> > >> Wei Yang
> > >> Help you, Help me
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline
  2025-01-15 16:39     ` Suren Baghdasaryan
@ 2025-02-13 22:59       ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-02-13 22:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, david.laight.linux,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, richard.weiyang, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Jan 15, 2025 at 8:39 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jan 15, 2025 at 2:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:25:59PM -0800, Suren Baghdasaryan wrote:
> > > Move several vma_area_struct members which are rarely or never used
> > > during page fault handling into the last cacheline to better pack
> > > vm_area_struct. As a result vm_area_struct will fit into 3 as opposed
> > > to 4 cachelines. New typical vm_area_struct layout:
> > >
> > > struct vm_area_struct {
> > >     union {
> > >         struct {
> > >             long unsigned int vm_start;              /*     0     8 */
> > >             long unsigned int vm_end;                /*     8     8 */
> > >         };                                           /*     0    16 */
> > >         freeptr_t          vm_freeptr;               /*     0     8 */
> > >     };                                               /*     0    16 */
> > >     struct mm_struct *         vm_mm;                /*    16     8 */
> > >     pgprot_t                   vm_page_prot;         /*    24     8 */
> > >     union {
> > >         const vm_flags_t   vm_flags;                 /*    32     8 */
> > >         vm_flags_t         __vm_flags;               /*    32     8 */
> > >     };                                               /*    32     8 */
> > >     unsigned int               vm_lock_seq;          /*    40     4 */
> >
> > Does it not make sense to move this seq field near the refcnt?
>
> In an earlier version, when vm_lock was not a refcount yet, I tried
> that and moving vm_lock_seq introduced regression in the pft test. We
> have that early vm_lock_seq check in the beginning of vma_start_read()
> and if it fails we bail out early without locking. I think that might
> be the reason why keeping vm_lock_seq in the first cacheling is
> beneficial. But I'll try moving it again now that we have vm_refcnt
> instead of the lock and see if pft still shows any regression.

I confirmed that moving vm_lock_seq next to vm_refcnt regresses
pagefault performance:

Hmean     faults/cpu-1    508634.6876 (   0.00%)   508548.5498 *  -0.02%*
Hmean     faults/cpu-4    474767.2684 (   0.00%)   475620.7653 *   0.18%*
Hmean     faults/cpu-7    451356.6844 (   0.00%)   446738.2381 *  -1.02%*
Hmean     faults/cpu-12   360114.9092 (   0.00%)   337121.8189 *  -6.38%*
Hmean     faults/cpu-21   227567.8237 (   0.00%)   205277.2029 *  -9.80%*
Hmean     faults/cpu-30   163383.6765 (   0.00%)   152765.1451 *  -6.50%*
Hmean     faults/cpu-48   118048.2568 (   0.00%)   109959.2027 *  -6.85%*
Hmean     faults/cpu-56   103189.6761 (   0.00%)    92989.3749 *  -9.89%*
Hmean     faults/sec-1    508228.4512 (   0.00%)   508129.1963 *  -0.02%*
Hmean     faults/sec-4   1854868.9033 (   0.00%)  1862443.6146 *   0.41%*
Hmean     faults/sec-7   3088881.6158 (   0.00%)  3050403.1664 *  -1.25%*
Hmean     faults/sec-12  4222540.9948 (   0.00%)  3951163.9557 *  -6.43%*
Hmean     faults/sec-21  4555777.5386 (   0.00%)  4130470.6021 *  -9.34%*
Hmean     faults/sec-30  4336721.3467 (   0.00%)  4150477.5095 *  -4.29%*
Hmean     faults/sec-48  5163921.7465 (   0.00%)  4857286.2166 *  -5.94%*
Hmean     faults/sec-56  5413622.8890 (   0.00%)  4936484.0021 *  -8.81%*

So, I kept it unchanged in v10
(https://lore.kernel.org/all/20250213224655.1680278-14-surenb@google.com/)

>
> >
> > >     /* XXX 4 bytes hole, try to pack */
> > >
> > >     struct list_head           anon_vma_chain;       /*    48    16 */
> > >     /* --- cacheline 1 boundary (64 bytes) --- */
> > >     struct anon_vma *          anon_vma;             /*    64     8 */
> > >     const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> > >     long unsigned int          vm_pgoff;             /*    80     8 */
> > >     struct file *              vm_file;              /*    88     8 */
> > >     void *                     vm_private_data;      /*    96     8 */
> > >     atomic_long_t              swap_readahead_info;  /*   104     8 */
> > >     struct mempolicy *         vm_policy;            /*   112     8 */
> > >     struct vma_numab_state *   numab_state;          /*   120     8 */
> > >     /* --- cacheline 2 boundary (128 bytes) --- */
> > >     refcount_t          vm_refcnt (__aligned__(64)); /*   128     4 */
> > >
> > >     /* XXX 4 bytes hole, try to pack */
> > >
> > >     struct {
> > >         struct rb_node     rb (__aligned__(8));      /*   136    24 */
> > >         long unsigned int  rb_subtree_last;          /*   160     8 */
> > >     } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> > >     struct anon_vma_name *     anon_name;            /*   168     8 */
> > >     struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */
> > >
> > >     /* size: 192, cachelines: 3, members: 18 */
> > >     /* sum members: 176, holes: 2, sum holes: 8 */
> > >     /* padding: 8 */
> > >     /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
> > > } __attribute__((__aligned__(64)));
> >
> >

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH] refcount: Strengthen inc_not_zero()
  2025-02-06  3:03                     ` [PATCH] refcount: Strengthen inc_not_zero() Suren Baghdasaryan
@ 2025-02-13 23:04                       ` Suren Baghdasaryan
  0 siblings, 0 replies; 140+ messages in thread
From: Suren Baghdasaryan @ 2025-02-13 23:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, boqun.feng, mark.rutland, Mateusz Guzik, akpm,
	willy, liam.howlett, lorenzo.stoakes, david.laight.linux, mhocko,
	vbabka, hannes, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	richard.weiyang, corbet, linux-doc, linux-mm, linux-kernel,
	kernel-team

On Wed, Feb 5, 2025 at 7:03 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 28, 2025 at 3:51 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jan 27, 2025 at 11:21 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
> > > >
> > > > On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > > > > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > > > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > > > > >
> > > > > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > > > > conditions after the refcount succeeds.
> > > > > > >
> > > > > > > And this is probably fine, but let me ponder this all a little more.
> > > > > >
> > > > > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > > > > fix this, these things are hard enough as they are.
> > > > > >
> > > > > > Will, others, wdyt?
> > > > >
> > > > > We should also update the Documentation (atomic_t.txt and
> > > > > refcount-vs-atomic.rst) if we strengthen this.
> > > > >
> > > > > > ---
> > > > > > Subject: refcount: Strengthen inc_not_zero()
> > > > > >
> > > > > > For speculative lookups where a successful inc_not_zero() pins the
> > > > > > object, but where we still need to double check if the object acquired
> > > > > > is indeed the one we set out to aquire, needs this validation to happen
> > > > > > *after* the increment.
> > > > > >
> > > > > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > > > > >
> > > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > > ---
> > > > > >  include/linux/refcount.h | 15 ++++++++-------
> > > > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > > > index 35f039ecb272..340e7ffa445e 100644
> > > > > > --- a/include/linux/refcount.h
> > > > > > +++ b/include/linux/refcount.h
> > > > > > @@ -69,9 +69,10 @@
> > > > > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > > > > >   * load.
> > > > > >   *
> > > > > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > > > > - * future stores against the inc, this ensures we'll never modify the object
> > > > > > - * if we did not in fact acquire a reference.
> > > > > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > > > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > > > > + * are from this object and not anything previously occupying this memory --
> > > > > > + * consider SLAB_TYPESAFE_BY_RCU.
> > > > > >   *
> > > > > >   * The decrements will provide release order, such that all the prior loads and
> > > > > >   * stores will be issued before, it also provides a control dependency, which
> > > > > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > > >     do {
> > > > > >             if (!old)
> > > > > >                     break;
> > > > > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> > > > >
> > > > > Hmm, do the later memory accesses need to be ordered against the store
> > > > > part of the increment or just the read? If it's the former, then I don't
> > > > > think that _acquire is sufficient -- accesses can still get in-between
> > > > > the read and write parts of the RmW.
> > > >
> > > > I dug some more into this at the end of last week. For the
> > > > SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> > > > dec_and_test(), then I think using _acquire() above is correct as the
> > > > later references can only move up into the critical section in the case
> > > > that we successfully obtained a reference.
> > > >
> > > > However, if we're going to make the barriers implicit in the refcount
> > > > operations here then I think we also need to do something on the producer
> > > > side for when the object is re-initialised after being destroyed and
> > > > allocated again. I think that would necessitate release ordering for
> > > > refcount_set() so that whatever allows the consumer to validate the
> > > > object (e.g. sequence number) is published *before* the refcount.
> > >
> > > Thanks Will!
> > > I would like to expand on your answer to provide an example of the
> > > race that would happen without release ordering in the producer. To
> > > save reader's time I provide a simplified flow and reasoning first.
> > > More detailed code of what I'm considering a typical
> > > SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
> > > reply (Addendum).
> > > Simplified flow looks like this:
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return;
> > >     smp_rmb(); /* Peter's new acquire fence */
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return;
> > >     }
> > >     use(obj->value);
> > >
> > > producer:
> > >     old_key = obj->key;
> > >     remove(collection, old_key);
> > >     if (!refcount_dec_and_test(&obj->ref))
> > >         return;
> > >     obj->key = KEY_INVALID;
> > >     free(objj);
> > >     ...
> > >     obj = malloc(); /* obj is reused */
> > >     obj->key = new_key;
> > >     obj->value = new_value;
> > >     smp_wmb(); /* Will's proposed release fence */
> > >     refcount_set(obj->ref, 1);
> > >     insert(collection, key, obj);
> > >
> > > Let's consider a case when new_key == old_key. Will call both of them
> > > "key". Without WIll's proposed fence the following reordering is
> > > possible:
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >
> > >                  producer:
> > >                      key = obj->key
> > >                      remove(collection, key);
> > >                      if (!refcount_dec_and_test(&obj->ref))
> > >                          return;
> > >                      obj->key = KEY_INVALID;
> > >                      free(objj);
> > >                      obj = malloc(); /* obj is reused */
> > >                      refcount_set(obj->ref, 1);
> > >                      obj->key = key; /* same key */
> > >
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return;
> > >     smp_rmb();
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return;
> > >     }
> > >     use(obj->value);
> > >
> > >                      obj->value = new_value; /* reordered store */
> > >                      add(collection, key, obj);
> > >
> > > So, the consumer finds the old object, successfully takes a refcount
> > > and validates the key. It succeeds because the object is allocated and
> > > has the same key, which is fine. However it proceeds to use stale
> > > obj->value. Will's proposed release ordering would prevent that.
> > >
> > > The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > > omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
> > > I think it would be better to introduce two new functions:
> > > refcount_add_not_zero_acquire() and refcount_set_release(), clearly
> > > document that they should be used when a freed object can be recycled
> > > and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
> > > should also clarify that once it's called, the object can be accessed
> > > by consumers even if it was not added yet into the collection used for
> > > object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
> > > comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > > then can explicitly use these new functions in the example code,
> > > further clarifying their purpose and proper use.
> > > WDYT?
> >
> > Hi Peter,
> > Should I take a stab at preparing a patch to add the two new
> > refcounting functions suggested above with updates to the
> > documentation and comments?
> > If you disagree with that or need more time to decide then I'll wait.
> > Please let me know.
>
> Not sure if "--in-reply-to" worked but I just posted a patch adding
> new refcounting APIs for SLAB_TYPESAFE_BY_RCU here:
> https://lore.kernel.org/all/20250206025201.979573-1-surenb@google.com/

Since I did not get any replies other than Vlastimil's Ack on the
above patch, I went ahead and posted v10 of my patchset [1] and
included the patch above in it [2]. Feedback is highly appreciated!

[1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
[2] https://lore.kernel.org/all/20250213224655.1680278-11-surenb@google.com/


> Since Peter seems to be busy I discussed these ordering requirements
> for SLAB_TYPESAFE_BY_RCU with Paul McKenney and he was leaning towards
> having separate functions with the additional fences for this case.
> That's what I provided in my patch.
> Another possible option is to add acquire ordering in the
> __refcount_add_not_zero() as Peter suggested and add
> refcount_set_release() function.
> Thanks,
> Suren.
>
>
> > Thanks,
> > Suren.
> >
> >
> > >
> > > ADDENDUM.
> > > Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:
> > >
> > > struct object {
> > >     refcount_t ref;
> > >     u64 key;
> > >     u64 value;
> > > };
> > >
> > > void init(struct object *obj, u64 key, u64 value)
> > > {
> > >     obj->key = key;
> > >     obj->value = value;
> > >     smp_wmb(); /* Will's proposed release fence */
> > >     refcount_set(obj->ref, 1);
> > > }
> > >
> > > bool get_ref(struct object *obj, u64 key)
> > > {
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return false;
> > >     smp_rmb(); /* Peter's new acquire fence */
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return false;
> > >     }
> > >     return true;
> > > }
> > >
> > > void put_ref(struct object *obj)
> > > {
> > >     if (!refcount_dec_and_test(&obj->ref))
> > >         return;
> > >     obj->key = KEY_INVALID;
> > >     free(obj);
> > > }
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >     if (!get_ref(obj, key)
> > >         return;
> > >     use(obj->value);
> > >
> > > producer:
> > >     remove(collection, old_obj->key);
> > >     put_ref(old_obj);
> > >     new_obj = malloc();
> > >     init(new_obj, new_key, new_value);
> > >     insert(collection, new_key, new_obj);
> > >
> > > With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
> > > equal to new_obj.
> > >
> > >
> > > >
> > > > Will

On Wed, Feb 5, 2025 at 7:03 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 28, 2025 at 3:51 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jan 27, 2025 at 11:21 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
> > > >
> > > > On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > > > > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > > > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > > > > >
> > > > > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > > > > conditions after the refcount succeeds.
> > > > > > >
> > > > > > > And this is probably fine, but let me ponder this all a little more.
> > > > > >
> > > > > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > > > > fix this, these things are hard enough as they are.
> > > > > >
> > > > > > Will, others, wdyt?
> > > > >
> > > > > We should also update the Documentation (atomic_t.txt and
> > > > > refcount-vs-atomic.rst) if we strengthen this.
> > > > >
> > > > > > ---
> > > > > > Subject: refcount: Strengthen inc_not_zero()
> > > > > >
> > > > > > For speculative lookups where a successful inc_not_zero() pins the
> > > > > > object, but where we still need to double check if the object acquired
> > > > > > is indeed the one we set out to aquire, needs this validation to happen
> > > > > > *after* the increment.
> > > > > >
> > > > > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > > > > >
> > > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > > ---
> > > > > >  include/linux/refcount.h | 15 ++++++++-------
> > > > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > > > index 35f039ecb272..340e7ffa445e 100644
> > > > > > --- a/include/linux/refcount.h
> > > > > > +++ b/include/linux/refcount.h
> > > > > > @@ -69,9 +69,10 @@
> > > > > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > > > > >   * load.
> > > > > >   *
> > > > > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > > > > - * future stores against the inc, this ensures we'll never modify the object
> > > > > > - * if we did not in fact acquire a reference.
> > > > > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > > > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > > > > + * are from this object and not anything previously occupying this memory --
> > > > > > + * consider SLAB_TYPESAFE_BY_RCU.
> > > > > >   *
> > > > > >   * The decrements will provide release order, such that all the prior loads and
> > > > > >   * stores will be issued before, it also provides a control dependency, which
> > > > > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > > >     do {
> > > > > >             if (!old)
> > > > > >                     break;
> > > > > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> > > > >
> > > > > Hmm, do the later memory accesses need to be ordered against the store
> > > > > part of the increment or just the read? If it's the former, then I don't
> > > > > think that _acquire is sufficient -- accesses can still get in-between
> > > > > the read and write parts of the RmW.
> > > >
> > > > I dug some more into this at the end of last week. For the
> > > > SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> > > > dec_and_test(), then I think using _acquire() above is correct as the
> > > > later references can only move up into the critical section in the case
> > > > that we successfully obtained a reference.
> > > >
> > > > However, if we're going to make the barriers implicit in the refcount
> > > > operations here then I think we also need to do something on the producer
> > > > side for when the object is re-initialised after being destroyed and
> > > > allocated again. I think that would necessitate release ordering for
> > > > refcount_set() so that whatever allows the consumer to validate the
> > > > object (e.g. sequence number) is published *before* the refcount.
> > >
> > > Thanks Will!
> > > I would like to expand on your answer to provide an example of the
> > > race that would happen without release ordering in the producer. To
> > > save reader's time I provide a simplified flow and reasoning first.
> > > More detailed code of what I'm considering a typical
> > > SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
> > > reply (Addendum).
> > > Simplified flow looks like this:
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return;
> > >     smp_rmb(); /* Peter's new acquire fence */
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return;
> > >     }
> > >     use(obj->value);
> > >
> > > producer:
> > >     old_key = obj->key;
> > >     remove(collection, old_key);
> > >     if (!refcount_dec_and_test(&obj->ref))
> > >         return;
> > >     obj->key = KEY_INVALID;
> > >     free(objj);
> > >     ...
> > >     obj = malloc(); /* obj is reused */
> > >     obj->key = new_key;
> > >     obj->value = new_value;
> > >     smp_wmb(); /* Will's proposed release fence */
> > >     refcount_set(obj->ref, 1);
> > >     insert(collection, key, obj);
> > >
> > > Let's consider a case when new_key == old_key. Will call both of them
> > > "key". Without WIll's proposed fence the following reordering is
> > > possible:
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >
> > >                  producer:
> > >                      key = obj->key
> > >                      remove(collection, key);
> > >                      if (!refcount_dec_and_test(&obj->ref))
> > >                          return;
> > >                      obj->key = KEY_INVALID;
> > >                      free(objj);
> > >                      obj = malloc(); /* obj is reused */
> > >                      refcount_set(obj->ref, 1);
> > >                      obj->key = key; /* same key */
> > >
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return;
> > >     smp_rmb();
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return;
> > >     }
> > >     use(obj->value);
> > >
> > >                      obj->value = new_value; /* reordered store */
> > >                      add(collection, key, obj);
> > >
> > > So, the consumer finds the old object, successfully takes a refcount
> > > and validates the key. It succeeds because the object is allocated and
> > > has the same key, which is fine. However it proceeds to use stale
> > > obj->value. Will's proposed release ordering would prevent that.
> > >
> > > The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > > omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
> > > I think it would be better to introduce two new functions:
> > > refcount_add_not_zero_acquire() and refcount_set_release(), clearly
> > > document that they should be used when a freed object can be recycled
> > > and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
> > > should also clarify that once it's called, the object can be accessed
> > > by consumers even if it was not added yet into the collection used for
> > > object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
> > > comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > > then can explicitly use these new functions in the example code,
> > > further clarifying their purpose and proper use.
> > > WDYT?
> >
> > Hi Peter,
> > Should I take a stab at preparing a patch to add the two new
> > refcounting functions suggested above with updates to the
> > documentation and comments?
> > If you disagree with that or need more time to decide then I'll wait.
> > Please let me know.
>
> Not sure if "--in-reply-to" worked but I just posted a patch adding
> new refcounting APIs for SLAB_TYPESAFE_BY_RCU here:
> https://lore.kernel.org/all/20250206025201.979573-1-surenb@google.com/
> Since Peter seems to be busy I discussed these ordering requirements
> for SLAB_TYPESAFE_BY_RCU with Paul McKenney and he was leaning towards
> having separate functions with the additional fences for this case.
> That's what I provided in my patch.
> Another possible option is to add acquire ordering in the
> __refcount_add_not_zero() as Peter suggested and add
> refcount_set_release() function.
> Thanks,
> Suren.
>
>
> > Thanks,
> > Suren.
> >
> >
> > >
> > > ADDENDUM.
> > > Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:
> > >
> > > struct object {
> > >     refcount_t ref;
> > >     u64 key;
> > >     u64 value;
> > > };
> > >
> > > void init(struct object *obj, u64 key, u64 value)
> > > {
> > >     obj->key = key;
> > >     obj->value = value;
> > >     smp_wmb(); /* Will's proposed release fence */
> > >     refcount_set(obj->ref, 1);
> > > }
> > >
> > > bool get_ref(struct object *obj, u64 key)
> > > {
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return false;
> > >     smp_rmb(); /* Peter's new acquire fence */
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return false;
> > >     }
> > >     return true;
> > > }
> > >
> > > void put_ref(struct object *obj)
> > > {
> > >     if (!refcount_dec_and_test(&obj->ref))
> > >         return;
> > >     obj->key = KEY_INVALID;
> > >     free(obj);
> > > }
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >     if (!get_ref(obj, key)
> > >         return;
> > >     use(obj->value);
> > >
> > > producer:
> > >     remove(collection, old_obj->key);
> > >     put_ref(old_obj);
> > >     new_obj = malloc();
> > >     init(new_obj, new_key, new_value);
> > >     insert(collection, new_key, new_obj);
> > >
> > > With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
> > > equal to new_obj.
> > >
> > >
> > > >
> > > > Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

end of thread, other threads:[~2025-02-13 23:04 UTC | newest]

Thread overview: 140+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-11  4:25 [PATCH v9 00/17] reimplement per-vma lock as a refcount Suren Baghdasaryan
2025-01-11  4:25 ` [PATCH v9 01/17] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2025-01-11  4:25 ` [PATCH v9 02/17] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2025-01-11  4:25 ` [PATCH v9 03/17] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2025-01-11  4:25 ` [PATCH v9 04/17] mm: introduce vma_iter_store_attached() to use with attached vmas Suren Baghdasaryan
2025-01-13 11:58   ` Lorenzo Stoakes
2025-01-13 16:31     ` Suren Baghdasaryan
2025-01-13 16:44       ` Lorenzo Stoakes
2025-01-13 16:47       ` Lorenzo Stoakes
2025-01-13 19:09         ` Suren Baghdasaryan
2025-01-14 11:38           ` Lorenzo Stoakes
2025-01-11  4:25 ` [PATCH v9 05/17] mm: mark vmas detached upon exit Suren Baghdasaryan
2025-01-13 12:05   ` Lorenzo Stoakes
2025-01-13 17:02     ` Suren Baghdasaryan
2025-01-13 17:13       ` Lorenzo Stoakes
2025-01-13 19:11         ` Suren Baghdasaryan
2025-01-13 20:32           ` Vlastimil Babka
2025-01-13 20:42             ` Suren Baghdasaryan
2025-01-14 11:36               ` Lorenzo Stoakes
2025-01-11  4:25 ` [PATCH v9 06/17] types: move struct rcuwait into types.h Suren Baghdasaryan
2025-01-13 14:46   ` Lorenzo Stoakes
2025-01-11  4:25 ` [PATCH v9 07/17] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
2025-01-13 15:25   ` Lorenzo Stoakes
2025-01-13 17:53     ` Suren Baghdasaryan
2025-01-14 11:48       ` Lorenzo Stoakes
2025-01-11  4:25 ` [PATCH v9 08/17] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
2025-01-13 15:27   ` Lorenzo Stoakes
2025-01-13 17:53     ` Suren Baghdasaryan
2025-01-11  4:25 ` [PATCH v9 09/17] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
2025-01-13 15:52   ` Lorenzo Stoakes
2025-01-11  4:25 ` [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
2025-01-11  6:31   ` Hillf Danton
2025-01-11  9:59     ` Suren Baghdasaryan
2025-01-11 10:00       ` Suren Baghdasaryan
2025-01-11 12:13       ` Hillf Danton
2025-01-11 17:11         ` Suren Baghdasaryan
2025-01-11 23:44           ` Hillf Danton
2025-01-12  0:31             ` Suren Baghdasaryan
2025-01-15  9:39           ` Peter Zijlstra
2025-01-16 10:52             ` Hillf Danton
2025-01-11 12:39   ` David Laight
2025-01-11 17:07     ` Matthew Wilcox
2025-01-11 18:30     ` Paul E. McKenney
2025-01-11 22:19       ` David Laight
2025-01-11 22:50         ` [PATCH v9 10/17] refcount: introduce __refcount_{add|inc}_not_zero_limited - clang 17.0.1 bug David Laight
2025-01-12 11:37           ` David Laight
2025-01-12 17:56             ` Paul E. McKenney
2025-01-11  4:25 ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
2025-01-11 11:24   ` Mateusz Guzik
2025-01-11 20:14     ` Suren Baghdasaryan
2025-01-11 20:16       ` Suren Baghdasaryan
2025-01-11 20:31       ` Mateusz Guzik
2025-01-11 20:58         ` Suren Baghdasaryan
2025-01-11 20:38       ` Vlastimil Babka
2025-01-13  1:47       ` Wei Yang
2025-01-13  2:25         ` Wei Yang
2025-01-13 21:14           ` Suren Baghdasaryan
2025-01-13 21:08         ` Suren Baghdasaryan
2025-01-15 10:48       ` Peter Zijlstra
2025-01-15 11:13         ` Peter Zijlstra
2025-01-15 15:00           ` Suren Baghdasaryan
2025-01-15 15:35             ` Peter Zijlstra
2025-01-15 15:38               ` Peter Zijlstra
2025-01-15 16:22                 ` Suren Baghdasaryan
2025-01-15 16:00           ` [PATCH] refcount: Strengthen inc_not_zero() Peter Zijlstra
2025-01-16 15:12             ` Suren Baghdasaryan
2025-01-17 15:41             ` Will Deacon
2025-01-27 14:09               ` Will Deacon
2025-01-27 19:21                 ` Suren Baghdasaryan
2025-01-28 23:51                   ` Suren Baghdasaryan
2025-02-06  2:52                     ` [PATCH 1/1] refcount: provide ops for cases when object's memory can be reused Suren Baghdasaryan
2025-02-06 10:41                       ` Vlastimil Babka
2025-02-06  3:03                     ` [PATCH] refcount: Strengthen inc_not_zero() Suren Baghdasaryan
2025-02-13 23:04                       ` Suren Baghdasaryan
2025-01-17 16:13             ` Matthew Wilcox
2025-01-12  2:59   ` [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count Wei Yang
2025-01-12 17:35     ` Suren Baghdasaryan
2025-01-13  0:59       ` Wei Yang
2025-01-13  2:37   ` Wei Yang
2025-01-13 21:16     ` Suren Baghdasaryan
2025-01-13  9:36   ` Wei Yang
2025-01-13 21:18     ` Suren Baghdasaryan
2025-01-15  2:58   ` Wei Yang
2025-01-15  3:12     ` Suren Baghdasaryan
2025-01-15 12:05       ` Wei Yang
2025-01-15 15:01         ` Suren Baghdasaryan
2025-01-16  1:37           ` Wei Yang
2025-01-16  1:41             ` Suren Baghdasaryan
2025-01-16  9:10               ` Wei Yang
2025-01-11  4:25 ` [PATCH v9 12/17] mm: move lesser used vma_area_struct members into the last cacheline Suren Baghdasaryan
2025-01-13 16:15   ` Lorenzo Stoakes
2025-01-15 10:50   ` Peter Zijlstra
2025-01-15 16:39     ` Suren Baghdasaryan
2025-02-13 22:59       ` Suren Baghdasaryan
2025-01-11  4:26 ` [PATCH v9 13/17] mm/debug: print vm_refcnt state when dumping the vma Suren Baghdasaryan
2025-01-13 16:21   ` Lorenzo Stoakes
2025-01-13 16:35     ` Liam R. Howlett
2025-01-13 17:57       ` Suren Baghdasaryan
2025-01-14 11:41         ` Lorenzo Stoakes
2025-01-11  4:26 ` [PATCH v9 14/17] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
2025-01-13 16:28   ` Lorenzo Stoakes
2025-01-13 17:56     ` Suren Baghdasaryan
2025-01-14 11:45       ` Lorenzo Stoakes
2025-01-11  4:26 ` [PATCH v9 15/17] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
2025-01-11  4:26 ` [PATCH v9 16/17] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2025-01-15  2:27   ` Wei Yang
2025-01-15  3:15     ` Suren Baghdasaryan
2025-01-15  3:58       ` Liam R. Howlett
2025-01-15  5:41         ` Suren Baghdasaryan
2025-01-15  3:59       ` Mateusz Guzik
2025-01-15  5:47         ` Suren Baghdasaryan
2025-01-15  5:51           ` Mateusz Guzik
2025-01-15  6:41             ` Suren Baghdasaryan
2025-01-15  7:58       ` Vlastimil Babka
2025-01-15 15:10         ` Suren Baghdasaryan
2025-02-13 22:56           ` Suren Baghdasaryan
2025-01-15 12:17       ` Wei Yang
2025-01-15 21:46         ` Suren Baghdasaryan
2025-01-11  4:26 ` [PATCH v9 17/17] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2025-01-13 16:33   ` Lorenzo Stoakes
2025-01-13 17:56     ` Suren Baghdasaryan
2025-01-11  4:52 ` [PATCH v9 00/17] reimplement per-vma lock as a refcount Matthew Wilcox
2025-01-11  9:45   ` Suren Baghdasaryan
2025-01-13 12:14 ` Lorenzo Stoakes
2025-01-13 16:58   ` Suren Baghdasaryan
2025-01-13 17:11     ` Lorenzo Stoakes
2025-01-13 19:00       ` Suren Baghdasaryan
2025-01-14 11:35         ` Lorenzo Stoakes
2025-01-14  1:49   ` Andrew Morton
2025-01-14  2:53     ` Suren Baghdasaryan
2025-01-14  4:09       ` Andrew Morton
2025-01-14  9:09         ` Vlastimil Babka
2025-01-14 10:27           ` Hillf Danton
2025-01-14  9:47         ` Lorenzo Stoakes
2025-01-14 14:59         ` Liam R. Howlett
2025-01-14 15:54           ` Suren Baghdasaryan
2025-01-15 11:34             ` Lorenzo Stoakes
2025-01-15 15:14               ` Suren Baghdasaryan
2025-01-28  5:26 ` Shivank Garg
2025-01-28  5:50   ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).