[PATCH v6 00/16] move per-vma lock into vm_area

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 00/16] move per-vma lock into vm_area_struct
@ 2024-12-16 19:24 Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 01/16] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
                   ` (16 more replies)
  0 siblings, 17 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. replaces vm_lock and vma->detached flag with a reference counter;
4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls.

Pagefault microbenchmarks show performance improvement:
Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Changes since v5 [4]
- Added Reviewed-by, per Vlastimil Babka;
- Added replacement of vm_lock and vma->detached flag with vm_refcnt,
per Peter Zijlstra and Matthew Wilcox
- Marked vmas detached during exit_mmap;
- Ensureed vmas are in detached state before they are freed;
- Changed SLAB_TYPESAFE_BY_RCU patch to not require ctor, leading to a
much simpler code;
- Removed unnecessary patch [5]
- Updated documentation to reflect changes to vm_lock;

Patchset applies over mm-unstable after reverting v5 of this patchset [4]
(currently 687e99a5faa5-905ab222508a)

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20241206225204.4008261-1-surenb@google.com/
[5] https://lore.kernel.org/all/20241206225204.4008261-6-surenb@google.com/

Suren Baghdasaryan (16):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm/nommu: fix the last places where vma is not locked before being
    attached
  types: move struct rcuwait into types.h
  mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  mm: move mmap_init_lock() out of the header file
  mm: uninline the main body of vma_start_write()
  refcount: introduce __refcount_{add|inc}_not_zero_limited
  mm: replace vm_lock and detached flag with a reference count
  mm: enforce vma to be in detached state before freeing
  mm: remove extra vma_numab_state_init() call
  mm: introduce vma_ensure_detached()
  mm: prepare lock_vma_under_rcu() for vma reuse possibility
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst |  44 ++++----
 include/linux/mm.h                 | 162 +++++++++++++++++++++++------
 include/linux/mm_types.h           |  37 ++++---
 include/linux/mmap_lock.h          |   6 --
 include/linux/rcuwait.h            |  13 +--
 include/linux/refcount.h           |  20 +++-
 include/linux/slab.h               |   6 --
 include/linux/types.h              |  12 +++
 kernel/fork.c                      |  88 ++++------------
 mm/init-mm.c                       |   1 +
 mm/memory.c                        |  75 +++++++++++--
 mm/mmap.c                          |   8 +-
 mm/nommu.c                         |   2 +
 mm/userfaultfd.c                   |  31 +++---
 mm/vma.c                           |  15 ++-
 mm/vma.h                           |   4 +-
 tools/testing/vma/linux/atomic.h   |   5 +
 tools/testing/vma/vma_internal.h   |  96 +++++++++--------
 18 files changed, 378 insertions(+), 247 deletions(-)

-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v6 01/16] mm: introduce vma_start_read_locked{_nested} helpers
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 02/16] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Introduce helper functions which can be used to read-lock a VMA when
holding mmap_lock for read.  Replace direct accesses to vma->vm_lock with
these new helpers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 mm/userfaultfd.c   | 22 +++++-----------------
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1352147a2648..3815a43ba504 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,6 +735,30 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	return true;
 }
 
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read_nested(&vma->vm_lock->lock, subclass);
+}
+
+/*
+ * Use only while holding mmap read lock which guarantees that locking will not
+ * fail (nobody can concurrently write-lock the vma). vma_start_read() should
+ * not be used in such cases because it might fail due to mm_lock_seq overflow.
+ * This functionality is used to obtain vma read lock and drop the mmap read lock.
+ */
+static inline void vma_start_read_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_locked(vma->vm_mm);
+	down_read(&vma->vm_lock->lock);
+}
+
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 8e16dc290ddf..bc9a66ec6a6e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -84,16 +84,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
-	if (!IS_ERR(vma)) {
-		/*
-		 * We cannot use vma_start_read() as it may fail due to
-		 * false locked (see comment in vma_start_read()). We
-		 * can avoid that by directly locking vm_lock under
-		 * mmap_lock, which guarantees that nobody can lock the
-		 * vma for write (vma_start_write()) under us.
-		 */
-		down_read(&vma->vm_lock->lock);
-	}
+	if (!IS_ERR(vma))
+		vma_start_read_locked(vma);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1491,14 +1483,10 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		/*
-		 * See comment in uffd_lock_vma() as to why not using
-		 * vma_start_read() here.
-		 */
-		down_read(&(*dst_vmap)->vm_lock->lock);
+		vma_start_read_locked(*dst_vmap);
 		if (*dst_vmap != *src_vmap)
-			down_read_nested(&(*src_vmap)->vm_lock->lock,
-					 SINGLE_DEPTH_NESTING);
+			vma_start_read_locked_nested(*src_vmap,
+						SINGLE_DEPTH_NESTING);
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 02/16] mm: move per-vma lock into vm_area_struct
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 01/16] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 03/16] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing.  Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].

Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code.  When that split-away part is a lock, it complicates
things even further.  With no performance benefits, there are no reasons
for this split.  Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.  Move
vm_lock back into vm_area_struct, aligning it at the cacheline boundary
and changing the cache to be cacheline-aligned as well.  With kernel
compiled using defconfig, this causes VMA memory consumption to grow from
160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs.  Note that the size of this structure is
dependent on the kernel configuration and typically the original size is
higher than 160 bytes.  Therefore these calculations are close to the
worst case scenario.  A more realistic vm_area_struct usage before this
change is:

     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    176   46    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
which is 3.9MB per 100000 VMAs.  This memory consumption growth can be
addressed later by optimizing the vm_lock.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h               | 28 ++++++++++--------
 include/linux/mm_types.h         |  6 ++--
 kernel/fork.c                    | 49 ++++----------------------------
 tools/testing/vma/vma_internal.h | 33 +++++----------------
 4 files changed, 32 insertions(+), 84 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3815a43ba504..e1768a9395c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -697,6 +697,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_lock_init(struct vm_area_struct *vma)
+{
+	init_rwsem(&vma->vm_lock.lock);
+	vma->vm_lock_seq = UINT_MAX;
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -714,7 +720,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
 		return false;
 
 	/*
@@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+		up_read(&vma->vm_lock.lock);
 		return false;
 	}
 	return true;
@@ -744,7 +750,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock->lock, subclass);
+	down_read_nested(&vma->vm_lock.lock, subclass);
 }
 
 /*
@@ -756,13 +762,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
 static inline void vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock->lock);
+	down_read(&vma->vm_lock.lock);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock->lock);
+	up_read(&vma->vm_lock.lock);
 	rcu_read_unlock();
 }
 
@@ -791,7 +797,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock->lock);
+	down_write(&vma->vm_lock.lock);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -799,7 +805,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock->lock);
+	up_write(&vma->vm_lock.lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -811,7 +817,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock->lock))
+	if (!rwsem_is_locked(&vma->vm_lock.lock))
 		vma_assert_write_locked(vma);
 }
 
@@ -844,6 +850,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
+static inline void vma_lock_init(struct vm_area_struct *vma) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -878,10 +885,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
 
-/*
- * WARNING: vma_init does not initialize vma->vm_lock.
- * Use vm_area_alloc()/vm_area_free() if vma needs locking.
- */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	memset(vma, 0, sizeof(*vma));
@@ -890,6 +893,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
 	vma_numab_state_init(vma);
+	vma_lock_init(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 266f53b2bb49..825f6328f9e5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -700,8 +700,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock *vm_lock;
 #endif
 
 	/*
@@ -754,6 +752,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index 8dc670fe90d4..eb3e35d65e95 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-#ifdef CONFIG_PER_VMA_LOCK
-
-/* SLAB cache for vm_area_struct.lock */
-static struct kmem_cache *vma_lock_cachep;
-
-static bool vma_lock_alloc(struct vm_area_struct *vma)
-{
-	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
-}
-
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
-}
-
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
-static inline void vma_lock_free(struct vm_area_struct *vma) {}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
 struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		kmem_cache_free(vm_area_cachep, vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	if (!vma_lock_alloc(new)) {
-		kmem_cache_free(vm_area_cachep, new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
@@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
 {
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
-	vma_lock_free(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 						  vm_rcu);
 
 	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -3189,11 +3152,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-
-	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
-#ifdef CONFIG_PER_VMA_LOCK
-	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
-#endif
+	vm_area_cachep = KMEM_CACHE(vm_area_struct,
+			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
 }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index b973b3e41c83..568c18d24d53 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -270,10 +270,10 @@ struct vm_area_struct {
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_lock.lock (in write mode)
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_lock.lock (in read or write mode)
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -282,7 +282,7 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock *vm_lock;
+	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -459,17 +459,10 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline bool vma_lock_alloc(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma)
 {
-	vma->vm_lock = calloc(1, sizeof(struct vma_lock));
-
-	if (!vma->vm_lock)
-		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
+	init_rwsem(&vma->vm_lock.lock);
 	vma->vm_lock_seq = UINT_MAX;
-
-	return true;
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
@@ -492,6 +485,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
+	vma_lock_init(vma);
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -502,10 +496,6 @@ static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 		return NULL;
 
 	vma_init(vma, mm);
-	if (!vma_lock_alloc(vma)) {
-		free(vma);
-		return NULL;
-	}
 
 	return vma;
 }
@@ -518,10 +508,7 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	if (!vma_lock_alloc(new)) {
-		free(new);
-		return NULL;
-	}
+	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 
 	return new;
@@ -691,14 +678,8 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void vma_lock_free(struct vm_area_struct *vma)
-{
-	free(vma->vm_lock);
-}
-
 static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-	vma_lock_free(vma);
 	free(vma);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 03/16] mm: mark vma as detached until it's added into vma tree
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 01/16] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 02/16] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 04/16] mm/nommu: fix the last places where vma is not locked before being attached Suren Baghdasaryan
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Current implementation does not set detached flag when a VMA is first
allocated.  This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree.  Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added into
a tree.

Introduce vma_mark_attached() to make the API more readable and to
simplify possible future cleanup when vma->vm_mm might be used to indicate
detached vma and vma_mark_attached() will need an additional mm parameter.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h               | 27 ++++++++++++++++++++-------
 kernel/fork.c                    |  4 ++++
 mm/memory.c                      |  2 +-
 mm/vma.c                         |  6 +++---
 mm/vma.h                         |  2 ++
 tools/testing/vma/vma_internal.h | 17 ++++++++++++-----
 6 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1768a9395c9..689f5a1e2181 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,12 +821,21 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
 		vma_assert_write_locked(vma);
 }
 
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
+}
+
+static inline bool is_vma_detached(struct vm_area_struct *vma)
+{
+	return vma->detached;
 }
 
 static inline void release_fault_lock(struct vm_fault *vmf)
@@ -857,8 +866,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 		{ mmap_assert_write_locked(vma->vm_mm); }
-static inline void vma_mark_detached(struct vm_area_struct *vma,
-				     bool detached) {}
+static inline void vma_mark_attached(struct vm_area_struct *vma) {}
+static inline void vma_mark_detached(struct vm_area_struct *vma) {}
 
 static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		unsigned long address)
@@ -891,7 +900,10 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
+#endif
 	vma_numab_state_init(vma);
 	vma_lock_init(vma);
 }
@@ -1086,6 +1098,7 @@ static inline int vma_iter_bulk_store(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index eb3e35d65e95..57dc5b935f79 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,6 +465,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	data_race(memcpy(new, orig, sizeof(*new)));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+#ifdef CONFIG_PER_VMA_LOCK
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
+#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
diff --git a/mm/memory.c b/mm/memory.c
index 2d97a17dd3ba..cc7159aef918 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6350,7 +6350,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		goto inval;
 
 	/* Check if the VMA got isolated after we found it */
-	if (vma->detached) {
+	if (is_vma_detached(vma)) {
 		vma_end_read(vma);
 		count_vm_vma_lock_event(VMA_LOCK_MISS);
 		/* The area was replaced with another one */
diff --git a/mm/vma.c b/mm/vma.c
index 6fa240e5b0c5..fbd7254517d6 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -327,7 +327,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 
 	if (vp->remove) {
 again:
-		vma_mark_detached(vp->remove, true);
+		vma_mark_detached(vp->remove);
 		if (vp->file) {
 			uprobe_munmap(vp->remove, vp->remove->vm_start,
 				      vp->remove->vm_end);
@@ -1222,7 +1222,7 @@ static void reattach_vmas(struct ma_state *mas_detach)
 
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
+		vma_mark_attached(vma);
 
 	__mt_destroy(mas_detach->tree);
 }
@@ -1297,7 +1297,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (error)
 			goto munmap_gather_failed;
 
-		vma_mark_detached(next, true);
+		vma_mark_detached(next);
 		nrpages = vma_pages(next);
 
 		vms->nr_pages += nrpages;
diff --git a/mm/vma.h b/mm/vma.h
index 61ed044b6145..24636a2b0acf 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -157,6 +157,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	if (unlikely(mas_is_err(&vmi->mas)))
 		return -ENOMEM;
 
+	vma_mark_attached(vma);
 	return 0;
 }
 
@@ -389,6 +390,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 
 	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
 	mas_store_prealloc(&vmi->mas, vma);
+	vma_mark_attached(vma);
 }
 
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 568c18d24d53..0cdc5f8c3d60 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -465,13 +465,17 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma->detached = false;
+}
+
 static inline void vma_assert_write_locked(struct vm_area_struct *);
-static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
 	/* When detaching vma should be write-locked */
-	if (detached)
-		vma_assert_write_locked(vma);
-	vma->detached = detached;
+	vma_assert_write_locked(vma);
+	vma->detached = true;
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -484,7 +488,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_mark_detached(vma, false);
+	/* vma is not locked, can't use vma_mark_detached() */
+	vma->detached = true;
 	vma_lock_init(vma);
 }
 
@@ -510,6 +515,8 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	memcpy(new, orig, sizeof(*new));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
+	/* vma is not locked, can't use vma_mark_detached() */
+	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 04/16] mm/nommu: fix the last places where vma is not locked before being attached
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 03/16] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 05/16] types: move struct rcuwait into types.h Suren Baghdasaryan
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

nommu configuration has two places where vma gets attached to the vma tree
without write-locking it. Add the missing locks to ensure vma is always
locked before it's attached.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/nommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/nommu.c b/mm/nommu.c
index 9cb6e99215e2..248392ef4048 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1189,6 +1189,7 @@ unsigned long do_mmap(struct file *file,
 		goto error_just_free;
 
 	setup_vma_to_mm(vma, current->mm);
+	vma_start_write(vma);
 	current->mm->map_count++;
 	/* add the VMA to the tree */
 	vma_iter_store(&vmi, vma);
@@ -1356,6 +1357,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	setup_vma_to_mm(vma, mm);
 	setup_vma_to_mm(new, mm);
+	vma_start_write(new);
 	vma_iter_store(vmi, new);
 	mm->map_count++;
 	return 0;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 05/16] types: move struct rcuwait into types.h
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 04/16] mm/nommu: fix the last places where vma is not locked before being attached Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 06/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Move rcuwait struct definition into types.h so that rcuwait can be used
without including rcuwait.h which includes other headers. Without this
change mm_types.h can't use rcuwait due to a the following circular
dependency:

mm_types.h -> rcuwait.h -> signal.h -> mm_types.h

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/rcuwait.h | 13 +------------
 include/linux/types.h   | 12 ++++++++++++
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 27343424225c..9ad134a04b41 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -4,18 +4,7 @@
 
 #include <linux/rcupdate.h>
 #include <linux/sched/signal.h>
-
-/*
- * rcuwait provides a way of blocking and waking up a single
- * task in an rcu-safe manner.
- *
- * The only time @task is non-nil is when a user is blocked (or
- * checking if it needs to) on a condition, and reset as soon as we
- * know that the condition has succeeded and are awoken.
- */
-struct rcuwait {
-	struct task_struct __rcu *task;
-};
+#include <linux/types.h>
 
 #define __RCUWAIT_INITIALIZER(name)		\
 	{ .task = NULL, }
diff --git a/include/linux/types.h b/include/linux/types.h
index 2d7b9ae8714c..f1356a9a5730 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -248,5 +248,17 @@ typedef void (*swap_func_t)(void *a, void *b, int size);
 typedef int (*cmp_r_func_t)(const void *a, const void *b, const void *priv);
 typedef int (*cmp_func_t)(const void *a, const void *b);
 
+/*
+ * rcuwait provides a way of blocking and waking up a single
+ * task in an rcu-safe manner.
+ *
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
+ */
+struct rcuwait {
+	struct task_struct __rcu *task;
+};
+
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 06/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 05/16] types: move struct rcuwait into types.h Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-17 11:31   ` Lokesh Gidra
  2024-12-16 19:24 ` [PATCH v6 07/16] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
possibility of vma_start_read_locked/vma_start_read_locked_nested failing
due to refcount overflow. Prepare for such possibility by changing these
APIs and adjusting their users.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
---
 include/linux/mm.h |  6 ++++--
 mm/userfaultfd.c   | 17 ++++++++++++-----
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 689f5a1e2181..0ecd321c50b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  * not be used in such cases because it might fail due to mm_lock_seq overflow.
  * This functionality is used to obtain vma read lock and drop the mmap read lock.
  */
-static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
+static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
 	mmap_assert_locked(vma->vm_mm);
 	down_read_nested(&vma->vm_lock.lock, subclass);
+	return true;
 }
 
 /*
@@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
  * not be used in such cases because it might fail due to mm_lock_seq overflow.
  * This functionality is used to obtain vma read lock and drop the mmap read lock.
  */
-static inline void vma_start_read_locked(struct vm_area_struct *vma)
+static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
 	down_read(&vma->vm_lock.lock);
+	return true;
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc9a66ec6a6e..79e8ae676f75 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
 	if (!IS_ERR(vma))
-		vma_start_read_locked(vma);
+		if (!vma_start_read_locked(vma))
+			vma = ERR_PTR(-EAGAIN);
 
 	mmap_read_unlock(mm);
 	return vma;
@@ -1483,10 +1484,16 @@ static int uffd_move_lock(struct mm_struct *mm,
 	mmap_read_lock(mm);
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
-		vma_start_read_locked(*dst_vmap);
-		if (*dst_vmap != *src_vmap)
-			vma_start_read_locked_nested(*src_vmap,
-						SINGLE_DEPTH_NESTING);
+		if (!vma_start_read_locked(*dst_vmap)) {
+			if (*dst_vmap != *src_vmap) {
+				if (!vma_start_read_locked_nested(*src_vmap,
+							SINGLE_DEPTH_NESTING)) {
+					vma_end_read(*dst_vmap);
+					err = -EAGAIN;
+				}
+			}
+		} else
+			err = -EAGAIN;
 	}
 	mmap_read_unlock(mm);
 	return err;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 07/16] mm: move mmap_init_lock() out of the header file
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 06/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 08/16] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
not have to reside in the header file. This move lets us avoid including
additional headers in mmap_lock.h later, when mmap_init_lock() needs to
initialize rcuwait object.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mmap_lock.h | 6 ------
 kernel/fork.c             | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 45a21faa3ff6..4706c6769902 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -122,12 +122,6 @@ static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
-static inline void mmap_init_lock(struct mm_struct *mm)
-{
-	init_rwsem(&mm->mmap_lock);
-	mm_lock_seqcount_init(mm);
-}
-
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_start_locking(mm, true);
diff --git a/kernel/fork.c b/kernel/fork.c
index 57dc5b935f79..8cb19c23e892 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,6 +1224,12 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static inline void mmap_init_lock(struct mm_struct *mm)
+{
+	init_rwsem(&mm->mmap_lock);
+	mm_lock_seqcount_init(mm);
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 08/16] mm: uninline the main body of vma_start_write()
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 07/16] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 09/16] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

vma_start_write() is used in many places and will grow in size very soon.
It is not used in performance critical paths and uninlining it should
limit the future code size growth.
No functional changes.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 12 +++---------
 mm/memory.c        | 14 ++++++++++++++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0ecd321c50b7..ccb8f2afeca8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -787,6 +787,8 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l
 	return (vma->vm_lock_seq == *mm_lock_seq);
 }
 
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq);
+
 /*
  * Begin writing to a VMA.
  * Exclude concurrent readers under the per-VMA lock until the currently
@@ -799,15 +801,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
 
-	down_write(&vma->vm_lock.lock);
-	/*
-	 * We should use WRITE_ONCE() here because we can have concurrent reads
-	 * from the early lockless pessimistic check in vma_start_read().
-	 * We don't really care about the correctness of that early check, but
-	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
-	 */
-	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+	__vma_start_write(vma, mm_lock_seq);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index cc7159aef918..c6356ea703d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6329,6 +6329,20 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
+{
+	down_write(&vma->vm_lock.lock);
+	/*
+	 * We should use WRITE_ONCE() here because we can have concurrent reads
+	 * from the early lockless pessimistic check in vma_start_read().
+	 * We don't really care about the correctness of that early check, but
+	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
+	 */
+	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
+	up_write(&vma->vm_lock.lock);
+}
+EXPORT_SYMBOL_GPL(__vma_start_write);
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 09/16] refcount: introduce __refcount_{add|inc}_not_zero_limited
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (7 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 08/16] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Introduce functions to increase refcount but with a top limit above
which they will fail to increase. Setting the limit to 0 indicates
no limit.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/refcount.h | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..e51a49179307 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -137,13 +137,19 @@ static inline unsigned int refcount_read(const refcount_t *r)
 }
 
 static inline __must_check __signed_wrap
-bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
+bool __refcount_add_not_zero_limited(int i, refcount_t *r, int *oldp,
+				     int limit)
 {
 	int old = refcount_read(r);
 
 	do {
 		if (!old)
 			break;
+		if (limit && old + i > limit) {
+			if (oldp)
+				*oldp = old;
+			return false;
+		}
 	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
 
 	if (oldp)
@@ -155,6 +161,12 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	return old;
 }
 
+static inline __must_check __signed_wrap
+bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
+{
+	return __refcount_add_not_zero_limited(i, r, oldp, 0);
+}
+
 /**
  * refcount_add_not_zero - add a value to a refcount unless it is 0
  * @i: the value to add to the refcount
@@ -213,6 +225,12 @@ static inline void refcount_add(int i, refcount_t *r)
 	__refcount_add(i, r, NULL);
 }
 
+static inline __must_check bool __refcount_inc_not_zero_limited(refcount_t *r,
+								int *oldp, int limit)
+{
+	return __refcount_add_not_zero_limited(1, r, oldp, limit);
+}
+
 static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
 {
 	return __refcount_add_not_zero(1, r, oldp);
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (8 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 09/16] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 20:42   ` Peter Zijlstra
                     ` (2 more replies)
  2024-12-16 19:24 ` [PATCH v6 11/16] mm: enforce vma to be in detached state before freeing Suren Baghdasaryan
                   ` (6 subsequent siblings)
  16 siblings, 3 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore and the vma->detached flag with
a refcount (vm_refcnt).
When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state. Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached()
to be done only after vma has been write-locked. vma_mark_attached()
changes vm_refcnt to 1 to indicate that it has been attached to the vma
tree. When a reader takes read lock, it increments vm_refcnt, unless the
top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
a writer. When writer takes write lock, it both increments vm_refcnt and
sets the top usable bit to indicate its presence. If there are readers,
writer will wait using newly introduced mm->vma_writer_wait. Since all
writers take mmap_lock in write mode first, there can be only one writer
at a time. The last reader to release the lock will signal the writer
to wake up.
refcount might overflow if there are many competing readers, in which case
read-locking will fail. Readers are expected to handle such failures.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               | 95 ++++++++++++++++++++++++--------
 include/linux/mm_types.h         | 23 ++++----
 kernel/fork.c                    |  9 +--
 mm/init-mm.c                     |  1 +
 mm/memory.c                      | 33 +++++++----
 tools/testing/vma/linux/atomic.h |  5 ++
 tools/testing/vma/vma_internal.h | 57 ++++++++++---------
 7 files changed, 147 insertions(+), 76 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ccb8f2afeca8..d9edabc385b3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -32,6 +32,7 @@
 #include <linux/memremap.h>
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
+#include <linux/rcuwait.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -699,10 +700,27 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #ifdef CONFIG_PER_VMA_LOCK
 static inline void vma_lock_init(struct vm_area_struct *vma)
 {
-	init_rwsem(&vma->vm_lock.lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key lockdep_key;
+
+	lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
+#endif
+	refcount_set(&vma->vm_refcnt, VMA_STATE_DETACHED);
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline void vma_refcount_put(struct vm_area_struct *vma)
+{
+	int refcnt;
+
+	if (!__refcount_dec_and_test(&vma->vm_refcnt, &refcnt)) {
+		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+
+		if (refcnt & VMA_STATE_LOCKED)
+			rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
+	}
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -710,6 +728,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read(struct vm_area_struct *vma)
 {
+	int oldcnt;
+
 	/*
 	 * Check before locking. A race might cause false locked result.
 	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
@@ -720,13 +740,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
+
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
+	/* Limit at VMA_STATE_LOCKED - 2 to leave one count for a writer */
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_STATE_LOCKED - 2))) {
+		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
 		return false;
+	}
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
 
 	/*
-	 * Overflow might produce false locked result.
+	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
-	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
+	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 *
 	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
@@ -734,10 +761,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * after it has been unlocked.
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
-	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock.lock);
+	if (oldcnt & VMA_STATE_LOCKED ||
+	    unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
+		vma_refcount_put(vma);
 		return false;
 	}
+
 	return true;
 }
 
@@ -749,8 +778,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
+	int oldcnt;
+
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock.lock, subclass);
+	rwsem_acquire_read(&vma->vmlock_dep_map, subclass, 0, _RET_IP_);
+	/* Limit at VMA_STATE_LOCKED - 2 to leave one count for a writer */
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_STATE_LOCKED - 2))) {
+		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+		return false;
+	}
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
 	return true;
 }
 
@@ -762,15 +800,13 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int
  */
 static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
-	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock.lock);
-	return true;
+	return vma_start_read_locked_nested(vma, 0);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
 	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock.lock);
+	vma_refcount_put(vma);
 	rcu_read_unlock();
 }
 
@@ -813,25 +849,42 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock.lock))
+	if (refcount_read(&vma->vm_refcnt) <= VMA_STATE_ATTACHED)
 		vma_assert_write_locked(vma);
 }
 
-static inline void vma_mark_attached(struct vm_area_struct *vma)
+/*
+ * WARNING: to avoid racing with vma_mark_attached(), should be called either
+ * under mmap_write_lock or when the object has been isolated under
+ * mmap_write_lock, ensuring no competing writers.
+ */
+static inline bool is_vma_detached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
+	return refcount_read(&vma->vm_refcnt) == VMA_STATE_DETACHED;
 }
 
-static inline void vma_mark_detached(struct vm_area_struct *vma)
+static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+
+	if (is_vma_detached(vma))
+		refcount_set(&vma->vm_refcnt, VMA_STATE_ATTACHED);
 }
 
-static inline bool is_vma_detached(struct vm_area_struct *vma)
+static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
-	return vma->detached;
+	vma_assert_write_locked(vma);
+
+	if (is_vma_detached(vma))
+		return;
+
+	/* We are the only writer, so no need to use vma_refcount_put(). */
+	if (!refcount_dec_and_test(&vma->vm_refcnt)) {
+		/*
+		 * Reader must have temporarily raised vm_refcnt but it will
+		 * drop it without using the vma since vma is write-locked.
+		 */
+	}
 }
 
 static inline void release_fault_lock(struct vm_fault *vmf)
@@ -896,10 +949,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-#endif
 	vma_numab_state_init(vma);
 	vma_lock_init(vma);
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 825f6328f9e5..803f718c007c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
+#include <linux/types.h>
 
 #include <asm/mmu.h>
 
@@ -599,9 +600,9 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
 }
 #endif
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
+#define VMA_STATE_DETACHED	0x0
+#define VMA_STATE_ATTACHED	0x1
+#define VMA_STATE_LOCKED	0x40000000
 
 struct vma_numab_state {
 	/*
@@ -679,19 +680,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/*
-	 * Flag to indicate areas detached from the mm->mm_mt tree.
-	 * Unstable RCU readers are allowed to read this.
-	 */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_refcnt VMA_STATE_LOCKED is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_refcnt VMA_STATE_LOCKED is set or vm_refcnt > VMA_STATE_ATTACHED
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -754,7 +749,10 @@ struct vm_area_struct {
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_PER_VMA_LOCK
 	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+	refcount_t vm_refcnt ____cacheline_aligned_in_smp;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map vmlock_dep_map;
+#endif
 #endif
 } __randomize_layout;
 
@@ -889,6 +887,7 @@ struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		struct rcuwait vma_writer_wait;
 		/*
 		 * This field has lock-like semantics, meaning it is sometimes
 		 * accessed with ACQUIRE/RELEASE semantics.
diff --git a/kernel/fork.c b/kernel/fork.c
index 8cb19c23e892..283909d082cb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,10 +465,6 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	data_race(memcpy(new, orig, sizeof(*new)));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
-#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
@@ -488,8 +484,6 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
 						  vm_rcu);
 
-	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -1228,6 +1222,9 @@ static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
 	mm_lock_seqcount_init(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+	rcuwait_init(&mm->vma_writer_wait);
+#endif
 }
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 6af3ad675930..4600e7605cab 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -40,6 +40,7 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
+	.vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
 #endif
 	.user_ns	= &init_user_ns,
diff --git a/mm/memory.c b/mm/memory.c
index c6356ea703d8..cff132003e24 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6331,7 +6331,25 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #ifdef CONFIG_PER_VMA_LOCK
 void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 {
-	down_write(&vma->vm_lock.lock);
+	bool detached;
+
+	/*
+	 * If vma is detached then only vma_mark_attached() can raise the
+	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
+	 */
+	if (!refcount_inc_not_zero(&vma->vm_refcnt)) {
+		WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
+		return;
+	}
+
+	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
+	/* vma is attached, set the writer present bit */
+	refcount_add(VMA_STATE_LOCKED, &vma->vm_refcnt);
+	/* wait until state is VMA_STATE_ATTACHED + (VMA_STATE_LOCKED + 1) */
+	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
+		   refcount_read(&vma->vm_refcnt) == VMA_STATE_ATTACHED + (VMA_STATE_LOCKED + 1),
+		   TASK_UNINTERRUPTIBLE);
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -6339,7 +6357,10 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+	detached = refcount_sub_and_test(VMA_STATE_LOCKED + 1,
+					 &vma->vm_refcnt);
+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+	VM_BUG_ON_VMA(detached, vma); /* vma should remain attached */
 }
 EXPORT_SYMBOL_GPL(__vma_start_write);
 
@@ -6355,7 +6376,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 
 	rcu_read_lock();
-retry:
 	vma = mas_walk(&mas);
 	if (!vma)
 		goto inval;
@@ -6363,13 +6383,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma_start_read(vma))
 		goto inval;
 
-	/* Check if the VMA got isolated after we found it */
-	if (is_vma_detached(vma)) {
-		vma_end_read(vma);
-		count_vm_vma_lock_event(VMA_LOCK_MISS);
-		/* The area was replaced with another one */
-		goto retry;
-	}
 	/*
 	 * At this point, we have a stable reference to a VMA: The VMA is
 	 * locked and we know it hasn't already been isolated.
diff --git a/tools/testing/vma/linux/atomic.h b/tools/testing/vma/linux/atomic.h
index e01f66f98982..2e2021553196 100644
--- a/tools/testing/vma/linux/atomic.h
+++ b/tools/testing/vma/linux/atomic.h
@@ -9,4 +9,9 @@
 #define atomic_set(x, y) do {} while (0)
 #define U8_MAX UCHAR_MAX
 
+#ifndef atomic_cmpxchg_relaxed
+#define  atomic_cmpxchg_relaxed		uatomic_cmpxchg
+#define  atomic_cmpxchg_release         uatomic_cmpxchg
+#endif /* atomic_cmpxchg_relaxed */
+
 #endif	/* _LINUX_ATOMIC_H */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 0cdc5f8c3d60..b55556b16060 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -25,7 +25,7 @@
 #include <linux/maple_tree.h>
 #include <linux/mm.h>
 #include <linux/rbtree.h>
-#include <linux/rwsem.h>
+#include <linux/refcount.h>
 
 extern unsigned long stack_guard_gap;
 #ifdef CONFIG_MMU
@@ -132,10 +132,6 @@ typedef __bitwise unsigned int vm_fault_t;
  */
 #define pr_warn_once pr_err
 
-typedef struct refcount_struct {
-	atomic_t refs;
-} refcount_t;
-
 struct kref {
 	refcount_t refcount;
 };
@@ -228,15 +224,14 @@ struct mm_struct {
 	unsigned long def_flags;
 };
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
-
-
 struct file {
 	struct address_space	*f_mapping;
 };
 
+#define VMA_STATE_DETACHED	0x0
+#define VMA_STATE_ATTACHED	0x1
+#define VMA_STATE_LOCKED	0x40000000
+
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
@@ -264,16 +259,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/* Flag to indicate areas detached from the mm->mm_mt tree */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock.lock (in write mode)
+	 *  - vm_refcnt VMA_STATE_LOCKED is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock.lock (in read or write mode)
+	 *  - vm_refcnt VMA_STATE_LOCKED is set or vm_refcnt > VMA_STATE_ATTACHED
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -282,7 +274,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -335,6 +326,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	refcount_t vm_refcnt;
+#endif
 } __randomize_layout;
 
 struct vm_fault {};
@@ -461,21 +456,37 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 
 static inline void vma_lock_init(struct vm_area_struct *vma)
 {
-	init_rwsem(&vma->vm_lock.lock);
+	refcount_set(&vma->vm_refcnt, VMA_STATE_DETACHED);
 	vma->vm_lock_seq = UINT_MAX;
 }
 
-static inline void vma_mark_attached(struct vm_area_struct *vma)
+static inline bool is_vma_detached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
+	return refcount_read(&vma->vm_refcnt) == VMA_STATE_DETACHED;
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+
+	if (is_vma_detached(vma))
+		refcount_set(&vma->vm_refcnt, VMA_STATE_ATTACHED);
+}
+
 static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+
+	if (is_vma_detached(vma))
+		return;
+
+	if (!refcount_dec_and_test(&vma->vm_refcnt)) {
+		/*
+		 * Reader must have temporarily raised vm_refcnt but it will
+		 * drop it without using the vma since vma is write-locked.
+		 */
+	}
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -488,8 +499,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
 	vma_lock_init(vma);
 }
 
@@ -515,8 +524,6 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	memcpy(new, orig, sizeof(*new));
 	vma_lock_init(new);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 11/16] mm: enforce vma to be in detached state before freeing
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (9 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 21:16   ` Peter Zijlstra
  2024-12-16 19:24 ` [PATCH v6 12/16] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

exit_mmap() frees vmas without detaching them. This will become a problem
when we introduce vma reuse. Ensure that vmas are always detached before
being freed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 kernel/fork.c |  4 ++++
 mm/vma.c      | 10 ++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 283909d082cb..f1ddfc7b3b48 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -473,6 +473,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void __vm_area_free(struct vm_area_struct *vma)
 {
+#ifdef CONFIG_PER_VMA_LOCK
+	/* The vma should be detached while being destroyed. */
+	VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
+#endif
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
diff --git a/mm/vma.c b/mm/vma.c
index fbd7254517d6..0436a7d21e01 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -413,9 +413,15 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	if (unreachable)
+	if (unreachable) {
+#ifdef CONFIG_PER_VMA_LOCK
+		if (!is_vma_detached(vma)) {
+			vma_start_write(vma);
+			vma_mark_detached(vma);
+		}
+#endif
 		__vm_area_free(vma);
-	else
+	} else
 		vm_area_free(vma);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 12/16] mm: remove extra vma_numab_state_init() call
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (10 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 11/16] mm: enforce vma to be in detached state before freeing Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 13/16] mm: introduce vma_ensure_detached() Suren Baghdasaryan
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

vma_init() already memset's the whole vm_area_struct to 0, so there is
no need to an additional vma_numab_state_init().

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d9edabc385b3..b73cf64233a4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -949,7 +949,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_numab_state_init(vma);
 	vma_lock_init(vma);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 13/16] mm: introduce vma_ensure_detached()
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (11 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 12/16] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-17 10:26   ` Peter Zijlstra
  2024-12-16 19:24 ` [PATCH v6 14/16] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

vma_start_read() can temporarily raise vm_refcnt of a write-locked and
detached vma:

// vm_refcnt==1 (attached)
vma_start_write()
    vma->vm_lock_seq = mm->mm_lock_seq

                    vma_start_read()
                       vm_refcnt++; // vm_refcnt==2

vma_mark_detached()
    vm_refcnt--; // vm_refcnt==1

// vma is detached but vm_refcnt!=0 temporarily

                       if (vma->vm_lock_seq == mm->mm_lock_seq)
                           vma_refcount_put()
                               vm_refcnt--; // vm_refcnt==0

This is currently not a problem when freeing the vma because RCU grace
period should pass before kmem_cache_free(vma) gets called and by that
time vma_start_read() should be done and vm_refcnt is 0. However once
we introduce possibility of vma reuse before RCU grace period is over,
this will become a problem (reused vma might be in non-detached state).
Introduce vma_ensure_detached() for the writer to wait for readers until
they exit vma_start_read().

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               |  9 ++++++
 mm/memory.c                      | 55 +++++++++++++++++++++++---------
 tools/testing/vma/vma_internal.h |  8 +++++
 3 files changed, 57 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b73cf64233a4..361f26dedab1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -863,6 +863,15 @@ static inline bool is_vma_detached(struct vm_area_struct *vma)
 	return refcount_read(&vma->vm_refcnt) == VMA_STATE_DETACHED;
 }
 
+/*
+ * WARNING: to avoid racing with vma_mark_attached(), should be called either
+ * under mmap_write_lock or when the object has been isolated under
+ * mmap_write_lock, ensuring no competing writers.
+ * Should be called after marking vma as detached to wait for possible
+ * readers which temporarily raised vm_refcnt to drop it back and exit.
+ */
+void vma_ensure_detached(struct vm_area_struct *vma);
+
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
 	vma_assert_write_locked(vma);
diff --git a/mm/memory.c b/mm/memory.c
index cff132003e24..534e279f98c1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6329,18 +6329,10 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
-void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
+static inline bool __vma_enter_locked(struct vm_area_struct *vma)
 {
-	bool detached;
-
-	/*
-	 * If vma is detached then only vma_mark_attached() can raise the
-	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
-	 */
-	if (!refcount_inc_not_zero(&vma->vm_refcnt)) {
-		WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-		return;
-	}
+	if (!refcount_inc_not_zero(&vma->vm_refcnt))
+		return false;
 
 	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
 	/* vma is attached, set the writer present bit */
@@ -6350,6 +6342,22 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 		   refcount_read(&vma->vm_refcnt) == VMA_STATE_ATTACHED + (VMA_STATE_LOCKED + 1),
 		   TASK_UNINTERRUPTIBLE);
 	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
+
+	return true;
+}
+
+static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *is_detached)
+{
+	*is_detached = refcount_sub_and_test(VMA_STATE_LOCKED + 1,
+					     &vma->vm_refcnt);
+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+}
+
+void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
+{
+	bool locked;
+
+	locked = __vma_enter_locked(vma);
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -6357,13 +6365,30 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	detached = refcount_sub_and_test(VMA_STATE_LOCKED + 1,
-					 &vma->vm_refcnt);
-	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
-	VM_BUG_ON_VMA(detached, vma); /* vma should remain attached */
+	if (locked) {
+		bool detached;
+
+		__vma_exit_locked(vma, &detached);
+		/* vma was originally attached and should remain so */
+		VM_BUG_ON_VMA(detached, vma);
+	}
 }
 EXPORT_SYMBOL_GPL(__vma_start_write);
 
+void vma_ensure_detached(struct vm_area_struct *vma)
+{
+	if (is_vma_detached(vma))
+		return;
+
+	if (__vma_enter_locked(vma)) {
+		bool detached;
+
+		/* Wait for temporary readers to drop the vm_refcnt */
+		__vma_exit_locked(vma, &detached);
+		VM_BUG_ON_VMA(!detached, vma);
+	}
+}
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index b55556b16060..ac0a59906fea 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -465,6 +465,14 @@ static inline bool is_vma_detached(struct vm_area_struct *vma)
 	return refcount_read(&vma->vm_refcnt) == VMA_STATE_DETACHED;
 }
 
+static inline void vma_ensure_detached(struct vm_area_struct *vma)
+{
+	if (is_vma_detached(vma))
+		return;
+
+	refcount_set(&vma->vm_refcnt, VMA_STATE_DETACHED);
+}
+
 static inline void vma_assert_write_locked(struct vm_area_struct *);
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 14/16] mm: prepare lock_vma_under_rcu() for vma reuse possibility
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (12 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 13/16] mm: introduce vma_ensure_detached() Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma
to be reused and attached to another mm after lock_vma_under_rcu() locks
the vma. lock_vma_under_rcu() should ensure that vma_start_read() is using
the original mm and after locking the vma it should ensure that vma->vm_mm
has not changed from under us.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 10 ++++++----
 mm/memory.c        |  7 ++++---
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 361f26dedab1..bfd01ae07660 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -725,8 +725,10 @@ static inline void vma_refcount_put(struct vm_area_struct *vma)
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
  * using mmap_lock. The function should never yield false unlocked result.
+ * False locked result is possible if mm_lock_seq overflows or if vma gets
+ * reused and attached to a different mm before we lock it.
  */
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	int oldcnt;
 
@@ -737,7 +739,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * we don't rely on for anything - the mm_lock_seq read against which we
 	 * need ordering is below.
 	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
 		return false;
 
 
@@ -762,7 +764,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (oldcnt & VMA_STATE_LOCKED ||
-	    unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
+	    unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
 		vma_refcount_put(vma);
 		return false;
 	}
@@ -918,7 +920,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_lock_init(struct vm_area_struct *vma) {}
-static inline bool vma_start_read(struct vm_area_struct *vma)
+static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 534e279f98c1..2131d9769bb4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6405,7 +6405,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma)
 		goto inval;
 
-	if (!vma_start_read(vma))
+	if (!vma_start_read(mm, vma))
 		goto inval;
 
 	/*
@@ -6415,8 +6415,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	 * fields are accessible for RCU readers.
 	 */
 
-	/* Check since vm_start/vm_end might change before we lock the VMA */
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+	/* Check if the vma we locked is the right one. */
+	if (unlikely(vma->vm_mm != mm ||
+		     address < vma->vm_start || address >= vma->vm_end))
 		goto inval_end_read;
 
 	rcu_read_unlock();
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (13 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 14/16] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:24 ` [PATCH v6 16/16] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
  2024-12-16 19:39 ` [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu(). Current checks are sufficient as long as vma
is detached before it is freed. Implement this guarantee by calling
vma_ensure_detached() before vma is freed and make vm_area_cachep
SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and
will minimize the number of call_rcu() calls.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               |  2 --
 include/linux/mm_types.h         | 10 +++++++---
 include/linux/slab.h             |  6 ------
 kernel/fork.c                    | 34 ++++++++++----------------------
 mm/mmap.c                        |  8 +++++++-
 mm/vma.c                         | 15 +++-----------
 mm/vma.h                         |  2 +-
 tools/testing/vma/vma_internal.h |  7 +------
 8 files changed, 29 insertions(+), 55 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bfd01ae07660..da773302af70 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -258,8 +258,6 @@ void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
-/* Use only if VMA has no other users */
-void __vm_area_free(struct vm_area_struct *vma);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 803f718c007c..a720f7383dd8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -544,6 +544,12 @@ static inline void *folio_get_private(struct folio *folio)
 
 typedef unsigned long vm_flags_t;
 
+/*
+ * freeptr_t represents a SLUB freelist pointer, which might be encoded
+ * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
+ */
+typedef struct { unsigned long v; } freeptr_t;
+
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
  * conditions.  These are held in a global tree and are pinned by the VMAs that
@@ -658,9 +664,7 @@ struct vm_area_struct {
 			unsigned long vm_start;
 			unsigned long vm_end;
 		};
-#ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
-#endif
+		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
 	};
 
 	/*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 10a971c2bde3..681b685b6c4e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -234,12 +234,6 @@ enum _slab_flag_bits {
 #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
 #endif
 
-/*
- * freeptr_t represents a SLUB freelist pointer, which might be encoded
- * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
- */
-typedef struct { unsigned long v; } freeptr_t;
-
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
diff --git a/kernel/fork.c b/kernel/fork.c
index f1ddfc7b3b48..7affb9245f64 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -471,36 +471,16 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	return new;
 }
 
-void __vm_area_free(struct vm_area_struct *vma)
+void vm_area_free(struct vm_area_struct *vma)
 {
 #ifdef CONFIG_PER_VMA_LOCK
-	/* The vma should be detached while being destroyed. */
-	VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
+	vma_ensure_detached(vma);
 #endif
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
-static void vm_area_free_rcu_cb(struct rcu_head *head)
-{
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
-
-	__vm_area_free(vma);
-}
-#endif
-
-void vm_area_free(struct vm_area_struct *vma)
-{
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
-#else
-	__vm_area_free(vma);
-#endif
-}
-
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3147,6 +3127,11 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
+	struct kmem_cache_args args = {
+		.use_freeptr_offset = true,
+		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+	};
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3163,8 +3148,9 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
-	vm_area_cachep = KMEM_CACHE(vm_area_struct,
-			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+			sizeof(struct vm_area_struct), &args,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
 			SLAB_ACCOUNT);
 	mmap_init();
 	nsproxy_cache_init();
diff --git a/mm/mmap.c b/mm/mmap.c
index df9154b15ef9..c848f6d645e9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1284,7 +1284,13 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		remove_vma(vma, /* unreachable = */ true);
+#ifdef CONFIG_PER_VMA_LOCK
+		if (!is_vma_detached(vma)) {
+			vma_start_write(vma);
+			vma_mark_detached(vma);
+		}
+#endif
+		remove_vma(vma);
 		count++;
 		cond_resched();
 		vma = vma_next(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index 0436a7d21e01..1b46b92b2d4d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -406,23 +406,14 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
 /*
  * Close a vm structure and free it.
  */
-void remove_vma(struct vm_area_struct *vma, bool unreachable)
+void remove_vma(struct vm_area_struct *vma)
 {
 	might_sleep();
 	vma_close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	if (unreachable) {
-#ifdef CONFIG_PER_VMA_LOCK
-		if (!is_vma_detached(vma)) {
-			vma_start_write(vma);
-			vma_mark_detached(vma);
-		}
-#endif
-		__vm_area_free(vma);
-	} else
-		vm_area_free(vma);
+	vm_area_free(vma);
 }
 
 /*
@@ -1206,7 +1197,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		remove_vma(vma, /* unreachable = */ false);
+		remove_vma(vma);
 
 	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
diff --git a/mm/vma.h b/mm/vma.h
index 24636a2b0acf..3e6c14a748c2 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -170,7 +170,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
 		  bool unlock);
 
-void remove_vma(struct vm_area_struct *vma, bool unreachable);
+void remove_vma(struct vm_area_struct *vma);
 
 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index ac0a59906fea..3342cad87ece 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -700,14 +700,9 @@ static inline void mpol_put(struct mempolicy *)
 {
 }
 
-static inline void __vm_area_free(struct vm_area_struct *vma)
-{
-	free(vma);
-}
-
 static inline void vm_area_free(struct vm_area_struct *vma)
 {
-	__vm_area_free(vma);
+	free(vma);
 }
 
 static inline void lru_add_drain(void)
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v6 16/16] docs/mm: document latest changes to vm_lock
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (14 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
@ 2024-12-16 19:24 ` Suren Baghdasaryan
  2024-12-16 19:39 ` [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
  16 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:24 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team, surenb

Change the documentation to reflect that vm_lock is integrated into vma
and replaced with vm_refcnt.
Document newly introduced vma_start_read_locked{_nested} functions.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 Documentation/mm/process_addrs.rst | 44 ++++++++++++++++++------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 81417fa2ed20..f573de936b5d 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -716,9 +716,14 @@ calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
 critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
 before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
 
-VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
-their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
-via :c:func:`!vma_end_read`.
+In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
+and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
+fail due to lock contention but the caller should still check their return values
+in case they fail for other reasons.
+
+VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
+duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
+:c:func:`!vma_end_read`.
 
 VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
 VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
@@ -726,9 +731,9 @@ acquired. An mmap write lock **must** be held for the duration of the VMA write
 lock, releasing or downgrading the mmap write lock also releases the VMA write
 lock so there is no :c:func:`!vma_end_write` function.
 
-Note that a semaphore write lock is not held across a VMA lock. Rather, a
-sequence number is used for serialisation, and the write semaphore is only
-acquired at the point of write lock to update this.
+Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
+modified so that readers can detect the presense of a writer. The reference counter is
+restored once the vma sequence number used for serialisation is updated.
 
 This ensures the semantics we require - VMA write locks provide exclusive write
 access to the VMA.
@@ -738,7 +743,7 @@ Implementation details
 
 The VMA lock mechanism is designed to be a lightweight means of avoiding the use
 of the heavily contended mmap lock. It is implemented using a combination of a
-read/write semaphore and sequence numbers belonging to the containing
+reference counter and sequence numbers belonging to the containing
 :c:struct:`!struct mm_struct` and the VMA.
 
 Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
@@ -779,28 +784,31 @@ release of any VMA locks on its release makes sense, as you would never want to
 keep VMAs locked across entirely separate write operations. It also maintains
 correct lock ordering.
 
-Each time a VMA read lock is acquired, we acquire a read lock on the
-:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
-the sequence count of the VMA does not match that of the mm.
+Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
+reference counter and check that the sequence count of the VMA does not match
+that of the mm.
 
-If it does, the read lock fails. If it does not, we hold the lock, excluding
-writers, but permitting other readers, who will also obtain this lock under RCU.
+If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
+If it does not, we keep the reference counter raised, excluding writers, but
+permitting other readers, who can also obtain this lock under RCU.
 
 Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
 are also RCU safe, so the whole read lock operation is guaranteed to function
 correctly.
 
-On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
-read/write semaphore, before setting the VMA's sequence number under this lock,
-also simultaneously holding the mmap write lock.
+On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
+modified by readers and wait for all readers to drop their reference count.
+Once there are no readers, VMA's sequence number is set to match that of the
+mm. During this entire operation mmap write lock is held.
 
 This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
 until these are finished and mutual exclusion is achieved.
 
-After setting the VMA's sequence number, the lock is released, avoiding
-complexity with a long-term held write lock.
+After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
+indicating a writer is cleared. From this point on, VMA's sequence number will
+indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
 
-This clever combination of a read/write semaphore and sequence count allows for
+This clever combination of a reference counter and sequence count allows for
 fast RCU-based per-VMA lock acquisition (especially on page fault, though
 utilised elsewhere) with minimal complexity around lock ordering.
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 00/16] move per-vma lock into vm_area_struct
  2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
                   ` (15 preceding siblings ...)
  2024-12-16 19:24 ` [PATCH v6 16/16] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
@ 2024-12-16 19:39 ` Suren Baghdasaryan
  2024-12-17 18:42   ` Andrew Morton
  16 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 19:39 UTC (permalink / raw)
  To: akpm
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 11:24 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
>
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Changes since v5 [4]
> - Added Reviewed-by, per Vlastimil Babka;
> - Added replacement of vm_lock and vma->detached flag with vm_refcnt,
> per Peter Zijlstra and Matthew Wilcox
> - Marked vmas detached during exit_mmap;
> - Ensureed vmas are in detached state before they are freed;
> - Changed SLAB_TYPESAFE_BY_RCU patch to not require ctor, leading to a
> much simpler code;
> - Removed unnecessary patch [5]
> - Updated documentation to reflect changes to vm_lock;
>
> Patchset applies over mm-unstable after reverting v5 of this patchset [4]
> (currently 687e99a5faa5-905ab222508a)

^^^
Please be aware of this if trying to apply to a branch. mm-unstable
contains an older version of this patchset which needs to be reverted
before this one can be applied.

>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20241206225204.4008261-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20241206225204.4008261-6-surenb@google.com/
>
> Suren Baghdasaryan (16):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm/nommu: fix the last places where vma is not locked before being
>     attached
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm: enforce vma to be in detached state before freeing
>   mm: remove extra vma_numab_state_init() call
>   mm: introduce vma_ensure_detached()
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
>
>  Documentation/mm/process_addrs.rst |  44 ++++----
>  include/linux/mm.h                 | 162 +++++++++++++++++++++++------
>  include/linux/mm_types.h           |  37 ++++---
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  20 +++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      |  88 ++++------------
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  75 +++++++++++--
>  mm/mmap.c                          |   8 +-
>  mm/nommu.c                         |   2 +
>  mm/userfaultfd.c                   |  31 +++---
>  mm/vma.c                           |  15 ++-
>  mm/vma.h                           |   4 +-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  96 +++++++++--------
>  18 files changed, 378 insertions(+), 247 deletions(-)
>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 19:24 ` [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
@ 2024-12-16 20:42   ` Peter Zijlstra
  2024-12-16 20:53     ` Suren Baghdasaryan
  2024-12-16 21:15   ` Peter Zijlstra
  2024-12-16 21:37   ` Peter Zijlstra
  2 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-16 20:42 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 11:24:13AM -0800, Suren Baghdasaryan wrote:
> @@ -734,10 +761,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * after it has been unlocked.
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
> +	if (oldcnt & VMA_STATE_LOCKED ||
> +	    unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {

You likely want that unlikely to cover both conditions :-)

> +		vma_refcount_put(vma);
>  		return false;
>  	}
> +
>  	return true;
>  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 20:42   ` Peter Zijlstra
@ 2024-12-16 20:53     ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 12:42 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 16, 2024 at 11:24:13AM -0800, Suren Baghdasaryan wrote:
> > @@ -734,10 +761,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * after it has been unlocked.
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> > +     if (oldcnt & VMA_STATE_LOCKED ||
> > +         unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
>
> You likely want that unlikely to cover both conditions :-)

True. VMA_STATE_LOCKED is set only while the writer is updating the
vm_lock_seq and that's a narrow window. I'll make that change in the
next revision. Thanks!

>
> > +             vma_refcount_put(vma);
> >               return false;
> >       }
> > +
> >       return true;
> >  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 19:24 ` [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
  2024-12-16 20:42   ` Peter Zijlstra
@ 2024-12-16 21:15   ` Peter Zijlstra
  2024-12-16 21:53     ` Suren Baghdasaryan
  2024-12-16 21:37   ` Peter Zijlstra
  2 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-16 21:15 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 11:24:13AM -0800, Suren Baghdasaryan wrote:

FWIW, I find the whole VMA_STATE_{A,DE}TATCHED thing awkward. And
perhaps s/VMA_STATE_LOCKED/VMA_LOCK_OFFSET/ ?

Also, perhaps:

#define VMA_REF_LIMIT	(VMA_LOCK_OFFSET - 2)

> @@ -699,10 +700,27 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
>  #ifdef CONFIG_PER_VMA_LOCK
>  static inline void vma_lock_init(struct vm_area_struct *vma)
>  {
> -	init_rwsem(&vma->vm_lock.lock);
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	static struct lock_class_key lockdep_key;
> +
> +	lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
> +#endif
> +	refcount_set(&vma->vm_refcnt, VMA_STATE_DETACHED);
>  	vma->vm_lock_seq = UINT_MAX;

Depending on how you do the actual allocation (GFP_ZERO) you might want
to avoid that vm_refcount store entirely.

Perhaps instead write: VM_WARN_ON(refcount_read(&vma->vm_refcnt));

> @@ -813,25 +849,42 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  
>  static inline void vma_assert_locked(struct vm_area_struct *vma)
>  {
> -	if (!rwsem_is_locked(&vma->vm_lock.lock))
> +	if (refcount_read(&vma->vm_refcnt) <= VMA_STATE_ATTACHED)
	if (is_vma_detached(vma))

>  		vma_assert_write_locked(vma);
>  }
>  
> -static inline void vma_mark_attached(struct vm_area_struct *vma)
> +/*
> + * WARNING: to avoid racing with vma_mark_attached(), should be called either
> + * under mmap_write_lock or when the object has been isolated under
> + * mmap_write_lock, ensuring no competing writers.
> + */
> +static inline bool is_vma_detached(struct vm_area_struct *vma)
>  {
> -	vma->detached = false;
> +	return refcount_read(&vma->vm_refcnt) == VMA_STATE_DETACHED;
	return !refcount_read(&vma->vm_refcnt);
>  }
>  
> -static inline void vma_mark_detached(struct vm_area_struct *vma)
> +static inline void vma_mark_attached(struct vm_area_struct *vma)
>  {
> -	/* When detaching vma should be write-locked */
>  	vma_assert_write_locked(vma);
> -	vma->detached = true;
> +
> +	if (is_vma_detached(vma))
> +		refcount_set(&vma->vm_refcnt, VMA_STATE_ATTACHED);

Urgh, so it would be really good to not call this at all them not 0.
I've not tried to untangle the mess, but this is really awkward. Surely
you don't add it to the mas multiple times either.

Also:

	refcount_set(&vma->vm_refcnt, 1);

is so much clearer.

That is, should this not live in vma_iter_store*(), right before
mas_store_gfp() ?

Also, ISTR having to set vm_lock_seq right before it?

>  }
>  
> -static inline bool is_vma_detached(struct vm_area_struct *vma)
> +static inline void vma_mark_detached(struct vm_area_struct *vma)
>  {
> -	return vma->detached;
> +	vma_assert_write_locked(vma);
> +
> +	if (is_vma_detached(vma))
> +		return;

Again, this just reads like confusion :/ Surely you don't have the same
with mas_detach?

> +
> +	/* We are the only writer, so no need to use vma_refcount_put(). */
> +	if (!refcount_dec_and_test(&vma->vm_refcnt)) {
> +		/*
> +		 * Reader must have temporarily raised vm_refcnt but it will
> +		 * drop it without using the vma since vma is write-locked.
> +		 */
> +	}
>  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 11/16] mm: enforce vma to be in detached state before freeing
  2024-12-16 19:24 ` [PATCH v6 11/16] mm: enforce vma to be in detached state before freeing Suren Baghdasaryan
@ 2024-12-16 21:16   ` Peter Zijlstra
  2024-12-16 21:18     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-16 21:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 11:24:14AM -0800, Suren Baghdasaryan wrote:
> exit_mmap() frees vmas without detaching them. This will become a problem
> when we introduce vma reuse. Ensure that vmas are always detached before
> being freed.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  kernel/fork.c |  4 ++++
>  mm/vma.c      | 10 ++++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 283909d082cb..f1ddfc7b3b48 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -473,6 +473,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  
>  void __vm_area_free(struct vm_area_struct *vma)
>  {
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* The vma should be detached while being destroyed. */
> +	VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> +#endif
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
> diff --git a/mm/vma.c b/mm/vma.c
> index fbd7254517d6..0436a7d21e01 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -413,9 +413,15 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
>  	mpol_put(vma_policy(vma));
> -	if (unreachable)
> +	if (unreachable) {
> +#ifdef CONFIG_PER_VMA_LOCK
> +		if (!is_vma_detached(vma)) {
> +			vma_start_write(vma);
> +			vma_mark_detached(vma);
> +		}
> +#endif
>  		__vm_area_free(vma);

Again, can't you race with lockess RCU lookups?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 11/16] mm: enforce vma to be in detached state before freeing
  2024-12-16 21:16   ` Peter Zijlstra
@ 2024-12-16 21:18     ` Peter Zijlstra
  2024-12-16 21:57       ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-16 21:18 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 10:16:35PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 16, 2024 at 11:24:14AM -0800, Suren Baghdasaryan wrote:
> > exit_mmap() frees vmas without detaching them. This will become a problem
> > when we introduce vma reuse. Ensure that vmas are always detached before
> > being freed.
> > 
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  kernel/fork.c |  4 ++++
> >  mm/vma.c      | 10 ++++++++--
> >  2 files changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 283909d082cb..f1ddfc7b3b48 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -473,6 +473,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >  
> >  void __vm_area_free(struct vm_area_struct *vma)
> >  {
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +	/* The vma should be detached while being destroyed. */
> > +	VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> > +#endif
> >  	vma_numab_state_free(vma);
> >  	free_anon_vma_name(vma);
> >  	kmem_cache_free(vm_area_cachep, vma);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index fbd7254517d6..0436a7d21e01 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -413,9 +413,15 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> >  	if (vma->vm_file)
> >  		fput(vma->vm_file);
> >  	mpol_put(vma_policy(vma));
> > -	if (unreachable)
> > +	if (unreachable) {
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +		if (!is_vma_detached(vma)) {
> > +			vma_start_write(vma);
> > +			vma_mark_detached(vma);
> > +		}
> > +#endif
> >  		__vm_area_free(vma);
> 
> Again, can't you race with lockess RCU lookups?

Ah, no, removing vma requires holding mmap_lock for writing and having
the vma locked, which would ensure preceding RCU readers are complete
(per the LOCK_OFFSET waiter thing) and new RCU readers are rejected for
the vma sequence thing.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 19:24 ` [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
  2024-12-16 20:42   ` Peter Zijlstra
  2024-12-16 21:15   ` Peter Zijlstra
@ 2024-12-16 21:37   ` Peter Zijlstra
  2024-12-16 21:44     ` Suren Baghdasaryan
  2 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-16 21:37 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 11:24:13AM -0800, Suren Baghdasaryan wrote:
> +static inline void vma_refcount_put(struct vm_area_struct *vma)
> +{
> +	int refcnt;
> +
> +	if (!__refcount_dec_and_test(&vma->vm_refcnt, &refcnt)) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> +
> +		if (refcnt & VMA_STATE_LOCKED)
> +			rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> +	}
> +}
> +
>  /*
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
> @@ -710,6 +728,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>   */
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  {
> +	int oldcnt;
> +
>  	/*
>  	 * Check before locking. A race might cause false locked result.
>  	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> @@ -720,13 +740,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>  		return false;
>  
> +
> +	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> +	/* Limit at VMA_STATE_LOCKED - 2 to leave one count for a writer */
> +	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> +						      VMA_STATE_LOCKED - 2))) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>  		return false;
> +	}
> +	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
>  
>  	/*
> +	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
>  	 * False unlocked result is impossible because we modify and check
> +	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
>  	 * modification invalidates all existing locks.
>  	 *
>  	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> @@ -734,10 +761,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * after it has been unlocked.
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
> +	if (oldcnt & VMA_STATE_LOCKED ||
> +	    unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> +		vma_refcount_put(vma);

Suppose we have detach race with a concurrent RCU lookup like:

					vma = mas_lookup();

	vma_start_write();
	mas_detach();
					vma_start_read()
					rwsem_acquire_read()
					inc // success
	vma_mark_detach();
	dec_and_test // assumes 1->0
		     // is actually 2->1

					if (vm_lock_seq == vma->vm_mm_mm_lock_seq) // true
					  vma_refcount_put
					    dec_and_test() // 1->0
					      *NO* rwsem_release()



>  		return false;
>  	}
> +
>  	return true;
>  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 21:37   ` Peter Zijlstra
@ 2024-12-16 21:44     ` Suren Baghdasaryan
  2024-12-17 10:30       ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 21:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 1:38 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 16, 2024 at 11:24:13AM -0800, Suren Baghdasaryan wrote:
> > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > +{
> > +     int refcnt;
> > +
> > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &refcnt)) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > +
> > +             if (refcnt & VMA_STATE_LOCKED)
> > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> > +     }
> > +}
> > +
> >  /*
> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >   * locked result to avoid performance overhead, in which case we fall back to
> > @@ -710,6 +728,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >   */
> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >  {
> > +     int oldcnt;
> > +
> >       /*
> >        * Check before locking. A race might cause false locked result.
> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > @@ -720,13 +740,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >               return false;
> >
> > +
> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > +     /* Limit at VMA_STATE_LOCKED - 2 to leave one count for a writer */
> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > +                                                   VMA_STATE_LOCKED - 2))) {
> > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >               return false;
> > +     }
> > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> >
> >       /*
> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        *
> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > @@ -734,10 +761,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * after it has been unlocked.
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> > +     if (oldcnt & VMA_STATE_LOCKED ||
> > +         unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > +             vma_refcount_put(vma);
>
> Suppose we have detach race with a concurrent RCU lookup like:
>
>                                         vma = mas_lookup();
>
>         vma_start_write();
>         mas_detach();
>                                         vma_start_read()
>                                         rwsem_acquire_read()
>                                         inc // success
>         vma_mark_detach();
>         dec_and_test // assumes 1->0
>                      // is actually 2->1
>
>                                         if (vm_lock_seq == vma->vm_mm_mm_lock_seq) // true
>                                           vma_refcount_put
>                                             dec_and_test() // 1->0
>                                               *NO* rwsem_release()
>

Yes, this is possible. I think that's not a problem until we start
reusing the vmas and I deal with this race later in this patchset.
I think what you described here is the same race I mention in the
description of this patch:
https://lore.kernel.org/all/20241216192419.2970941-14-surenb@google.com/
I introduce vma_ensure_detached() in that patch to handle this case
and ensure that vmas are detached before they are returned into the
slab cache for reuse. Does that make sense?


>
>
> >               return false;
> >       }
> > +
> >       return true;
> >  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 21:15   ` Peter Zijlstra
@ 2024-12-16 21:53     ` Suren Baghdasaryan
  2024-12-16 22:00       ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 21:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 1:15 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 16, 2024 at 11:24:13AM -0800, Suren Baghdasaryan wrote:
>
> FWIW, I find the whole VMA_STATE_{A,DE}TATCHED thing awkward. And

I'm bad with naming things, so any better suggestions are welcome.
Are you suggesting to drop VMA_STATE_{A,DE}TATCHED nomenclature and
use 0/1 values directly?

> perhaps s/VMA_STATE_LOCKED/VMA_LOCK_OFFSET/ ?

Sounds good. I'll change it to VMA_LOCK_OFFSET.

>
> Also, perhaps:
>
> #define VMA_REF_LIMIT   (VMA_LOCK_OFFSET - 2)

Ack.

>
> > @@ -699,10 +700,27 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
> >  #ifdef CONFIG_PER_VMA_LOCK
> >  static inline void vma_lock_init(struct vm_area_struct *vma)
> >  {
> > -     init_rwsem(&vma->vm_lock.lock);
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +     static struct lock_class_key lockdep_key;
> > +
> > +     lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
> > +#endif
> > +     refcount_set(&vma->vm_refcnt, VMA_STATE_DETACHED);
> >       vma->vm_lock_seq = UINT_MAX;
>
> Depending on how you do the actual allocation (GFP_ZERO) you might want
> to avoid that vm_refcount store entirely.

I think we could initialize it to 0 in the slab cache constructor and
when vma is freed we already ensure it's 0. So, even when reused it
will be in the correct 0 state.

>
> Perhaps instead write: VM_WARN_ON(refcount_read(&vma->vm_refcnt));

Yes, with the above approach that should work.

>
> > @@ -813,25 +849,42 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >
> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> >  {
> > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
> > +     if (refcount_read(&vma->vm_refcnt) <= VMA_STATE_ATTACHED)
>         if (is_vma_detached(vma))
>
> >               vma_assert_write_locked(vma);
> >  }
> >
> > -static inline void vma_mark_attached(struct vm_area_struct *vma)
> > +/*
> > + * WARNING: to avoid racing with vma_mark_attached(), should be called either
> > + * under mmap_write_lock or when the object has been isolated under
> > + * mmap_write_lock, ensuring no competing writers.
> > + */
> > +static inline bool is_vma_detached(struct vm_area_struct *vma)
> >  {
> > -     vma->detached = false;
> > +     return refcount_read(&vma->vm_refcnt) == VMA_STATE_DETACHED;
>         return !refcount_read(&vma->vm_refcnt);
> >  }
> >
> > -static inline void vma_mark_detached(struct vm_area_struct *vma)
> > +static inline void vma_mark_attached(struct vm_area_struct *vma)
> >  {
> > -     /* When detaching vma should be write-locked */
> >       vma_assert_write_locked(vma);
> > -     vma->detached = true;
> > +
> > +     if (is_vma_detached(vma))
> > +             refcount_set(&vma->vm_refcnt, VMA_STATE_ATTACHED);
>
> Urgh, so it would be really good to not call this at all them not 0.
> I've not tried to untangle the mess, but this is really awkward. Surely
> you don't add it to the mas multiple times either.

The issue is that when we merge/split/shrink/grow vmas, we skip on
marking them detached while modifying them. Therefore from
vma_mark_attached() POV it will look like we are attaching an already
attached vma. I can try to clean that up if this is really a concern.

>
> Also:
>
>         refcount_set(&vma->vm_refcnt, 1);
>
> is so much clearer.

Ok, IIUC you are in favour of dropping VMA_STATE_ATTACHED/VMA_STATE_DETACHED.

>
> That is, should this not live in vma_iter_store*(), right before
> mas_store_gfp() ?

Currently it's done right *after* mas_store_gfp() but I was debating
with myself if it indeed should be *before* insertion into the tree...

>
> Also, ISTR having to set vm_lock_seq right before it?

Yes, vma_mark_attached() requires vma to be write-locked beforehand,
hence the above vma_assert_write_locked(). But oftentimes it's locked
not right before vma_mark_attached() because some other modification
functions also require vma to be write-locked.

>
> >  }
> >
> > -static inline bool is_vma_detached(struct vm_area_struct *vma)
> > +static inline void vma_mark_detached(struct vm_area_struct *vma)
> >  {
> > -     return vma->detached;
> > +     vma_assert_write_locked(vma);
> > +
> > +     if (is_vma_detached(vma))
> > +             return;
>
> Again, this just reads like confusion :/ Surely you don't have the same
> with mas_detach?

I'll double-check if we ever double-mark vma as detached.

Thanks for the review!

>
> > +
> > +     /* We are the only writer, so no need to use vma_refcount_put(). */
> > +     if (!refcount_dec_and_test(&vma->vm_refcnt)) {
> > +             /*
> > +              * Reader must have temporarily raised vm_refcnt but it will
> > +              * drop it without using the vma since vma is write-locked.
> > +              */
> > +     }
> >  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 11/16] mm: enforce vma to be in detached state before freeing
  2024-12-16 21:18     ` Peter Zijlstra
@ 2024-12-16 21:57       ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-16 21:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 1:18 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 16, 2024 at 10:16:35PM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 16, 2024 at 11:24:14AM -0800, Suren Baghdasaryan wrote:
> > > exit_mmap() frees vmas without detaching them. This will become a problem
> > > when we introduce vma reuse. Ensure that vmas are always detached before
> > > being freed.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  kernel/fork.c |  4 ++++
> > >  mm/vma.c      | 10 ++++++++--
> > >  2 files changed, 12 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 283909d082cb..f1ddfc7b3b48 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -473,6 +473,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > >
> > >  void __vm_area_free(struct vm_area_struct *vma)
> > >  {
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +   /* The vma should be detached while being destroyed. */
> > > +   VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> > > +#endif
> > >     vma_numab_state_free(vma);
> > >     free_anon_vma_name(vma);
> > >     kmem_cache_free(vm_area_cachep, vma);
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index fbd7254517d6..0436a7d21e01 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -413,9 +413,15 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> > >     if (vma->vm_file)
> > >             fput(vma->vm_file);
> > >     mpol_put(vma_policy(vma));
> > > -   if (unreachable)
> > > +   if (unreachable) {
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +           if (!is_vma_detached(vma)) {
> > > +                   vma_start_write(vma);
> > > +                   vma_mark_detached(vma);
> > > +           }
> > > +#endif
> > >             __vm_area_free(vma);
> >
> > Again, can't you race with lockess RCU lookups?
>
> Ah, no, removing vma requires holding mmap_lock for writing and having
> the vma locked, which would ensure preceding RCU readers are complete
> (per the LOCK_OFFSET waiter thing) and new RCU readers are rejected for
> the vma sequence thing.

Correct. Once vma is detached it can't be found by new readers.
Possible existing readers are purged later in this patchset by calling
vma_ensure_detached() from vm_area_free(). I don't do that in this
patch because those existing temporary readers do not pose issues up
until we start reusing the vmas.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 21:53     ` Suren Baghdasaryan
@ 2024-12-16 22:00       ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-16 22:00 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 01:53:06PM -0800, Suren Baghdasaryan wrote:

> > That is, should this not live in vma_iter_store*(), right before
> > mas_store_gfp() ?
> 
> Currently it's done right *after* mas_store_gfp() but I was debating
> with myself if it indeed should be *before* insertion into the tree...

The moment it goes into the tree it becomes visible to RCU lookups, it's
a bit weird to have them with !refcnt at that point, but I don't suppose
it harms.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 13/16] mm: introduce vma_ensure_detached()
  2024-12-16 19:24 ` [PATCH v6 13/16] mm: introduce vma_ensure_detached() Suren Baghdasaryan
@ 2024-12-17 10:26   ` Peter Zijlstra
  2024-12-17 15:58     ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-17 10:26 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 11:24:16AM -0800, Suren Baghdasaryan wrote:
> vma_start_read() can temporarily raise vm_refcnt of a write-locked and
> detached vma:
> 
> // vm_refcnt==1 (attached)
> vma_start_write()
>     vma->vm_lock_seq = mm->mm_lock_seq
> 
>                     vma_start_read()
>                        vm_refcnt++; // vm_refcnt==2
> 
> vma_mark_detached()
>     vm_refcnt--; // vm_refcnt==1
> 
> // vma is detached but vm_refcnt!=0 temporarily
> 
>                        if (vma->vm_lock_seq == mm->mm_lock_seq)
>                            vma_refcount_put()
>                                vm_refcnt--; // vm_refcnt==0
> 
> This is currently not a problem when freeing the vma because RCU grace
> period should pass before kmem_cache_free(vma) gets called and by that
> time vma_start_read() should be done and vm_refcnt is 0. However once
> we introduce possibility of vma reuse before RCU grace period is over,
> this will become a problem (reused vma might be in non-detached state).
> Introduce vma_ensure_detached() for the writer to wait for readers until
> they exit vma_start_read().

So aside from the lockdep problem (which I think is fixable), the normal
way to fix the above is to make dec_and_test() do the kmem_cache_free().

Then the last user does the free and everything just works.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-16 21:44     ` Suren Baghdasaryan
@ 2024-12-17 10:30       ` Peter Zijlstra
  2024-12-17 16:27         ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-17 10:30 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 01:44:45PM -0800, Suren Baghdasaryan wrote:
> On Mon, Dec 16, 2024 at 1:38 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Dec 16, 2024 at 11:24:13AM -0800, Suren Baghdasaryan wrote:
> > > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > +{
> > > +     int refcnt;
> > > +
> > > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &refcnt)) {
> > > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > > +
> > > +             if (refcnt & VMA_STATE_LOCKED)
> > > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> > > +     }
> > > +}
> > > +
> > >  /*
> > >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> > >   * locked result to avoid performance overhead, in which case we fall back to
> > > @@ -710,6 +728,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> > >   */
> > >  static inline bool vma_start_read(struct vm_area_struct *vma)
> > >  {
> > > +     int oldcnt;
> > > +
> > >       /*
> > >        * Check before locking. A race might cause false locked result.
> > >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > > @@ -720,13 +740,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > >               return false;
> > >
> > > +
> > > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > > +     /* Limit at VMA_STATE_LOCKED - 2 to leave one count for a writer */
> > > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > > +                                                   VMA_STATE_LOCKED - 2))) {
> > > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > >               return false;
> > > +     }
> > > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> > >
> > >       /*
> > > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> > >        * False unlocked result is impossible because we modify and check
> > > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> > >        * modification invalidates all existing locks.
> > >        *
> > >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > > @@ -734,10 +761,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >        * after it has been unlocked.
> > >        * This pairs with RELEASE semantics in vma_end_write_all().
> > >        */
> > > +     if (oldcnt & VMA_STATE_LOCKED ||
> > > +         unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > > +             vma_refcount_put(vma);
> >
> > Suppose we have detach race with a concurrent RCU lookup like:
> >
> >                                         vma = mas_lookup();
> >
> >         vma_start_write();
> >         mas_detach();
> >                                         vma_start_read()
> >                                         rwsem_acquire_read()
> >                                         inc // success
> >         vma_mark_detach();
> >         dec_and_test // assumes 1->0
> >                      // is actually 2->1
> >
> >                                         if (vm_lock_seq == vma->vm_mm_mm_lock_seq) // true
> >                                           vma_refcount_put
> >                                             dec_and_test() // 1->0
> >                                               *NO* rwsem_release()
> >
> 
> Yes, this is possible. I think that's not a problem until we start
> reusing the vmas and I deal with this race later in this patchset.
> I think what you described here is the same race I mention in the
> description of this patch:
> https://lore.kernel.org/all/20241216192419.2970941-14-surenb@google.com/
> I introduce vma_ensure_detached() in that patch to handle this case
> and ensure that vmas are detached before they are returned into the
> slab cache for reuse. Does that make sense?

So I just replied there, and no, I don't think it makes sense. Just put
the kmem_cache_free() in vma_refcount_put(), to be done on 0.

Anyway, my point was more about the weird entanglement of lockdep and
the refcount. Just pull the lockdep annotation out of _put() and put it
explicitly in the vma_start_read() error paths and vma_end_read().

Additionally, having vma_end_write() would allow you to put a lockdep
annotation in vma_{start,end}_write() -- which was I think the original
reason I proposed it a while back, that and having improved clarity when
reading the code, since explicitly marking the end of a section is
helpful.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 06/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2024-12-16 19:24 ` [PATCH v6 06/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
@ 2024-12-17 11:31   ` Lokesh Gidra
  2024-12-17 15:51     ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Lokesh Gidra @ 2024-12-17 11:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, Dec 16, 2024 at 11:24 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
> possibility of vma_start_read_locked/vma_start_read_locked_nested failing
> due to refcount overflow. Prepare for such possibility by changing these
> APIs and adjusting their users.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> ---
>  include/linux/mm.h |  6 ++++--
>  mm/userfaultfd.c   | 17 ++++++++++++-----
>  2 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 689f5a1e2181..0ecd321c50b7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>   * not be used in such cases because it might fail due to mm_lock_seq overflow.
>   * This functionality is used to obtain vma read lock and drop the mmap read lock.
>   */
> -static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> +static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
>  {
>         mmap_assert_locked(vma->vm_mm);
>         down_read_nested(&vma->vm_lock.lock, subclass);
> +       return true;
>  }
>
>  /*
> @@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
>   * not be used in such cases because it might fail due to mm_lock_seq overflow.
>   * This functionality is used to obtain vma read lock and drop the mmap read lock.
>   */
> -static inline void vma_start_read_locked(struct vm_area_struct *vma)
> +static inline bool vma_start_read_locked(struct vm_area_struct *vma)
>  {
>         mmap_assert_locked(vma->vm_mm);
>         down_read(&vma->vm_lock.lock);
> +       return true;
>  }
>
>  static inline void vma_end_read(struct vm_area_struct *vma)
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc9a66ec6a6e..79e8ae676f75 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
>         mmap_read_lock(mm);
>         vma = find_vma_and_prepare_anon(mm, address);
>         if (!IS_ERR(vma))
> -               vma_start_read_locked(vma);
> +               if (!vma_start_read_locked(vma))
> +                       vma = ERR_PTR(-EAGAIN);
>
>         mmap_read_unlock(mm);
>         return vma;
> @@ -1483,10 +1484,16 @@ static int uffd_move_lock(struct mm_struct *mm,
>         mmap_read_lock(mm);
>         err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
>         if (!err) {
> -               vma_start_read_locked(*dst_vmap);
> -               if (*dst_vmap != *src_vmap)
> -                       vma_start_read_locked_nested(*src_vmap,
> -                                               SINGLE_DEPTH_NESTING);
> +               if (!vma_start_read_locked(*dst_vmap)) {

I think you mistakenly reversed the condition. This block should be
executed if we manage to lock dst_vma successfully.
> +                       if (*dst_vmap != *src_vmap) {
> +                               if (!vma_start_read_locked_nested(*src_vmap,
> +                                                       SINGLE_DEPTH_NESTING)) {
> +                                       vma_end_read(*dst_vmap);
> +                                       err = -EAGAIN;
> +                               }
> +                       }
> +               } else
> +                       err = -EAGAIN;
>         }
>         mmap_read_unlock(mm);
>         return err;
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 06/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  2024-12-17 11:31   ` Lokesh Gidra
@ 2024-12-17 15:51     ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-17 15:51 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, peterz, willy, liam.howlett, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Dec 17, 2024 at 3:31 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Mon, Dec 16, 2024 at 11:24 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > With upcoming replacement of vm_lock with vm_refcnt, we need to handle a
> > possibility of vma_start_read_locked/vma_start_read_locked_nested failing
> > due to refcount overflow. Prepare for such possibility by changing these
> > APIs and adjusting their users.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> >  include/linux/mm.h |  6 ++++--
> >  mm/userfaultfd.c   | 17 ++++++++++++-----
> >  2 files changed, 16 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 689f5a1e2181..0ecd321c50b7 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -747,10 +747,11 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >   * not be used in such cases because it might fail due to mm_lock_seq overflow.
> >   * This functionality is used to obtain vma read lock and drop the mmap read lock.
> >   */
> > -static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> > +static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> >  {
> >         mmap_assert_locked(vma->vm_mm);
> >         down_read_nested(&vma->vm_lock.lock, subclass);
> > +       return true;
> >  }
> >
> >  /*
> > @@ -759,10 +760,11 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
> >   * not be used in such cases because it might fail due to mm_lock_seq overflow.
> >   * This functionality is used to obtain vma read lock and drop the mmap read lock.
> >   */
> > -static inline void vma_start_read_locked(struct vm_area_struct *vma)
> > +static inline bool vma_start_read_locked(struct vm_area_struct *vma)
> >  {
> >         mmap_assert_locked(vma->vm_mm);
> >         down_read(&vma->vm_lock.lock);
> > +       return true;
> >  }
> >
> >  static inline void vma_end_read(struct vm_area_struct *vma)
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc9a66ec6a6e..79e8ae676f75 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -85,7 +85,8 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
> >         mmap_read_lock(mm);
> >         vma = find_vma_and_prepare_anon(mm, address);
> >         if (!IS_ERR(vma))
> > -               vma_start_read_locked(vma);
> > +               if (!vma_start_read_locked(vma))
> > +                       vma = ERR_PTR(-EAGAIN);
> >
> >         mmap_read_unlock(mm);
> >         return vma;
> > @@ -1483,10 +1484,16 @@ static int uffd_move_lock(struct mm_struct *mm,
> >         mmap_read_lock(mm);
> >         err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
> >         if (!err) {
> > -               vma_start_read_locked(*dst_vmap);
> > -               if (*dst_vmap != *src_vmap)
> > -                       vma_start_read_locked_nested(*src_vmap,
> > -                                               SINGLE_DEPTH_NESTING);
> > +               if (!vma_start_read_locked(*dst_vmap)) {
>
> I think you mistakenly reversed the condition. This block should be
> executed if we manage to lock dst_vma successfully.

Oops. Sorry, you are right. That above condition should have been
reversed. I'll fix it in the next revision.
Thanks!

> > +                       if (*dst_vmap != *src_vmap) {
> > +                               if (!vma_start_read_locked_nested(*src_vmap,
> > +                                                       SINGLE_DEPTH_NESTING)) {
> > +                                       vma_end_read(*dst_vmap);
> > +                                       err = -EAGAIN;
> > +                               }
> > +                       }
> > +               } else
> > +                       err = -EAGAIN;
> >         }
> >         mmap_read_unlock(mm);
> >         return err;
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 13/16] mm: introduce vma_ensure_detached()
  2024-12-17 10:26   ` Peter Zijlstra
@ 2024-12-17 15:58     ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-17 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Dec 17, 2024 at 2:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 16, 2024 at 11:24:16AM -0800, Suren Baghdasaryan wrote:
> > vma_start_read() can temporarily raise vm_refcnt of a write-locked and
> > detached vma:
> >
> > // vm_refcnt==1 (attached)
> > vma_start_write()
> >     vma->vm_lock_seq = mm->mm_lock_seq
> >
> >                     vma_start_read()
> >                        vm_refcnt++; // vm_refcnt==2
> >
> > vma_mark_detached()
> >     vm_refcnt--; // vm_refcnt==1
> >
> > // vma is detached but vm_refcnt!=0 temporarily
> >
> >                        if (vma->vm_lock_seq == mm->mm_lock_seq)
> >                            vma_refcount_put()
> >                                vm_refcnt--; // vm_refcnt==0
> >
> > This is currently not a problem when freeing the vma because RCU grace
> > period should pass before kmem_cache_free(vma) gets called and by that
> > time vma_start_read() should be done and vm_refcnt is 0. However once
> > we introduce possibility of vma reuse before RCU grace period is over,
> > this will become a problem (reused vma might be in non-detached state).
> > Introduce vma_ensure_detached() for the writer to wait for readers until
> > they exit vma_start_read().
>
> So aside from the lockdep problem (which I think is fixable), the normal
> way to fix the above is to make dec_and_test() do the kmem_cache_free().
>
> Then the last user does the free and everything just works.

I see your point. Let me reply in the other patch where you have more
comments about this.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-17 10:30       ` Peter Zijlstra
@ 2024-12-17 16:27         ` Suren Baghdasaryan
  2024-12-18  9:41           ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-17 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Dec 17, 2024 at 2:30 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 16, 2024 at 01:44:45PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Dec 16, 2024 at 1:38 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Mon, Dec 16, 2024 at 11:24:13AM -0800, Suren Baghdasaryan wrote:
> > > > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > > +{
> > > > +     int refcnt;
> > > > +
> > > > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &refcnt)) {
> > > > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > > > +
> > > > +             if (refcnt & VMA_STATE_LOCKED)
> > > > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> > > > +     }
> > > > +}
> > > > +
> > > >  /*
> > > >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> > > >   * locked result to avoid performance overhead, in which case we fall back to
> > > > @@ -710,6 +728,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> > > >   */
> > > >  static inline bool vma_start_read(struct vm_area_struct *vma)
> > > >  {
> > > > +     int oldcnt;
> > > > +
> > > >       /*
> > > >        * Check before locking. A race might cause false locked result.
> > > >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > > > @@ -720,13 +740,20 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > > >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > > >               return false;
> > > >
> > > > +
> > > > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > > > +     /* Limit at VMA_STATE_LOCKED - 2 to leave one count for a writer */
> > > > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > > > +                                                   VMA_STATE_LOCKED - 2))) {
> > > > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > > >               return false;
> > > > +     }
> > > > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> > > >
> > > >       /*
> > > > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> > > >        * False unlocked result is impossible because we modify and check
> > > > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> > > >        * modification invalidates all existing locks.
> > > >        *
> > > >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > > > @@ -734,10 +761,12 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > > >        * after it has been unlocked.
> > > >        * This pairs with RELEASE semantics in vma_end_write_all().
> > > >        */
> > > > +     if (oldcnt & VMA_STATE_LOCKED ||
> > > > +         unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > > > +             vma_refcount_put(vma);
> > >
> > > Suppose we have detach race with a concurrent RCU lookup like:
> > >
> > >                                         vma = mas_lookup();
> > >
> > >         vma_start_write();
> > >         mas_detach();
> > >                                         vma_start_read()
> > >                                         rwsem_acquire_read()
> > >                                         inc // success
> > >         vma_mark_detach();
> > >         dec_and_test // assumes 1->0
> > >                      // is actually 2->1
> > >
> > >                                         if (vm_lock_seq == vma->vm_mm_mm_lock_seq) // true
> > >                                           vma_refcount_put
> > >                                             dec_and_test() // 1->0
> > >                                               *NO* rwsem_release()
> > >
> >
> > Yes, this is possible. I think that's not a problem until we start
> > reusing the vmas and I deal with this race later in this patchset.
> > I think what you described here is the same race I mention in the
> > description of this patch:
> > https://lore.kernel.org/all/20241216192419.2970941-14-surenb@google.com/
> > I introduce vma_ensure_detached() in that patch to handle this case
> > and ensure that vmas are detached before they are returned into the
> > slab cache for reuse. Does that make sense?
>
> So I just replied there, and no, I don't think it makes sense. Just put
> the kmem_cache_free() in vma_refcount_put(), to be done on 0.

That's very appealing indeed and makes things much simpler. The
problem I see with that is the case when we detach a vma from the tree
to isolate it, then do some cleanup and only then free it. That's done
in vms_gather_munmap_vmas() here:
https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we
even might reattach detached vmas back:
https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW,
detached state is not final and we can't destroy the object that
reached this state. We could change states to: 0=unused (we can free
the object), 1=detached, 2=attached, etc. but then vma_start_read()
should do something like refcount_inc_more_than_one() instead of
refcount_inc_not_zero(). Would you be ok with such an approach?

>
> Anyway, my point was more about the weird entanglement of lockdep and
> the refcount. Just pull the lockdep annotation out of _put() and put it
> explicitly in the vma_start_read() error paths and vma_end_read().

Ok, I think that's easy.

>
> Additionally, having vma_end_write() would allow you to put a lockdep
> annotation in vma_{start,end}_write() -- which was I think the original
> reason I proposed it a while back, that and having improved clarity when
> reading the code, since explicitly marking the end of a section is
> helpful.

The vma->vmlock_dep_map is tracking vma->vm_refcnt, not the
vma->vm_lock_seq (similar to how today vma->vm_lock has its lockdep
tracking that rw_semaphore). If I implement vma_end_write() then it
will simply be something like:

void vma_end_write(vma)
{
         vma_assert_write_locked(vma);
         vma->vm_lock_seq = UINT_MAX;
}

so, vmlock_dep_map would not be involved.

If you want to track vma->vm_lock_seq with a separate lockdep, that
would be more complicated. Specifically for vma_end_write_all() that
would require us to call rwsem_release() on all locked vmas, however
we currently do not track individual locked vmas. vma_end_write_all()
allows us not to worry about tracking them, knowing that once we do
mmap_write_unlock() they all will get unlocked with one increment of
mm->mm_lock_seq. If your suggestion is to replace vma_end_write_all()
with vma_end_write() and unlock vmas individually across the mm code,
that would be a sizable effort. If that is indeed your ultimate goal,
I can do that as a separate project: introduce vma_end_write(),
gradually add them in required places (not yet sure how complex that
would be), then retire vma_end_write_all() and add a lockdep for
vma->vm_lock_seq.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 00/16] move per-vma lock into vm_area_struct
  2024-12-16 19:39 ` [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
@ 2024-12-17 18:42   ` Andrew Morton
  2024-12-17 18:49     ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Andrew Morton @ 2024-12-17 18:42 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Mon, 16 Dec 2024 11:39:16 -0800 Suren Baghdasaryan <surenb@google.com> wrote:

> > Patchset applies over mm-unstable after reverting v5 of this patchset [4]
> > (currently 687e99a5faa5-905ab222508a)
> 
> ^^^
> Please be aware of this if trying to apply to a branch. mm-unstable
> contains an older version of this patchset which needs to be reverted
> before this one can be applied.

I quietly updated mm-unstable to v6.  I understand that a v7 is expected.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 00/16] move per-vma lock into vm_area_struct
  2024-12-17 18:42   ` Andrew Morton
@ 2024-12-17 18:49     ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-17 18:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Dec 17, 2024 at 10:42 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Mon, 16 Dec 2024 11:39:16 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > Patchset applies over mm-unstable after reverting v5 of this patchset [4]
> > > (currently 687e99a5faa5-905ab222508a)
> >
> > ^^^
> > Please be aware of this if trying to apply to a branch. mm-unstable
> > contains an older version of this patchset which needs to be reverted
> > before this one can be applied.
>
> I quietly updated mm-unstable to v6.  I understand that a v7 is expected.

Thanks! Yes, I'll post v7 once our discussion with Peter on
refcounting is concluded.

Could you please fixup the issue that Lokesh found in
https://lore.kernel.org/all/20241216192419.2970941-7-surenb@google.com/
?
Instead of

+                if (!vma_start_read_locked(*dst_vmap)) {

it should be:

+                if (vma_start_read_locked(*dst_vmap)) {

That's the only critical issue found in v6 so far.
Thanks!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-17 16:27         ` Suren Baghdasaryan
@ 2024-12-18  9:41           ` Peter Zijlstra
  2024-12-18 10:06             ` Peter Zijlstra
  2024-12-18 15:42             ` Suren Baghdasaryan
  0 siblings, 2 replies; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-18  9:41 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote:

> > So I just replied there, and no, I don't think it makes sense. Just put
> > the kmem_cache_free() in vma_refcount_put(), to be done on 0.
> 
> That's very appealing indeed and makes things much simpler. The
> problem I see with that is the case when we detach a vma from the tree
> to isolate it, then do some cleanup and only then free it. That's done
> in vms_gather_munmap_vmas() here:
> https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we
> even might reattach detached vmas back:
> https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW,
> detached state is not final and we can't destroy the object that
> reached this state. 

Urgh, so that's the munmap() path, but arguably when that fails, the
map stays in place.

I think this means you're marking detached too soon; you should only
mark detached once you reach the point of no return.

That said, once you've reached the point of no return; and are about to
go remove the page-tables, you very much want to ensure a lack of
concurrency.

So perhaps waiting for out-standing readers at this point isn't crazy.

Also, I'm having a very hard time reading this maple tree stuff :/
Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a
second tree, it does not in fact unlink them from the mm yet.

AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the
mm -- and that being able to fail is mind boggling and I suppose is what
gives rise to much of this insanity :/

Anyway, I would expect remove_vma() to be the one that marks it detached
(it's already unreachable through vma_lookup() at this point) and there
you should wait for concurrent readers to bugger off.

> We could change states to: 0=unused (we can free
> the object), 1=detached, 2=attached, etc. but then vma_start_read()
> should do something like refcount_inc_more_than_one() instead of
> refcount_inc_not_zero(). Would you be ok with such an approach?

Urgh, I would strongly suggest ditching refcount_t if we go this route.
The thing is; refcount_t should remain a 'simple' straight forward
interface and not allow people to do the wrong thing. Its not meant to
be the kitchen sink -- we have atomic_t for that.

Anyway, the more common scheme at that point is using -1 for 'free', I
think folio->_mapcount uses that even. For that see:
atomic_add_negative*().

> > Additionally, having vma_end_write() would allow you to put a lockdep
> > annotation in vma_{start,end}_write() -- which was I think the original
> > reason I proposed it a while back, that and having improved clarity when
> > reading the code, since explicitly marking the end of a section is
> > helpful.
> 
> The vma->vmlock_dep_map is tracking vma->vm_refcnt, not the
> vma->vm_lock_seq (similar to how today vma->vm_lock has its lockdep
> tracking that rw_semaphore). If I implement vma_end_write() then it
> will simply be something like:
> 
> void vma_end_write(vma)
> {
>          vma_assert_write_locked(vma);
>          vma->vm_lock_seq = UINT_MAX;
> }
> 
> so, vmlock_dep_map would not be involved.

That's just weird; why would you not track vma_{start,end}_write() with
the exclusive side of the 'rwsem' dep_map ?

> If you want to track vma->vm_lock_seq with a separate lockdep, that
> would be more complicated. Specifically for vma_end_write_all() that
> would require us to call rwsem_release() on all locked vmas, however
> we currently do not track individual locked vmas. vma_end_write_all()
> allows us not to worry about tracking them, knowing that once we do
> mmap_write_unlock() they all will get unlocked with one increment of
> mm->mm_lock_seq. If your suggestion is to replace vma_end_write_all()
> with vma_end_write() and unlock vmas individually across the mm code,
> that would be a sizable effort. If that is indeed your ultimate goal,
> I can do that as a separate project: introduce vma_end_write(),
> gradually add them in required places (not yet sure how complex that
> would be), then retire vma_end_write_all() and add a lockdep for
> vma->vm_lock_seq.

Yeah, so ultimately I think it would be clearer if you explicitly mark
the point where the vma modification is 'done'. But I don't suppose we
have to do that here.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18  9:41           ` Peter Zijlstra
@ 2024-12-18 10:06             ` Peter Zijlstra
  2024-12-18 15:37               ` Liam R. Howlett
  2024-12-18 15:42             ` Suren Baghdasaryan
  1 sibling, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-18 10:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 10:41:04AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote:
> 
> > > So I just replied there, and no, I don't think it makes sense. Just put
> > > the kmem_cache_free() in vma_refcount_put(), to be done on 0.
> > 
> > That's very appealing indeed and makes things much simpler. The
> > problem I see with that is the case when we detach a vma from the tree
> > to isolate it, then do some cleanup and only then free it. That's done
> > in vms_gather_munmap_vmas() here:
> > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we
> > even might reattach detached vmas back:
> > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW,
> > detached state is not final and we can't destroy the object that
> > reached this state. 
> 
> Urgh, so that's the munmap() path, but arguably when that fails, the
> map stays in place.
> 
> I think this means you're marking detached too soon; you should only
> mark detached once you reach the point of no return.
> 
> That said, once you've reached the point of no return; and are about to
> go remove the page-tables, you very much want to ensure a lack of
> concurrency.
> 
> So perhaps waiting for out-standing readers at this point isn't crazy.
> 
> Also, I'm having a very hard time reading this maple tree stuff :/
> Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a
> second tree, it does not in fact unlink them from the mm yet.
> 
> AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the
> mm -- and that being able to fail is mind boggling and I suppose is what
> gives rise to much of this insanity :/
> 
> Anyway, I would expect remove_vma() to be the one that marks it detached
> (it's already unreachable through vma_lookup() at this point) and there
> you should wait for concurrent readers to bugger off.

Also, I think vma_start_write() in that gather look is too early, you're
not actually going to change the VMA yet -- with obvious exception of
the split cases.

That too should probably come after you've passes all the fail/unwind
spots.

Something like so perhaps? (yeah, I know, I wrecked a bunch)

diff --git a/mm/vma.c b/mm/vma.c
index 8e31b7e25aeb..45d43adcbb36 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 
+	mas_for_each(mas_detach, vma, ULONG_MAX) {
+		vma_start_write(next);
+		vma_mark_detached(next, true);
+	}
+
 	mm = current->mm;
 	mm->map_count -= vms->vma_count;
 	mm->locked_vm -= vms->locked_vm;
@@ -1219,9 +1224,6 @@ static void reattach_vmas(struct ma_state *mas_detach)
 	struct vm_area_struct *vma;
 
 	mas_set(mas_detach, 0);
-	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
-
 	__mt_destroy(mas_detach->tree);
 }
 
@@ -1289,13 +1291,11 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			if (error)
 				goto end_split_failed;
 		}
-		vma_start_write(next);
 		mas_set(mas_detach, vms->vma_count++);
 		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
 			goto munmap_gather_failed;
 
-		vma_mark_detached(next, true);
 		nrpages = vma_pages(next);
 
 		vms->nr_pages += nrpages;
@@ -1431,14 +1431,17 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	struct vma_munmap_struct vms;
 	int error;
 
+	error = mas_preallocate(vmi->mas);
+	if (error)
+		goto gather_failed;
+
 	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
 	error = vms_gather_munmap_vmas(&vms, &mas_detach);
 	if (error)
 		goto gather_failed;
 
 	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
-	if (error)
-		goto clear_tree_failed;
+	VM_WARN_ON(error);
 
 	/* Point of no return */
 	vms_complete_munmap_vmas(&vms, &mas_detach);

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 10:06             ` Peter Zijlstra
@ 2024-12-18 15:37               ` Liam R. Howlett
  2024-12-18 15:50                 ` Suren Baghdasaryan
                                   ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Liam R. Howlett @ 2024-12-18 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Peter Zijlstra <peterz@infradead.org> [241218 05:06]:
> On Wed, Dec 18, 2024 at 10:41:04AM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote:
> > 
> > > > So I just replied there, and no, I don't think it makes sense. Just put
> > > > the kmem_cache_free() in vma_refcount_put(), to be done on 0.
> > > 
> > > That's very appealing indeed and makes things much simpler. The
> > > problem I see with that is the case when we detach a vma from the tree
> > > to isolate it, then do some cleanup and only then free it. That's done
> > > in vms_gather_munmap_vmas() here:
> > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we
> > > even might reattach detached vmas back:
> > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW,
> > > detached state is not final and we can't destroy the object that
> > > reached this state. 
> > 
> > Urgh, so that's the munmap() path, but arguably when that fails, the
> > map stays in place.
> > 
> > I think this means you're marking detached too soon; you should only
> > mark detached once you reach the point of no return.
> > 
> > That said, once you've reached the point of no return; and are about to
> > go remove the page-tables, you very much want to ensure a lack of
> > concurrency.
> > 
> > So perhaps waiting for out-standing readers at this point isn't crazy.
> > 
> > Also, I'm having a very hard time reading this maple tree stuff :/
> > Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a
> > second tree, it does not in fact unlink them from the mm yet.

Yes, that's correct.  I tried to make this clear with a gather/complete
naming like other areas of the mm.  I hope that helped.

Also, the comments for the function state that's what's going on:

 * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree                                             
 * for removal at a later date.  Handles splitting first and last if necessary                                          
 * and marking the vmas as isolated.

... might be worth updating with new information.

> > 
> > AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the
> > mm -- and that being able to fail is mind boggling and I suppose is what
> > gives rise to much of this insanity :/

This is also correct.  The maple tree is a b-tree variant that has
internal nodes.  When you write to it, including nulls, they are tracked
and may need to allocate.  This is a cost for rcu lookups; we will use
the same or less memory in the end but must maintain a consistent view
of the ranges.

But to put this into perspective, we get 16 nodes per 4k page, most
writes will use 1 or 3 of these from a kmem_cache, so we are talking
about a very unlikely possibility.  Except when syzbot decides to fail
random allocations.

We could preallocate for the write, but this section of the code is
GFP_KERNEL, so we don't.  Preallocation is an option to simplify the
failure path though... which is what you did below.

> > 
> > Anyway, I would expect remove_vma() to be the one that marks it detached
> > (it's already unreachable through vma_lookup() at this point) and there
> > you should wait for concurrent readers to bugger off.
> 
> Also, I think vma_start_write() in that gather look is too early, you're
> not actually going to change the VMA yet -- with obvious exception of
> the split cases.

The split needs to start the write on the vma to avoid anyone reading it
while it's being altered.

> 
> That too should probably come after you've passes all the fail/unwind
> spots.

Do you mean the split?  I'd like to move the split later as well..
tracking that is a pain and may need an extra vma for when one vma is
split twice before removing the middle part.

Actually, I think we need to allocate two (or at least one) vmas in this
case and just pass one through to unmap (written only to the mas_detach
tree?).  It would be nice to find a way to NOT need to do that even.. I
had tried to use a vma on the stack years ago, which didn't work out.

> 
> Something like so perhaps? (yeah, I know, I wrecked a bunch)
> 
> diff --git a/mm/vma.c b/mm/vma.c
> index 8e31b7e25aeb..45d43adcbb36 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	struct vm_area_struct *vma;
>  	struct mm_struct *mm;
>  

mas_set(mas_detach, 0);

> +	mas_for_each(mas_detach, vma, ULONG_MAX) {
> +		vma_start_write(next);
> +		vma_mark_detached(next, true);
> +	}
> +
>  	mm = current->mm;
>  	mm->map_count -= vms->vma_count;
>  	mm->locked_vm -= vms->locked_vm;
> @@ -1219,9 +1224,6 @@ static void reattach_vmas(struct ma_state *mas_detach)
>  	struct vm_area_struct *vma;
>  

>  	mas_set(mas_detach, 0);
Drop the mas_set here.

> -	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		vma_mark_detached(vma, false);
> -
>  	__mt_destroy(mas_detach->tree);
>  }
>  
> @@ -1289,13 +1291,11 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  			if (error)
>  				goto end_split_failed;
>  		}
> -		vma_start_write(next);
>  		mas_set(mas_detach, vms->vma_count++);
>  		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
>  		if (error)
>  			goto munmap_gather_failed;
>  
> -		vma_mark_detached(next, true);
>  		nrpages = vma_pages(next);
>  
>  		vms->nr_pages += nrpages;
> @@ -1431,14 +1431,17 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	struct vma_munmap_struct vms;
>  	int error;
>  

The preallocation needs to know the range being stored to know what's
going to happen.

vma_iter_config(vmi, start, end);

> +	error = mas_preallocate(vmi->mas);

We haven't had a need to have a vma iterator preallocate for storing a
null, but we can add one for this.

> +	if (error)
> +		goto gather_failed;
> +
>  	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
>  	error = vms_gather_munmap_vmas(&vms, &mas_detach);
>  	if (error)
>  		goto gather_failed;
>  

Drop this stuff.
>  	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
> -	if (error)
> -		goto clear_tree_failed;
> +	VM_WARN_ON(error);

Do this instead
vma_iter_config(vmi, start, end);
vma_iter_clear(vmi);

>  
>  	/* Point of no return */
>  	vms_complete_munmap_vmas(&vms, &mas_detach);

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18  9:41           ` Peter Zijlstra
  2024-12-18 10:06             ` Peter Zijlstra
@ 2024-12-18 15:42             ` Suren Baghdasaryan
  1 sibling, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, willy, liam.howlett, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 1:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote:
>
> > > So I just replied there, and no, I don't think it makes sense. Just put
> > > the kmem_cache_free() in vma_refcount_put(), to be done on 0.
> >
> > That's very appealing indeed and makes things much simpler. The
> > problem I see with that is the case when we detach a vma from the tree
> > to isolate it, then do some cleanup and only then free it. That's done
> > in vms_gather_munmap_vmas() here:
> > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we
> > even might reattach detached vmas back:
> > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW,
> > detached state is not final and we can't destroy the object that
> > reached this state.
>
> Urgh, so that's the munmap() path, but arguably when that fails, the
> map stays in place.
>
> I think this means you're marking detached too soon; you should only
> mark detached once you reach the point of no return.
>
> That said, once you've reached the point of no return; and are about to
> go remove the page-tables, you very much want to ensure a lack of
> concurrency.
>
> So perhaps waiting for out-standing readers at this point isn't crazy.
>
> Also, I'm having a very hard time reading this maple tree stuff :/
> Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a
> second tree, it does not in fact unlink them from the mm yet.

Yes, I think you are correct.

>
> AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the
> mm -- and that being able to fail is mind boggling and I suppose is what
> gives rise to much of this insanity :/
>
> Anyway, I would expect remove_vma() to be the one that marks it detached
> (it's already unreachable through vma_lookup() at this point) and there
> you should wait for concurrent readers to bugger off.

There is an issue with that. Note that vms_complete_munmap_vmas()
that's calling remove_vma() might drop the mmap write lock, so
detaching without a write lock would break current rules.

>
> > We could change states to: 0=unused (we can free
> > the object), 1=detached, 2=attached, etc. but then vma_start_read()
> > should do something like refcount_inc_more_than_one() instead of
> > refcount_inc_not_zero(). Would you be ok with such an approach?
>
> Urgh, I would strongly suggest ditching refcount_t if we go this route.
> The thing is; refcount_t should remain a 'simple' straight forward
> interface and not allow people to do the wrong thing. Its not meant to
> be the kitchen sink -- we have atomic_t for that.

Ack. If we go this route I'll use atomics directly.

>
> Anyway, the more common scheme at that point is using -1 for 'free', I
> think folio->_mapcount uses that even. For that see:
> atomic_add_negative*().

Thanks for the reference.

>
> > > Additionally, having vma_end_write() would allow you to put a lockdep
> > > annotation in vma_{start,end}_write() -- which was I think the original
> > > reason I proposed it a while back, that and having improved clarity when
> > > reading the code, since explicitly marking the end of a section is
> > > helpful.
> >
> > The vma->vmlock_dep_map is tracking vma->vm_refcnt, not the
> > vma->vm_lock_seq (similar to how today vma->vm_lock has its lockdep
> > tracking that rw_semaphore). If I implement vma_end_write() then it
> > will simply be something like:
> >
> > void vma_end_write(vma)
> > {
> >          vma_assert_write_locked(vma);
> >          vma->vm_lock_seq = UINT_MAX;
> > }
> >
> > so, vmlock_dep_map would not be involved.
>
> That's just weird; why would you not track vma_{start,end}_write() with
> the exclusive side of the 'rwsem' dep_map ?
>
> > If you want to track vma->vm_lock_seq with a separate lockdep, that
> > would be more complicated. Specifically for vma_end_write_all() that
> > would require us to call rwsem_release() on all locked vmas, however
> > we currently do not track individual locked vmas. vma_end_write_all()
> > allows us not to worry about tracking them, knowing that once we do
> > mmap_write_unlock() they all will get unlocked with one increment of
> > mm->mm_lock_seq. If your suggestion is to replace vma_end_write_all()
> > with vma_end_write() and unlock vmas individually across the mm code,
> > that would be a sizable effort. If that is indeed your ultimate goal,
> > I can do that as a separate project: introduce vma_end_write(),
> > gradually add them in required places (not yet sure how complex that
> > would be), then retire vma_end_write_all() and add a lockdep for
> > vma->vm_lock_seq.
>
> Yeah, so ultimately I think it would be clearer if you explicitly mark
> the point where the vma modification is 'done'. But I don't suppose we
> have to do that here.

Ack.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 15:37               ` Liam R. Howlett
@ 2024-12-18 15:50                 ` Suren Baghdasaryan
  2024-12-18 16:18                   ` Peter Zijlstra
  2024-12-18 15:57                 ` Suren Baghdasaryan
  2024-12-18 16:13                 ` Peter Zijlstra
  2 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 15:50 UTC (permalink / raw)
  To: Liam R. Howlett, Peter Zijlstra, Suren Baghdasaryan, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 7:37 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Peter Zijlstra <peterz@infradead.org> [241218 05:06]:
> > On Wed, Dec 18, 2024 at 10:41:04AM +0100, Peter Zijlstra wrote:
> > > On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote:
> > >
> > > > > So I just replied there, and no, I don't think it makes sense. Just put
> > > > > the kmem_cache_free() in vma_refcount_put(), to be done on 0.
> > > >
> > > > That's very appealing indeed and makes things much simpler. The
> > > > problem I see with that is the case when we detach a vma from the tree
> > > > to isolate it, then do some cleanup and only then free it. That's done
> > > > in vms_gather_munmap_vmas() here:
> > > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we
> > > > even might reattach detached vmas back:
> > > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW,
> > > > detached state is not final and we can't destroy the object that
> > > > reached this state.
> > >
> > > Urgh, so that's the munmap() path, but arguably when that fails, the
> > > map stays in place.
> > >
> > > I think this means you're marking detached too soon; you should only
> > > mark detached once you reach the point of no return.
> > >
> > > That said, once you've reached the point of no return; and are about to
> > > go remove the page-tables, you very much want to ensure a lack of
> > > concurrency.
> > >
> > > So perhaps waiting for out-standing readers at this point isn't crazy.
> > >
> > > Also, I'm having a very hard time reading this maple tree stuff :/
> > > Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a
> > > second tree, it does not in fact unlink them from the mm yet.
>
> Yes, that's correct.  I tried to make this clear with a gather/complete
> naming like other areas of the mm.  I hope that helped.
>
> Also, the comments for the function state that's what's going on:
>
>  * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
>  * for removal at a later date.  Handles splitting first and last if necessary
>  * and marking the vmas as isolated.
>
> ... might be worth updating with new information.
>
> > >
> > > AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the
> > > mm -- and that being able to fail is mind boggling and I suppose is what
> > > gives rise to much of this insanity :/
>
> This is also correct.  The maple tree is a b-tree variant that has
> internal nodes.  When you write to it, including nulls, they are tracked
> and may need to allocate.  This is a cost for rcu lookups; we will use
> the same or less memory in the end but must maintain a consistent view
> of the ranges.
>
> But to put this into perspective, we get 16 nodes per 4k page, most
> writes will use 1 or 3 of these from a kmem_cache, so we are talking
> about a very unlikely possibility.  Except when syzbot decides to fail
> random allocations.
>
> We could preallocate for the write, but this section of the code is
> GFP_KERNEL, so we don't.  Preallocation is an option to simplify the
> failure path though... which is what you did below.
>
> > >
> > > Anyway, I would expect remove_vma() to be the one that marks it detached
> > > (it's already unreachable through vma_lookup() at this point) and there
> > > you should wait for concurrent readers to bugger off.
> >
> > Also, I think vma_start_write() in that gather look is too early, you're
> > not actually going to change the VMA yet -- with obvious exception of
> > the split cases.
>
> The split needs to start the write on the vma to avoid anyone reading it
> while it's being altered.

I think vma_start_write() should be done inside
vms_gather_munmap_vmas() for __mmap_prepare() to work correctly:

__mmap_prepare
    vms_gather_munmap_vmas
    vms_clean_up_area // clears PTEs
...
__mmap_complete
    vms_complete_munmap_vmas

If we do not write-lock the vmas inside vms_gather_munmap_vmas(), we
will be clearing PTEs from under a discoverable vma.
There might be other places like this too but I think we can move
vma_mark_detach() like you suggested without moving vma_start_write()
and that should be enough.

>
> >
> > That too should probably come after you've passes all the fail/unwind
> > spots.
>
> Do you mean the split?  I'd like to move the split later as well..
> tracking that is a pain and may need an extra vma for when one vma is
> split twice before removing the middle part.
>
> Actually, I think we need to allocate two (or at least one) vmas in this
> case and just pass one through to unmap (written only to the mas_detach
> tree?).  It would be nice to find a way to NOT need to do that even.. I
> had tried to use a vma on the stack years ago, which didn't work out.
>
> >
> > Something like so perhaps? (yeah, I know, I wrecked a bunch)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 8e31b7e25aeb..45d43adcbb36 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> >       struct vm_area_struct *vma;
> >       struct mm_struct *mm;
> >
>
> mas_set(mas_detach, 0);
>
> > +     mas_for_each(mas_detach, vma, ULONG_MAX) {
> > +             vma_start_write(next);
> > +             vma_mark_detached(next, true);
> > +     }
> > +
> >       mm = current->mm;
> >       mm->map_count -= vms->vma_count;
> >       mm->locked_vm -= vms->locked_vm;
> > @@ -1219,9 +1224,6 @@ static void reattach_vmas(struct ma_state *mas_detach)
> >       struct vm_area_struct *vma;
> >
>
> >       mas_set(mas_detach, 0);
> Drop the mas_set here.
>
> > -     mas_for_each(mas_detach, vma, ULONG_MAX)
> > -             vma_mark_detached(vma, false);
> > -
> >       __mt_destroy(mas_detach->tree);
> >  }
> >
> > @@ -1289,13 +1291,11 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >                       if (error)
> >                               goto end_split_failed;
> >               }
> > -             vma_start_write(next);
> >               mas_set(mas_detach, vms->vma_count++);
> >               error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
> >               if (error)
> >                       goto munmap_gather_failed;
> >
> > -             vma_mark_detached(next, true);
> >               nrpages = vma_pages(next);
> >
> >               vms->nr_pages += nrpages;
> > @@ -1431,14 +1431,17 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       struct vma_munmap_struct vms;
> >       int error;
> >
>
> The preallocation needs to know the range being stored to know what's
> going to happen.
>
> vma_iter_config(vmi, start, end);
>
> > +     error = mas_preallocate(vmi->mas);
>
> We haven't had a need to have a vma iterator preallocate for storing a
> null, but we can add one for this.
>
> > +     if (error)
> > +             goto gather_failed;
> > +
> >       init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
> >       error = vms_gather_munmap_vmas(&vms, &mas_detach);
> >       if (error)
> >               goto gather_failed;
> >
>
> Drop this stuff.
> >       error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
> > -     if (error)
> > -             goto clear_tree_failed;
> > +     VM_WARN_ON(error);
>
> Do this instead
> vma_iter_config(vmi, start, end);
> vma_iter_clear(vmi);
>
> >
> >       /* Point of no return */
> >       vms_complete_munmap_vmas(&vms, &mas_detach);

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 15:37               ` Liam R. Howlett
  2024-12-18 15:50                 ` Suren Baghdasaryan
@ 2024-12-18 15:57                 ` Suren Baghdasaryan
  2024-12-18 16:13                 ` Peter Zijlstra
  2 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 15:57 UTC (permalink / raw)
  To: Liam R. Howlett, Peter Zijlstra, Suren Baghdasaryan, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 7:37 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Peter Zijlstra <peterz@infradead.org> [241218 05:06]:
> > On Wed, Dec 18, 2024 at 10:41:04AM +0100, Peter Zijlstra wrote:
> > > On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote:
> > >
> > > > > So I just replied there, and no, I don't think it makes sense. Just put
> > > > > the kmem_cache_free() in vma_refcount_put(), to be done on 0.
> > > >
> > > > That's very appealing indeed and makes things much simpler. The
> > > > problem I see with that is the case when we detach a vma from the tree
> > > > to isolate it, then do some cleanup and only then free it. That's done
> > > > in vms_gather_munmap_vmas() here:
> > > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we
> > > > even might reattach detached vmas back:
> > > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW,
> > > > detached state is not final and we can't destroy the object that
> > > > reached this state.
> > >
> > > Urgh, so that's the munmap() path, but arguably when that fails, the
> > > map stays in place.
> > >
> > > I think this means you're marking detached too soon; you should only
> > > mark detached once you reach the point of no return.
> > >
> > > That said, once you've reached the point of no return; and are about to
> > > go remove the page-tables, you very much want to ensure a lack of
> > > concurrency.
> > >
> > > So perhaps waiting for out-standing readers at this point isn't crazy.
> > >
> > > Also, I'm having a very hard time reading this maple tree stuff :/
> > > Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a
> > > second tree, it does not in fact unlink them from the mm yet.
>
> Yes, that's correct.  I tried to make this clear with a gather/complete
> naming like other areas of the mm.  I hope that helped.
>
> Also, the comments for the function state that's what's going on:
>
>  * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
>  * for removal at a later date.  Handles splitting first and last if necessary
>  * and marking the vmas as isolated.
>
> ... might be worth updating with new information.
>
> > >
> > > AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the
> > > mm -- and that being able to fail is mind boggling and I suppose is what
> > > gives rise to much of this insanity :/
>
> This is also correct.  The maple tree is a b-tree variant that has
> internal nodes.  When you write to it, including nulls, they are tracked
> and may need to allocate.  This is a cost for rcu lookups; we will use
> the same or less memory in the end but must maintain a consistent view
> of the ranges.
>
> But to put this into perspective, we get 16 nodes per 4k page, most
> writes will use 1 or 3 of these from a kmem_cache, so we are talking
> about a very unlikely possibility.  Except when syzbot decides to fail
> random allocations.
>
> We could preallocate for the write, but this section of the code is
> GFP_KERNEL, so we don't.  Preallocation is an option to simplify the
> failure path though... which is what you did below.
>
> > >
> > > Anyway, I would expect remove_vma() to be the one that marks it detached
> > > (it's already unreachable through vma_lookup() at this point) and there
> > > you should wait for concurrent readers to bugger off.
> >
> > Also, I think vma_start_write() in that gather look is too early, you're
> > not actually going to change the VMA yet -- with obvious exception of
> > the split cases.
>
> The split needs to start the write on the vma to avoid anyone reading it
> while it's being altered.
>
> >
> > That too should probably come after you've passes all the fail/unwind
> > spots.
>
> Do you mean the split?  I'd like to move the split later as well..
> tracking that is a pain and may need an extra vma for when one vma is
> split twice before removing the middle part.
>
> Actually, I think we need to allocate two (or at least one) vmas in this
> case and just pass one through to unmap (written only to the mas_detach
> tree?).  It would be nice to find a way to NOT need to do that even.. I
> had tried to use a vma on the stack years ago, which didn't work out.
>
> >
> > Something like so perhaps? (yeah, I know, I wrecked a bunch)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 8e31b7e25aeb..45d43adcbb36 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> >       struct vm_area_struct *vma;
> >       struct mm_struct *mm;
> >
>
> mas_set(mas_detach, 0);
>
> > +     mas_for_each(mas_detach, vma, ULONG_MAX) {
> > +             vma_start_write(next);
> > +             vma_mark_detached(next, true);
> > +     }
> > +
> >       mm = current->mm;
> >       mm->map_count -= vms->vma_count;
> >       mm->locked_vm -= vms->locked_vm;
> > @@ -1219,9 +1224,6 @@ static void reattach_vmas(struct ma_state *mas_detach)
> >       struct vm_area_struct *vma;
> >
>
> >       mas_set(mas_detach, 0);
> Drop the mas_set here.
>
> > -     mas_for_each(mas_detach, vma, ULONG_MAX)
> > -             vma_mark_detached(vma, false);
> > -
> >       __mt_destroy(mas_detach->tree);
> >  }
> >
> > @@ -1289,13 +1291,11 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >                       if (error)
> >                               goto end_split_failed;
> >               }
> > -             vma_start_write(next);
> >               mas_set(mas_detach, vms->vma_count++);
> >               error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
> >               if (error)
> >                       goto munmap_gather_failed;
> >
> > -             vma_mark_detached(next, true);
> >               nrpages = vma_pages(next);
> >
> >               vms->nr_pages += nrpages;
> > @@ -1431,14 +1431,17 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       struct vma_munmap_struct vms;
> >       int error;
> >
>
> The preallocation needs to know the range being stored to know what's
> going to happen.
>
> vma_iter_config(vmi, start, end);
>
> > +     error = mas_preallocate(vmi->mas);
>
> We haven't had a need to have a vma iterator preallocate for storing a
> null, but we can add one for this.
>
> > +     if (error)
> > +             goto gather_failed;
> > +
> >       init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
> >       error = vms_gather_munmap_vmas(&vms, &mas_detach);
> >       if (error)
> >               goto gather_failed;
> >
>
> Drop this stuff.
> >       error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
> > -     if (error)
> > -             goto clear_tree_failed;
> > +     VM_WARN_ON(error);
>
> Do this instead
> vma_iter_config(vmi, start, end);
> vma_iter_clear(vmi);

Thanks for the input, Liam. Let me try to make a patch from these
suggestions and see where we end up and what might blow up.

>
> >
> >       /* Point of no return */
> >       vms_complete_munmap_vmas(&vms, &mas_detach);

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 15:37               ` Liam R. Howlett
  2024-12-18 15:50                 ` Suren Baghdasaryan
  2024-12-18 15:57                 ` Suren Baghdasaryan
@ 2024-12-18 16:13                 ` Peter Zijlstra
  2 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-18 16:13 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, willy, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 10:37:24AM -0500, Liam R. Howlett wrote:

> This is also correct.  The maple tree is a b-tree variant that has
> internal nodes.

Right, I remembered that much :-)

> > Also, I think vma_start_write() in that gather look is too early, you're
> > not actually going to change the VMA yet -- with obvious exception of
> > the split cases.
> 
> The split needs to start the write on the vma to avoid anyone reading it
> while it's being altered.

__split_vma() does vma_start_write() itself, so that should be good
already.

> > That too should probably come after you've passes all the fail/unwind
> > spots.
> 
> Do you mean the split? 

No, I means the detach muck :-)

> I'd like to move the split later as well..
> tracking that is a pain and may need an extra vma for when one vma is
> split twice before removing the middle part.
> 
> Actually, I think we need to allocate two (or at least one) vmas in this
> case and just pass one through to unmap (written only to the mas_detach
> tree?).  It would be nice to find a way to NOT need to do that even.. I
> had tried to use a vma on the stack years ago, which didn't work out.

Urgh yeah, vma on stack sounds like utter pain :-)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 15:50                 ` Suren Baghdasaryan
@ 2024-12-18 16:18                   ` Peter Zijlstra
  2024-12-18 17:36                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-18 16:18 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 07:50:34AM -0800, Suren Baghdasaryan wrote:

> I think vma_start_write() should be done inside
> vms_gather_munmap_vmas() for __mmap_prepare() to work correctly:
> 
> __mmap_prepare
>     vms_gather_munmap_vmas
>     vms_clean_up_area // clears PTEs
> ...
> __mmap_complete
>     vms_complete_munmap_vmas

I'm unsure what exactly you mean; __split_vma() will start_write on the
one that is broken up and the rest won't actually change until
vms_complete_munmap_vmas().

> If we do not write-lock the vmas inside vms_gather_munmap_vmas(), we
> will be clearing PTEs from under a discoverable vma.

You will not. vms_complete_munmap_vmas() will call remove_vma() to
remove PTEs IIRC, and if you do start_write() and detach() before
dropping mmap_lock_write, you should be good.

> There might be other places like this too but I think we can move
> vma_mark_detach() like you suggested without moving vma_start_write()
> and that should be enough.

I really don't see why you can't move vma_start_write() -- note that by
moving that after you've unhooked the vmas from the mm (which you have
by that point) you get the sync point you wanted. 



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 16:18                   ` Peter Zijlstra
@ 2024-12-18 17:36                     ` Suren Baghdasaryan
  2024-12-18 17:44                       ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 8:18 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Dec 18, 2024 at 07:50:34AM -0800, Suren Baghdasaryan wrote:
>
> > I think vma_start_write() should be done inside
> > vms_gather_munmap_vmas() for __mmap_prepare() to work correctly:
> >
> > __mmap_prepare
> >     vms_gather_munmap_vmas
> >     vms_clean_up_area // clears PTEs
> > ...
> > __mmap_complete
> >     vms_complete_munmap_vmas
>
> I'm unsure what exactly you mean; __split_vma() will start_write on the
> one that is broken up and the rest won't actually change until
> vms_complete_munmap_vmas().

Ah, sorry, I missed the write-locking in the __split_vma(). Looks like
indeed vma_start_write() is not needed in vms_gather_munmap_vmas().

>
> > If we do not write-lock the vmas inside vms_gather_munmap_vmas(), we
> > will be clearing PTEs from under a discoverable vma.
>
> You will not. vms_complete_munmap_vmas() will call remove_vma() to
> remove PTEs IIRC, and if you do start_write() and detach() before
> dropping mmap_lock_write, you should be good.

Ok, I think we will have to move mmap_write_downgrade() inside
vms_complete_munmap_vmas() to be called after remove_vma().
vms_clear_ptes() is using vmas, so we can't move remove_vma() before
mmap_write_downgrade().

>
> > There might be other places like this too but I think we can move
> > vma_mark_detach() like you suggested without moving vma_start_write()
> > and that should be enough.
>
> I really don't see why you can't move vma_start_write() -- note that by
> moving that after you've unhooked the vmas from the mm (which you have
> by that point) you get the sync point you wanted.
>
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 17:36                     ` Suren Baghdasaryan
@ 2024-12-18 17:44                       ` Peter Zijlstra
  2024-12-18 17:58                         ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-18 17:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:

> > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > remove PTEs IIRC, and if you do start_write() and detach() before
> > dropping mmap_lock_write, you should be good.
> 
> Ok, I think we will have to move mmap_write_downgrade() inside
> vms_complete_munmap_vmas() to be called after remove_vma().
> vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> mmap_write_downgrade().

Why ?!

vms_clear_ptes() and remove_vma() are fine where they are -- there is no
concurrency left at this point.

Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
which is *after* the vmas have been unhooked from the mm, you wait for
any concurrent user to go away.

And since they're unhooked, there can't be any new users.

So you're the one and only user left, and code is fine the way it is.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 17:44                       ` Peter Zijlstra
@ 2024-12-18 17:58                         ` Suren Baghdasaryan
  2024-12-18 19:00                           ` Liam R. Howlett
  2024-12-19  8:53                           ` Peter Zijlstra
  0 siblings, 2 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
>
> > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > dropping mmap_lock_write, you should be good.
> >
> > Ok, I think we will have to move mmap_write_downgrade() inside
> > vms_complete_munmap_vmas() to be called after remove_vma().
> > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > mmap_write_downgrade().
>
> Why ?!
>
> vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> concurrency left at this point.
>
> Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> which is *after* the vmas have been unhooked from the mm, you wait for
> any concurrent user to go away.
>
> And since they're unhooked, there can't be any new users.
>
> So you're the one and only user left, and code is fine the way it is.

Ok, let me make sure I understand this part of your proposal. From
your earlier email:

@@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
vma_munmap_struct *vms,
        struct vm_area_struct *vma;
        struct mm_struct *mm;

+       mas_for_each(mas_detach, vma, ULONG_MAX) {
+               vma_start_write(next);
+               vma_mark_detached(next, true);
+       }
+
        mm = current->mm;
        mm->map_count -= vms->vma_count;
        mm->locked_vm -= vms->locked_vm;

This would mean:

vms_complete_munmap_vmas
           vma_start_write
           vma_mark_detached
           mmap_write_downgrade
           vms_clear_ptes
           remove_vma

And remove_vma will be just freeing the vmas. Is that correct?
I'm a bit confused because the original thinking was that
vma_mark_detached() would drop the last refcnt and if it's 0 we would
free the vma right there. If that's still what we want to do then I
think the above sequence should look like this:

vms_complete_munmap_vmas
           vms_clear_ptes
           remove_vma
               vma_start_write
               vma_mark_detached
           mmap_write_downgrade

because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
Please let me know which way you want to move forward.


>
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 17:58                         ` Suren Baghdasaryan
@ 2024-12-18 19:00                           ` Liam R. Howlett
  2024-12-18 19:07                             ` Suren Baghdasaryan
  2024-12-19  8:53                           ` Peter Zijlstra
  1 sibling, 1 reply; 74+ messages in thread
From: Liam R. Howlett @ 2024-12-18 19:00 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Peter Zijlstra, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241218 12:58]:
> On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> >
> > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > dropping mmap_lock_write, you should be good.
> > >
> > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > mmap_write_downgrade().
> >
> > Why ?!
> >
> > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > concurrency left at this point.
> >
> > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > which is *after* the vmas have been unhooked from the mm, you wait for
> > any concurrent user to go away.
> >
> > And since they're unhooked, there can't be any new users.
> >
> > So you're the one and only user left, and code is fine the way it is.
> 
> Ok, let me make sure I understand this part of your proposal. From
> your earlier email:
> 
> @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> vma_munmap_struct *vms,
>         struct vm_area_struct *vma;
>         struct mm_struct *mm;
> 
> +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> +               vma_start_write(next);
> +               vma_mark_detached(next, true);
> +       }
> +
>         mm = current->mm;
>         mm->map_count -= vms->vma_count;
>         mm->locked_vm -= vms->locked_vm;
> 
> This would mean:
> 
> vms_complete_munmap_vmas
>            vma_start_write
>            vma_mark_detached
>            mmap_write_downgrade
>            vms_clear_ptes
>            remove_vma
> 
> And remove_vma will be just freeing the vmas. Is that correct?
> I'm a bit confused because the original thinking was that
> vma_mark_detached() would drop the last refcnt and if it's 0 we would
> free the vma right there. If that's still what we want to do then I
> think the above sequence should look like this:
> 
> vms_complete_munmap_vmas
>            vms_clear_ptes
>            remove_vma
>                vma_start_write
>                vma_mark_detached
>            mmap_write_downgrade
> 
> because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> Please let me know which way you want to move forward.
> 

Are we sure we're not causing issues with the MAP_FIXED path here?

With the above change, we'd be freeing the PTEs before marking the vmas
as detached or vma_start_write().


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 19:00                           ` Liam R. Howlett
@ 2024-12-18 19:07                             ` Suren Baghdasaryan
  2024-12-18 19:29                               ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 19:07 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Peter Zijlstra, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 11:00 AM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241218 12:58]:
> > On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> > >
> > > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > > dropping mmap_lock_write, you should be good.
> > > >
> > > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > > mmap_write_downgrade().
> > >
> > > Why ?!
> > >
> > > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > > concurrency left at this point.
> > >
> > > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > > which is *after* the vmas have been unhooked from the mm, you wait for
> > > any concurrent user to go away.
> > >
> > > And since they're unhooked, there can't be any new users.
> > >
> > > So you're the one and only user left, and code is fine the way it is.
> >
> > Ok, let me make sure I understand this part of your proposal. From
> > your earlier email:
> >
> > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> > vma_munmap_struct *vms,
> >         struct vm_area_struct *vma;
> >         struct mm_struct *mm;
> >
> > +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> > +               vma_start_write(next);
> > +               vma_mark_detached(next, true);
> > +       }
> > +
> >         mm = current->mm;
> >         mm->map_count -= vms->vma_count;
> >         mm->locked_vm -= vms->locked_vm;
> >
> > This would mean:
> >
> > vms_complete_munmap_vmas
> >            vma_start_write
> >            vma_mark_detached
> >            mmap_write_downgrade
> >            vms_clear_ptes
> >            remove_vma
> >
> > And remove_vma will be just freeing the vmas. Is that correct?
> > I'm a bit confused because the original thinking was that
> > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > free the vma right there. If that's still what we want to do then I
> > think the above sequence should look like this:
> >
> > vms_complete_munmap_vmas
> >            vms_clear_ptes
> >            remove_vma
> >                vma_start_write
> >                vma_mark_detached
> >            mmap_write_downgrade
> >
> > because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> > Please let me know which way you want to move forward.
> >
>
> Are we sure we're not causing issues with the MAP_FIXED path here?
>
> With the above change, we'd be freeing the PTEs before marking the vmas
> as detached or vma_start_write().

IIUC when we call vms_complete_munmap_vmas() all vmas inside
mas_detach have been already write-locked, no?

>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 19:07                             ` Suren Baghdasaryan
@ 2024-12-18 19:29                               ` Suren Baghdasaryan
  2024-12-18 19:38                                 ` Liam R. Howlett
  2024-12-19  8:55                                 ` Peter Zijlstra
  0 siblings, 2 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 19:29 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Peter Zijlstra, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 11:07 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Dec 18, 2024 at 11:00 AM 'Liam R. Howlett' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [241218 12:58]:
> > > On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> > > >
> > > > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > > > dropping mmap_lock_write, you should be good.
> > > > >
> > > > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > > > mmap_write_downgrade().
> > > >
> > > > Why ?!
> > > >
> > > > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > > > concurrency left at this point.
> > > >
> > > > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > > > which is *after* the vmas have been unhooked from the mm, you wait for
> > > > any concurrent user to go away.
> > > >
> > > > And since they're unhooked, there can't be any new users.
> > > >
> > > > So you're the one and only user left, and code is fine the way it is.
> > >
> > > Ok, let me make sure I understand this part of your proposal. From
> > > your earlier email:
> > >
> > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> > > vma_munmap_struct *vms,
> > >         struct vm_area_struct *vma;
> > >         struct mm_struct *mm;
> > >
> > > +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> > > +               vma_start_write(next);
> > > +               vma_mark_detached(next, true);
> > > +       }
> > > +
> > >         mm = current->mm;
> > >         mm->map_count -= vms->vma_count;
> > >         mm->locked_vm -= vms->locked_vm;
> > >
> > > This would mean:
> > >
> > > vms_complete_munmap_vmas
> > >            vma_start_write
> > >            vma_mark_detached
> > >            mmap_write_downgrade
> > >            vms_clear_ptes
> > >            remove_vma
> > >
> > > And remove_vma will be just freeing the vmas. Is that correct?
> > > I'm a bit confused because the original thinking was that
> > > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > > free the vma right there. If that's still what we want to do then I
> > > think the above sequence should look like this:
> > >
> > > vms_complete_munmap_vmas
> > >            vms_clear_ptes
> > >            remove_vma
> > >                vma_start_write
> > >                vma_mark_detached
> > >            mmap_write_downgrade
> > >
> > > because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> > > Please let me know which way you want to move forward.
> > >
> >
> > Are we sure we're not causing issues with the MAP_FIXED path here?
> >
> > With the above change, we'd be freeing the PTEs before marking the vmas
> > as detached or vma_start_write().
>
> IIUC when we call vms_complete_munmap_vmas() all vmas inside
> mas_detach have been already write-locked, no?

Yeah, I think we can simply do this:

vms_complete_munmap_vmas
           vms_clear_ptes
           remove_vma
               vma_mark_detached
           mmap_write_downgrade

If my assumption is incorrect, assertion inside vma_mark_detached()
should trigger. I tried a quick test and so far nothing exploded.

>
> >
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 19:29                               ` Suren Baghdasaryan
@ 2024-12-18 19:38                                 ` Liam R. Howlett
  2024-12-18 20:00                                   ` Suren Baghdasaryan
  2024-12-19  8:55                                 ` Peter Zijlstra
  1 sibling, 1 reply; 74+ messages in thread
From: Liam R. Howlett @ 2024-12-18 19:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Peter Zijlstra, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241218 14:29]:
> On Wed, Dec 18, 2024 at 11:07 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Dec 18, 2024 at 11:00 AM 'Liam R. Howlett' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [241218 12:58]:
> > > > On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > >
> > > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > >
> > > > > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > > > > dropping mmap_lock_write, you should be good.
> > > > > >
> > > > > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > > > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > > > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > > > > mmap_write_downgrade().
> > > > >
> > > > > Why ?!
> > > > >
> > > > > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > > > > concurrency left at this point.
> > > > >
> > > > > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > > > > which is *after* the vmas have been unhooked from the mm, you wait for
> > > > > any concurrent user to go away.
> > > > >
> > > > > And since they're unhooked, there can't be any new users.
> > > > >
> > > > > So you're the one and only user left, and code is fine the way it is.
> > > >
> > > > Ok, let me make sure I understand this part of your proposal. From
> > > > your earlier email:
> > > >
> > > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> > > > vma_munmap_struct *vms,
> > > >         struct vm_area_struct *vma;
> > > >         struct mm_struct *mm;
> > > >
> > > > +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> > > > +               vma_start_write(next);
> > > > +               vma_mark_detached(next, true);
> > > > +       }
> > > > +
> > > >         mm = current->mm;
> > > >         mm->map_count -= vms->vma_count;
> > > >         mm->locked_vm -= vms->locked_vm;
> > > >
> > > > This would mean:
> > > >
> > > > vms_complete_munmap_vmas
> > > >            vma_start_write
> > > >            vma_mark_detached
> > > >            mmap_write_downgrade
> > > >            vms_clear_ptes
> > > >            remove_vma
> > > >
> > > > And remove_vma will be just freeing the vmas. Is that correct?
> > > > I'm a bit confused because the original thinking was that
> > > > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > > > free the vma right there. If that's still what we want to do then I
> > > > think the above sequence should look like this:
> > > >
> > > > vms_complete_munmap_vmas
> > > >            vms_clear_ptes
> > > >            remove_vma
> > > >                vma_start_write
> > > >                vma_mark_detached
> > > >            mmap_write_downgrade
> > > >
> > > > because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> > > > Please let me know which way you want to move forward.
> > > >
> > >
> > > Are we sure we're not causing issues with the MAP_FIXED path here?
> > >
> > > With the above change, we'd be freeing the PTEs before marking the vmas
> > > as detached or vma_start_write().
> >
> > IIUC when we call vms_complete_munmap_vmas() all vmas inside
> > mas_detach have been already write-locked, no?

That's the way it is today - but I thought you were moving the lock to
the complete stage, not adding a new one? (why add a new one otherwise?)

> 
> Yeah, I think we can simply do this:
> 
> vms_complete_munmap_vmas
>            vms_clear_ptes
>            remove_vma
>                vma_mark_detached
>            mmap_write_downgrade
> 
> If my assumption is incorrect, assertion inside vma_mark_detached()
> should trigger. I tried a quick test and so far nothing exploded.
> 

If they are write locked, then the page faults are not a concern.  There
is also the rmap race that Jann found in mmap_region() [1].  This is
probably also fine since you are keeping the write lock in place earlier
on in the gather stage.  Note the ptes will already be cleared by the
time vms_complete_munmap_vmas() is called in this case.

[1] https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/

Thanks,
Liam

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 19:38                                 ` Liam R. Howlett
@ 2024-12-18 20:00                                   ` Suren Baghdasaryan
  2024-12-18 20:38                                     ` Liam R. Howlett
  0 siblings, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 20:00 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Peter Zijlstra, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 11:38 AM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241218 14:29]:
> > On Wed, Dec 18, 2024 at 11:07 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Dec 18, 2024 at 11:00 AM 'Liam R. Howlett' via kernel-team
> > > <kernel-team@android.com> wrote:
> > > >
> > > > * Suren Baghdasaryan <surenb@google.com> [241218 12:58]:
> > > > > On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > >
> > > > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > > >
> > > > > > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > > > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > > > > > dropping mmap_lock_write, you should be good.
> > > > > > >
> > > > > > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > > > > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > > > > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > > > > > mmap_write_downgrade().
> > > > > >
> > > > > > Why ?!
> > > > > >
> > > > > > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > > > > > concurrency left at this point.
> > > > > >
> > > > > > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > > > > > which is *after* the vmas have been unhooked from the mm, you wait for
> > > > > > any concurrent user to go away.
> > > > > >
> > > > > > And since they're unhooked, there can't be any new users.
> > > > > >
> > > > > > So you're the one and only user left, and code is fine the way it is.
> > > > >
> > > > > Ok, let me make sure I understand this part of your proposal. From
> > > > > your earlier email:
> > > > >
> > > > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> > > > > vma_munmap_struct *vms,
> > > > >         struct vm_area_struct *vma;
> > > > >         struct mm_struct *mm;
> > > > >
> > > > > +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> > > > > +               vma_start_write(next);
> > > > > +               vma_mark_detached(next, true);
> > > > > +       }
> > > > > +
> > > > >         mm = current->mm;
> > > > >         mm->map_count -= vms->vma_count;
> > > > >         mm->locked_vm -= vms->locked_vm;
> > > > >
> > > > > This would mean:
> > > > >
> > > > > vms_complete_munmap_vmas
> > > > >            vma_start_write
> > > > >            vma_mark_detached
> > > > >            mmap_write_downgrade
> > > > >            vms_clear_ptes
> > > > >            remove_vma
> > > > >
> > > > > And remove_vma will be just freeing the vmas. Is that correct?
> > > > > I'm a bit confused because the original thinking was that
> > > > > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > > > > free the vma right there. If that's still what we want to do then I
> > > > > think the above sequence should look like this:
> > > > >
> > > > > vms_complete_munmap_vmas
> > > > >            vms_clear_ptes
> > > > >            remove_vma
> > > > >                vma_start_write
> > > > >                vma_mark_detached
> > > > >            mmap_write_downgrade
> > > > >
> > > > > because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> > > > > Please let me know which way you want to move forward.
> > > > >
> > > >
> > > > Are we sure we're not causing issues with the MAP_FIXED path here?
> > > >
> > > > With the above change, we'd be freeing the PTEs before marking the vmas
> > > > as detached or vma_start_write().
> > >
> > > IIUC when we call vms_complete_munmap_vmas() all vmas inside
> > > mas_detach have been already write-locked, no?
>
> That's the way it is today - but I thought you were moving the lock to
> the complete stage, not adding a new one? (why add a new one otherwise?)

Is my understanding correct that mas_detach is populated by
vms_gather_munmap_vmas() only with vmas that went through
__split_vma() (and were write-locked there)? I don't see any path that
would add any other vma into mas_detach but maybe I'm missing
something?

>
> >
> > Yeah, I think we can simply do this:
> >
> > vms_complete_munmap_vmas
> >            vms_clear_ptes
> >            remove_vma
> >                vma_mark_detached
> >            mmap_write_downgrade
> >
> > If my assumption is incorrect, assertion inside vma_mark_detached()
> > should trigger. I tried a quick test and so far nothing exploded.
> >
>
> If they are write locked, then the page faults are not a concern.  There
> is also the rmap race that Jann found in mmap_region() [1].  This is
> probably also fine since you are keeping the write lock in place earlier
> on in the gather stage.  Note the ptes will already be cleared by the
> time vms_complete_munmap_vmas() is called in this case.
>
> [1] https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/
>
> Thanks,
> Liam
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 20:00                                   ` Suren Baghdasaryan
@ 2024-12-18 20:38                                     ` Liam R. Howlett
  2024-12-18 21:53                                       ` Suren Baghdasaryan
  0 siblings, 1 reply; 74+ messages in thread
From: Liam R. Howlett @ 2024-12-18 20:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Peter Zijlstra, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [241218 15:01]:
> On Wed, Dec 18, 2024 at 11:38 AM 'Liam R. Howlett' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [241218 14:29]:
> > > On Wed, Dec 18, 2024 at 11:07 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Wed, Dec 18, 2024 at 11:00 AM 'Liam R. Howlett' via kernel-team
> > > > <kernel-team@android.com> wrote:
> > > > >
> > > > > * Suren Baghdasaryan <surenb@google.com> [241218 12:58]:
> > > > > > On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > > >
> > > > > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > > > >
> > > > > > > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > > > > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > > > > > > dropping mmap_lock_write, you should be good.
> > > > > > > >
> > > > > > > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > > > > > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > > > > > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > > > > > > mmap_write_downgrade().
> > > > > > >
> > > > > > > Why ?!
> > > > > > >
> > > > > > > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > > > > > > concurrency left at this point.
> > > > > > >
> > > > > > > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > > > > > > which is *after* the vmas have been unhooked from the mm, you wait for
> > > > > > > any concurrent user to go away.
> > > > > > >
> > > > > > > And since they're unhooked, there can't be any new users.
> > > > > > >
> > > > > > > So you're the one and only user left, and code is fine the way it is.
> > > > > >
> > > > > > Ok, let me make sure I understand this part of your proposal. From
> > > > > > your earlier email:
> > > > > >
> > > > > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> > > > > > vma_munmap_struct *vms,
> > > > > >         struct vm_area_struct *vma;
> > > > > >         struct mm_struct *mm;
> > > > > >
> > > > > > +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> > > > > > +               vma_start_write(next);
> > > > > > +               vma_mark_detached(next, true);
> > > > > > +       }
> > > > > > +
> > > > > >         mm = current->mm;
> > > > > >         mm->map_count -= vms->vma_count;
> > > > > >         mm->locked_vm -= vms->locked_vm;
> > > > > >
> > > > > > This would mean:
> > > > > >
> > > > > > vms_complete_munmap_vmas
> > > > > >            vma_start_write
> > > > > >            vma_mark_detached
> > > > > >            mmap_write_downgrade
> > > > > >            vms_clear_ptes
> > > > > >            remove_vma
> > > > > >
> > > > > > And remove_vma will be just freeing the vmas. Is that correct?
> > > > > > I'm a bit confused because the original thinking was that
> > > > > > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > > > > > free the vma right there. If that's still what we want to do then I
> > > > > > think the above sequence should look like this:
> > > > > >
> > > > > > vms_complete_munmap_vmas
> > > > > >            vms_clear_ptes
> > > > > >            remove_vma
> > > > > >                vma_start_write
> > > > > >                vma_mark_detached
> > > > > >            mmap_write_downgrade
> > > > > >
> > > > > > because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> > > > > > Please let me know which way you want to move forward.
> > > > > >
> > > > >
> > > > > Are we sure we're not causing issues with the MAP_FIXED path here?
> > > > >
> > > > > With the above change, we'd be freeing the PTEs before marking the vmas
> > > > > as detached or vma_start_write().
> > > >
> > > > IIUC when we call vms_complete_munmap_vmas() all vmas inside
> > > > mas_detach have been already write-locked, no?
> >
> > That's the way it is today - but I thought you were moving the lock to
> > the complete stage, not adding a new one? (why add a new one otherwise?)
> 
> Is my understanding correct that mas_detach is populated by
> vms_gather_munmap_vmas() only with vmas that went through
> __split_vma() (and were write-locked there)? I don't see any path that
> would add any other vma into mas_detach but maybe I'm missing
> something?

No, that is not correct.

vms_gather_munmap_vmas() calls split on the first vma, then adds all
vmas that are within the range of the munmap() call.  Potentially
splitting the last vma and adding that in the
"if (next->vm_end > vms->end)" block.

Sometimes this is a single vma that gets split twice, sometimes no
splits happen and entire vmas are unmapped, sometimes it's just one vma
that isn't split.

My observation is the common case is a single vma, but besides that we
see 3, and sometimes 7 at a time, but it could be any number of vmas and
not all of them are split.

There is a loop for_each_vma_range() that does:

vma_start_write(next);
mas_set(mas_detach, vms->mas_count++);
mas_store_gfp(mas_detach, next, GFP_KERNEL);


> 
> >
> > >
> > > Yeah, I think we can simply do this:
> > >
> > > vms_complete_munmap_vmas
> > >            vms_clear_ptes
> > >            remove_vma
> > >                vma_mark_detached
> > >            mmap_write_downgrade
> > >
> > > If my assumption is incorrect, assertion inside vma_mark_detached()
> > > should trigger. I tried a quick test and so far nothing exploded.
> > >
> >
> > If they are write locked, then the page faults are not a concern.  There
> > is also the rmap race that Jann found in mmap_region() [1].  This is
> > probably also fine since you are keeping the write lock in place earlier
> > on in the gather stage.  Note the ptes will already be cleared by the
> > time vms_complete_munmap_vmas() is called in this case.
> >
> > [1] https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/
> >
> > Thanks,
> > Liam
> >
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 20:38                                     ` Liam R. Howlett
@ 2024-12-18 21:53                                       ` Suren Baghdasaryan
  2024-12-18 21:55                                         ` Suren Baghdasaryan
                                                           ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 21:53 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Peter Zijlstra, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 12:38 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241218 15:01]:
> > On Wed, Dec 18, 2024 at 11:38 AM 'Liam R. Howlett' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [241218 14:29]:
> > > > On Wed, Dec 18, 2024 at 11:07 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Wed, Dec 18, 2024 at 11:00 AM 'Liam R. Howlett' via kernel-team
> > > > > <kernel-team@android.com> wrote:
> > > > > >
> > > > > > * Suren Baghdasaryan <surenb@google.com> [241218 12:58]:
> > > > > > > On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > > > > >
> > > > > > > > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > > > > > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > > > > > > > dropping mmap_lock_write, you should be good.
> > > > > > > > >
> > > > > > > > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > > > > > > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > > > > > > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > > > > > > > mmap_write_downgrade().
> > > > > > > >
> > > > > > > > Why ?!
> > > > > > > >
> > > > > > > > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > > > > > > > concurrency left at this point.
> > > > > > > >
> > > > > > > > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > > > > > > > which is *after* the vmas have been unhooked from the mm, you wait for
> > > > > > > > any concurrent user to go away.
> > > > > > > >
> > > > > > > > And since they're unhooked, there can't be any new users.
> > > > > > > >
> > > > > > > > So you're the one and only user left, and code is fine the way it is.
> > > > > > >
> > > > > > > Ok, let me make sure I understand this part of your proposal. From
> > > > > > > your earlier email:
> > > > > > >
> > > > > > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> > > > > > > vma_munmap_struct *vms,
> > > > > > >         struct vm_area_struct *vma;
> > > > > > >         struct mm_struct *mm;
> > > > > > >
> > > > > > > +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> > > > > > > +               vma_start_write(next);
> > > > > > > +               vma_mark_detached(next, true);
> > > > > > > +       }
> > > > > > > +
> > > > > > >         mm = current->mm;
> > > > > > >         mm->map_count -= vms->vma_count;
> > > > > > >         mm->locked_vm -= vms->locked_vm;
> > > > > > >
> > > > > > > This would mean:
> > > > > > >
> > > > > > > vms_complete_munmap_vmas
> > > > > > >            vma_start_write
> > > > > > >            vma_mark_detached
> > > > > > >            mmap_write_downgrade
> > > > > > >            vms_clear_ptes
> > > > > > >            remove_vma
> > > > > > >
> > > > > > > And remove_vma will be just freeing the vmas. Is that correct?
> > > > > > > I'm a bit confused because the original thinking was that
> > > > > > > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > > > > > > free the vma right there. If that's still what we want to do then I
> > > > > > > think the above sequence should look like this:
> > > > > > >
> > > > > > > vms_complete_munmap_vmas
> > > > > > >            vms_clear_ptes
> > > > > > >            remove_vma
> > > > > > >                vma_start_write
> > > > > > >                vma_mark_detached
> > > > > > >            mmap_write_downgrade
> > > > > > >
> > > > > > > because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> > > > > > > Please let me know which way you want to move forward.
> > > > > > >
> > > > > >
> > > > > > Are we sure we're not causing issues with the MAP_FIXED path here?
> > > > > >
> > > > > > With the above change, we'd be freeing the PTEs before marking the vmas
> > > > > > as detached or vma_start_write().
> > > > >
> > > > > IIUC when we call vms_complete_munmap_vmas() all vmas inside
> > > > > mas_detach have been already write-locked, no?
> > >
> > > That's the way it is today - but I thought you were moving the lock to
> > > the complete stage, not adding a new one? (why add a new one otherwise?)
> >
> > Is my understanding correct that mas_detach is populated by
> > vms_gather_munmap_vmas() only with vmas that went through
> > __split_vma() (and were write-locked there)? I don't see any path that
> > would add any other vma into mas_detach but maybe I'm missing
> > something?
>
> No, that is not correct.
>
> vms_gather_munmap_vmas() calls split on the first vma, then adds all
> vmas that are within the range of the munmap() call.  Potentially
> splitting the last vma and adding that in the
> "if (next->vm_end > vms->end)" block.
>
> Sometimes this is a single vma that gets split twice, sometimes no
> splits happen and entire vmas are unmapped, sometimes it's just one vma
> that isn't split.
>
> My observation is the common case is a single vma, but besides that we
> see 3, and sometimes 7 at a time, but it could be any number of vmas and
> not all of them are split.
>
> There is a loop for_each_vma_range() that does:
>
> vma_start_write(next);
> mas_set(mas_detach, vms->mas_count++);
> mas_store_gfp(mas_detach, next, GFP_KERNEL);

Ah, ok I see now. I completely misunderstood what for_each_vma_range()
was doing.

Then I think vma_start_write() should remain inside
vms_gather_munmap_vmas() and all vmas in mas_detach should be
write-locked, even the ones we are not modifying. Otherwise what would
prevent the race I mentioned before?

__mmap_region
    __mmap_prepare
        vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach,
                                                      // some locked
by __split_vma(), some not locked

                                     lock_vma_under_rcu()
                                         vma = mas_walk // finds
unlocked vma also in mas_detach
                                         vma_start_read(vma) //
succeeds since vma is not locked
                                         // vma->detached, vm_start,
vm_end checks pass
                                     // vma is successfully read-locked

       vms_clean_up_area(mas_detach)
            vms_clear_ptes
                                     // steps on a cleared PTE
    __mmap_new_vma
        vma_set_range // installs new vma in the range
    __mmap_complete
        vms_complete_munmap_vmas // vmas are write-locked and detached
but it's too late



>
>
> >
> > >
> > > >
> > > > Yeah, I think we can simply do this:
> > > >
> > > > vms_complete_munmap_vmas
> > > >            vms_clear_ptes
> > > >            remove_vma
> > > >                vma_mark_detached
> > > >            mmap_write_downgrade
> > > >
> > > > If my assumption is incorrect, assertion inside vma_mark_detached()
> > > > should trigger. I tried a quick test and so far nothing exploded.
> > > >
> > >
> > > If they are write locked, then the page faults are not a concern.  There
> > > is also the rmap race that Jann found in mmap_region() [1].  This is
> > > probably also fine since you are keeping the write lock in place earlier
> > > on in the gather stage.  Note the ptes will already be cleared by the
> > > time vms_complete_munmap_vmas() is called in this case.
> > >
> > > [1] https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/
> > >
> > > Thanks,
> > > Liam
> > >
> > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > >

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 21:53                                       ` Suren Baghdasaryan
@ 2024-12-18 21:55                                         ` Suren Baghdasaryan
  2024-12-19  0:35                                         ` Andrew Morton
  2024-12-19  9:13                                         ` Peter Zijlstra
  2 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-18 21:55 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Peter Zijlstra, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 1:53 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Dec 18, 2024 at 12:38 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [241218 15:01]:
> > > On Wed, Dec 18, 2024 at 11:38 AM 'Liam R. Howlett' via kernel-team
> > > <kernel-team@android.com> wrote:
> > > >
> > > > * Suren Baghdasaryan <surenb@google.com> [241218 14:29]:
> > > > > On Wed, Dec 18, 2024 at 11:07 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Wed, Dec 18, 2024 at 11:00 AM 'Liam R. Howlett' via kernel-team
> > > > > > <kernel-team@android.com> wrote:
> > > > > > >
> > > > > > > * Suren Baghdasaryan <surenb@google.com> [241218 12:58]:
> > > > > > > > On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > > > > > >
> > > > > > > > > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > > > > > > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > > > > > > > > dropping mmap_lock_write, you should be good.
> > > > > > > > > >
> > > > > > > > > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > > > > > > > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > > > > > > > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > > > > > > > > mmap_write_downgrade().
> > > > > > > > >
> > > > > > > > > Why ?!
> > > > > > > > >
> > > > > > > > > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > > > > > > > > concurrency left at this point.
> > > > > > > > >
> > > > > > > > > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > > > > > > > > which is *after* the vmas have been unhooked from the mm, you wait for
> > > > > > > > > any concurrent user to go away.
> > > > > > > > >
> > > > > > > > > And since they're unhooked, there can't be any new users.
> > > > > > > > >
> > > > > > > > > So you're the one and only user left, and code is fine the way it is.
> > > > > > > >
> > > > > > > > Ok, let me make sure I understand this part of your proposal. From
> > > > > > > > your earlier email:
> > > > > > > >
> > > > > > > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> > > > > > > > vma_munmap_struct *vms,
> > > > > > > >         struct vm_area_struct *vma;
> > > > > > > >         struct mm_struct *mm;
> > > > > > > >
> > > > > > > > +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> > > > > > > > +               vma_start_write(next);
> > > > > > > > +               vma_mark_detached(next, true);
> > > > > > > > +       }
> > > > > > > > +
> > > > > > > >         mm = current->mm;
> > > > > > > >         mm->map_count -= vms->vma_count;
> > > > > > > >         mm->locked_vm -= vms->locked_vm;
> > > > > > > >
> > > > > > > > This would mean:
> > > > > > > >
> > > > > > > > vms_complete_munmap_vmas
> > > > > > > >            vma_start_write
> > > > > > > >            vma_mark_detached
> > > > > > > >            mmap_write_downgrade
> > > > > > > >            vms_clear_ptes
> > > > > > > >            remove_vma
> > > > > > > >
> > > > > > > > And remove_vma will be just freeing the vmas. Is that correct?
> > > > > > > > I'm a bit confused because the original thinking was that
> > > > > > > > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > > > > > > > free the vma right there. If that's still what we want to do then I
> > > > > > > > think the above sequence should look like this:
> > > > > > > >
> > > > > > > > vms_complete_munmap_vmas
> > > > > > > >            vms_clear_ptes
> > > > > > > >            remove_vma
> > > > > > > >                vma_start_write
> > > > > > > >                vma_mark_detached
> > > > > > > >            mmap_write_downgrade
> > > > > > > >
> > > > > > > > because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> > > > > > > > Please let me know which way you want to move forward.
> > > > > > > >
> > > > > > >
> > > > > > > Are we sure we're not causing issues with the MAP_FIXED path here?
> > > > > > >
> > > > > > > With the above change, we'd be freeing the PTEs before marking the vmas
> > > > > > > as detached or vma_start_write().
> > > > > >
> > > > > > IIUC when we call vms_complete_munmap_vmas() all vmas inside
> > > > > > mas_detach have been already write-locked, no?
> > > >
> > > > That's the way it is today - but I thought you were moving the lock to
> > > > the complete stage, not adding a new one? (why add a new one otherwise?)
> > >
> > > Is my understanding correct that mas_detach is populated by
> > > vms_gather_munmap_vmas() only with vmas that went through
> > > __split_vma() (and were write-locked there)? I don't see any path that
> > > would add any other vma into mas_detach but maybe I'm missing
> > > something?
> >
> > No, that is not correct.
> >
> > vms_gather_munmap_vmas() calls split on the first vma, then adds all
> > vmas that are within the range of the munmap() call.  Potentially
> > splitting the last vma and adding that in the
> > "if (next->vm_end > vms->end)" block.
> >
> > Sometimes this is a single vma that gets split twice, sometimes no
> > splits happen and entire vmas are unmapped, sometimes it's just one vma
> > that isn't split.
> >
> > My observation is the common case is a single vma, but besides that we
> > see 3, and sometimes 7 at a time, but it could be any number of vmas and
> > not all of them are split.
> >
> > There is a loop for_each_vma_range() that does:
> >
> > vma_start_write(next);
> > mas_set(mas_detach, vms->mas_count++);
> > mas_store_gfp(mas_detach, next, GFP_KERNEL);
>
> Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> was doing.
>
> Then I think vma_start_write() should remain inside
> vms_gather_munmap_vmas() and all vmas in mas_detach should be
> write-locked, even the ones we are not modifying. Otherwise what would
> prevent the race I mentioned before?
>
> __mmap_region
>     __mmap_prepare
>         vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach,
>                                                       // some locked
> by __split_vma(), some not locked
>
>                                      lock_vma_under_rcu()
>                                          vma = mas_walk // finds
> unlocked vma also in mas_detach
>                                          vma_start_read(vma) //
> succeeds since vma is not locked
>                                          // vma->detached, vm_start,
> vm_end checks pass
>                                      // vma is successfully read-locked
>
>        vms_clean_up_area(mas_detach)
>             vms_clear_ptes
>                                      // steps on a cleared PTE
>     __mmap_new_vma
>         vma_set_range // installs new vma in the range
>     __mmap_complete
>         vms_complete_munmap_vmas // vmas are write-locked and detached
> but it's too late

Sorry about the formatting. Without comments should look better:

__mmap_region
    __mmap_prepare
        vms_gather_munmap_vmas

                                     lock_vma_under_rcu()
                                         vma = mas_walk
                                         vma_start_read(vma)
                                         // vma is still valid and attached

       vms_clean_up_area(mas_detach)
            vms_clear_ptes
                                     // steps on a cleared PTE
    __mmap_new_vma
        vma_set_range
    __mmap_complete
        vms_complete_munmap_vmas

>
>
>
> >
> >
> > >
> > > >
> > > > >
> > > > > Yeah, I think we can simply do this:
> > > > >
> > > > > vms_complete_munmap_vmas
> > > > >            vms_clear_ptes
> > > > >            remove_vma
> > > > >                vma_mark_detached
> > > > >            mmap_write_downgrade
> > > > >
> > > > > If my assumption is incorrect, assertion inside vma_mark_detached()
> > > > > should trigger. I tried a quick test and so far nothing exploded.
> > > > >
> > > >
> > > > If they are write locked, then the page faults are not a concern.  There
> > > > is also the rmap race that Jann found in mmap_region() [1].  This is
> > > > probably also fine since you are keeping the write lock in place earlier
> > > > on in the gather stage.  Note the ptes will already be cleared by the
> > > > time vms_complete_munmap_vmas() is called in this case.
> > > >
> > > > [1] https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/
> > > >
> > > > Thanks,
> > > > Liam
> > > >
> > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > > >

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 21:53                                       ` Suren Baghdasaryan
  2024-12-18 21:55                                         ` Suren Baghdasaryan
@ 2024-12-19  0:35                                         ` Andrew Morton
  2024-12-19  0:47                                           ` Suren Baghdasaryan
  2024-12-19  9:13                                         ` Peter Zijlstra
  2 siblings, 1 reply; 74+ messages in thread
From: Andrew Morton @ 2024-12-19  0:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Peter Zijlstra, willy, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, 18 Dec 2024 13:53:17 -0800 Suren Baghdasaryan <surenb@google.com> wrote:

> > There is a loop for_each_vma_range() that does:
> >
> > vma_start_write(next);
> > mas_set(mas_detach, vms->mas_count++);
> > mas_store_gfp(mas_detach, next, GFP_KERNEL);
> 
> Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> was doing.

I'll drop the v6 series from mm-unstable.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19  0:35                                         ` Andrew Morton
@ 2024-12-19  0:47                                           ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-19  0:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R. Howlett, Peter Zijlstra, willy, lorenzo.stoakes, mhocko,
	vbabka, hannes, mjguzik, oliver.sang, mgorman, david, peterx,
	oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 4:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 18 Dec 2024 13:53:17 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > There is a loop for_each_vma_range() that does:
> > >
> > > vma_start_write(next);
> > > mas_set(mas_detach, vms->mas_count++);
> > > mas_store_gfp(mas_detach, next, GFP_KERNEL);
> >
> > Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> > was doing.
>
> I'll drop the v6 series from mm-unstable.

Sounds good. v7 will be quite different.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 17:58                         ` Suren Baghdasaryan
  2024-12-18 19:00                           ` Liam R. Howlett
@ 2024-12-19  8:53                           ` Peter Zijlstra
  2024-12-19 16:08                             ` Suren Baghdasaryan
  1 sibling, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-19  8:53 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 09:58:12AM -0800, Suren Baghdasaryan wrote:

> And remove_vma will be just freeing the vmas. Is that correct?

Yep.

> I'm a bit confused because the original thinking was that
> vma_mark_detached() would drop the last refcnt and if it's 0 we would
> free the vma right there. If that's still what we want to do then I
> think the above sequence should look like this:

Right; sorry about that. So my initial objection to that extra sync was
based on the reasons presented -- but having had to look at the unmap
path again (my mm-foo is somewhat rusty, I've not done much the past few
years) I realized that keeping a VMA alive beyond unmapping PTEs is just
plain daft.

So yes, back to your original semantics, but cleaned up to not need that
extra sync point -- instead relying on the natural placement of
vma_start_write() after unhooking from the mm. And not for reasons of
the race, but for reasons of integrity -- VMA without PTEs is asking for
more trouble.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 19:29                               ` Suren Baghdasaryan
  2024-12-18 19:38                                 ` Liam R. Howlett
@ 2024-12-19  8:55                                 ` Peter Zijlstra
  2024-12-19 16:08                                   ` Suren Baghdasaryan
  1 sibling, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-19  8:55 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 11:29:23AM -0800, Suren Baghdasaryan wrote:

> Yeah, I think we can simply do this:
> 
> vms_complete_munmap_vmas
>            vms_clear_ptes
>            remove_vma
>                vma_mark_detached
>            mmap_write_downgrade
> 
> If my assumption is incorrect, assertion inside vma_mark_detached()
> should trigger. I tried a quick test and so far nothing exploded.

I think that would be unfortunate and could cause regressions. I think
we want to keep vms_clear_ptes() under the read-lock.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-18 21:53                                       ` Suren Baghdasaryan
  2024-12-18 21:55                                         ` Suren Baghdasaryan
  2024-12-19  0:35                                         ` Andrew Morton
@ 2024-12-19  9:13                                         ` Peter Zijlstra
  2024-12-19 11:20                                           ` Peter Zijlstra
  2024-12-19 16:14                                           ` Suren Baghdasaryan
  2 siblings, 2 replies; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-19  9:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Wed, Dec 18, 2024 at 01:53:17PM -0800, Suren Baghdasaryan wrote:

> Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> was doing.
> 
> Then I think vma_start_write() should remain inside
> vms_gather_munmap_vmas() and all vmas in mas_detach should be

No, it must not. You really are not modifying anything yet (except the
split, which we've already noted mark write themselves).

> write-locked, even the ones we are not modifying. Otherwise what would
> prevent the race I mentioned before?
> 
> __mmap_region
>     __mmap_prepare
>         vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach,
>                                                       // some locked
> by __split_vma(), some not locked
> 
>                                      lock_vma_under_rcu()
>                                          vma = mas_walk // finds
> unlocked vma also in mas_detach
>                                          vma_start_read(vma) //
> succeeds since vma is not locked
>                                          // vma->detached, vm_start,
> vm_end checks pass
>                                      // vma is successfully read-locked
> 
>        vms_clean_up_area(mas_detach)
>             vms_clear_ptes
>                                      // steps on a cleared PTE

So here we have the added complexity that the vma is not unhooked at
all. Is there anything that would prevent a concurrent gup_fast() from
doing the same -- touch a cleared PTE?

AFAICT two threads, one doing overlapping mmap() and the other doing
gup_fast() can result in exactly this scenario.

If we don't care about the GUP case, when I'm thinking we should not
care about the lockless RCU case either.

>     __mmap_new_vma
>         vma_set_range // installs new vma in the range
>     __mmap_complete
>         vms_complete_munmap_vmas // vmas are write-locked and detached
> but it's too late

But at this point that old vma really is unhooked, and the
vma_write_start() here will ensure readers are gone and it will clear
PTEs *again*.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19  9:13                                         ` Peter Zijlstra
@ 2024-12-19 11:20                                           ` Peter Zijlstra
  2024-12-19 16:17                                             ` Suren Baghdasaryan
  2024-12-19 17:16                                             ` Liam R. Howlett
  2024-12-19 16:14                                           ` Suren Baghdasaryan
  1 sibling, 2 replies; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-19 11:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 10:13:34AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 18, 2024 at 01:53:17PM -0800, Suren Baghdasaryan wrote:
> 
> > Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> > was doing.
> > 
> > Then I think vma_start_write() should remain inside
> > vms_gather_munmap_vmas() and all vmas in mas_detach should be
> 
> No, it must not. You really are not modifying anything yet (except the
> split, which we've already noted mark write themselves).
> 
> > write-locked, even the ones we are not modifying. Otherwise what would
> > prevent the race I mentioned before?
> > 
> > __mmap_region
> >     __mmap_prepare
> >         vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach,
> >                                                       // some locked
> > by __split_vma(), some not locked
> > 
> >                                      lock_vma_under_rcu()
> >                                          vma = mas_walk // finds
> > unlocked vma also in mas_detach
> >                                          vma_start_read(vma) //
> > succeeds since vma is not locked
> >                                          // vma->detached, vm_start,
> > vm_end checks pass
> >                                      // vma is successfully read-locked
> > 
> >        vms_clean_up_area(mas_detach)
> >             vms_clear_ptes
> >                                      // steps on a cleared PTE
> 
> So here we have the added complexity that the vma is not unhooked at
> all. Is there anything that would prevent a concurrent gup_fast() from
> doing the same -- touch a cleared PTE?
> 
> AFAICT two threads, one doing overlapping mmap() and the other doing
> gup_fast() can result in exactly this scenario.
> 
> If we don't care about the GUP case, when I'm thinking we should not
> care about the lockless RCU case either.

Also, at this point we'll just fail to find a page, and that is nothing
special. The problem with accessing an unmapped VMA is that the
page-table walk will instantiate page-tables.

Given this is an overlapping mmap -- we're going to need to those
page-tables anyway, so no harm done.

Only after the VMA is unlinked must we ensure we don't accidentally
re-instantiate page-tables.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19  8:53                           ` Peter Zijlstra
@ 2024-12-19 16:08                             ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-19 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 12:53 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Dec 18, 2024 at 09:58:12AM -0800, Suren Baghdasaryan wrote:
>
> > And remove_vma will be just freeing the vmas. Is that correct?
>
> Yep.
>
> > I'm a bit confused because the original thinking was that
> > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > free the vma right there. If that's still what we want to do then I
> > think the above sequence should look like this:
>
> Right; sorry about that. So my initial objection to that extra sync was
> based on the reasons presented -- but having had to look at the unmap
> path again (my mm-foo is somewhat rusty, I've not done much the past few
> years) I realized that keeping a VMA alive beyond unmapping PTEs is just
> plain daft.
>
> So yes, back to your original semantics, but cleaned up to not need that
> extra sync point -- instead relying on the natural placement of
> vma_start_write() after unhooking from the mm. And not for reasons of
> the race, but for reasons of integrity -- VMA without PTEs is asking for
> more trouble.

Ack. Thanks for clarification!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19  8:55                                 ` Peter Zijlstra
@ 2024-12-19 16:08                                   ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-19 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 12:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Dec 18, 2024 at 11:29:23AM -0800, Suren Baghdasaryan wrote:
>
> > Yeah, I think we can simply do this:
> >
> > vms_complete_munmap_vmas
> >            vms_clear_ptes
> >            remove_vma
> >                vma_mark_detached
> >            mmap_write_downgrade
> >
> > If my assumption is incorrect, assertion inside vma_mark_detached()
> > should trigger. I tried a quick test and so far nothing exploded.
>
> I think that would be unfortunate and could cause regressions. I think
> we want to keep vms_clear_ptes() under the read-lock.

Ok, I'll stop considering this option.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19  9:13                                         ` Peter Zijlstra
  2024-12-19 11:20                                           ` Peter Zijlstra
@ 2024-12-19 16:14                                           ` Suren Baghdasaryan
  2024-12-19 17:23                                             ` Peter Zijlstra
  1 sibling, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-19 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Dec 18, 2024 at 01:53:17PM -0800, Suren Baghdasaryan wrote:
>
> > Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> > was doing.
> >
> > Then I think vma_start_write() should remain inside
> > vms_gather_munmap_vmas() and all vmas in mas_detach should be
>
> No, it must not. You really are not modifying anything yet (except the
> split, which we've already noted mark write themselves).
>
> > write-locked, even the ones we are not modifying. Otherwise what would
> > prevent the race I mentioned before?
> >
> > __mmap_region
> >     __mmap_prepare
> >         vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach,
> >                                                       // some locked
> > by __split_vma(), some not locked
> >
> >                                      lock_vma_under_rcu()
> >                                          vma = mas_walk // finds
> > unlocked vma also in mas_detach
> >                                          vma_start_read(vma) //
> > succeeds since vma is not locked
> >                                          // vma->detached, vm_start,
> > vm_end checks pass
> >                                      // vma is successfully read-locked
> >
> >        vms_clean_up_area(mas_detach)
> >             vms_clear_ptes
> >                                      // steps on a cleared PTE
>
> So here we have the added complexity that the vma is not unhooked at
> all. Is there anything that would prevent a concurrent gup_fast() from
> doing the same -- touch a cleared PTE?
>
> AFAICT two threads, one doing overlapping mmap() and the other doing
> gup_fast() can result in exactly this scenario.
>
> If we don't care about the GUP case, when I'm thinking we should not
> care about the lockless RCU case either.
>
> >     __mmap_new_vma
> >         vma_set_range // installs new vma in the range
> >     __mmap_complete
> >         vms_complete_munmap_vmas // vmas are write-locked and detached
> > but it's too late
>
> But at this point that old vma really is unhooked, and the
> vma_write_start() here will ensure readers are gone and it will clear
> PTEs *again*.

So, to summarize, you want vma_start_write() and vma_mark_detached()
to be done when we are removing the vma from the tree, right?
Something like:

vma_start_write()
vma_iter_store()
vma_mark_detached()

And the race I described is not a real problem since the vma is still
in the tree, so gup_fast() does exactly that and will simply reinstall
the ptes.

>
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 11:20                                           ` Peter Zijlstra
@ 2024-12-19 16:17                                             ` Suren Baghdasaryan
  2024-12-19 17:16                                             ` Liam R. Howlett
  1 sibling, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-19 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 3:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 19, 2024 at 10:13:34AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 18, 2024 at 01:53:17PM -0800, Suren Baghdasaryan wrote:
> >
> > > Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> > > was doing.
> > >
> > > Then I think vma_start_write() should remain inside
> > > vms_gather_munmap_vmas() and all vmas in mas_detach should be
> >
> > No, it must not. You really are not modifying anything yet (except the
> > split, which we've already noted mark write themselves).
> >
> > > write-locked, even the ones we are not modifying. Otherwise what would
> > > prevent the race I mentioned before?
> > >
> > > __mmap_region
> > >     __mmap_prepare
> > >         vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach,
> > >                                                       // some locked
> > > by __split_vma(), some not locked
> > >
> > >                                      lock_vma_under_rcu()
> > >                                          vma = mas_walk // finds
> > > unlocked vma also in mas_detach
> > >                                          vma_start_read(vma) //
> > > succeeds since vma is not locked
> > >                                          // vma->detached, vm_start,
> > > vm_end checks pass
> > >                                      // vma is successfully read-locked
> > >
> > >        vms_clean_up_area(mas_detach)
> > >             vms_clear_ptes
> > >                                      // steps on a cleared PTE
> >
> > So here we have the added complexity that the vma is not unhooked at
> > all. Is there anything that would prevent a concurrent gup_fast() from
> > doing the same -- touch a cleared PTE?
> >
> > AFAICT two threads, one doing overlapping mmap() and the other doing
> > gup_fast() can result in exactly this scenario.
> >
> > If we don't care about the GUP case, when I'm thinking we should not
> > care about the lockless RCU case either.
>
> Also, at this point we'll just fail to find a page, and that is nothing
> special. The problem with accessing an unmapped VMA is that the
> page-table walk will instantiate page-tables.
>
> Given this is an overlapping mmap -- we're going to need to those
> page-tables anyway, so no harm done.
>
> Only after the VMA is unlinked must we ensure we don't accidentally
> re-instantiate page-tables.

Got it. I'll need some time to digest all the input but I think I
understand more or less the overall direction. Thanks, Peter!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 11:20                                           ` Peter Zijlstra
  2024-12-19 16:17                                             ` Suren Baghdasaryan
@ 2024-12-19 17:16                                             ` Liam R. Howlett
  2024-12-19 17:42                                               ` Peter Zijlstra
  1 sibling, 1 reply; 74+ messages in thread
From: Liam R. Howlett @ 2024-12-19 17:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Peter Zijlstra <peterz@infradead.org> [241219 06:20]:
> On Thu, Dec 19, 2024 at 10:13:34AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 18, 2024 at 01:53:17PM -0800, Suren Baghdasaryan wrote:
> > 
> > > Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> > > was doing.
> > > 
> > > Then I think vma_start_write() should remain inside
> > > vms_gather_munmap_vmas() and all vmas in mas_detach should be
> > 
> > No, it must not. You really are not modifying anything yet (except the
> > split, which we've already noted mark write themselves).
> > 
> > > write-locked, even the ones we are not modifying. Otherwise what would
> > > prevent the race I mentioned before?
> > > 
> > > __mmap_region
> > >     __mmap_prepare
> > >         vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach,
> > >                                                       // some locked
> > > by __split_vma(), some not locked
> > > 
> > >                                      lock_vma_under_rcu()
> > >                                          vma = mas_walk // finds
> > > unlocked vma also in mas_detach
> > >                                          vma_start_read(vma) //
> > > succeeds since vma is not locked
> > >                                          // vma->detached, vm_start,
> > > vm_end checks pass
> > >                                      // vma is successfully read-locked
> > > 
> > >        vms_clean_up_area(mas_detach)
> > >             vms_clear_ptes
> > >                                      // steps on a cleared PTE
> > 
> > So here we have the added complexity that the vma is not unhooked at
> > all.

Well, hold on - it is taken out of the rmap/anon vma chain here.  It is
completely unhooked except the vma tree at this point.  We're not adding
complexity, we're dealing with it.


>Is there anything that would prevent a concurrent gup_fast() from
> > doing the same -- touch a cleared PTE?

Where does gup_fast() install PTEs?  Doesn't it bail once a READ_ONCE()
on any level returns no PTE?

> > 
> > AFAICT two threads, one doing overlapping mmap() and the other doing
> > gup_fast() can result in exactly this scenario.

The mmap() call will race with the gup_fast(), but either the nr_pinned
will be returned from gup_fast() before vms_clean_up_area() removes the
page table (or any higher level), or gup_fast() will find nothing.

> > 
> > If we don't care about the GUP case, when I'm thinking we should not
> > care about the lockless RCU case either.
> 
> Also, at this point we'll just fail to find a page, and that is nothing
> special. The problem with accessing an unmapped VMA is that the
> page-table walk will instantiate page-tables.

I think there is a problem if we are reinstalling page tables on a vma
that's about to be removed.  I think we are avoiding this with our
locking though?

> 
> Given this is an overlapping mmap -- we're going to need to those
> page-tables anyway, so no harm done.

Well, maybe?  The mapping may now be an anon vma vs a file backed, or
maybe it's PROT_NONE?

> Only after the VMA is unlinked must we ensure we don't accidentally
> re-instantiate page-tables.

It's not as simple as that, unfortunately.  There are vma callbacks for
drivers (or hugetlbfs, or whatever) that do other things.  So we need to
clean up the area before we are able to replace the vma and part of that
clean up is the page tables, or anon vma chain, and/or closing a file.

There are other ways of finding the vma as well, besides the vma tree.
We are following the locking so that we are safe from those perspectives
as well, and so the vma has to be unlinked in a few places in a certain
order.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 16:14                                           ` Suren Baghdasaryan
@ 2024-12-19 17:23                                             ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-19 17:23 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 08:14:24AM -0800, Suren Baghdasaryan wrote:
> On Thu, Dec 19, 2024 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Dec 18, 2024 at 01:53:17PM -0800, Suren Baghdasaryan wrote:
> >
> > > Ah, ok I see now. I completely misunderstood what for_each_vma_range()
> > > was doing.
> > >
> > > Then I think vma_start_write() should remain inside
> > > vms_gather_munmap_vmas() and all vmas in mas_detach should be
> >
> > No, it must not. You really are not modifying anything yet (except the
> > split, which we've already noted mark write themselves).
> >
> > > write-locked, even the ones we are not modifying. Otherwise what would
> > > prevent the race I mentioned before?
> > >
> > > __mmap_region
> > >     __mmap_prepare
> > >         vms_gather_munmap_vmas // adds vmas to be unmapped into mas_detach,
> > >                                                       // some locked
> > > by __split_vma(), some not locked
> > >
> > >                                      lock_vma_under_rcu()
> > >                                          vma = mas_walk // finds
> > > unlocked vma also in mas_detach
> > >                                          vma_start_read(vma) //
> > > succeeds since vma is not locked
> > >                                          // vma->detached, vm_start,
> > > vm_end checks pass
> > >                                      // vma is successfully read-locked
> > >
> > >        vms_clean_up_area(mas_detach)
> > >             vms_clear_ptes
> > >                                      // steps on a cleared PTE
> >
> > So here we have the added complexity that the vma is not unhooked at
> > all. Is there anything that would prevent a concurrent gup_fast() from
> > doing the same -- touch a cleared PTE?
> >
> > AFAICT two threads, one doing overlapping mmap() and the other doing
> > gup_fast() can result in exactly this scenario.
> >
> > If we don't care about the GUP case, when I'm thinking we should not
> > care about the lockless RCU case either.
> >
> > >     __mmap_new_vma
> > >         vma_set_range // installs new vma in the range
> > >     __mmap_complete
> > >         vms_complete_munmap_vmas // vmas are write-locked and detached
> > > but it's too late
> >
> > But at this point that old vma really is unhooked, and the
> > vma_write_start() here will ensure readers are gone and it will clear
> > PTEs *again*.
> 
> So, to summarize, you want vma_start_write() and vma_mark_detached()
> to be done when we are removing the vma from the tree, right?

*after*

> Something like:
 
  vma_iter_store()
  vma_start_write()
  vma_mark_detached()

By having vma_start_write() after being unlinked you get the guarantee
of no concurrency. New lookups cannot find you (because of that
vma_iter_store()) and existing readers will be waited for.

> And the race I described is not a real problem since the vma is still
> in the tree, so gup_fast() does exactly that and will simply reinstall
> the ptes.

Just so.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 17:16                                             ` Liam R. Howlett
@ 2024-12-19 17:42                                               ` Peter Zijlstra
  2024-12-19 18:18                                                 ` Liam R. Howlett
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-19 17:42 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, willy, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 12:16:45PM -0500, Liam R. Howlett wrote:

> Well, hold on - it is taken out of the rmap/anon vma chain here.  It is
> completely unhooked except the vma tree at this point.  We're not adding
> complexity, we're dealing with it.

So I'm not entirely sure I understand the details here -- this is again
about being able to do rollback when things fail?

There is comment above the vms_clean_up_area() call in __mmap_prepare(),
but its not making sense atm.

> >Is there anything that would prevent a concurrent gup_fast() from
> > > doing the same -- touch a cleared PTE?
> 
> Where does gup_fast() install PTEs?  Doesn't it bail once a READ_ONCE()
> on any level returns no PTE?

I think you're right, GUP doesn't, but any 'normal' page-table walker
will.

> > > AFAICT two threads, one doing overlapping mmap() and the other doing
> > > gup_fast() can result in exactly this scenario.
> 
> The mmap() call will race with the gup_fast(), but either the nr_pinned
> will be returned from gup_fast() before vms_clean_up_area() removes the
> page table (or any higher level), or gup_fast() will find nothing.

Agreed.

> > > If we don't care about the GUP case, when I'm thinking we should not
> > > care about the lockless RCU case either.
> > 
> > Also, at this point we'll just fail to find a page, and that is nothing
> > special. The problem with accessing an unmapped VMA is that the
> > page-table walk will instantiate page-tables.
> 
> I think there is a problem if we are reinstalling page tables on a vma
> that's about to be removed.  I think we are avoiding this with our
> locking though?

So this is purely about the overlapping part, right? We need to remove
the old pages, install the new mapping and have new pages populate the
thing.

But either way around, the range stays valid and page-tables stay
needed.

> > Given this is an overlapping mmap -- we're going to need to those
> > page-tables anyway, so no harm done.
> 
> Well, maybe?  The mapping may now be an anon vma vs a file backed, or
> maybe it's PROT_NONE?

The page-tables don't care about all that no? The only thing where it
matters is for things like THP, because that affects the level of
page-tables, but otherwise it's all page-table content (ptes).

> > Only after the VMA is unlinked must we ensure we don't accidentally
> > re-instantiate page-tables.
> 
> It's not as simple as that, unfortunately.  There are vma callbacks for
> drivers (or hugetlbfs, or whatever) that do other things.  So we need to
> clean up the area before we are able to replace the vma and part of that
> clean up is the page tables, or anon vma chain, and/or closing a file.
> 
> There are other ways of finding the vma as well, besides the vma tree.
> We are following the locking so that we are safe from those perspectives
> as well, and so the vma has to be unlinked in a few places in a certain
> order.

For RCU lookups only the mas tree matters -- and its left present there.

If you really want to block RCU readers, I would suggest punching a hole
in the mm_mt. All the traditional code won't notice anyway, this is all
with mmap_lock held for writing.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 17:42                                               ` Peter Zijlstra
@ 2024-12-19 18:18                                                 ` Liam R. Howlett
  2024-12-19 18:46                                                   ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Liam R. Howlett @ 2024-12-19 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Peter Zijlstra <peterz@infradead.org> [241219 12:42]:
> On Thu, Dec 19, 2024 at 12:16:45PM -0500, Liam R. Howlett wrote:
> 
> > Well, hold on - it is taken out of the rmap/anon vma chain here.  It is
> > completely unhooked except the vma tree at this point.  We're not adding
> > complexity, we're dealing with it.
> 
> So I'm not entirely sure I understand the details here -- this is again
> about being able to do rollback when things fail?

no, it's so that things can be ready for the replacement that won't
cause issues during insertion.  We can't have two VMAs in the rmap for
the same area, for instance.

I would like to be able to undo things, but if there's a file that's
closed then we can't undo that.

> 
> There is comment above the vms_clean_up_area() call in __mmap_prepare(),
> but its not making sense atm.
> 
> > >Is there anything that would prevent a concurrent gup_fast() from
> > > > doing the same -- touch a cleared PTE?
> > 
> > Where does gup_fast() install PTEs?  Doesn't it bail once a READ_ONCE()
> > on any level returns no PTE?
> 
> I think you're right, GUP doesn't, but any 'normal' page-table walker
> will.

Normal page-table walkers will be locked out by the page table lock
while the area is cleared, it cannot be re-walked after that because of
the vma lock (except rcu walkers, which is why these new locks are
needed).  Any direct page table walking won't be a problem because we've
removed any way to get to it - I think?

> 
> > > > AFAICT two threads, one doing overlapping mmap() and the other doing
> > > > gup_fast() can result in exactly this scenario.
> > 
> > The mmap() call will race with the gup_fast(), but either the nr_pinned
> > will be returned from gup_fast() before vms_clean_up_area() removes the
> > page table (or any higher level), or gup_fast() will find nothing.
> 
> Agreed.
> 
> > > > If we don't care about the GUP case, when I'm thinking we should not
> > > > care about the lockless RCU case either.
> > > 
> > > Also, at this point we'll just fail to find a page, and that is nothing
> > > special. The problem with accessing an unmapped VMA is that the
> > > page-table walk will instantiate page-tables.
> > 
> > I think there is a problem if we are reinstalling page tables on a vma
> > that's about to be removed.  I think we are avoiding this with our
> > locking though?
> 
> So this is purely about the overlapping part, right? We need to remove
> the old pages, install the new mapping and have new pages populate the
> thing.

No.  If we allow rcu readers to re-fault, we may hit a race where the
page fault has found the vma, but doesn't fault things in before the
ptes are removed or the vma is freed/reused.  Today that won't happen
because we mark the vma as going to be removed before the tree is
changed, so anyone reaching the vma will see it's not safe.

If we want to use different locking strategies for munmap() vs
MAP_FIXED, then we'd need to be sure the vma that is being freed is
isolated for removal and all readers are done before freeing.  I think
this is what you are trying to convey to Suren?

I don't like the idea of another locking strategy in munmap() vs
MAP_FIXED, especially if you think about who would be doing this..
basically no one.  There isn't a sane (legitimate) application that's
going to try and page fault in something that's being removed.

> 
> But either way around, the range stays valid and page-tables stay
> needed.
> 
> > > Given this is an overlapping mmap -- we're going to need to those
> > > page-tables anyway, so no harm done.
> > 
> > Well, maybe?  The mapping may now be an anon vma vs a file backed, or
> > maybe it's PROT_NONE?
> 
> The page-tables don't care about all that no? The only thing where it
> matters is for things like THP, because that affects the level of
> page-tables, but otherwise it's all page-table content (ptes).

It sounds racy, couldn't we read the old page table entry attributes and
act on it after the new attributes are set?

During the switch, after we drop the page table lock but haven't yet
inserted the new vma, then we'd run into a situation where the new
mapping may already be occupied if we don't have some sort of locking
here.  Wouldn't that be an issue as well?  It seems like there are a
number of things that could go bad?

> 
> > > Only after the VMA is unlinked must we ensure we don't accidentally
> > > re-instantiate page-tables.
> > 
> > It's not as simple as that, unfortunately.  There are vma callbacks for
> > drivers (or hugetlbfs, or whatever) that do other things.  So we need to
> > clean up the area before we are able to replace the vma and part of that
> > clean up is the page tables, or anon vma chain, and/or closing a file.
> > 
> > There are other ways of finding the vma as well, besides the vma tree.
> > We are following the locking so that we are safe from those perspectives
> > as well, and so the vma has to be unlinked in a few places in a certain
> > order.
> 
> For RCU lookups only the mas tree matters -- and its left present there.
> 
> If you really want to block RCU readers, I would suggest punching a hole
> in the mm_mt. All the traditional code won't notice anyway, this is all
> with mmap_lock held for writing.

We don't want to block all rcu readers, we want to block the rcu readers
that would see the problem - that is, anyone trying to read a particular
area.

Right now we can page fault in unpopulated vmas while writing other vmas
to the tree.  We are also moving more users to rcu reading to use the
vmas they need without waiting on writes to finish.

Maybe I don't understand your suggestion, but I would think punching a
hole would lose this advantage?

Thanks,
Liam

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 18:18                                                 ` Liam R. Howlett
@ 2024-12-19 18:46                                                   ` Peter Zijlstra
  2024-12-19 18:55                                                     ` Liam R. Howlett
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2024-12-19 18:46 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, willy, lorenzo.stoakes,
	mhocko, vbabka, hannes, mjguzik, oliver.sang, mgorman, david,
	peterx, oleg, dave, paulmck, brauner, dhowells, hdanton, hughd,
	lokeshgidra, minchan, jannh, shakeel.butt, souravpanda,
	pasha.tatashin, klarasmodin, corbet, linux-doc, linux-mm,
	linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 01:18:23PM -0500, Liam R. Howlett wrote:

> > For RCU lookups only the mas tree matters -- and its left present there.
> > 
> > If you really want to block RCU readers, I would suggest punching a hole
> > in the mm_mt. All the traditional code won't notice anyway, this is all
> > with mmap_lock held for writing.
> 
> We don't want to block all rcu readers, we want to block the rcu readers
> that would see the problem - that is, anyone trying to read a particular
> area.
> 
> Right now we can page fault in unpopulated vmas while writing other vmas
> to the tree.  We are also moving more users to rcu reading to use the
> vmas they need without waiting on writes to finish.
> 
> Maybe I don't understand your suggestion, but I would think punching a
> hole would lose this advantage?

My suggestion was to remove the range stuck in mas_detach from mm_mt.
That is exactly the affected range, no?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 18:46                                                   ` Peter Zijlstra
@ 2024-12-19 18:55                                                     ` Liam R. Howlett
  2024-12-20 15:22                                                       ` Suren Baghdasaryan
  2024-12-23  3:03                                                       ` Suren Baghdasaryan
  0 siblings, 2 replies; 74+ messages in thread
From: Liam R. Howlett @ 2024-12-19 18:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, akpm, willy, lorenzo.stoakes, mhocko, vbabka,
	hannes, mjguzik, oliver.sang, mgorman, david, peterx, oleg, dave,
	paulmck, brauner, dhowells, hdanton, hughd, lokeshgidra, minchan,
	jannh, shakeel.butt, souravpanda, pasha.tatashin, klarasmodin,
	corbet, linux-doc, linux-mm, linux-kernel, kernel-team

* Peter Zijlstra <peterz@infradead.org> [241219 13:47]:
> On Thu, Dec 19, 2024 at 01:18:23PM -0500, Liam R. Howlett wrote:
> 
> > > For RCU lookups only the mas tree matters -- and its left present there.
> > > 
> > > If you really want to block RCU readers, I would suggest punching a hole
> > > in the mm_mt. All the traditional code won't notice anyway, this is all
> > > with mmap_lock held for writing.
> > 
> > We don't want to block all rcu readers, we want to block the rcu readers
> > that would see the problem - that is, anyone trying to read a particular
> > area.
> > 
> > Right now we can page fault in unpopulated vmas while writing other vmas
> > to the tree.  We are also moving more users to rcu reading to use the
> > vmas they need without waiting on writes to finish.
> > 
> > Maybe I don't understand your suggestion, but I would think punching a
> > hole would lose this advantage?
> 
> My suggestion was to remove the range stuck in mas_detach from mm_mt.
> That is exactly the affected range, no?

Yes.

But then looping over the vmas will show a gap where there should not be
a gap.

If we stop rcu readers entirely we lose the advantage.

This is exactly the issue that the locking dance was working around :)


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 18:55                                                     ` Liam R. Howlett
@ 2024-12-20 15:22                                                       ` Suren Baghdasaryan
  2024-12-23  3:03                                                       ` Suren Baghdasaryan
  1 sibling, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-20 15:22 UTC (permalink / raw)
  To: Liam R. Howlett, Peter Zijlstra, Suren Baghdasaryan, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 10:55 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Peter Zijlstra <peterz@infradead.org> [241219 13:47]:
> > On Thu, Dec 19, 2024 at 01:18:23PM -0500, Liam R. Howlett wrote:
> >
> > > > For RCU lookups only the mas tree matters -- and its left present there.
> > > >
> > > > If you really want to block RCU readers, I would suggest punching a hole
> > > > in the mm_mt. All the traditional code won't notice anyway, this is all
> > > > with mmap_lock held for writing.
> > >
> > > We don't want to block all rcu readers, we want to block the rcu readers
> > > that would see the problem - that is, anyone trying to read a particular
> > > area.
> > >
> > > Right now we can page fault in unpopulated vmas while writing other vmas
> > > to the tree.  We are also moving more users to rcu reading to use the
> > > vmas they need without waiting on writes to finish.
> > >
> > > Maybe I don't understand your suggestion, but I would think punching a
> > > hole would lose this advantage?
> >
> > My suggestion was to remove the range stuck in mas_detach from mm_mt.
> > That is exactly the affected range, no?
>
> Yes.
>
> But then looping over the vmas will show a gap where there should not be
> a gap.
>
> If we stop rcu readers entirely we lose the advantage.
>
> This is exactly the issue that the locking dance was working around :)

Sorry for not participating in the discussion, folks. I'm down with a
terrible flu and my brain is not working well. I'll catch up once I
get better.

>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-19 18:55                                                     ` Liam R. Howlett
  2024-12-20 15:22                                                       ` Suren Baghdasaryan
@ 2024-12-23  3:03                                                       ` Suren Baghdasaryan
  2024-12-26 17:12                                                         ` Suren Baghdasaryan
  1 sibling, 1 reply; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-23  3:03 UTC (permalink / raw)
  To: Liam R. Howlett, Peter Zijlstra, Suren Baghdasaryan, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Thu, Dec 19, 2024 at 10:55 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Peter Zijlstra <peterz@infradead.org> [241219 13:47]:
> > On Thu, Dec 19, 2024 at 01:18:23PM -0500, Liam R. Howlett wrote:
> >
> > > > For RCU lookups only the mas tree matters -- and its left present there.
> > > >
> > > > If you really want to block RCU readers, I would suggest punching a hole
> > > > in the mm_mt. All the traditional code won't notice anyway, this is all
> > > > with mmap_lock held for writing.
> > >
> > > We don't want to block all rcu readers, we want to block the rcu readers
> > > that would see the problem - that is, anyone trying to read a particular
> > > area.
> > >
> > > Right now we can page fault in unpopulated vmas while writing other vmas
> > > to the tree.  We are also moving more users to rcu reading to use the
> > > vmas they need without waiting on writes to finish.
> > >
> > > Maybe I don't understand your suggestion, but I would think punching a
> > > hole would lose this advantage?
> >
> > My suggestion was to remove the range stuck in mas_detach from mm_mt.
> > That is exactly the affected range, no?
>
> Yes.
>
> But then looping over the vmas will show a gap where there should not be
> a gap.
>
> If we stop rcu readers entirely we lose the advantage.
>
> This is exactly the issue that the locking dance was working around :)

IOW we write-lock the entire range before removing any part of it for
the whole transaction to be atomic, correct?


Peter, you suggested the following pattern for ensuring vma is
detached with no possible readers:

vma_iter_store()
vma_start_write()
vma_mark_detached()

What do you think about this alternative?

vma_start_write()
...
vma_iter_store()
vma_mark_detached()
        vma_assert_write_locked(vma)
        if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt)))
                vma_start_write()

The second vma_start_write() is unlikely to be executed because the
vma is locked, vm_refcnt might be increased only temporarily by
readers before they realize the vma is locked and that's a very narrow
window. I think performance should not visibly suffer?
OTOH this would let us keep current locking patterns and would
guarantee that vma_mark_detached() always exits with a detached and
unused vma (less possibilities for someone not following an exact
pattern and ending up with a detached but still used vma).

>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count
  2024-12-23  3:03                                                       ` Suren Baghdasaryan
@ 2024-12-26 17:12                                                         ` Suren Baghdasaryan
  0 siblings, 0 replies; 74+ messages in thread
From: Suren Baghdasaryan @ 2024-12-26 17:12 UTC (permalink / raw)
  To: Liam R. Howlett, Peter Zijlstra, Suren Baghdasaryan, akpm, willy,
	lorenzo.stoakes, mhocko, vbabka, hannes, mjguzik, oliver.sang,
	mgorman, david, peterx, oleg, dave, paulmck, brauner, dhowells,
	hdanton, hughd, lokeshgidra, minchan, jannh, shakeel.butt,
	souravpanda, pasha.tatashin, klarasmodin, corbet, linux-doc,
	linux-mm, linux-kernel, kernel-team

On Sun, Dec 22, 2024 at 7:03 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Dec 19, 2024 at 10:55 AM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > * Peter Zijlstra <peterz@infradead.org> [241219 13:47]:
> > > On Thu, Dec 19, 2024 at 01:18:23PM -0500, Liam R. Howlett wrote:
> > >
> > > > > For RCU lookups only the mas tree matters -- and its left present there.
> > > > >
> > > > > If you really want to block RCU readers, I would suggest punching a hole
> > > > > in the mm_mt. All the traditional code won't notice anyway, this is all
> > > > > with mmap_lock held for writing.
> > > >
> > > > We don't want to block all rcu readers, we want to block the rcu readers
> > > > that would see the problem - that is, anyone trying to read a particular
> > > > area.
> > > >
> > > > Right now we can page fault in unpopulated vmas while writing other vmas
> > > > to the tree.  We are also moving more users to rcu reading to use the
> > > > vmas they need without waiting on writes to finish.
> > > >
> > > > Maybe I don't understand your suggestion, but I would think punching a
> > > > hole would lose this advantage?
> > >
> > > My suggestion was to remove the range stuck in mas_detach from mm_mt.
> > > That is exactly the affected range, no?
> >
> > Yes.
> >
> > But then looping over the vmas will show a gap where there should not be
> > a gap.
> >
> > If we stop rcu readers entirely we lose the advantage.
> >
> > This is exactly the issue that the locking dance was working around :)
>
> IOW we write-lock the entire range before removing any part of it for
> the whole transaction to be atomic, correct?
>
>
> Peter, you suggested the following pattern for ensuring vma is
> detached with no possible readers:
>
> vma_iter_store()
> vma_start_write()
> vma_mark_detached()
>
> What do you think about this alternative?
>
> vma_start_write()
> ...
> vma_iter_store()
> vma_mark_detached()
>         vma_assert_write_locked(vma)
>         if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt)))
>                 vma_start_write()
>
> The second vma_start_write() is unlikely to be executed because the
> vma is locked, vm_refcnt might be increased only temporarily by
> readers before they realize the vma is locked and that's a very narrow
> window. I think performance should not visibly suffer?
> OTOH this would let us keep current locking patterns and would
> guarantee that vma_mark_detached() always exits with a detached and
> unused vma (less possibilities for someone not following an exact
> pattern and ending up with a detached but still used vma).

I posted v7 of this patchset at
https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
From the things we discussed, I didn't include the following:
- Changing vma locking patterns
- Changing do_vmi_align_munmap() to avoid reattach_vmas()
It seems we need more discussion for the first one and the second one
can be done completely independent from this patchset. I feel this
patchset is already quite large, so trying to keep its size
manageable.
Thanks,
Suren.

>
> >

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2024-12-26 17:12 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-16 19:24 [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 01/16] mm: introduce vma_start_read_locked{_nested} helpers Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 02/16] mm: move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 03/16] mm: mark vma as detached until it's added into vma tree Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 04/16] mm/nommu: fix the last places where vma is not locked before being attached Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 05/16] types: move struct rcuwait into types.h Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 06/16] mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail Suren Baghdasaryan
2024-12-17 11:31   ` Lokesh Gidra
2024-12-17 15:51     ` Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 07/16] mm: move mmap_init_lock() out of the header file Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 08/16] mm: uninline the main body of vma_start_write() Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 09/16] refcount: introduce __refcount_{add|inc}_not_zero_limited Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count Suren Baghdasaryan
2024-12-16 20:42   ` Peter Zijlstra
2024-12-16 20:53     ` Suren Baghdasaryan
2024-12-16 21:15   ` Peter Zijlstra
2024-12-16 21:53     ` Suren Baghdasaryan
2024-12-16 22:00       ` Peter Zijlstra
2024-12-16 21:37   ` Peter Zijlstra
2024-12-16 21:44     ` Suren Baghdasaryan
2024-12-17 10:30       ` Peter Zijlstra
2024-12-17 16:27         ` Suren Baghdasaryan
2024-12-18  9:41           ` Peter Zijlstra
2024-12-18 10:06             ` Peter Zijlstra
2024-12-18 15:37               ` Liam R. Howlett
2024-12-18 15:50                 ` Suren Baghdasaryan
2024-12-18 16:18                   ` Peter Zijlstra
2024-12-18 17:36                     ` Suren Baghdasaryan
2024-12-18 17:44                       ` Peter Zijlstra
2024-12-18 17:58                         ` Suren Baghdasaryan
2024-12-18 19:00                           ` Liam R. Howlett
2024-12-18 19:07                             ` Suren Baghdasaryan
2024-12-18 19:29                               ` Suren Baghdasaryan
2024-12-18 19:38                                 ` Liam R. Howlett
2024-12-18 20:00                                   ` Suren Baghdasaryan
2024-12-18 20:38                                     ` Liam R. Howlett
2024-12-18 21:53                                       ` Suren Baghdasaryan
2024-12-18 21:55                                         ` Suren Baghdasaryan
2024-12-19  0:35                                         ` Andrew Morton
2024-12-19  0:47                                           ` Suren Baghdasaryan
2024-12-19  9:13                                         ` Peter Zijlstra
2024-12-19 11:20                                           ` Peter Zijlstra
2024-12-19 16:17                                             ` Suren Baghdasaryan
2024-12-19 17:16                                             ` Liam R. Howlett
2024-12-19 17:42                                               ` Peter Zijlstra
2024-12-19 18:18                                                 ` Liam R. Howlett
2024-12-19 18:46                                                   ` Peter Zijlstra
2024-12-19 18:55                                                     ` Liam R. Howlett
2024-12-20 15:22                                                       ` Suren Baghdasaryan
2024-12-23  3:03                                                       ` Suren Baghdasaryan
2024-12-26 17:12                                                         ` Suren Baghdasaryan
2024-12-19 16:14                                           ` Suren Baghdasaryan
2024-12-19 17:23                                             ` Peter Zijlstra
2024-12-19  8:55                                 ` Peter Zijlstra
2024-12-19 16:08                                   ` Suren Baghdasaryan
2024-12-19  8:53                           ` Peter Zijlstra
2024-12-19 16:08                             ` Suren Baghdasaryan
2024-12-18 15:57                 ` Suren Baghdasaryan
2024-12-18 16:13                 ` Peter Zijlstra
2024-12-18 15:42             ` Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 11/16] mm: enforce vma to be in detached state before freeing Suren Baghdasaryan
2024-12-16 21:16   ` Peter Zijlstra
2024-12-16 21:18     ` Peter Zijlstra
2024-12-16 21:57       ` Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 12/16] mm: remove extra vma_numab_state_init() call Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 13/16] mm: introduce vma_ensure_detached() Suren Baghdasaryan
2024-12-17 10:26   ` Peter Zijlstra
2024-12-17 15:58     ` Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 14/16] mm: prepare lock_vma_under_rcu() for vma reuse possibility Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 15/16] mm: make vma cache SLAB_TYPESAFE_BY_RCU Suren Baghdasaryan
2024-12-16 19:24 ` [PATCH v6 16/16] docs/mm: document latest changes to vm_lock Suren Baghdasaryan
2024-12-16 19:39 ` [PATCH v6 00/16] move per-vma lock into vm_area_struct Suren Baghdasaryan
2024-12-17 18:42   ` Andrew Morton
2024-12-17 18:49     ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).