linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
@ 2024-10-10 20:56 Andrii Nakryiko
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-10 20:56 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	Andrii Nakryiko

Implement speculative (lockless) resolution of VMA to inode to uprobe,
bypassing the need to take mmap_lock for reads, if possible. Patch #1 by Suren
adds mm_struct helpers that help detect whether mm_struct was changed, which
is used by uprobe logic to validate that speculative results can be trusted
after all the lookup logic results in a valid uprobe instance. Patch #2
follows to make mm_lock_seq into 64-bit counter (on 64-bit architectures), as
requested by Jann Horn.

Patch #3 is a simplification to uprobe VMA flag checking, suggested by Oleg.

And, finally, patch #4 is the speculative VMA-to-uprobe resolution logic
itself, and is the focal point of this patch set. It makes entry uprobes in
common case scale very well with number of CPUs, as we avoid any locking or
cache line bouncing between CPUs. See corresponding patch for details and
benchmarking results.

Note, this patch set assumes that FMODE_BACKING files were switched to have
SLAB_TYPE_SAFE_BY_RCU semantics, which was recently done by Christian Brauner
in [0]. This change can be pulled into perf/core through stable
tags/vfs-6.13.for-bpf.file tag from [1].

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.13.for-bpf.file&id=8b1bc2590af61129b82a189e9dc7c2804c34400e
  [1] git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git

v2->v3:
- dropped kfree_rcu() patch (Christian);
- added data_race() annotations for fields of vma and vma->vm_file which could
  be modified during speculative lookup (Oleg);
- fixed int->long problem in stubs for mmap_lock_speculation_{start,end}(),
  caught by Kernel test robot;
v1->v2:
- adjusted vma_end_write_all() comment to point out it should never be called
  manually now, but I wasn't sure how ACQUIRE/RELEASE comments should be
  reworded (previously requested by Jann), so I'd appreciate some help there
  (Jann);
- int -> long change for mm_lock_seq, as agreed at LPC2024 (Jann, Suren, Liam);
- kfree_rcu_mightsleep() for FMODE_BACKING (Suren, Christian);
- vm_flags simplification in find_active_uprobe_rcu() and
  find_active_uprobe_speculative() (Oleg);
- guard(rcu)() simplified find_active_uprobe_speculative() implementation.

Andrii Nakryiko (3):
  mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  uprobes: simplify find_active_uprobe_rcu() VMA checks
  uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution

Suren Baghdasaryan (1):
  mm: introduce mmap_lock_speculation_{start|end}

 include/linux/mm.h        |  6 ++--
 include/linux/mm_types.h  |  7 ++--
 include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
 kernel/events/uprobes.c   | 52 +++++++++++++++++++++++++++-
 kernel/fork.c             |  3 --
 5 files changed, 119 insertions(+), 21 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-10 20:56 [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
@ 2024-10-10 20:56 ` Andrii Nakryiko
  2024-10-13  7:56   ` Shakeel Butt
  2024-10-23 20:10   ` Peter Zijlstra
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures Andrii Nakryiko
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-10 20:56 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	Andrii Nakryiko

From: Suren Baghdasaryan <surenb@google.com>

Add helper functions to speculatively perform operations without
read-locking mmap_lock, expecting that mmap_lock will not be
write-locked and mm is not modified from under us.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240912210222.186542-1-surenb@google.com
---
 include/linux/mm_types.h  |  3 ++
 include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
 kernel/fork.c             |  3 --
 3 files changed, 63 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8e38bc..5d8cdebd42bc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -887,6 +887,9 @@ struct mm_struct {
 		 * Roughly speaking, incrementing the sequence number is
 		 * equivalent to releasing locks on VMAs; reading the sequence
 		 * number can be part of taking a read lock on a VMA.
+		 * Incremented every time mmap_lock is write-locked/unlocked.
+		 * Initialized to 0, therefore odd values indicate mmap_lock
+		 * is write-locked and even values that it's released.
 		 *
 		 * Can be modified under write mmap_lock using RELEASE
 		 * semantics.
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index de9dc20b01ba..9d23635bc701 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -71,39 +71,84 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void init_mm_lock_seq(struct mm_struct *mm)
+{
+	mm->mm_lock_seq = 0;
+}
+
 /*
- * Drop all currently-held per-VMA locks.
- * This is called from the mmap_lock implementation directly before releasing
- * a write-locked mmap_lock (or downgrading it to read-locked).
- * This should normally NOT be called manually from other places.
- * If you want to call this manually anyway, keep in mind that this will release
- * *all* VMA write locks, including ones from further up the stack.
+ * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
+ * or write-unlocked (RELEASE semantics).
  */
-static inline void vma_end_write_all(struct mm_struct *mm)
+static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
 {
 	mmap_assert_write_locked(mm);
 	/*
 	 * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
 	 * mmap_lock being held.
-	 * We need RELEASE semantics here to ensure that preceding stores into
-	 * the VMA take effect before we unlock it with this store.
-	 * Pairs with ACQUIRE semantics in vma_start_read().
 	 */
-	smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
+
+	if (acquire) {
+		WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+		/*
+		 * For ACQUIRE semantics we should ensure no following stores are
+		 * reordered to appear before the mm->mm_lock_seq modification.
+		 */
+		smp_wmb();
+	} else {
+		/*
+		 * We need RELEASE semantics here to ensure that preceding stores
+		 * into the VMA take effect before we unlock it with this store.
+		 * Pairs with ACQUIRE semantics in vma_start_read().
+		 */
+		smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
+	}
+}
+
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
+{
+	/* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
+	*seq = smp_load_acquire(&mm->mm_lock_seq);
+	/* Allow speculation if mmap_lock is not write-locked */
+	return (*seq & 1) == 0;
+}
+
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
+{
+	/* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
+	smp_rmb();
+	return seq == READ_ONCE(mm->mm_lock_seq);
 }
+
 #else
-static inline void vma_end_write_all(struct mm_struct *mm) {}
+static inline void init_mm_lock_seq(struct mm_struct *mm) {}
+static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
 #endif
 
+/*
+ * Drop all currently-held per-VMA locks.
+ * This is called from the mmap_lock implementation directly before releasing
+ * a write-locked mmap_lock (or downgrading it to read-locked).
+ * This should NOT be called manually from other places.
+ */
+static inline void vma_end_write_all(struct mm_struct *mm)
+{
+	inc_mm_lock_seq(mm, false);
+}
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
+	init_mm_lock_seq(mm);
 }
 
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_start_locking(mm, true);
 	down_write(&mm->mmap_lock);
+	inc_mm_lock_seq(mm, true);
 	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
@@ -111,6 +156,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
 {
 	__mmap_lock_trace_start_locking(mm, true);
 	down_write_nested(&mm->mmap_lock, subclass);
+	inc_mm_lock_seq(mm, true);
 	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
@@ -120,6 +166,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
 
 	__mmap_lock_trace_start_locking(mm, true);
 	ret = down_write_killable(&mm->mmap_lock);
+	if (!ret)
+		inc_mm_lock_seq(mm, true);
 	__mmap_lock_trace_acquire_returned(mm, true, ret == 0);
 	return ret;
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..dd1bded0294d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1261,9 +1261,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
-#ifdef CONFIG_PER_VMA_LOCK
-	mm->mm_lock_seq = 0;
-#endif
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-10 20:56 [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
@ 2024-10-10 20:56 ` Andrii Nakryiko
  2024-10-13  7:56   ` Shakeel Butt
  2024-10-23 19:31   ` Peter Zijlstra
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-10 20:56 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	Andrii Nakryiko

To increase mm->mm_lock_seq robustness, switch it from int to long, so
that it's a 64-bit counter on 64-bit systems and we can stop worrying
about it wrapping around in just ~4 billion iterations. Same goes for
VMA's matching vm_lock_seq, which is derived from mm_lock_seq.

I didn't use __u64 outright to keep 32-bit architectures unaffected, but
if it seems important enough, I have nothing against using __u64.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/mm.h        | 6 +++---
 include/linux/mm_types.h  | 4 ++--
 include/linux/mmap_lock.h | 8 ++++----
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecf63d2b0582..97819437832e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -730,7 +730,7 @@ static inline void vma_end_read(struct vm_area_struct *vma)
 }
 
 /* WARNING! Can only be used if mmap_lock is expected to be write-locked */
-static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
+static bool __is_vma_write_locked(struct vm_area_struct *vma, long *mm_lock_seq)
 {
 	mmap_assert_write_locked(vma->vm_mm);
 
@@ -749,7 +749,7 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
  */
 static inline void vma_start_write(struct vm_area_struct *vma)
 {
-	int mm_lock_seq;
+	long mm_lock_seq;
 
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
@@ -767,7 +767,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
-	int mm_lock_seq;
+	long mm_lock_seq;
 
 	VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma);
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5d8cdebd42bc..0dc57d6cfe38 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -715,7 +715,7 @@ struct vm_area_struct {
 	 * counter reuse can only lead to occasional unnecessary use of the
 	 * slowpath.
 	 */
-	int vm_lock_seq;
+	long vm_lock_seq;
 	/* Unstable RCU readers are allowed to read this. */
 	struct vma_lock *vm_lock;
 #endif
@@ -898,7 +898,7 @@ struct mm_struct {
 		 * Can be read with ACQUIRE semantics if not holding write
 		 * mmap_lock.
 		 */
-		int mm_lock_seq;
+		long mm_lock_seq;
 #endif
 
 
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 9d23635bc701..f8fd6d879aa9 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -105,7 +105,7 @@ static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
 	}
 }
 
-static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, long *seq)
 {
 	/* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
 	*seq = smp_load_acquire(&mm->mm_lock_seq);
@@ -113,7 +113,7 @@ static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
 	return (*seq & 1) == 0;
 }
 
-static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, long seq)
 {
 	/* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
 	smp_rmb();
@@ -123,8 +123,8 @@ static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
 #else
 static inline void init_mm_lock_seq(struct mm_struct *mm) {}
 static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
-static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
-static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, long *seq) { return false; }
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, long seq) { return false; }
 #endif
 
 /*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks
  2024-10-10 20:56 [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures Andrii Nakryiko
@ 2024-10-10 20:56 ` Andrii Nakryiko
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
  2024-10-23 17:54 ` [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  4 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-10 20:56 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	Andrii Nakryiko

At the point where find_active_uprobe_rcu() is used we know that VMA in
question has triggered software breakpoint, so we don't need to validate
vma->vm_flags. Keep only vma->vm_file NULL check.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2a0059464383..fa1024aad6c4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2057,7 +2057,7 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
 	mmap_read_lock(mm);
 	vma = vma_lookup(mm, bp_vaddr);
 	if (vma) {
-		if (valid_vma(vma, false)) {
+		if (vma->vm_file) {
 			struct inode *inode = file_inode(vma->vm_file);
 			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-10 20:56 [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
@ 2024-10-10 20:56 ` Andrii Nakryiko
  2024-10-11  5:01   ` Oleg Nesterov
  2024-10-23 19:22   ` Peter Zijlstra
  2024-10-23 17:54 ` [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  4 siblings, 2 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-10 20:56 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	Andrii Nakryiko

Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING
files, a special case, now goes through RCU-delated freeing), we can
safely access vma->vm_file->f_inode field locklessly under just
rcu_read_lock() protection, which enables looking up uprobe from
uprobes_tree completely locklessly and speculatively without the need to
acquire mmap_lock for reads. In most cases, anyway, assuming that there
are no parallel mm and/or VMA modifications. The underlying struct
file's memory won't go away from under us (even if struct file can be
reused in the meantime).

We rely on newly added mmap_lock_speculation_{start,end}() helpers to
validate that mm_struct stays intact for entire duration of this
speculation. If not, we fall back to mmap_lock-protected lookup.
The speculative logic is written in such a way that it will safely
handle any garbage values that might be read from vma or file structs.

Benchmarking results speak for themselves.

BEFORE (latest tip/perf/core)
=============================
uprobe-nop            ( 1 cpus):    3.384 ± 0.004M/s  (  3.384M/s/cpu)
uprobe-nop            ( 2 cpus):    5.456 ± 0.005M/s  (  2.728M/s/cpu)
uprobe-nop            ( 3 cpus):    7.863 ± 0.015M/s  (  2.621M/s/cpu)
uprobe-nop            ( 4 cpus):    9.442 ± 0.008M/s  (  2.360M/s/cpu)
uprobe-nop            ( 5 cpus):   11.036 ± 0.013M/s  (  2.207M/s/cpu)
uprobe-nop            ( 6 cpus):   10.884 ± 0.019M/s  (  1.814M/s/cpu)
uprobe-nop            ( 7 cpus):    7.897 ± 0.145M/s  (  1.128M/s/cpu)
uprobe-nop            ( 8 cpus):   10.021 ± 0.128M/s  (  1.253M/s/cpu)
uprobe-nop            (10 cpus):    9.932 ± 0.170M/s  (  0.993M/s/cpu)
uprobe-nop            (12 cpus):    8.369 ± 0.056M/s  (  0.697M/s/cpu)
uprobe-nop            (14 cpus):    8.678 ± 0.017M/s  (  0.620M/s/cpu)
uprobe-nop            (16 cpus):    7.392 ± 0.003M/s  (  0.462M/s/cpu)
uprobe-nop            (24 cpus):    5.326 ± 0.178M/s  (  0.222M/s/cpu)
uprobe-nop            (32 cpus):    5.426 ± 0.059M/s  (  0.170M/s/cpu)
uprobe-nop            (40 cpus):    5.262 ± 0.070M/s  (  0.132M/s/cpu)
uprobe-nop            (48 cpus):    6.121 ± 0.010M/s  (  0.128M/s/cpu)
uprobe-nop            (56 cpus):    6.252 ± 0.035M/s  (  0.112M/s/cpu)
uprobe-nop            (64 cpus):    7.644 ± 0.023M/s  (  0.119M/s/cpu)
uprobe-nop            (72 cpus):    7.781 ± 0.001M/s  (  0.108M/s/cpu)
uprobe-nop            (80 cpus):    8.992 ± 0.048M/s  (  0.112M/s/cpu)

AFTER
=====
uprobe-nop            ( 1 cpus):    3.534 ± 0.033M/s  (  3.534M/s/cpu)
uprobe-nop            ( 2 cpus):    6.701 ± 0.007M/s  (  3.351M/s/cpu)
uprobe-nop            ( 3 cpus):   10.031 ± 0.007M/s  (  3.344M/s/cpu)
uprobe-nop            ( 4 cpus):   13.003 ± 0.012M/s  (  3.251M/s/cpu)
uprobe-nop            ( 5 cpus):   16.274 ± 0.006M/s  (  3.255M/s/cpu)
uprobe-nop            ( 6 cpus):   19.563 ± 0.024M/s  (  3.261M/s/cpu)
uprobe-nop            ( 7 cpus):   22.696 ± 0.054M/s  (  3.242M/s/cpu)
uprobe-nop            ( 8 cpus):   24.534 ± 0.010M/s  (  3.067M/s/cpu)
uprobe-nop            (10 cpus):   30.475 ± 0.117M/s  (  3.047M/s/cpu)
uprobe-nop            (12 cpus):   33.371 ± 0.017M/s  (  2.781M/s/cpu)
uprobe-nop            (14 cpus):   38.864 ± 0.004M/s  (  2.776M/s/cpu)
uprobe-nop            (16 cpus):   41.476 ± 0.020M/s  (  2.592M/s/cpu)
uprobe-nop            (24 cpus):   64.696 ± 0.021M/s  (  2.696M/s/cpu)
uprobe-nop            (32 cpus):   85.054 ± 0.027M/s  (  2.658M/s/cpu)
uprobe-nop            (40 cpus):  101.979 ± 0.032M/s  (  2.549M/s/cpu)
uprobe-nop            (48 cpus):  110.518 ± 0.056M/s  (  2.302M/s/cpu)
uprobe-nop            (56 cpus):  117.737 ± 0.020M/s  (  2.102M/s/cpu)
uprobe-nop            (64 cpus):  124.613 ± 0.079M/s  (  1.947M/s/cpu)
uprobe-nop            (72 cpus):  133.239 ± 0.032M/s  (  1.851M/s/cpu)
uprobe-nop            (80 cpus):  142.037 ± 0.138M/s  (  1.775M/s/cpu)

Previously total throughput was maxing out at 11mln/s, and gradually
declining past 8 cores. With this change, it now keeps growing with each
added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core
Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index fa1024aad6c4..9dc6e78975c9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2047,6 +2047,52 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	return is_trap_insn(&opcode);
 }
 
+static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct uprobe *uprobe = NULL;
+	struct vm_area_struct *vma;
+	struct file *vm_file;
+	struct inode *vm_inode;
+	unsigned long vm_pgoff, vm_start;
+	loff_t offset;
+	long seq;
+
+	guard(rcu)();
+
+	if (!mmap_lock_speculation_start(mm, &seq))
+		return NULL;
+
+	vma = vma_lookup(mm, bp_vaddr);
+	if (!vma)
+		return NULL;
+
+	/* vm_file memory can be reused for another instance of struct file,
+	 * but can't be freed from under us, so it's safe to read fields from
+	 * it, even if the values are some garbage values; ultimately
+	 * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
+	 * that whatever we speculatively found is correct
+	 */
+	vm_file = READ_ONCE(vma->vm_file);
+	if (!vm_file)
+		return NULL;
+
+	vm_pgoff = data_race(vma->vm_pgoff);
+	vm_start = data_race(vma->vm_start);
+	vm_inode = data_race(vm_file->f_inode);
+
+	offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
+	uprobe = find_uprobe_rcu(vm_inode, offset);
+	if (!uprobe)
+		return NULL;
+
+	/* now double check that nothing about MM changed */
+	if (!mmap_lock_speculation_end(mm, seq))
+		return NULL;
+
+	return uprobe;
+}
+
 /* assumes being inside RCU protected region */
 static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
 {
@@ -2054,6 +2100,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
 	struct uprobe *uprobe = NULL;
 	struct vm_area_struct *vma;
 
+	uprobe = find_active_uprobe_speculative(bp_vaddr);
+	if (uprobe)
+		return uprobe;
+
 	mmap_read_lock(mm);
 	vma = vma_lookup(mm, bp_vaddr);
 	if (vma) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
@ 2024-10-11  5:01   ` Oleg Nesterov
  2024-10-23 19:22   ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Oleg Nesterov @ 2024-10-11  5:01 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, peterz, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On 10/10, Andrii Nakryiko wrote:
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)

FWIW,

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
@ 2024-10-13  7:56   ` Shakeel Butt
  2024-10-14 20:27     ` Andrii Nakryiko
  2024-10-23 20:10   ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Shakeel Butt @ 2024-10-13  7:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, peterz, oleg, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, hannes, Liam.Howlett,
	lorenzo.stoakes

On Thu, Oct 10, 2024 at 01:56:41PM GMT, Andrii Nakryiko wrote:
> From: Suren Baghdasaryan <surenb@google.com>
> 
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> Link: https://lore.kernel.org/bpf/20240912210222.186542-1-surenb@google.com

Looks good to me. mmap_lock_speculation_* functions could use kerneldoc
but that can be added later.

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures Andrii Nakryiko
@ 2024-10-13  7:56   ` Shakeel Butt
  2024-10-17  2:01     ` Suren Baghdasaryan
  2024-10-23 19:31   ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Shakeel Butt @ 2024-10-13  7:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, peterz, oleg, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, hannes, Liam.Howlett,
	lorenzo.stoakes

On Thu, Oct 10, 2024 at 01:56:42PM GMT, Andrii Nakryiko wrote:
> To increase mm->mm_lock_seq robustness, switch it from int to long, so
> that it's a 64-bit counter on 64-bit systems and we can stop worrying
> about it wrapping around in just ~4 billion iterations. Same goes for
> VMA's matching vm_lock_seq, which is derived from mm_lock_seq.
> 
> I didn't use __u64 outright to keep 32-bit architectures unaffected, but
> if it seems important enough, I have nothing against using __u64.
> 
> Suggested-by: Jann Horn <jannh@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-13  7:56   ` Shakeel Butt
@ 2024-10-14 20:27     ` Andrii Nakryiko
  2024-10-14 20:48       ` Suren Baghdasaryan
  0 siblings, 1 reply; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-14 20:27 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, peterz, oleg,
	rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, mjguzik, brauner, jannh, mhocko, vbabka, hannes,
	Liam.Howlett, lorenzo.stoakes

On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 10, 2024 at 01:56:41PM GMT, Andrii Nakryiko wrote:
> > From: Suren Baghdasaryan <surenb@google.com>
> >
> > Add helper functions to speculatively perform operations without
> > read-locking mmap_lock, expecting that mmap_lock will not be
> > write-locked and mm is not modified from under us.
> >
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > Link: https://lore.kernel.org/bpf/20240912210222.186542-1-surenb@google.com
>
> Looks good to me. mmap_lock_speculation_* functions could use kerneldoc
> but that can be added later.

Yep, though probably best if Suren can do that in the follow up, as he
knows all the right words to use :)

>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>

Thanks!

>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-14 20:27     ` Andrii Nakryiko
@ 2024-10-14 20:48       ` Suren Baghdasaryan
  0 siblings, 0 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2024-10-14 20:48 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Shakeel Butt, Andrii Nakryiko, linux-trace-kernel, linux-mm,
	peterz, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	hannes, Liam.Howlett, lorenzo.stoakes

On Mon, Oct 14, 2024 at 1:27 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 10, 2024 at 01:56:41PM GMT, Andrii Nakryiko wrote:
> > > From: Suren Baghdasaryan <surenb@google.com>
> > >
> > > Add helper functions to speculatively perform operations without
> > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > write-locked and mm is not modified from under us.
> > >
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > Link: https://lore.kernel.org/bpf/20240912210222.186542-1-surenb@google.com
> >
> > Looks good to me. mmap_lock_speculation_* functions could use kerneldoc
> > but that can be added later.
>
> Yep, though probably best if Suren can do that in the follow up, as he
> knows all the right words to use :)

Will add to my TODO list.

>
> >
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >
>
> Thanks!
>
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-13  7:56   ` Shakeel Butt
@ 2024-10-17  2:01     ` Suren Baghdasaryan
  2024-10-17 18:55       ` Andrii Nakryiko
  2024-10-23 19:02       ` Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2024-10-17  2:01 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, peterz, oleg,
	rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, akpm,
	mjguzik, brauner, jannh, mhocko, vbabka, hannes, Liam.Howlett,
	lorenzo.stoakes

On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 10, 2024 at 01:56:42PM GMT, Andrii Nakryiko wrote:
> > To increase mm->mm_lock_seq robustness, switch it from int to long, so
> > that it's a 64-bit counter on 64-bit systems and we can stop worrying
> > about it wrapping around in just ~4 billion iterations. Same goes for
> > VMA's matching vm_lock_seq, which is derived from mm_lock_seq.

vm_lock_seq does not need to be long but for consistency I guess that
makes sense. While at it, can you please change these seq counters to
be unsigned?
Also, did you check with pahole if the vm_area_struct layout change
pushes some members into a difference cacheline or creates new gaps?

> >
> > I didn't use __u64 outright to keep 32-bit architectures unaffected, but
> > if it seems important enough, I have nothing against using __u64.
> >
> > Suggested-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-17  2:01     ` Suren Baghdasaryan
@ 2024-10-17 18:55       ` Andrii Nakryiko
  2024-10-17 19:42         ` Suren Baghdasaryan
  2024-10-23 19:02       ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-17 18:55 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Shakeel Butt, Andrii Nakryiko, linux-trace-kernel, linux-mm,
	peterz, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	hannes, Liam.Howlett, lorenzo.stoakes

On Wed, Oct 16, 2024 at 7:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 10, 2024 at 01:56:42PM GMT, Andrii Nakryiko wrote:
> > > To increase mm->mm_lock_seq robustness, switch it from int to long, so
> > > that it's a 64-bit counter on 64-bit systems and we can stop worrying
> > > about it wrapping around in just ~4 billion iterations. Same goes for
> > > VMA's matching vm_lock_seq, which is derived from mm_lock_seq.
>
> vm_lock_seq does not need to be long but for consistency I guess that

How come, we literally assign vm_lock_seq from mm_lock_seq and do
direct comparisons. They have to be exactly the same type, no?

> makes sense. While at it, can you please change these seq counters to
> be unsigned?

There is `vma->vm_lock_seq = -1;` in kernel/fork.c, should it be
switched to ULONG_MAX then? In general, unless this is critical for
correctness, I'd very much like stuff like this to be done in the mm
tree afterwards, but it seems trivial enough, so if you insist I'll do
it.

> Also, did you check with pahole if the vm_area_struct layout change
> pushes some members into a difference cacheline or creates new gaps?
>

Just did. We had 3 byte hole after `bool detached;`, it now grew to 7
bytes (so +4) and then vm_lock_seq itself is now 8 bytes (so +4),
which now does push rb and rb_subtree_last into *THE SAME* cache line
(which sounds like an improvement to me). vm_lock_seq and vm_lock stay
in the same cache line. vm_pgoff and vm_file are now in the same cache
line, and given they are probably always accessed together, seems like
a good accidental change as well. See below pahole outputs before and
after.

That singular detached bool looks like a complete waste, tbh. Maybe it
would be better to roll it into vm_flags and save 8 bytes? (not that I
want to do those mm changes in this patch set, of course...).
vm_area_struct is otherwise nicely tightly packed.

tl;dr, seems fine, and detached would be best to get rid of, if
possible (but that's a completely separate thing)

BEFORE
======
struct vm_area_struct {
        union {
                struct {
                        long unsigned int vm_start;      /*     0     8 */
                        long unsigned int vm_end;        /*     8     8 */
                };                                       /*     0    16 */
                struct callback_head vm_rcu;             /*     0    16 */
        } __attribute__((__aligned__(8)));               /*     0    16 */
        struct mm_struct *         vm_mm;                /*    16     8 */
        pgprot_t                   vm_page_prot;         /*    24     8 */
        union {
                const vm_flags_t   vm_flags;             /*    32     8 */
                vm_flags_t         __vm_flags;           /*    32     8 */
        };                                               /*    32     8 */
        bool                       detached;             /*    40     1 */

        /* XXX 3 bytes hole, try to pack */

        int                        vm_lock_seq;          /*    44     4 */
        struct vma_lock *          vm_lock;              /*    48     8 */
        struct {
                struct rb_node     rb;                   /*    56    24 */
                /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
                long unsigned int  rb_subtree_last;      /*    80     8 */
        }                                                /*    56    32 */
        struct list_head           anon_vma_chain;       /*    88    16 */
        struct anon_vma *          anon_vma;             /*   104     8 */
        const struct vm_operations_struct  * vm_ops;     /*   112     8 */
        long unsigned int          vm_pgoff;             /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        struct file *              vm_file;              /*   128     8 */
        void *                     vm_private_data;      /*   136     8 */
        atomic_long_t              swap_readahead_info;  /*   144     8 */
        struct mempolicy *         vm_policy;            /*   152     8 */
        struct vma_numab_state *   numab_state;          /*   160     8 */
        struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     8 */

        /* size: 176, cachelines: 3, members: 18 */
        /* sum members: 173, holes: 1, sum holes: 3 */
        /* forced alignments: 2 */
        /* last cacheline: 48 bytes */
} __attribute__((__aligned__(8)));

AFTER
=====
struct vm_area_struct {
        union {
                struct {
                        long unsigned int vm_start;      /*     0     8 */
                        long unsigned int vm_end;        /*     8     8 */
                };                                       /*     0    16 */
                struct callback_head vm_rcu;             /*     0    16 */
        } __attribute__((__aligned__(8)));               /*     0    16 */
        struct mm_struct *         vm_mm;                /*    16     8 */
        pgprot_t                   vm_page_prot;         /*    24     8 */
        union {
                const vm_flags_t   vm_flags;             /*    32     8 */
                vm_flags_t         __vm_flags;           /*    32     8 */
        };                                               /*    32     8 */
        bool                       detached;             /*    40     1 */

        /* XXX 7 bytes hole, try to pack */

        long int                   vm_lock_seq;          /*    48     8 */
        struct vma_lock *          vm_lock;              /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct {
                struct rb_node     rb;                   /*    64    24 */
                long unsigned int  rb_subtree_last;      /*    88     8 */
        }                                                /*    64    32 */
        struct list_head           anon_vma_chain;       /*    96    16 */
        struct anon_vma *          anon_vma;             /*   112     8 */
        const struct vm_operations_struct  * vm_ops;     /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        long unsigned int          vm_pgoff;             /*   128     8 */
        struct file *              vm_file;              /*   136     8 */
        void *                     vm_private_data;      /*   144     8 */
        atomic_long_t              swap_readahead_info;  /*   152     8 */
        struct mempolicy *         vm_policy;            /*   160     8 */
        struct vma_numab_state *   numab_state;          /*   168     8 */
        struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */

        /* size: 184, cachelines: 3, members: 18 */
        /* sum members: 177, holes: 1, sum holes: 7 */
        /* forced alignments: 2 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));


> > >
> > > I didn't use __u64 outright to keep 32-bit architectures unaffected, but
> > > if it seems important enough, I have nothing against using __u64.
> > >
> > > Suggested-by: Jann Horn <jannh@google.com>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-17 18:55       ` Andrii Nakryiko
@ 2024-10-17 19:42         ` Suren Baghdasaryan
  2024-10-17 20:12           ` Andrii Nakryiko
  0 siblings, 1 reply; 29+ messages in thread
From: Suren Baghdasaryan @ 2024-10-17 19:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Shakeel Butt, Andrii Nakryiko, linux-trace-kernel, linux-mm,
	peterz, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	hannes, Liam.Howlett, lorenzo.stoakes

On Thu, Oct 17, 2024 at 11:55 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Oct 16, 2024 at 7:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Oct 10, 2024 at 01:56:42PM GMT, Andrii Nakryiko wrote:
> > > > To increase mm->mm_lock_seq robustness, switch it from int to long, so
> > > > that it's a 64-bit counter on 64-bit systems and we can stop worrying
> > > > about it wrapping around in just ~4 billion iterations. Same goes for
> > > > VMA's matching vm_lock_seq, which is derived from mm_lock_seq.
> >
> > vm_lock_seq does not need to be long but for consistency I guess that
>
> How come, we literally assign vm_lock_seq from mm_lock_seq and do
> direct comparisons. They have to be exactly the same type, no?

Not necessarily. vm_lock_seq is a snapshot of the mm_lock_seq but it
does not have to be a "complete" snapshot. Just something that has a
very high probability of identifying a match and a rare false positive
is not a problem (see comment in
https://elixir.bootlin.com/linux/v6.11.3/source/include/linux/mm.h#L678).
So, something like this for taking and comparing a snapshot would do:

vma->vm_lock_seq = (unsigned int)mm->mm_lock_seq;
if (vma->vm_lock_seq == (unsigned int)mm->mm_lock_seq)

>
> > makes sense. While at it, can you please change these seq counters to
> > be unsigned?
>
> There is `vma->vm_lock_seq = -1;` in kernel/fork.c, should it be
> switched to ULONG_MAX then? In general, unless this is critical for
> correctness, I'd very much like stuff like this to be done in the mm
> tree afterwards, but it seems trivial enough, so if you insist I'll do
> it.

Yeah, ULONG_MAX should work fine here. vma->vm_lock_seq is initialized
to -1 to avoid false initial match with mm->mm_lock_seq which is
initialized to 0. As I said, a false match is not a problem but if we
can avoid it, that's better.

>
> > Also, did you check with pahole if the vm_area_struct layout change
> > pushes some members into a difference cacheline or creates new gaps?
> >
>
> Just did. We had 3 byte hole after `bool detached;`, it now grew to 7
> bytes (so +4) and then vm_lock_seq itself is now 8 bytes (so +4),
> which now does push rb and rb_subtree_last into *THE SAME* cache line
> (which sounds like an improvement to me). vm_lock_seq and vm_lock stay
> in the same cache line. vm_pgoff and vm_file are now in the same cache
> line, and given they are probably always accessed together, seems like
> a good accidental change as well. See below pahole outputs before and
> after.

Ok, sounds good to me. Looks like keeping both sequence numbers 64bit
is not an issue. Changing them to unsigned would be nice and trivial
but I don't insist. You can add:

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> That singular detached bool looks like a complete waste, tbh. Maybe it
> would be better to roll it into vm_flags and save 8 bytes? (not that I
> want to do those mm changes in this patch set, of course...).
> vm_area_struct is otherwise nicely tightly packed.
>
> tl;dr, seems fine, and detached would be best to get rid of, if
> possible (but that's a completely separate thing)

Yeah, I'll take a look at that. Thanks!

>
> BEFORE
> ======
> struct vm_area_struct {
>         union {
>                 struct {
>                         long unsigned int vm_start;      /*     0     8 */
>                         long unsigned int vm_end;        /*     8     8 */
>                 };                                       /*     0    16 */
>                 struct callback_head vm_rcu;             /*     0    16 */
>         } __attribute__((__aligned__(8)));               /*     0    16 */
>         struct mm_struct *         vm_mm;                /*    16     8 */
>         pgprot_t                   vm_page_prot;         /*    24     8 */
>         union {
>                 const vm_flags_t   vm_flags;             /*    32     8 */
>                 vm_flags_t         __vm_flags;           /*    32     8 */
>         };                                               /*    32     8 */
>         bool                       detached;             /*    40     1 */
>
>         /* XXX 3 bytes hole, try to pack */
>
>         int                        vm_lock_seq;          /*    44     4 */
>         struct vma_lock *          vm_lock;              /*    48     8 */
>         struct {
>                 struct rb_node     rb;                   /*    56    24 */
>                 /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
>                 long unsigned int  rb_subtree_last;      /*    80     8 */
>         }                                                /*    56    32 */
>         struct list_head           anon_vma_chain;       /*    88    16 */
>         struct anon_vma *          anon_vma;             /*   104     8 */
>         const struct vm_operations_struct  * vm_ops;     /*   112     8 */
>         long unsigned int          vm_pgoff;             /*   120     8 */
>         /* --- cacheline 2 boundary (128 bytes) --- */
>         struct file *              vm_file;              /*   128     8 */
>         void *                     vm_private_data;      /*   136     8 */
>         atomic_long_t              swap_readahead_info;  /*   144     8 */
>         struct mempolicy *         vm_policy;            /*   152     8 */
>         struct vma_numab_state *   numab_state;          /*   160     8 */
>         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     8 */
>
>         /* size: 176, cachelines: 3, members: 18 */
>         /* sum members: 173, holes: 1, sum holes: 3 */
>         /* forced alignments: 2 */
>         /* last cacheline: 48 bytes */
> } __attribute__((__aligned__(8)));
>
> AFTER
> =====
> struct vm_area_struct {
>         union {
>                 struct {
>                         long unsigned int vm_start;      /*     0     8 */
>                         long unsigned int vm_end;        /*     8     8 */
>                 };                                       /*     0    16 */
>                 struct callback_head vm_rcu;             /*     0    16 */
>         } __attribute__((__aligned__(8)));               /*     0    16 */
>         struct mm_struct *         vm_mm;                /*    16     8 */
>         pgprot_t                   vm_page_prot;         /*    24     8 */
>         union {
>                 const vm_flags_t   vm_flags;             /*    32     8 */
>                 vm_flags_t         __vm_flags;           /*    32     8 */
>         };                                               /*    32     8 */
>         bool                       detached;             /*    40     1 */
>
>         /* XXX 7 bytes hole, try to pack */
>
>         long int                   vm_lock_seq;          /*    48     8 */
>         struct vma_lock *          vm_lock;              /*    56     8 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         struct {
>                 struct rb_node     rb;                   /*    64    24 */
>                 long unsigned int  rb_subtree_last;      /*    88     8 */
>         }                                                /*    64    32 */
>         struct list_head           anon_vma_chain;       /*    96    16 */
>         struct anon_vma *          anon_vma;             /*   112     8 */
>         const struct vm_operations_struct  * vm_ops;     /*   120     8 */
>         /* --- cacheline 2 boundary (128 bytes) --- */
>         long unsigned int          vm_pgoff;             /*   128     8 */
>         struct file *              vm_file;              /*   136     8 */
>         void *                     vm_private_data;      /*   144     8 */
>         atomic_long_t              swap_readahead_info;  /*   152     8 */
>         struct mempolicy *         vm_policy;            /*   160     8 */
>         struct vma_numab_state *   numab_state;          /*   168     8 */
>         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */
>
>         /* size: 184, cachelines: 3, members: 18 */
>         /* sum members: 177, holes: 1, sum holes: 7 */
>         /* forced alignments: 2 */
>         /* last cacheline: 56 bytes */
> } __attribute__((__aligned__(8)));
>
>
> > > >
> > > > I didn't use __u64 outright to keep 32-bit architectures unaffected, but
> > > > if it seems important enough, I have nothing against using __u64.
> > > >
> > > > Suggested-by: Jann Horn <jannh@google.com>
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > >
> > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-17 19:42         ` Suren Baghdasaryan
@ 2024-10-17 20:12           ` Andrii Nakryiko
  0 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-17 20:12 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Shakeel Butt, Andrii Nakryiko, linux-trace-kernel, linux-mm,
	peterz, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, akpm, mjguzik, brauner, jannh, mhocko, vbabka,
	hannes, Liam.Howlett, lorenzo.stoakes

On Thu, Oct 17, 2024 at 12:42 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Oct 17, 2024 at 11:55 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Oct 16, 2024 at 7:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Thu, Oct 10, 2024 at 01:56:42PM GMT, Andrii Nakryiko wrote:
> > > > > To increase mm->mm_lock_seq robustness, switch it from int to long, so
> > > > > that it's a 64-bit counter on 64-bit systems and we can stop worrying
> > > > > about it wrapping around in just ~4 billion iterations. Same goes for
> > > > > VMA's matching vm_lock_seq, which is derived from mm_lock_seq.
> > >
> > > vm_lock_seq does not need to be long but for consistency I guess that
> >
> > How come, we literally assign vm_lock_seq from mm_lock_seq and do
> > direct comparisons. They have to be exactly the same type, no?
>
> Not necessarily. vm_lock_seq is a snapshot of the mm_lock_seq but it
> does not have to be a "complete" snapshot. Just something that has a
> very high probability of identifying a match and a rare false positive
> is not a problem (see comment in
> https://elixir.bootlin.com/linux/v6.11.3/source/include/linux/mm.h#L678).
> So, something like this for taking and comparing a snapshot would do:
>
> vma->vm_lock_seq = (unsigned int)mm->mm_lock_seq;
> if (vma->vm_lock_seq == (unsigned int)mm->mm_lock_seq)

Ah, ok, I see what's the idea behind it, makes sense.

>
> >
> > > makes sense. While at it, can you please change these seq counters to
> > > be unsigned?
> >
> > There is `vma->vm_lock_seq = -1;` in kernel/fork.c, should it be
> > switched to ULONG_MAX then? In general, unless this is critical for
> > correctness, I'd very much like stuff like this to be done in the mm
> > tree afterwards, but it seems trivial enough, so if you insist I'll do
> > it.
>
> Yeah, ULONG_MAX should work fine here. vma->vm_lock_seq is initialized
> to -1 to avoid false initial match with mm->mm_lock_seq which is
> initialized to 0. As I said, a false match is not a problem but if we
> can avoid it, that's better.

ok, ULONG_MAX and unsigned long it is, will update

>
> >
> > > Also, did you check with pahole if the vm_area_struct layout change
> > > pushes some members into a difference cacheline or creates new gaps?
> > >
> >
> > Just did. We had 3 byte hole after `bool detached;`, it now grew to 7
> > bytes (so +4) and then vm_lock_seq itself is now 8 bytes (so +4),
> > which now does push rb and rb_subtree_last into *THE SAME* cache line
> > (which sounds like an improvement to me). vm_lock_seq and vm_lock stay
> > in the same cache line. vm_pgoff and vm_file are now in the same cache
> > line, and given they are probably always accessed together, seems like
> > a good accidental change as well. See below pahole outputs before and
> > after.
>
> Ok, sounds good to me. Looks like keeping both sequence numbers 64bit
> is not an issue. Changing them to unsigned would be nice and trivial
> but I don't insist. You can add:
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

thanks, will switch to unsigned in the next revision (next week,
probably, to let some of the pending patches land)

>
> >
> > That singular detached bool looks like a complete waste, tbh. Maybe it
> > would be better to roll it into vm_flags and save 8 bytes? (not that I
> > want to do those mm changes in this patch set, of course...).
> > vm_area_struct is otherwise nicely tightly packed.
> >
> > tl;dr, seems fine, and detached would be best to get rid of, if
> > possible (but that's a completely separate thing)
>
> Yeah, I'll take a look at that. Thanks!
>
> >
> > BEFORE
> > ======
> > struct vm_area_struct {
> >         union {
> >                 struct {
> >                         long unsigned int vm_start;      /*     0     8 */
> >                         long unsigned int vm_end;        /*     8     8 */
> >                 };                                       /*     0    16 */
> >                 struct callback_head vm_rcu;             /*     0    16 */
> >         } __attribute__((__aligned__(8)));               /*     0    16 */
> >         struct mm_struct *         vm_mm;                /*    16     8 */
> >         pgprot_t                   vm_page_prot;         /*    24     8 */
> >         union {
> >                 const vm_flags_t   vm_flags;             /*    32     8 */
> >                 vm_flags_t         __vm_flags;           /*    32     8 */
> >         };                                               /*    32     8 */
> >         bool                       detached;             /*    40     1 */
> >
> >         /* XXX 3 bytes hole, try to pack */
> >
> >         int                        vm_lock_seq;          /*    44     4 */
> >         struct vma_lock *          vm_lock;              /*    48     8 */
> >         struct {
> >                 struct rb_node     rb;                   /*    56    24 */
> >                 /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
> >                 long unsigned int  rb_subtree_last;      /*    80     8 */
> >         }                                                /*    56    32 */
> >         struct list_head           anon_vma_chain;       /*    88    16 */
> >         struct anon_vma *          anon_vma;             /*   104     8 */
> >         const struct vm_operations_struct  * vm_ops;     /*   112     8 */
> >         long unsigned int          vm_pgoff;             /*   120     8 */
> >         /* --- cacheline 2 boundary (128 bytes) --- */
> >         struct file *              vm_file;              /*   128     8 */
> >         void *                     vm_private_data;      /*   136     8 */
> >         atomic_long_t              swap_readahead_info;  /*   144     8 */
> >         struct mempolicy *         vm_policy;            /*   152     8 */
> >         struct vma_numab_state *   numab_state;          /*   160     8 */
> >         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     8 */
> >
> >         /* size: 176, cachelines: 3, members: 18 */
> >         /* sum members: 173, holes: 1, sum holes: 3 */
> >         /* forced alignments: 2 */
> >         /* last cacheline: 48 bytes */
> > } __attribute__((__aligned__(8)));
> >
> > AFTER
> > =====
> > struct vm_area_struct {
> >         union {
> >                 struct {
> >                         long unsigned int vm_start;      /*     0     8 */
> >                         long unsigned int vm_end;        /*     8     8 */
> >                 };                                       /*     0    16 */
> >                 struct callback_head vm_rcu;             /*     0    16 */
> >         } __attribute__((__aligned__(8)));               /*     0    16 */
> >         struct mm_struct *         vm_mm;                /*    16     8 */
> >         pgprot_t                   vm_page_prot;         /*    24     8 */
> >         union {
> >                 const vm_flags_t   vm_flags;             /*    32     8 */
> >                 vm_flags_t         __vm_flags;           /*    32     8 */
> >         };                                               /*    32     8 */
> >         bool                       detached;             /*    40     1 */
> >
> >         /* XXX 7 bytes hole, try to pack */
> >
> >         long int                   vm_lock_seq;          /*    48     8 */
> >         struct vma_lock *          vm_lock;              /*    56     8 */
> >         /* --- cacheline 1 boundary (64 bytes) --- */
> >         struct {
> >                 struct rb_node     rb;                   /*    64    24 */
> >                 long unsigned int  rb_subtree_last;      /*    88     8 */
> >         }                                                /*    64    32 */
> >         struct list_head           anon_vma_chain;       /*    96    16 */
> >         struct anon_vma *          anon_vma;             /*   112     8 */
> >         const struct vm_operations_struct  * vm_ops;     /*   120     8 */
> >         /* --- cacheline 2 boundary (128 bytes) --- */
> >         long unsigned int          vm_pgoff;             /*   128     8 */
> >         struct file *              vm_file;              /*   136     8 */
> >         void *                     vm_private_data;      /*   144     8 */
> >         atomic_long_t              swap_readahead_info;  /*   152     8 */
> >         struct mempolicy *         vm_policy;            /*   160     8 */
> >         struct vma_numab_state *   numab_state;          /*   168     8 */
> >         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */
> >
> >         /* size: 184, cachelines: 3, members: 18 */
> >         /* sum members: 177, holes: 1, sum holes: 7 */
> >         /* forced alignments: 2 */
> >         /* last cacheline: 56 bytes */
> > } __attribute__((__aligned__(8)));
> >
> >
> > > > >
> > > > > I didn't use __u64 outright to keep 32-bit architectures unaffected, but
> > > > > if it seems important enough, I have nothing against using __u64.
> > > > >
> > > > > Suggested-by: Jann Horn <jannh@google.com>
> > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > >
> > > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-10-10 20:56 [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
@ 2024-10-23 17:54 ` Andrii Nakryiko
  4 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-23 17:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, peterz, oleg, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Thu, Oct 10, 2024 at 1:56 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Implement speculative (lockless) resolution of VMA to inode to uprobe,
> bypassing the need to take mmap_lock for reads, if possible. Patch #1 by Suren
> adds mm_struct helpers that help detect whether mm_struct was changed, which
> is used by uprobe logic to validate that speculative results can be trusted
> after all the lookup logic results in a valid uprobe instance. Patch #2
> follows to make mm_lock_seq into 64-bit counter (on 64-bit architectures), as
> requested by Jann Horn.
>
> Patch #3 is a simplification to uprobe VMA flag checking, suggested by Oleg.
>
> And, finally, patch #4 is the speculative VMA-to-uprobe resolution logic
> itself, and is the focal point of this patch set. It makes entry uprobes in
> common case scale very well with number of CPUs, as we avoid any locking or
> cache line bouncing between CPUs. See corresponding patch for details and
> benchmarking results.
>
> Note, this patch set assumes that FMODE_BACKING files were switched to have
> SLAB_TYPE_SAFE_BY_RCU semantics, which was recently done by Christian Brauner
> in [0]. This change can be pulled into perf/core through stable
> tags/vfs-6.13.for-bpf.file tag from [1].
>
>   [0] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.13.for-bpf.file&id=8b1bc2590af61129b82a189e9dc7c2804c34400e
>   [1] git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
>
> v2->v3:
> - dropped kfree_rcu() patch (Christian);
> - added data_race() annotations for fields of vma and vma->vm_file which could
>   be modified during speculative lookup (Oleg);
> - fixed int->long problem in stubs for mmap_lock_speculation_{start,end}(),
>   caught by Kernel test robot;
> v1->v2:
> - adjusted vma_end_write_all() comment to point out it should never be called
>   manually now, but I wasn't sure how ACQUIRE/RELEASE comments should be
>   reworded (previously requested by Jann), so I'd appreciate some help there
>   (Jann);
> - int -> long change for mm_lock_seq, as agreed at LPC2024 (Jann, Suren, Liam);
> - kfree_rcu_mightsleep() for FMODE_BACKING (Suren, Christian);
> - vm_flags simplification in find_active_uprobe_rcu() and
>   find_active_uprobe_speculative() (Oleg);
> - guard(rcu)() simplified find_active_uprobe_speculative() implementation.
>
> Andrii Nakryiko (3):
>   mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
>   uprobes: simplify find_active_uprobe_rcu() VMA checks
>   uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
>
> Suren Baghdasaryan (1):
>   mm: introduce mmap_lock_speculation_{start|end}
>
>  include/linux/mm.h        |  6 ++--
>  include/linux/mm_types.h  |  7 ++--
>  include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
>  kernel/events/uprobes.c   | 52 +++++++++++++++++++++++++++-
>  kernel/fork.c             |  3 --
>  5 files changed, 119 insertions(+), 21 deletions(-)
>
> --
> 2.43.5
>

This applies cleanly to tip/perf/core with or without Jiri's patches
([0]). No need to rebase and resend, this is ready to go in.

  [0] https://lore.kernel.org/all/20241018202252.693462-1-jolsa@kernel.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-17  2:01     ` Suren Baghdasaryan
  2024-10-17 18:55       ` Andrii Nakryiko
@ 2024-10-23 19:02       ` Peter Zijlstra
  2024-10-23 19:12         ` Andrii Nakryiko
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2024-10-23 19:02 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Shakeel Butt, Andrii Nakryiko, linux-trace-kernel, linux-mm, oleg,
	rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, akpm,
	mjguzik, brauner, jannh, mhocko, vbabka, hannes, Liam.Howlett,
	lorenzo.stoakes

On Wed, Oct 16, 2024 at 07:01:59PM -0700, Suren Baghdasaryan wrote:
> On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 10, 2024 at 01:56:42PM GMT, Andrii Nakryiko wrote:
> > > To increase mm->mm_lock_seq robustness, switch it from int to long, so
> > > that it's a 64-bit counter on 64-bit systems and we can stop worrying
> > > about it wrapping around in just ~4 billion iterations. Same goes for
> > > VMA's matching vm_lock_seq, which is derived from mm_lock_seq.
> 
> vm_lock_seq does not need to be long but for consistency I guess that
> makes sense. While at it, can you please change these seq counters to
> be unsigned?

Yeah, that. Kees is waging war on signed types that 'overflow'. These
sequence counter thingies are designed to wrap and should very much be
unsigned.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-23 19:02       ` Peter Zijlstra
@ 2024-10-23 19:12         ` Andrii Nakryiko
  0 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-23 19:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, Shakeel Butt, Andrii Nakryiko,
	linux-trace-kernel, linux-mm, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, akpm, mjguzik, brauner,
	jannh, mhocko, vbabka, hannes, Liam.Howlett, lorenzo.stoakes

On Wed, Oct 23, 2024 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 16, 2024 at 07:01:59PM -0700, Suren Baghdasaryan wrote:
> > On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Oct 10, 2024 at 01:56:42PM GMT, Andrii Nakryiko wrote:
> > > > To increase mm->mm_lock_seq robustness, switch it from int to long, so
> > > > that it's a 64-bit counter on 64-bit systems and we can stop worrying
> > > > about it wrapping around in just ~4 billion iterations. Same goes for
> > > > VMA's matching vm_lock_seq, which is derived from mm_lock_seq.
> >
> > vm_lock_seq does not need to be long but for consistency I guess that
> > makes sense. While at it, can you please change these seq counters to
> > be unsigned?
>
> Yeah, that. Kees is waging war on signed types that 'overflow'. These
> sequence counter thingies are designed to wrap and should very much be
> unsigned.

Ah, ok, I already forgot I need to update this. Will do.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
  2024-10-11  5:01   ` Oleg Nesterov
@ 2024-10-23 19:22   ` Peter Zijlstra
  2024-10-23 20:02     ` Andrii Nakryiko
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2024-10-23 19:22 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Thu, Oct 10, 2024 at 01:56:44PM -0700, Andrii Nakryiko wrote:

> Suggested-by: Matthew Wilcox <willy@infradead.org>

I'm fairly sure I've suggested much the same :-)

> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index fa1024aad6c4..9dc6e78975c9 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2047,6 +2047,52 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>  	return is_trap_insn(&opcode);
>  }
>  
> +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct uprobe *uprobe = NULL;
> +	struct vm_area_struct *vma;
> +	struct file *vm_file;
> +	struct inode *vm_inode;
> +	unsigned long vm_pgoff, vm_start;
> +	loff_t offset;
> +	long seq;
> +
> +	guard(rcu)();
> +
> +	if (!mmap_lock_speculation_start(mm, &seq))
> +		return NULL;

So traditional seqcount assumed non-preemptible lock sides and would
spin-wait for the LSB to clear, but for PREEMPT_RT we added preemptible
seqcount support and that takes the lock to wait, which in this case is
exactly the same as returning NULL and doing the lookup holding
mmap_lock, so yeah.

> +
> +	vma = vma_lookup(mm, bp_vaddr);
> +	if (!vma)
> +		return NULL;
> +
> +	/* vm_file memory can be reused for another instance of struct file,

Comment style nit.

> +	 * but can't be freed from under us, so it's safe to read fields from
> +	 * it, even if the values are some garbage values; ultimately
> +	 * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> +	 * that whatever we speculatively found is correct
> +	 */
> +	vm_file = READ_ONCE(vma->vm_file);
> +	if (!vm_file)
> +		return NULL;
> +
> +	vm_pgoff = data_race(vma->vm_pgoff);
> +	vm_start = data_race(vma->vm_start);
> +	vm_inode = data_race(vm_file->f_inode);

So... seqcount has kcsan annotations other than data_race(). I suppose
this works, but it all feels like a bad copy with random changes.

> +
> +	offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> +	uprobe = find_uprobe_rcu(vm_inode, offset);
> +	if (!uprobe)
> +		return NULL;
> +
> +	/* now double check that nothing about MM changed */
> +	if (!mmap_lock_speculation_end(mm, seq))
> +		return NULL;

Typically seqcount does a re-try here.

> +
> +	return uprobe;
> +}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures Andrii Nakryiko
  2024-10-13  7:56   ` Shakeel Butt
@ 2024-10-23 19:31   ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2024-10-23 19:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Thu, Oct 10, 2024 at 01:56:42PM -0700, Andrii Nakryiko wrote:
> To increase mm->mm_lock_seq robustness, switch it from int to long, so
> that it's a 64-bit counter on 64-bit systems and we can stop worrying
> about it wrapping around in just ~4 billion iterations. Same goes for
> VMA's matching vm_lock_seq, which is derived from mm_lock_seq.
> 
> I didn't use __u64 outright to keep 32-bit architectures unaffected, but
> if it seems important enough, I have nothing against using __u64.

(__uXX are the uapi types)

> 
> Suggested-by: Jann Horn <jannh@google.com>

Jann, do you see problems with the normal seqcount being unsigned (int)?
I suppose especially for preemptible seqcounts it might already be
entirely feasible to wrap them?

Doing u64 is tricky but not impossible, it would require something like
we do for GUP_GET_PXX_LOW_HIGH. OTOH, I don't think we really care about
32bit enough to bother.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-23 19:22   ` Peter Zijlstra
@ 2024-10-23 20:02     ` Andrii Nakryiko
  2024-10-23 20:19       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-23 20:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Wed, Oct 23, 2024 at 12:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 10, 2024 at 01:56:44PM -0700, Andrii Nakryiko wrote:
>
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
>
> I'm fairly sure I've suggested much the same :-)

I'll add another Suggested-by, didn't mean to rob anyone of credits :)

>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/events/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 50 insertions(+)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index fa1024aad6c4..9dc6e78975c9 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -2047,6 +2047,52 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
> >       return is_trap_insn(&opcode);
> >  }
> >
> > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > +{
> > +     struct mm_struct *mm = current->mm;
> > +     struct uprobe *uprobe = NULL;
> > +     struct vm_area_struct *vma;
> > +     struct file *vm_file;
> > +     struct inode *vm_inode;
> > +     unsigned long vm_pgoff, vm_start;
> > +     loff_t offset;
> > +     long seq;
> > +
> > +     guard(rcu)();
> > +
> > +     if (!mmap_lock_speculation_start(mm, &seq))
> > +             return NULL;
>
> So traditional seqcount assumed non-preemptible lock sides and would
> spin-wait for the LSB to clear, but for PREEMPT_RT we added preemptible
> seqcount support and that takes the lock to wait, which in this case is
> exactly the same as returning NULL and doing the lookup holding
> mmap_lock, so yeah.
>

yep, and on configurations with CONFIG_PER_VMA_LOCK=n this will always
return false


> > +
> > +     vma = vma_lookup(mm, bp_vaddr);
> > +     if (!vma)
> > +             return NULL;
> > +
> > +     /* vm_file memory can be reused for another instance of struct file,
>
> Comment style nit.

mechanical memory, sorry, missed this one

>
> > +      * but can't be freed from under us, so it's safe to read fields from
> > +      * it, even if the values are some garbage values; ultimately
> > +      * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> > +      * that whatever we speculatively found is correct
> > +      */
> > +     vm_file = READ_ONCE(vma->vm_file);
> > +     if (!vm_file)
> > +             return NULL;
> > +
> > +     vm_pgoff = data_race(vma->vm_pgoff);
> > +     vm_start = data_race(vma->vm_start);
> > +     vm_inode = data_race(vm_file->f_inode);
>
> So... seqcount has kcsan annotations other than data_race(). I suppose
> this works, but it all feels like a bad copy with random changes.

I'm not sure what this means... Do I need to change anything? Drop
data_race()? Use READ_ONCE()? Do nothing?

>
> > +
> > +     offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > +     uprobe = find_uprobe_rcu(vm_inode, offset);
> > +     if (!uprobe)
> > +             return NULL;
> > +
> > +     /* now double check that nothing about MM changed */
> > +     if (!mmap_lock_speculation_end(mm, seq))
> > +             return NULL;
>
> Typically seqcount does a re-try here.

I'd like to keep it simple, we have fallback to locked version in case of a race

>
> > +
> > +     return uprobe;
> > +}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-10 20:56 ` [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
  2024-10-13  7:56   ` Shakeel Butt
@ 2024-10-23 20:10   ` Peter Zijlstra
  2024-10-23 22:17     ` Suren Baghdasaryan
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2024-10-23 20:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Thu, Oct 10, 2024 at 01:56:41PM -0700, Andrii Nakryiko wrote:
> From: Suren Baghdasaryan <surenb@google.com>
> 
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> Link: https://lore.kernel.org/bpf/20240912210222.186542-1-surenb@google.com
> ---
>  include/linux/mm_types.h  |  3 ++
>  include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
>  kernel/fork.c             |  3 --
>  3 files changed, 63 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8e38bc..5d8cdebd42bc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -887,6 +887,9 @@ struct mm_struct {
>  		 * Roughly speaking, incrementing the sequence number is
>  		 * equivalent to releasing locks on VMAs; reading the sequence
>  		 * number can be part of taking a read lock on a VMA.
> +		 * Incremented every time mmap_lock is write-locked/unlocked.
> +		 * Initialized to 0, therefore odd values indicate mmap_lock
> +		 * is write-locked and even values that it's released.
>  		 *
>  		 * Can be modified under write mmap_lock using RELEASE
>  		 * semantics.
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index de9dc20b01ba..9d23635bc701 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -71,39 +71,84 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
>  }
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline void init_mm_lock_seq(struct mm_struct *mm)
> +{
> +	mm->mm_lock_seq = 0;
> +}
> +
>  /*
> - * Drop all currently-held per-VMA locks.
> - * This is called from the mmap_lock implementation directly before releasing
> - * a write-locked mmap_lock (or downgrading it to read-locked).
> - * This should normally NOT be called manually from other places.
> - * If you want to call this manually anyway, keep in mind that this will release
> - * *all* VMA write locks, including ones from further up the stack.
> + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> + * or write-unlocked (RELEASE semantics).
>   */
> -static inline void vma_end_write_all(struct mm_struct *mm)
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
>  {
>  	mmap_assert_write_locked(mm);
>  	/*
>  	 * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
>  	 * mmap_lock being held.
> -	 * We need RELEASE semantics here to ensure that preceding stores into
> -	 * the VMA take effect before we unlock it with this store.
> -	 * Pairs with ACQUIRE semantics in vma_start_read().
>  	 */
> -	smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +
> +	if (acquire) {
> +		WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +		/*
> +		 * For ACQUIRE semantics we should ensure no following stores are
> +		 * reordered to appear before the mm->mm_lock_seq modification.
> +		 */
> +		smp_wmb();

Strictly speaking this isn't ACQUIRE, nor do we care about ACQUIRE here.
This really is about subsequent stores, loads are irrelevant.

> +	} else {
> +		/*
> +		 * We need RELEASE semantics here to ensure that preceding stores
> +		 * into the VMA take effect before we unlock it with this store.
> +		 * Pairs with ACQUIRE semantics in vma_start_read().
> +		 */

Again, not strictly true. We don't care about loads. Using RELEASE here
is fine and probably cheaper on a few platforms, but we don't strictly
need/care about RELEASE.

> +		smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +	}
> +}

Also, it might be saner to stick closer to the seqcount naming of
things and use two different functions for these two different things.

/* straight up copy of do_raw_write_seqcount_begin() */
static inline void mm_write_seqlock_begin(struct mm_struct *mm)
{
	kcsan_nestable_atomic_begin();
	mm->mm_lock_seq++;
	smp_wmb();
}

/* straigjt up copy of do_raw_write_seqcount_end() */
static inline void mm_write_seqcount_end(struct mm_struct *mm)
{
	smp_wmb();
	mm->mm_lock_seq++;
	kcsan_nestable_atomic_end();
}

Or better yet, just use seqcount...

> +
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> +{
> +	/* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> +	*seq = smp_load_acquire(&mm->mm_lock_seq);
> +	/* Allow speculation if mmap_lock is not write-locked */
> +	return (*seq & 1) == 0;
> +}
> +
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> +{
> +	/* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
> +	smp_rmb();
> +	return seq == READ_ONCE(mm->mm_lock_seq);
>  }

Because there's nothing better than well known functions with a randomly
different name and interface I suppose...


Anyway, all the actual code proposed is not wrong. I'm just a bit
annoyed its a random NIH of seqcount.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-23 20:02     ` Andrii Nakryiko
@ 2024-10-23 20:19       ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2024-10-23 20:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Wed, Oct 23, 2024 at 01:02:53PM -0700, Andrii Nakryiko wrote:

> > > +      * but can't be freed from under us, so it's safe to read fields from
> > > +      * it, even if the values are some garbage values; ultimately
> > > +      * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> > > +      * that whatever we speculatively found is correct
> > > +      */
> > > +     vm_file = READ_ONCE(vma->vm_file);
> > > +     if (!vm_file)
> > > +             return NULL;
> > > +
> > > +     vm_pgoff = data_race(vma->vm_pgoff);
> > > +     vm_start = data_race(vma->vm_start);
> > > +     vm_inode = data_race(vm_file->f_inode);
> >
> > So... seqcount has kcsan annotations other than data_race(). I suppose
> > this works, but it all feels like a bad copy with random changes.
> 
> I'm not sure what this means... Do I need to change anything? Drop
> data_race()? Use READ_ONCE()? Do nothing?

Keep for now. I've ranted at 1/n a bit, but unless the response is:
yeah, obviously this should be seqcount (unlikely) this is something
that can be fixed later (*sigh* always later... this todo list is a
problem).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-23 20:10   ` Peter Zijlstra
@ 2024-10-23 22:17     ` Suren Baghdasaryan
  2024-10-24  9:56       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Suren Baghdasaryan @ 2024-10-23 22:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Wed, Oct 23, 2024 at 1:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 10, 2024 at 01:56:41PM -0700, Andrii Nakryiko wrote:
> > From: Suren Baghdasaryan <surenb@google.com>
> >
> > Add helper functions to speculatively perform operations without
> > read-locking mmap_lock, expecting that mmap_lock will not be
> > write-locked and mm is not modified from under us.
> >
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > Link: https://lore.kernel.org/bpf/20240912210222.186542-1-surenb@google.com
> > ---
> >  include/linux/mm_types.h  |  3 ++
> >  include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
> >  kernel/fork.c             |  3 --
> >  3 files changed, 63 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6e3bdf8e38bc..5d8cdebd42bc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -887,6 +887,9 @@ struct mm_struct {
> >                * Roughly speaking, incrementing the sequence number is
> >                * equivalent to releasing locks on VMAs; reading the sequence
> >                * number can be part of taking a read lock on a VMA.
> > +              * Incremented every time mmap_lock is write-locked/unlocked.
> > +              * Initialized to 0, therefore odd values indicate mmap_lock
> > +              * is write-locked and even values that it's released.
> >                *
> >                * Can be modified under write mmap_lock using RELEASE
> >                * semantics.
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index de9dc20b01ba..9d23635bc701 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -71,39 +71,84 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> >  }
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +static inline void init_mm_lock_seq(struct mm_struct *mm)
> > +{
> > +     mm->mm_lock_seq = 0;
> > +}
> > +
> >  /*
> > - * Drop all currently-held per-VMA locks.
> > - * This is called from the mmap_lock implementation directly before releasing
> > - * a write-locked mmap_lock (or downgrading it to read-locked).
> > - * This should normally NOT be called manually from other places.
> > - * If you want to call this manually anyway, keep in mind that this will release
> > - * *all* VMA write locks, including ones from further up the stack.
> > + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> > + * or write-unlocked (RELEASE semantics).
> >   */
> > -static inline void vma_end_write_all(struct mm_struct *mm)
> > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
> >  {
> >       mmap_assert_write_locked(mm);
> >       /*
> >        * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> >        * mmap_lock being held.
> > -      * We need RELEASE semantics here to ensure that preceding stores into
> > -      * the VMA take effect before we unlock it with this store.
> > -      * Pairs with ACQUIRE semantics in vma_start_read().
> >        */
> > -     smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +
> > +     if (acquire) {
> > +             WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +             /*
> > +              * For ACQUIRE semantics we should ensure no following stores are
> > +              * reordered to appear before the mm->mm_lock_seq modification.
> > +              */
> > +             smp_wmb();
>
> Strictly speaking this isn't ACQUIRE, nor do we care about ACQUIRE here.
> This really is about subsequent stores, loads are irrelevant.
>
> > +     } else {
> > +             /*
> > +              * We need RELEASE semantics here to ensure that preceding stores
> > +              * into the VMA take effect before we unlock it with this store.
> > +              * Pairs with ACQUIRE semantics in vma_start_read().
> > +              */
>
> Again, not strictly true. We don't care about loads. Using RELEASE here
> is fine and probably cheaper on a few platforms, but we don't strictly
> need/care about RELEASE.
>
> > +             smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +     }
> > +}
>
> Also, it might be saner to stick closer to the seqcount naming of
> things and use two different functions for these two different things.
>
> /* straight up copy of do_raw_write_seqcount_begin() */
> static inline void mm_write_seqlock_begin(struct mm_struct *mm)
> {
>         kcsan_nestable_atomic_begin();
>         mm->mm_lock_seq++;
>         smp_wmb();
> }
>
> /* straigjt up copy of do_raw_write_seqcount_end() */
> static inline void mm_write_seqcount_end(struct mm_struct *mm)
> {
>         smp_wmb();
>         mm->mm_lock_seq++;
>         kcsan_nestable_atomic_end();
> }
>
> Or better yet, just use seqcount...

Yeah, with these changes it does look a lot like seqcount now...
I can take another stab at rewriting this using seqcount_t but one
issue that Jann was concerned about is the counter being int vs long.
seqcount_t uses unsigned, so I'm not sure how to address that if I
were to use seqcount_t. Any suggestions how to address that before I
move forward with a rewrite?

>
> > +
> > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> > +{
> > +     /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > +     *seq = smp_load_acquire(&mm->mm_lock_seq);
> > +     /* Allow speculation if mmap_lock is not write-locked */
> > +     return (*seq & 1) == 0;
> > +}
> > +
> > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > +{
> > +     /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
> > +     smp_rmb();
> > +     return seq == READ_ONCE(mm->mm_lock_seq);
> >  }
>
> Because there's nothing better than well known functions with a randomly
> different name and interface I suppose...
>
>
> Anyway, all the actual code proposed is not wrong. I'm just a bit
> annoyed its a random NIH of seqcount.

Ack. Let's decide what we do about u32 vs u64 issue and I'll rewrite this.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-23 22:17     ` Suren Baghdasaryan
@ 2024-10-24  9:56       ` Peter Zijlstra
  2024-10-24 16:28         ` Suren Baghdasaryan
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2024-10-24  9:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Wed, Oct 23, 2024 at 03:17:01PM -0700, Suren Baghdasaryan wrote:

> > Or better yet, just use seqcount...
> 
> Yeah, with these changes it does look a lot like seqcount now...
> I can take another stab at rewriting this using seqcount_t but one
> issue that Jann was concerned about is the counter being int vs long.
> seqcount_t uses unsigned, so I'm not sure how to address that if I
> were to use seqcount_t. Any suggestions how to address that before I
> move forward with a rewrite?

So if that issue is real, it is not specific to this case. Specifically
preemptible seqcount will be similarly affected. So we should probably
address that in the seqcount implementation.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-24  9:56       ` Peter Zijlstra
@ 2024-10-24 16:28         ` Suren Baghdasaryan
  2024-10-24 21:04           ` Suren Baghdasaryan
  0 siblings, 1 reply; 29+ messages in thread
From: Suren Baghdasaryan @ 2024-10-24 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Thu, Oct 24, 2024 at 2:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 23, 2024 at 03:17:01PM -0700, Suren Baghdasaryan wrote:
>
> > > Or better yet, just use seqcount...
> >
> > Yeah, with these changes it does look a lot like seqcount now...
> > I can take another stab at rewriting this using seqcount_t but one
> > issue that Jann was concerned about is the counter being int vs long.
> > seqcount_t uses unsigned, so I'm not sure how to address that if I
> > were to use seqcount_t. Any suggestions how to address that before I
> > move forward with a rewrite?
>
> So if that issue is real, it is not specific to this case. Specifically
> preemptible seqcount will be similarly affected. So we should probably
> address that in the seqcount implementation.

Sounds good. Let me try rewriting this patch using seqcount_t and I'll
work with Jann on a separate patch to change seqcount_t.
Thanks for the feedback!

>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-24 16:28         ` Suren Baghdasaryan
@ 2024-10-24 21:04           ` Suren Baghdasaryan
  2024-10-24 23:20             ` Andrii Nakryiko
  0 siblings, 1 reply; 29+ messages in thread
From: Suren Baghdasaryan @ 2024-10-24 21:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, akpm, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes

On Thu, Oct 24, 2024 at 9:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Oct 24, 2024 at 2:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Oct 23, 2024 at 03:17:01PM -0700, Suren Baghdasaryan wrote:
> >
> > > > Or better yet, just use seqcount...
> > >
> > > Yeah, with these changes it does look a lot like seqcount now...
> > > I can take another stab at rewriting this using seqcount_t but one
> > > issue that Jann was concerned about is the counter being int vs long.
> > > seqcount_t uses unsigned, so I'm not sure how to address that if I
> > > were to use seqcount_t. Any suggestions how to address that before I
> > > move forward with a rewrite?
> >
> > So if that issue is real, it is not specific to this case. Specifically
> > preemptible seqcount will be similarly affected. So we should probably
> > address that in the seqcount implementation.
>
> Sounds good. Let me try rewriting this patch using seqcount_t and I'll
> work with Jann on a separate patch to change seqcount_t.
> Thanks for the feedback!

I posted the patchset to convert mm_lock_seq into seqcount_t and to
add speculative functions at
https://lore.kernel.org/all/20241024205231.1944747-1-surenb@google.com/.

>
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-24 21:04           ` Suren Baghdasaryan
@ 2024-10-24 23:20             ` Andrii Nakryiko
  2024-10-24 23:33               ` Suren Baghdasaryan
  0 siblings, 1 reply; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-24 23:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Peter Zijlstra, Andrii Nakryiko, linux-trace-kernel, linux-mm,
	oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	akpm, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes

On Thu, Oct 24, 2024 at 2:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Oct 24, 2024 at 9:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Oct 24, 2024 at 2:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Oct 23, 2024 at 03:17:01PM -0700, Suren Baghdasaryan wrote:
> > >
> > > > > Or better yet, just use seqcount...
> > > >
> > > > Yeah, with these changes it does look a lot like seqcount now...
> > > > I can take another stab at rewriting this using seqcount_t but one
> > > > issue that Jann was concerned about is the counter being int vs long.
> > > > seqcount_t uses unsigned, so I'm not sure how to address that if I
> > > > were to use seqcount_t. Any suggestions how to address that before I
> > > > move forward with a rewrite?
> > >
> > > So if that issue is real, it is not specific to this case. Specifically
> > > preemptible seqcount will be similarly affected. So we should probably
> > > address that in the seqcount implementation.
> >
> > Sounds good. Let me try rewriting this patch using seqcount_t and I'll
> > work with Jann on a separate patch to change seqcount_t.
> > Thanks for the feedback!
>
> I posted the patchset to convert mm_lock_seq into seqcount_t and to
> add speculative functions at
> https://lore.kernel.org/all/20241024205231.1944747-1-surenb@google.com/.

Thanks, Suren! Hopefully it can land soon!

>
> >
> > >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-24 23:20             ` Andrii Nakryiko
@ 2024-10-24 23:33               ` Suren Baghdasaryan
  2024-10-25  5:12                 ` Andrii Nakryiko
  0 siblings, 1 reply; 29+ messages in thread
From: Suren Baghdasaryan @ 2024-10-24 23:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Peter Zijlstra, Andrii Nakryiko, linux-trace-kernel, linux-mm,
	oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	akpm, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes

On Thu, Oct 24, 2024 at 4:20 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Oct 24, 2024 at 2:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Oct 24, 2024 at 9:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Oct 24, 2024 at 2:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Wed, Oct 23, 2024 at 03:17:01PM -0700, Suren Baghdasaryan wrote:
> > > >
> > > > > > Or better yet, just use seqcount...
> > > > >
> > > > > Yeah, with these changes it does look a lot like seqcount now...
> > > > > I can take another stab at rewriting this using seqcount_t but one
> > > > > issue that Jann was concerned about is the counter being int vs long.
> > > > > seqcount_t uses unsigned, so I'm not sure how to address that if I
> > > > > were to use seqcount_t. Any suggestions how to address that before I
> > > > > move forward with a rewrite?
> > > >
> > > > So if that issue is real, it is not specific to this case. Specifically
> > > > preemptible seqcount will be similarly affected. So we should probably
> > > > address that in the seqcount implementation.
> > >
> > > Sounds good. Let me try rewriting this patch using seqcount_t and I'll
> > > work with Jann on a separate patch to change seqcount_t.
> > > Thanks for the feedback!
> >
> > I posted the patchset to convert mm_lock_seq into seqcount_t and to
> > add speculative functions at
> > https://lore.kernel.org/all/20241024205231.1944747-1-surenb@google.com/.
>
> Thanks, Suren! Hopefully it can land soon!

Would incorporating them into your patchset speed things up? If so,
feel free to include them into your series.
The only required change in your other patches is the renaming of
mmap_lock_speculation_start() to mmap_lock_speculation_begin().

>
> >
> > >
> > > >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-24 23:33               ` Suren Baghdasaryan
@ 2024-10-25  5:12                 ` Andrii Nakryiko
  0 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2024-10-25  5:12 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Peter Zijlstra, Andrii Nakryiko, linux-trace-kernel, linux-mm,
	oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	akpm, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes

On Thu, Oct 24, 2024 at 4:33 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Oct 24, 2024 at 4:20 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Oct 24, 2024 at 2:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Oct 24, 2024 at 9:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Thu, Oct 24, 2024 at 2:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > >
> > > > > On Wed, Oct 23, 2024 at 03:17:01PM -0700, Suren Baghdasaryan wrote:
> > > > >
> > > > > > > Or better yet, just use seqcount...
> > > > > >
> > > > > > Yeah, with these changes it does look a lot like seqcount now...
> > > > > > I can take another stab at rewriting this using seqcount_t but one
> > > > > > issue that Jann was concerned about is the counter being int vs long.
> > > > > > seqcount_t uses unsigned, so I'm not sure how to address that if I
> > > > > > were to use seqcount_t. Any suggestions how to address that before I
> > > > > > move forward with a rewrite?
> > > > >
> > > > > So if that issue is real, it is not specific to this case. Specifically
> > > > > preemptible seqcount will be similarly affected. So we should probably
> > > > > address that in the seqcount implementation.
> > > >
> > > > Sounds good. Let me try rewriting this patch using seqcount_t and I'll
> > > > work with Jann on a separate patch to change seqcount_t.
> > > > Thanks for the feedback!
> > >
> > > I posted the patchset to convert mm_lock_seq into seqcount_t and to
> > > add speculative functions at
> > > https://lore.kernel.org/all/20241024205231.1944747-1-surenb@google.com/.
> >
> > Thanks, Suren! Hopefully it can land soon!
>
> Would incorporating them into your patchset speed things up? If so,
> feel free to include them into your series.

I don't really think so. At this point the uprobe part is done (next
revision has a comment style fix, that's all). So I'll just wait for
your patches to be acked and applied, then I'll just do a trivial
rebase. This will be easier for everyone at this point, IMO, to not
couple them into a single patch set with two authors.

Hopefully Peter will take those patches through tip/perf/core, though,
so I don't have to wait for mm and tip trees to converge.

> The only required change in your other patches is the renaming of
> mmap_lock_speculation_start() to mmap_lock_speculation_begin().

Yep, no problem.

>
> >
> > >
> > > >
> > > > >

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-10-25  5:12 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-10 20:56 [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
2024-10-10 20:56 ` [PATCH v3 tip/perf/core 1/4] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
2024-10-13  7:56   ` Shakeel Butt
2024-10-14 20:27     ` Andrii Nakryiko
2024-10-14 20:48       ` Suren Baghdasaryan
2024-10-23 20:10   ` Peter Zijlstra
2024-10-23 22:17     ` Suren Baghdasaryan
2024-10-24  9:56       ` Peter Zijlstra
2024-10-24 16:28         ` Suren Baghdasaryan
2024-10-24 21:04           ` Suren Baghdasaryan
2024-10-24 23:20             ` Andrii Nakryiko
2024-10-24 23:33               ` Suren Baghdasaryan
2024-10-25  5:12                 ` Andrii Nakryiko
2024-10-10 20:56 ` [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures Andrii Nakryiko
2024-10-13  7:56   ` Shakeel Butt
2024-10-17  2:01     ` Suren Baghdasaryan
2024-10-17 18:55       ` Andrii Nakryiko
2024-10-17 19:42         ` Suren Baghdasaryan
2024-10-17 20:12           ` Andrii Nakryiko
2024-10-23 19:02       ` Peter Zijlstra
2024-10-23 19:12         ` Andrii Nakryiko
2024-10-23 19:31   ` Peter Zijlstra
2024-10-10 20:56 ` [PATCH v3 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
2024-10-10 20:56 ` [PATCH v3 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
2024-10-11  5:01   ` Oleg Nesterov
2024-10-23 19:22   ` Peter Zijlstra
2024-10-23 20:02     ` Andrii Nakryiko
2024-10-23 20:19       ` Peter Zijlstra
2024-10-23 17:54 ` [PATCH v3 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).