[PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
@ 2024-10-28  1:08 Andrii Nakryiko
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount Andrii Nakryiko
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Andrii Nakryiko @ 2024-10-28  1:08 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, akpm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Andrii Nakryiko

Implement speculative (lockless) resolution of VMA to inode to uprobe,
bypassing the need to take mmap_lock for reads, if possible. First two patches
by Suren adds mm_struct helpers that help detect whether mm_struct was
changed, which is used by uprobe logic to validate that speculative results
can be trusted after all the lookup logic results in a valid uprobe instance.

Patch #3 is a simplification to uprobe VMA flag checking, suggested by Oleg.

And, finally, patch #4 is the speculative VMA-to-uprobe resolution logic
itself, and is the focal point of this patch set. It makes entry uprobes in
common case scale very well with number of CPUs, as we avoid any locking or
cache line bouncing between CPUs. See corresponding patch for details and
benchmarking results.

Note, this patch set assumes that FMODE_BACKING files were switched to have
SLAB_TYPE_SAFE_BY_RCU semantics, which was recently done by Christian Brauner
in [0]. This change can be pulled into perf/core through stable
tags/vfs-6.13.for-bpf.file tag from [1].

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.13.for-bpf.file&id=8b1bc2590af61129b82a189e9dc7c2804c34400e
  [1] git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git

v3->v4:
- rebased and dropped data_race(), given mm_struct uses real seqcount (Peter);
v2->v3:
- dropped kfree_rcu() patch (Christian);
- added data_race() annotations for fields of vma and vma->vm_file which could
  be modified during speculative lookup (Oleg);
- fixed int->long problem in stubs for mmap_lock_speculation_{start,end}(),
  caught by Kernel test robot;
v1->v2:
- adjusted vma_end_write_all() comment to point out it should never be called
  manually now, but I wasn't sure how ACQUIRE/RELEASE comments should be
  reworded (previously requested by Jann), so I'd appreciate some help there
  (Jann);
- int -> long change for mm_lock_seq, as agreed at LPC2024 (Jann, Suren, Liam);
- kfree_rcu_mightsleep() for FMODE_BACKING (Suren, Christian);
- vm_flags simplification in find_active_uprobe_rcu() and
  find_active_uprobe_speculative() (Oleg);
- guard(rcu)() simplified find_active_uprobe_speculative() implementation.

Andrii Nakryiko (2):
  uprobes: simplify find_active_uprobe_rcu() VMA checks
  uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution

Suren Baghdasaryan (2):
  mm: Convert mm_lock_seq to a proper seqcount
  mm: Introduce mmap_lock_speculation_{begin|end}

 include/linux/mm.h               | 12 ++---
 include/linux/mm_types.h         |  7 ++-
 include/linux/mmap_lock.h        | 87 ++++++++++++++++++++++++--------
 kernel/events/uprobes.c          | 47 ++++++++++++++++-
 kernel/fork.c                    |  5 +-
 mm/init-mm.c                     |  2 +-
 tools/testing/vma/vma.c          |  4 +-
 tools/testing/vma/vma_internal.h |  4 +-
 8 files changed, 129 insertions(+), 39 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount
  2024-10-28  1:08 [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
@ 2024-10-28  1:08 ` Andrii Nakryiko
  2024-10-29 11:52   ` Vlastimil Babka
  2024-11-21 12:40   ` Peter Zijlstra
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end} Andrii Nakryiko
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Andrii Nakryiko @ 2024-10-28  1:08 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, akpm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Andrii Nakryiko

From: Suren Baghdasaryan <surenb@google.com>

Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock
variants to increment it, in-line with the usual seqcount usage pattern.
This lets us check whether the mmap_lock is write-locked by checking
mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be
used when implementing mmap_lock speculation functions.
As a result vm_lock_seq is also change to be unsigned to match the type
of mm_lock_seq.sequence.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/mm.h               | 12 +++----
 include/linux/mm_types.h         |  7 ++--
 include/linux/mmap_lock.h        | 58 +++++++++++++++++++++-----------
 kernel/fork.c                    |  5 +--
 mm/init-mm.c                     |  2 +-
 tools/testing/vma/vma.c          |  4 +--
 tools/testing/vma/vma_internal.h |  4 +--
 7 files changed, 56 insertions(+), 36 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecf63d2b0582..94b537088142 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -698,7 +698,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * we don't rely on for anything - the mm_lock_seq read against which we
 	 * need ordering is below.
 	 */
-	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq))
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
 	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
@@ -715,7 +715,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * after it has been unlocked.
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
-	if (unlikely(vma->vm_lock_seq == smp_load_acquire(&vma->vm_mm->mm_lock_seq))) {
+	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
 		up_read(&vma->vm_lock->lock);
 		return false;
 	}
@@ -730,7 +730,7 @@ static inline void vma_end_read(struct vm_area_struct *vma)
 }
 
 /* WARNING! Can only be used if mmap_lock is expected to be write-locked */
-static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
+static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_lock_seq)
 {
 	mmap_assert_write_locked(vma->vm_mm);
 
@@ -738,7 +738,7 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
 	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
 	 * mm->mm_lock_seq can't be concurrently modified.
 	 */
-	*mm_lock_seq = vma->vm_mm->mm_lock_seq;
+	*mm_lock_seq = vma->vm_mm->mm_lock_seq.sequence;
 	return (vma->vm_lock_seq == *mm_lock_seq);
 }
 
@@ -749,7 +749,7 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
  */
 static inline void vma_start_write(struct vm_area_struct *vma)
 {
-	int mm_lock_seq;
+	unsigned int mm_lock_seq;
 
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
@@ -767,7 +767,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
-	int mm_lock_seq;
+	unsigned int mm_lock_seq;
 
 	VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma);
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8e38bc..76e0cdc0462b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -715,7 +715,7 @@ struct vm_area_struct {
 	 * counter reuse can only lead to occasional unnecessary use of the
 	 * slowpath.
 	 */
-	int vm_lock_seq;
+	unsigned int vm_lock_seq;
 	/* Unstable RCU readers are allowed to read this. */
 	struct vma_lock *vm_lock;
 #endif
@@ -887,6 +887,9 @@ struct mm_struct {
 		 * Roughly speaking, incrementing the sequence number is
 		 * equivalent to releasing locks on VMAs; reading the sequence
 		 * number can be part of taking a read lock on a VMA.
+		 * Incremented every time mmap_lock is write-locked/unlocked.
+		 * Initialized to 0, therefore odd values indicate mmap_lock
+		 * is write-locked and even values that it's released.
 		 *
 		 * Can be modified under write mmap_lock using RELEASE
 		 * semantics.
@@ -895,7 +898,7 @@ struct mm_struct {
 		 * Can be read with ACQUIRE semantics if not holding write
 		 * mmap_lock.
 		 */
-		int mm_lock_seq;
+		seqcount_t mm_lock_seq;
 #endif
 
 
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index de9dc20b01ba..6b3272686860 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -71,39 +71,38 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
-/*
- * Drop all currently-held per-VMA locks.
- * This is called from the mmap_lock implementation directly before releasing
- * a write-locked mmap_lock (or downgrading it to read-locked).
- * This should normally NOT be called manually from other places.
- * If you want to call this manually anyway, keep in mind that this will release
- * *all* VMA write locks, including ones from further up the stack.
- */
-static inline void vma_end_write_all(struct mm_struct *mm)
+static inline void mm_lock_seqcount_init(struct mm_struct *mm)
 {
-	mmap_assert_write_locked(mm);
-	/*
-	 * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
-	 * mmap_lock being held.
-	 * We need RELEASE semantics here to ensure that preceding stores into
-	 * the VMA take effect before we unlock it with this store.
-	 * Pairs with ACQUIRE semantics in vma_start_read().
-	 */
-	smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
+	seqcount_init(&mm->mm_lock_seq);
+}
+
+static inline void mm_lock_seqcount_begin(struct mm_struct *mm)
+{
+	do_raw_write_seqcount_begin(&mm->mm_lock_seq);
+}
+
+static inline void mm_lock_seqcount_end(struct mm_struct *mm)
+{
+	do_raw_write_seqcount_end(&mm->mm_lock_seq);
 }
+
 #else
-static inline void vma_end_write_all(struct mm_struct *mm) {}
+static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
+static inline void mm_lock_seqcount_begin(struct mm_struct *mm) {}
+static inline void mm_lock_seqcount_end(struct mm_struct *mm) {}
 #endif
 
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
+	mm_lock_seqcount_init(mm);
 }
 
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_start_locking(mm, true);
 	down_write(&mm->mmap_lock);
+	mm_lock_seqcount_begin(mm);
 	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
@@ -111,6 +110,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
 {
 	__mmap_lock_trace_start_locking(mm, true);
 	down_write_nested(&mm->mmap_lock, subclass);
+	mm_lock_seqcount_begin(mm);
 	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
@@ -120,10 +120,30 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
 
 	__mmap_lock_trace_start_locking(mm, true);
 	ret = down_write_killable(&mm->mmap_lock);
+	if (!ret)
+		mm_lock_seqcount_begin(mm);
 	__mmap_lock_trace_acquire_returned(mm, true, ret == 0);
 	return ret;
 }
 
+/*
+ * Drop all currently-held per-VMA locks.
+ * This is called from the mmap_lock implementation directly before releasing
+ * a write-locked mmap_lock (or downgrading it to read-locked).
+ * This should normally NOT be called manually from other places.
+ * If you want to call this manually anyway, keep in mind that this will release
+ * *all* VMA write locks, including ones from further up the stack.
+ */
+static inline void vma_end_write_all(struct mm_struct *mm)
+{
+	mmap_assert_write_locked(mm);
+	/*
+	 * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
+	 * mmap_lock being held.
+	 */
+	mm_lock_seqcount_end(mm);
+}
+
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, true);
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..55c4088543dc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -448,7 +448,7 @@ static bool vma_lock_alloc(struct vm_area_struct *vma)
 		return false;
 
 	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = -1;
+	vma->vm_lock_seq = UINT_MAX;
 
 	return true;
 }
@@ -1261,9 +1261,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
-#ifdef CONFIG_PER_VMA_LOCK
-	mm->mm_lock_seq = 0;
-#endif
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 24c809379274..6af3ad675930 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -40,7 +40,7 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
-	.mm_lock_seq	= 0,
+	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
 #endif
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index c53f220eb6cc..bcdf831dfe3e 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -87,7 +87,7 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm,
 	 * begun. Linking to the tree will have caused this to be incremented,
 	 * which means we will get a false positive otherwise.
 	 */
-	vma->vm_lock_seq = -1;
+	vma->vm_lock_seq = UINT_MAX;
 
 	return vma;
 }
@@ -212,7 +212,7 @@ static bool vma_write_started(struct vm_area_struct *vma)
 	int seq = vma->vm_lock_seq;
 
 	/* We reset after each check. */
-	vma->vm_lock_seq = -1;
+	vma->vm_lock_seq = UINT_MAX;
 
 	/* The vma_start_write() stub simply increments this value. */
 	return seq > -1;
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index c5b9da034511..4007ec580f85 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -231,7 +231,7 @@ struct vm_area_struct {
 	 * counter reuse can only lead to occasional unnecessary use of the
 	 * slowpath.
 	 */
-	int vm_lock_seq;
+	unsigned int vm_lock_seq;
 	struct vma_lock *vm_lock;
 #endif
 
@@ -406,7 +406,7 @@ static inline bool vma_lock_alloc(struct vm_area_struct *vma)
 		return false;
 
 	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = -1;
+	vma->vm_lock_seq = UINT_MAX;
 
 	return true;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount Andrii Nakryiko
@ 2024-10-29 11:52   ` Vlastimil Babka
  2024-11-21 12:40   ` Peter Zijlstra
  1 sibling, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2024-10-29 11:52 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, linux-mm, akpm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca

On 10/28/24 02:08, Andrii Nakryiko wrote:
> From: Suren Baghdasaryan <surenb@google.com>
> 
> Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock
> variants to increment it, in-line with the usual seqcount usage pattern.
> This lets us check whether the mmap_lock is write-locked by checking
> mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be
> used when implementing mmap_lock speculation functions.
> As a result vm_lock_seq is also change to be unsigned to match the type
> of mm_lock_seq.sequence.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount Andrii Nakryiko
  2024-10-29 11:52   ` Vlastimil Babka
@ 2024-11-21 12:40   ` Peter Zijlstra
  2024-11-21 15:35     ` Suren Baghdasaryan
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2024-11-21 12:40 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, akpm, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, mjguzik, brauner,
	jannh, mhocko, vbabka, shakeel.butt, hannes, Liam.Howlett,
	lorenzo.stoakes, david, arnd, richard.weiyang, zhangpeng.00,
	linmiaohe, viro, hca

On Sun, Oct 27, 2024 at 06:08:15PM -0700, Andrii Nakryiko wrote:
> +/*
> + * Drop all currently-held per-VMA locks.
> + * This is called from the mmap_lock implementation directly before releasing
> + * a write-locked mmap_lock (or downgrading it to read-locked).
> + * This should normally NOT be called manually from other places.
> + * If you want to call this manually anyway, keep in mind that this will release
> + * *all* VMA write locks, including ones from further up the stack.
> + */
> +static inline void vma_end_write_all(struct mm_struct *mm)
> +{
> +	mmap_assert_write_locked(mm);
> +	/*
> +	 * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> +	 * mmap_lock being held.
> +	 */

You can write:

	ASSERT_EXCLUSIVE_WRITER(mm->mm_lock_seq);

instead of that comment. Then KCSAN will validate the claim.

> +	mm_lock_seqcount_end(mm);
> +}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount
  2024-11-21 12:40   ` Peter Zijlstra
@ 2024-11-21 15:35     ` Suren Baghdasaryan
  0 siblings, 0 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-21 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, akpm, oleg,
	rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca

On Thu, Nov 21, 2024 at 4:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Oct 27, 2024 at 06:08:15PM -0700, Andrii Nakryiko wrote:
> > +/*
> > + * Drop all currently-held per-VMA locks.
> > + * This is called from the mmap_lock implementation directly before releasing
> > + * a write-locked mmap_lock (or downgrading it to read-locked).
> > + * This should normally NOT be called manually from other places.
> > + * If you want to call this manually anyway, keep in mind that this will release
> > + * *all* VMA write locks, including ones from further up the stack.
> > + */
> > +static inline void vma_end_write_all(struct mm_struct *mm)
> > +{
> > +     mmap_assert_write_locked(mm);
> > +     /*
> > +      * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> > +      * mmap_lock being held.
> > +      */
>
> You can write:
>
>         ASSERT_EXCLUSIVE_WRITER(mm->mm_lock_seq);
>
> instead of that comment. Then KCSAN will validate the claim.

Thanks for the tip! This one looks not critical but I see there are
more important comments in "mm: Introduce
mmap_lock_speculation_{begin|end}". I'll send a new version shortly.

>
> > +     mm_lock_seqcount_end(mm);
> > +}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end}
  2024-10-28  1:08 [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount Andrii Nakryiko
@ 2024-10-28  1:08 ` Andrii Nakryiko
  2024-10-29 16:48   ` Vlastimil Babka
  2024-11-21 14:44   ` Peter Zijlstra
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Andrii Nakryiko @ 2024-10-28  1:08 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, akpm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Andrii Nakryiko

From: Suren Baghdasaryan <surenb@google.com>

Add helper functions to speculatively perform operations without
read-locking mmap_lock, expecting that mmap_lock will not be
write-locked and mm is not modified from under us.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/mmap_lock.h | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 6b3272686860..58dde2e35f7e 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -71,6 +71,7 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
+
 static inline void mm_lock_seqcount_init(struct mm_struct *mm)
 {
 	seqcount_init(&mm->mm_lock_seq);
@@ -86,11 +87,35 @@ static inline void mm_lock_seqcount_end(struct mm_struct *mm)
 	do_raw_write_seqcount_end(&mm->mm_lock_seq);
 }
 
-#else
+static inline bool mmap_lock_speculation_begin(struct mm_struct *mm, unsigned int *seq)
+{
+	*seq = raw_read_seqcount(&mm->mm_lock_seq);
+	/* Allow speculation if mmap_lock is not write-locked */
+	return (*seq & 1) == 0;
+}
+
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, unsigned int seq)
+{
+	return !do_read_seqcount_retry(&mm->mm_lock_seq, seq);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
 static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
 static inline void mm_lock_seqcount_begin(struct mm_struct *mm) {}
 static inline void mm_lock_seqcount_end(struct mm_struct *mm) {}
-#endif
+
+static inline bool mmap_lock_speculation_begin(struct mm_struct *mm, unsigned int *seq)
+{
+	return false;
+}
+
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, unsigned int seq)
+{
+	return false;
+}
+
+#endif /* CONFIG_PER_VMA_LOCK */
 
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end}
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end} Andrii Nakryiko
@ 2024-10-29 16:48   ` Vlastimil Babka
  2024-11-21 14:44   ` Peter Zijlstra
  1 sibling, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2024-10-29 16:48 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, linux-mm, akpm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca

On 10/28/24 02:08, Andrii Nakryiko wrote:
> From: Suren Baghdasaryan <surenb@google.com>
> 
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/mmap_lock.h | 29 +++++++++++++++++++++++++++--
>  1 file changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 6b3272686860..58dde2e35f7e 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -71,6 +71,7 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
>  }
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> +
>  static inline void mm_lock_seqcount_init(struct mm_struct *mm)
>  {
>  	seqcount_init(&mm->mm_lock_seq);
> @@ -86,11 +87,35 @@ static inline void mm_lock_seqcount_end(struct mm_struct *mm)
>  	do_raw_write_seqcount_end(&mm->mm_lock_seq);
>  }
>  
> -#else
> +static inline bool mmap_lock_speculation_begin(struct mm_struct *mm, unsigned int *seq)
> +{
> +	*seq = raw_read_seqcount(&mm->mm_lock_seq);
> +	/* Allow speculation if mmap_lock is not write-locked */
> +	return (*seq & 1) == 0;
> +}
> +
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, unsigned int seq)
> +{
> +	return !do_read_seqcount_retry(&mm->mm_lock_seq, seq);
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
>  static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
>  static inline void mm_lock_seqcount_begin(struct mm_struct *mm) {}
>  static inline void mm_lock_seqcount_end(struct mm_struct *mm) {}
> -#endif
> +
> +static inline bool mmap_lock_speculation_begin(struct mm_struct *mm, unsigned int *seq)
> +{
> +	return false;
> +}
> +
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, unsigned int seq)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_PER_VMA_LOCK */
>  
>  static inline void mmap_init_lock(struct mm_struct *mm)
>  {


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end}
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end} Andrii Nakryiko
  2024-10-29 16:48   ` Vlastimil Babka
@ 2024-11-21 14:44   ` Peter Zijlstra
  2024-11-21 15:22     ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2024-11-21 14:44 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, akpm, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, mjguzik, brauner,
	jannh, mhocko, vbabka, shakeel.butt, hannes, Liam.Howlett,
	lorenzo.stoakes, david, arnd, richard.weiyang, zhangpeng.00,
	linmiaohe, viro, hca

On Sun, Oct 27, 2024 at 06:08:16PM -0700, Andrii Nakryiko wrote:
> From: Suren Baghdasaryan <surenb@google.com>
> 
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/mmap_lock.h | 29 +++++++++++++++++++++++++++--
>  1 file changed, 27 insertions(+), 2 deletions(-)
> 

> @@ -86,11 +87,35 @@ static inline void mm_lock_seqcount_end(struct mm_struct *mm)
>  	do_raw_write_seqcount_end(&mm->mm_lock_seq);
>  }
>  
> -#else
> +static inline bool mmap_lock_speculation_begin(struct mm_struct *mm, unsigned int *seq)
> +{
> +	*seq = raw_read_seqcount(&mm->mm_lock_seq);
> +	/* Allow speculation if mmap_lock is not write-locked */
> +	return (*seq & 1) == 0;
> +}

At the very least this should have more comment; I don't think it
adequately explains the reason for being weird. Perhaps:

	/*
	 * Since mmap_lock is a sleeping lock, and waiting for it to
	 * become unlocked is more or less equivalent with taking it
	 * ourselves, don't bother with the speculative path and take
	 * the slow path, which takes the lock.
	 */
	*seq = raw_read_seqcount(&mm->mm_lock_seq);
	return !(*seq & 1);

But perhaps it makes even more sense to add this functionality to
seqcount itself. The same argument can be made for seqcount_mutex and
seqcount_rwlock users.

> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, unsigned int seq)
> +{
> +	return !do_read_seqcount_retry(&mm->mm_lock_seq, seq);
> +}

This naming is somewhare weird, begin/end do not typically imply boolean
return values.

Perhaps something like? can_speculate, or speculate_try_begin, paired
with speculated_success or speculate_retry ?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end}
  2024-11-21 14:44   ` Peter Zijlstra
@ 2024-11-21 15:22     ` Peter Zijlstra
  2024-11-21 15:36       ` Suren Baghdasaryan
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2024-11-21 15:22 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, akpm, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, mjguzik, brauner,
	jannh, mhocko, vbabka, shakeel.butt, hannes, Liam.Howlett,
	lorenzo.stoakes, david, arnd, richard.weiyang, zhangpeng.00,
	linmiaohe, viro, hca, Thomas Gleixner

On Thu, Nov 21, 2024 at 03:44:42PM +0100, Peter Zijlstra wrote:

> But perhaps it makes even more sense to add this functionality to
> seqcount itself. The same argument can be made for seqcount_mutex and
> seqcount_rwlock users.

Something like so I suppose.

---
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5298765d6ca4..102afdf8c7db 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -318,6 +318,28 @@ SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
 	__seq;								\
 })
 
+/**
+ * raw_seqcount_try_begin() - begin a seqcount_t read critical section
+ *                            w/o lockdep and w/o counter stabilization
+ * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ *
+ * Very like raw_seqcount_begin(), except it enables eliding the critical
+ * section entirely if odd, instead of doing the speculation knowing it will
+ * fail.
+ *
+ * Useful when counter stabilization is more or less equivalent to taking
+ * the lock and there is a slowpath that does that.
+ *
+ * If true, start will be set to the (even) sequence count read.
+ *
+ * Return: true when a read critical section is started.
+ */
+#define raw_seqcount_try_begin(s, start)				\
+({									\
+	start = raw_read_seqcount(s);					\
+	!(start & 1);							\
+})
+
 /**
  * raw_seqcount_begin() - begin a seqcount_t read critical section w/o
  *                        lockdep and w/o counter stabilization

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end}
  2024-11-21 15:22     ` Peter Zijlstra
@ 2024-11-21 15:36       ` Suren Baghdasaryan
  2024-11-21 16:32         ` Suren Baghdasaryan
  0 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-21 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, akpm, oleg,
	rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca, Thomas Gleixner

On Thu, Nov 21, 2024 at 7:23 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 21, 2024 at 03:44:42PM +0100, Peter Zijlstra wrote:
>
> > But perhaps it makes even more sense to add this functionality to
> > seqcount itself. The same argument can be made for seqcount_mutex and
> > seqcount_rwlock users.
>
> Something like so I suppose.

Ok, let me put this all together. Thanks!

>
> ---
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index 5298765d6ca4..102afdf8c7db 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -318,6 +318,28 @@ SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
>         __seq;                                                          \
>  })
>
> +/**
> + * raw_seqcount_try_begin() - begin a seqcount_t read critical section
> + *                            w/o lockdep and w/o counter stabilization
> + * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
> + *
> + * Very like raw_seqcount_begin(), except it enables eliding the critical
> + * section entirely if odd, instead of doing the speculation knowing it will
> + * fail.
> + *
> + * Useful when counter stabilization is more or less equivalent to taking
> + * the lock and there is a slowpath that does that.
> + *
> + * If true, start will be set to the (even) sequence count read.
> + *
> + * Return: true when a read critical section is started.
> + */
> +#define raw_seqcount_try_begin(s, start)                               \
> +({                                                                     \
> +       start = raw_read_seqcount(s);                                   \
> +       !(start & 1);                                                   \
> +})
> +
>  /**
>   * raw_seqcount_begin() - begin a seqcount_t read critical section w/o
>   *                        lockdep and w/o counter stabilization

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end}
  2024-11-21 15:36       ` Suren Baghdasaryan
@ 2024-11-21 16:32         ` Suren Baghdasaryan
  0 siblings, 0 replies; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-21 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, akpm, oleg,
	rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca, Thomas Gleixner

On Thu, Nov 21, 2024 at 7:36 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Nov 21, 2024 at 7:23 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Nov 21, 2024 at 03:44:42PM +0100, Peter Zijlstra wrote:
> >
> > > But perhaps it makes even more sense to add this functionality to
> > > seqcount itself. The same argument can be made for seqcount_mutex and
> > > seqcount_rwlock users.
> >
> > Something like so I suppose.
>
> Ok, let me put this all together. Thanks!

I posted the new version at
https://lore.kernel.org/all/20241121162826.987947-1-surenb@google.com/
The changes are minimal, only those requested by Peter, so hopefully
they can be accepted quickly.

>
> >
> > ---
> > diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> > index 5298765d6ca4..102afdf8c7db 100644
> > --- a/include/linux/seqlock.h
> > +++ b/include/linux/seqlock.h
> > @@ -318,6 +318,28 @@ SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
> >         __seq;                                                          \
> >  })
> >
> > +/**
> > + * raw_seqcount_try_begin() - begin a seqcount_t read critical section
> > + *                            w/o lockdep and w/o counter stabilization
> > + * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
> > + *
> > + * Very like raw_seqcount_begin(), except it enables eliding the critical
> > + * section entirely if odd, instead of doing the speculation knowing it will
> > + * fail.
> > + *
> > + * Useful when counter stabilization is more or less equivalent to taking
> > + * the lock and there is a slowpath that does that.
> > + *
> > + * If true, start will be set to the (even) sequence count read.
> > + *
> > + * Return: true when a read critical section is started.
> > + */
> > +#define raw_seqcount_try_begin(s, start)                               \
> > +({                                                                     \
> > +       start = raw_read_seqcount(s);                                   \
> > +       !(start & 1);                                                   \
> > +})
> > +
> >  /**
> >   * raw_seqcount_begin() - begin a seqcount_t read critical section w/o
> >   *                        lockdep and w/o counter stabilization

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks
  2024-10-28  1:08 [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount Andrii Nakryiko
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end} Andrii Nakryiko
@ 2024-10-28  1:08 ` Andrii Nakryiko
  2024-10-28  1:51   ` Masami Hiramatsu
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
  2024-11-06  2:01 ` [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  4 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-10-28  1:08 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, akpm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Andrii Nakryiko

At the point where find_active_uprobe_rcu() is used we know that VMA in
question has triggered software breakpoint, so we don't need to validate
vma->vm_flags. Keep only vma->vm_file NULL check.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4ef4b51776eb..290c445768fa 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2084,7 +2084,7 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
 	mmap_read_lock(mm);
 	vma = vma_lookup(mm, bp_vaddr);
 	if (vma) {
-		if (valid_vma(vma, false)) {
+		if (vma->vm_file) {
 			struct inode *inode = file_inode(vma->vm_file);
 			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
@ 2024-10-28  1:51   ` Masami Hiramatsu
  0 siblings, 0 replies; 27+ messages in thread
From: Masami Hiramatsu @ 2024-10-28  1:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, akpm, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca

On Sun, 27 Oct 2024 18:08:17 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> At the point where find_active_uprobe_rcu() is used we know that VMA in
> question has triggered software breakpoint, so we don't need to validate
> vma->vm_flags. Keep only vma->vm_file NULL check.
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks,

> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 4ef4b51776eb..290c445768fa 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2084,7 +2084,7 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
>  	mmap_read_lock(mm);
>  	vma = vma_lookup(mm, bp_vaddr);
>  	if (vma) {
> -		if (valid_vma(vma, false)) {
> +		if (vma->vm_file) {
>  			struct inode *inode = file_inode(vma->vm_file);
>  			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
>  
> -- 
> 2.43.5
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-28  1:08 [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
@ 2024-10-28  1:08 ` Andrii Nakryiko
  2024-11-12  0:28   ` Masami Hiramatsu
  2024-11-06  2:01 ` [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  4 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-10-28  1:08 UTC (permalink / raw)
  To: linux-trace-kernel, linux-mm, akpm, peterz
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Andrii Nakryiko

Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING
files, a special case, now goes through RCU-delated freeing), we can
safely access vma->vm_file->f_inode field locklessly under just
rcu_read_lock() protection, which enables looking up uprobe from
uprobes_tree completely locklessly and speculatively without the need to
acquire mmap_lock for reads. In most cases, anyway, assuming that there
are no parallel mm and/or VMA modifications. The underlying struct
file's memory won't go away from under us (even if struct file can be
reused in the meantime).

We rely on newly added mmap_lock_speculation_{begin,end}() helpers to
validate that mm_struct stays intact for entire duration of this
speculation. If not, we fall back to mmap_lock-protected lookup.
The speculative logic is written in such a way that it will safely
handle any garbage values that might be read from vma or file structs.

Benchmarking results speak for themselves.

BEFORE (latest tip/perf/core)
=============================
uprobe-nop            ( 1 cpus):    3.384 ± 0.004M/s  (  3.384M/s/cpu)
uprobe-nop            ( 2 cpus):    5.456 ± 0.005M/s  (  2.728M/s/cpu)
uprobe-nop            ( 3 cpus):    7.863 ± 0.015M/s  (  2.621M/s/cpu)
uprobe-nop            ( 4 cpus):    9.442 ± 0.008M/s  (  2.360M/s/cpu)
uprobe-nop            ( 5 cpus):   11.036 ± 0.013M/s  (  2.207M/s/cpu)
uprobe-nop            ( 6 cpus):   10.884 ± 0.019M/s  (  1.814M/s/cpu)
uprobe-nop            ( 7 cpus):    7.897 ± 0.145M/s  (  1.128M/s/cpu)
uprobe-nop            ( 8 cpus):   10.021 ± 0.128M/s  (  1.253M/s/cpu)
uprobe-nop            (10 cpus):    9.932 ± 0.170M/s  (  0.993M/s/cpu)
uprobe-nop            (12 cpus):    8.369 ± 0.056M/s  (  0.697M/s/cpu)
uprobe-nop            (14 cpus):    8.678 ± 0.017M/s  (  0.620M/s/cpu)
uprobe-nop            (16 cpus):    7.392 ± 0.003M/s  (  0.462M/s/cpu)
uprobe-nop            (24 cpus):    5.326 ± 0.178M/s  (  0.222M/s/cpu)
uprobe-nop            (32 cpus):    5.426 ± 0.059M/s  (  0.170M/s/cpu)
uprobe-nop            (40 cpus):    5.262 ± 0.070M/s  (  0.132M/s/cpu)
uprobe-nop            (48 cpus):    6.121 ± 0.010M/s  (  0.128M/s/cpu)
uprobe-nop            (56 cpus):    6.252 ± 0.035M/s  (  0.112M/s/cpu)
uprobe-nop            (64 cpus):    7.644 ± 0.023M/s  (  0.119M/s/cpu)
uprobe-nop            (72 cpus):    7.781 ± 0.001M/s  (  0.108M/s/cpu)
uprobe-nop            (80 cpus):    8.992 ± 0.048M/s  (  0.112M/s/cpu)

AFTER
=====
uprobe-nop            ( 1 cpus):    3.534 ± 0.033M/s  (  3.534M/s/cpu)
uprobe-nop            ( 2 cpus):    6.701 ± 0.007M/s  (  3.351M/s/cpu)
uprobe-nop            ( 3 cpus):   10.031 ± 0.007M/s  (  3.344M/s/cpu)
uprobe-nop            ( 4 cpus):   13.003 ± 0.012M/s  (  3.251M/s/cpu)
uprobe-nop            ( 5 cpus):   16.274 ± 0.006M/s  (  3.255M/s/cpu)
uprobe-nop            ( 6 cpus):   19.563 ± 0.024M/s  (  3.261M/s/cpu)
uprobe-nop            ( 7 cpus):   22.696 ± 0.054M/s  (  3.242M/s/cpu)
uprobe-nop            ( 8 cpus):   24.534 ± 0.010M/s  (  3.067M/s/cpu)
uprobe-nop            (10 cpus):   30.475 ± 0.117M/s  (  3.047M/s/cpu)
uprobe-nop            (12 cpus):   33.371 ± 0.017M/s  (  2.781M/s/cpu)
uprobe-nop            (14 cpus):   38.864 ± 0.004M/s  (  2.776M/s/cpu)
uprobe-nop            (16 cpus):   41.476 ± 0.020M/s  (  2.592M/s/cpu)
uprobe-nop            (24 cpus):   64.696 ± 0.021M/s  (  2.696M/s/cpu)
uprobe-nop            (32 cpus):   85.054 ± 0.027M/s  (  2.658M/s/cpu)
uprobe-nop            (40 cpus):  101.979 ± 0.032M/s  (  2.549M/s/cpu)
uprobe-nop            (48 cpus):  110.518 ± 0.056M/s  (  2.302M/s/cpu)
uprobe-nop            (56 cpus):  117.737 ± 0.020M/s  (  2.102M/s/cpu)
uprobe-nop            (64 cpus):  124.613 ± 0.079M/s  (  1.947M/s/cpu)
uprobe-nop            (72 cpus):  133.239 ± 0.032M/s  (  1.851M/s/cpu)
uprobe-nop            (80 cpus):  142.037 ± 0.138M/s  (  1.775M/s/cpu)

Previously total throughput was maxing out at 11mln/s, and gradually
declining past 8 cores. With this change, it now keeps growing with each
added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core
Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 45 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 290c445768fa..efcd62f7051d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2074,6 +2074,47 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	return is_trap_insn(&opcode);
 }
 
+static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct uprobe *uprobe = NULL;
+	struct vm_area_struct *vma;
+	struct file *vm_file;
+	loff_t offset;
+	unsigned int seq;
+
+	guard(rcu)();
+
+	if (!mmap_lock_speculation_begin(mm, &seq))
+		return NULL;
+
+	vma = vma_lookup(mm, bp_vaddr);
+	if (!vma)
+		return NULL;
+
+	/*
+	 * vm_file memory can be reused for another instance of struct file,
+	 * but can't be freed from under us, so it's safe to read fields from
+	 * it, even if the values are some garbage values; ultimately
+	 * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
+	 * that whatever we speculatively found is correct
+	 */
+	vm_file = READ_ONCE(vma->vm_file);
+	if (!vm_file)
+		return NULL;
+
+	offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
+	uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
+	if (!uprobe)
+		return NULL;
+
+	/* now double check that nothing about MM changed */
+	if (!mmap_lock_speculation_end(mm, seq))
+		return NULL;
+
+	return uprobe;
+}
+
 /* assumes being inside RCU protected region */
 static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
 {
@@ -2081,6 +2122,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
 	struct uprobe *uprobe = NULL;
 	struct vm_area_struct *vma;
 
+	uprobe = find_active_uprobe_speculative(bp_vaddr);
+	if (uprobe)
+		return uprobe;
+
 	mmap_read_lock(mm);
 	vma = vma_lookup(mm, bp_vaddr);
 	if (vma) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
@ 2024-11-12  0:28   ` Masami Hiramatsu
  2024-11-12  1:04     ` Suren Baghdasaryan
  0 siblings, 1 reply; 27+ messages in thread
From: Masami Hiramatsu @ 2024-11-12  0:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, akpm, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca

On Sun, 27 Oct 2024 18:08:18 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING
> files, a special case, now goes through RCU-delated freeing), we can
> safely access vma->vm_file->f_inode field locklessly under just
> rcu_read_lock() protection, which enables looking up uprobe from
> uprobes_tree completely locklessly and speculatively without the need to
> acquire mmap_lock for reads. In most cases, anyway, assuming that there
> are no parallel mm and/or VMA modifications. The underlying struct
> file's memory won't go away from under us (even if struct file can be
> reused in the meantime).
> 
> We rely on newly added mmap_lock_speculation_{begin,end}() helpers to
> validate that mm_struct stays intact for entire duration of this
> speculation. If not, we fall back to mmap_lock-protected lookup.
> The speculative logic is written in such a way that it will safely
> handle any garbage values that might be read from vma or file structs.
> 
> Benchmarking results speak for themselves.
> 
> BEFORE (latest tip/perf/core)
> =============================
> uprobe-nop            ( 1 cpus):    3.384 ± 0.004M/s  (  3.384M/s/cpu)
> uprobe-nop            ( 2 cpus):    5.456 ± 0.005M/s  (  2.728M/s/cpu)
> uprobe-nop            ( 3 cpus):    7.863 ± 0.015M/s  (  2.621M/s/cpu)
> uprobe-nop            ( 4 cpus):    9.442 ± 0.008M/s  (  2.360M/s/cpu)
> uprobe-nop            ( 5 cpus):   11.036 ± 0.013M/s  (  2.207M/s/cpu)
> uprobe-nop            ( 6 cpus):   10.884 ± 0.019M/s  (  1.814M/s/cpu)
> uprobe-nop            ( 7 cpus):    7.897 ± 0.145M/s  (  1.128M/s/cpu)
> uprobe-nop            ( 8 cpus):   10.021 ± 0.128M/s  (  1.253M/s/cpu)
> uprobe-nop            (10 cpus):    9.932 ± 0.170M/s  (  0.993M/s/cpu)
> uprobe-nop            (12 cpus):    8.369 ± 0.056M/s  (  0.697M/s/cpu)
> uprobe-nop            (14 cpus):    8.678 ± 0.017M/s  (  0.620M/s/cpu)
> uprobe-nop            (16 cpus):    7.392 ± 0.003M/s  (  0.462M/s/cpu)
> uprobe-nop            (24 cpus):    5.326 ± 0.178M/s  (  0.222M/s/cpu)
> uprobe-nop            (32 cpus):    5.426 ± 0.059M/s  (  0.170M/s/cpu)
> uprobe-nop            (40 cpus):    5.262 ± 0.070M/s  (  0.132M/s/cpu)
> uprobe-nop            (48 cpus):    6.121 ± 0.010M/s  (  0.128M/s/cpu)
> uprobe-nop            (56 cpus):    6.252 ± 0.035M/s  (  0.112M/s/cpu)
> uprobe-nop            (64 cpus):    7.644 ± 0.023M/s  (  0.119M/s/cpu)
> uprobe-nop            (72 cpus):    7.781 ± 0.001M/s  (  0.108M/s/cpu)
> uprobe-nop            (80 cpus):    8.992 ± 0.048M/s  (  0.112M/s/cpu)
> 
> AFTER
> =====
> uprobe-nop            ( 1 cpus):    3.534 ± 0.033M/s  (  3.534M/s/cpu)
> uprobe-nop            ( 2 cpus):    6.701 ± 0.007M/s  (  3.351M/s/cpu)
> uprobe-nop            ( 3 cpus):   10.031 ± 0.007M/s  (  3.344M/s/cpu)
> uprobe-nop            ( 4 cpus):   13.003 ± 0.012M/s  (  3.251M/s/cpu)
> uprobe-nop            ( 5 cpus):   16.274 ± 0.006M/s  (  3.255M/s/cpu)
> uprobe-nop            ( 6 cpus):   19.563 ± 0.024M/s  (  3.261M/s/cpu)
> uprobe-nop            ( 7 cpus):   22.696 ± 0.054M/s  (  3.242M/s/cpu)
> uprobe-nop            ( 8 cpus):   24.534 ± 0.010M/s  (  3.067M/s/cpu)
> uprobe-nop            (10 cpus):   30.475 ± 0.117M/s  (  3.047M/s/cpu)
> uprobe-nop            (12 cpus):   33.371 ± 0.017M/s  (  2.781M/s/cpu)
> uprobe-nop            (14 cpus):   38.864 ± 0.004M/s  (  2.776M/s/cpu)
> uprobe-nop            (16 cpus):   41.476 ± 0.020M/s  (  2.592M/s/cpu)
> uprobe-nop            (24 cpus):   64.696 ± 0.021M/s  (  2.696M/s/cpu)
> uprobe-nop            (32 cpus):   85.054 ± 0.027M/s  (  2.658M/s/cpu)
> uprobe-nop            (40 cpus):  101.979 ± 0.032M/s  (  2.549M/s/cpu)
> uprobe-nop            (48 cpus):  110.518 ± 0.056M/s  (  2.302M/s/cpu)
> uprobe-nop            (56 cpus):  117.737 ± 0.020M/s  (  2.102M/s/cpu)
> uprobe-nop            (64 cpus):  124.613 ± 0.079M/s  (  1.947M/s/cpu)
> uprobe-nop            (72 cpus):  133.239 ± 0.032M/s  (  1.851M/s/cpu)
> uprobe-nop            (80 cpus):  142.037 ± 0.138M/s  (  1.775M/s/cpu)
> 
> Previously total throughput was maxing out at 11mln/s, and gradually
> declining past 8 cores. With this change, it now keeps growing with each
> added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core
> Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).
> 

Looks good to me, except one question below.

> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 45 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 290c445768fa..efcd62f7051d 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2074,6 +2074,47 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>  	return is_trap_insn(&opcode);
>  }
>  
> +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct uprobe *uprobe = NULL;
> +	struct vm_area_struct *vma;
> +	struct file *vm_file;
> +	loff_t offset;
> +	unsigned int seq;
> +
> +	guard(rcu)();
> +
> +	if (!mmap_lock_speculation_begin(mm, &seq))
> +		return NULL;
> +
> +	vma = vma_lookup(mm, bp_vaddr);
> +	if (!vma)
> +		return NULL;
> +
> +	/*
> +	 * vm_file memory can be reused for another instance of struct file,
> +	 * but can't be freed from under us, so it's safe to read fields from
> +	 * it, even if the values are some garbage values; ultimately
> +	 * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> +	 * that whatever we speculatively found is correct

If vm_file is a garbage value, may `vm_file->f_inode` access be dangerous?

> +	 */
> +	vm_file = READ_ONCE(vma->vm_file);
> +	if (!vm_file)
> +		return NULL;
> +
> +	offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
> +	uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
                                       ^^^^ Here

if it only stores vm_file or NULL, there's no problem.

Thank you,

> +	if (!uprobe)
> +		return NULL;
> +
> +	/* now double check that nothing about MM changed */
> +	if (!mmap_lock_speculation_end(mm, seq))
> +		return NULL;
> +
> +	return uprobe;
> +}
> +
>  /* assumes being inside RCU protected region */
>  static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
>  {
> @@ -2081,6 +2122,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
>  	struct uprobe *uprobe = NULL;
>  	struct vm_area_struct *vma;
>  
> +	uprobe = find_active_uprobe_speculative(bp_vaddr);
> +	if (uprobe)
> +		return uprobe;
> +
>  	mmap_read_lock(mm);
>  	vma = vma_lookup(mm, bp_vaddr);
>  	if (vma) {
> -- 
> 2.43.5
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-11-12  0:28   ` Masami Hiramatsu
@ 2024-11-12  1:04     ` Suren Baghdasaryan
  2024-11-12 18:09       ` Andrii Nakryiko
  0 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2024-11-12  1:04 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, linux-mm, akpm, peterz, oleg,
	rostedt, bpf, linux-kernel, jolsa, paulmck, willy, mjguzik,
	brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca

On Mon, Nov 11, 2024 at 4:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Sun, 27 Oct 2024 18:08:18 -0700
> Andrii Nakryiko <andrii@kernel.org> wrote:
>
> > Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING
> > files, a special case, now goes through RCU-delated freeing), we can
> > safely access vma->vm_file->f_inode field locklessly under just
> > rcu_read_lock() protection, which enables looking up uprobe from
> > uprobes_tree completely locklessly and speculatively without the need to
> > acquire mmap_lock for reads. In most cases, anyway, assuming that there
> > are no parallel mm and/or VMA modifications. The underlying struct
> > file's memory won't go away from under us (even if struct file can be
> > reused in the meantime).
> >
> > We rely on newly added mmap_lock_speculation_{begin,end}() helpers to
> > validate that mm_struct stays intact for entire duration of this
> > speculation. If not, we fall back to mmap_lock-protected lookup.
> > The speculative logic is written in such a way that it will safely
> > handle any garbage values that might be read from vma or file structs.
> >
> > Benchmarking results speak for themselves.
> >
> > BEFORE (latest tip/perf/core)
> > =============================
> > uprobe-nop            ( 1 cpus):    3.384 ± 0.004M/s  (  3.384M/s/cpu)
> > uprobe-nop            ( 2 cpus):    5.456 ± 0.005M/s  (  2.728M/s/cpu)
> > uprobe-nop            ( 3 cpus):    7.863 ± 0.015M/s  (  2.621M/s/cpu)
> > uprobe-nop            ( 4 cpus):    9.442 ± 0.008M/s  (  2.360M/s/cpu)
> > uprobe-nop            ( 5 cpus):   11.036 ± 0.013M/s  (  2.207M/s/cpu)
> > uprobe-nop            ( 6 cpus):   10.884 ± 0.019M/s  (  1.814M/s/cpu)
> > uprobe-nop            ( 7 cpus):    7.897 ± 0.145M/s  (  1.128M/s/cpu)
> > uprobe-nop            ( 8 cpus):   10.021 ± 0.128M/s  (  1.253M/s/cpu)
> > uprobe-nop            (10 cpus):    9.932 ± 0.170M/s  (  0.993M/s/cpu)
> > uprobe-nop            (12 cpus):    8.369 ± 0.056M/s  (  0.697M/s/cpu)
> > uprobe-nop            (14 cpus):    8.678 ± 0.017M/s  (  0.620M/s/cpu)
> > uprobe-nop            (16 cpus):    7.392 ± 0.003M/s  (  0.462M/s/cpu)
> > uprobe-nop            (24 cpus):    5.326 ± 0.178M/s  (  0.222M/s/cpu)
> > uprobe-nop            (32 cpus):    5.426 ± 0.059M/s  (  0.170M/s/cpu)
> > uprobe-nop            (40 cpus):    5.262 ± 0.070M/s  (  0.132M/s/cpu)
> > uprobe-nop            (48 cpus):    6.121 ± 0.010M/s  (  0.128M/s/cpu)
> > uprobe-nop            (56 cpus):    6.252 ± 0.035M/s  (  0.112M/s/cpu)
> > uprobe-nop            (64 cpus):    7.644 ± 0.023M/s  (  0.119M/s/cpu)
> > uprobe-nop            (72 cpus):    7.781 ± 0.001M/s  (  0.108M/s/cpu)
> > uprobe-nop            (80 cpus):    8.992 ± 0.048M/s  (  0.112M/s/cpu)
> >
> > AFTER
> > =====
> > uprobe-nop            ( 1 cpus):    3.534 ± 0.033M/s  (  3.534M/s/cpu)
> > uprobe-nop            ( 2 cpus):    6.701 ± 0.007M/s  (  3.351M/s/cpu)
> > uprobe-nop            ( 3 cpus):   10.031 ± 0.007M/s  (  3.344M/s/cpu)
> > uprobe-nop            ( 4 cpus):   13.003 ± 0.012M/s  (  3.251M/s/cpu)
> > uprobe-nop            ( 5 cpus):   16.274 ± 0.006M/s  (  3.255M/s/cpu)
> > uprobe-nop            ( 6 cpus):   19.563 ± 0.024M/s  (  3.261M/s/cpu)
> > uprobe-nop            ( 7 cpus):   22.696 ± 0.054M/s  (  3.242M/s/cpu)
> > uprobe-nop            ( 8 cpus):   24.534 ± 0.010M/s  (  3.067M/s/cpu)
> > uprobe-nop            (10 cpus):   30.475 ± 0.117M/s  (  3.047M/s/cpu)
> > uprobe-nop            (12 cpus):   33.371 ± 0.017M/s  (  2.781M/s/cpu)
> > uprobe-nop            (14 cpus):   38.864 ± 0.004M/s  (  2.776M/s/cpu)
> > uprobe-nop            (16 cpus):   41.476 ± 0.020M/s  (  2.592M/s/cpu)
> > uprobe-nop            (24 cpus):   64.696 ± 0.021M/s  (  2.696M/s/cpu)
> > uprobe-nop            (32 cpus):   85.054 ± 0.027M/s  (  2.658M/s/cpu)
> > uprobe-nop            (40 cpus):  101.979 ± 0.032M/s  (  2.549M/s/cpu)
> > uprobe-nop            (48 cpus):  110.518 ± 0.056M/s  (  2.302M/s/cpu)
> > uprobe-nop            (56 cpus):  117.737 ± 0.020M/s  (  2.102M/s/cpu)
> > uprobe-nop            (64 cpus):  124.613 ± 0.079M/s  (  1.947M/s/cpu)
> > uprobe-nop            (72 cpus):  133.239 ± 0.032M/s  (  1.851M/s/cpu)
> > uprobe-nop            (80 cpus):  142.037 ± 0.138M/s  (  1.775M/s/cpu)
> >
> > Previously total throughput was maxing out at 11mln/s, and gradually
> > declining past 8 cores. With this change, it now keeps growing with each
> > added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core
> > Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).
> >
>
> Looks good to me, except one question below.
>
> > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/events/uprobes.c | 45 +++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 45 insertions(+)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 290c445768fa..efcd62f7051d 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -2074,6 +2074,47 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
> >       return is_trap_insn(&opcode);
> >  }
> >
> > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > +{
> > +     struct mm_struct *mm = current->mm;
> > +     struct uprobe *uprobe = NULL;
> > +     struct vm_area_struct *vma;
> > +     struct file *vm_file;
> > +     loff_t offset;
> > +     unsigned int seq;
> > +
> > +     guard(rcu)();
> > +
> > +     if (!mmap_lock_speculation_begin(mm, &seq))
> > +             return NULL;
> > +
> > +     vma = vma_lookup(mm, bp_vaddr);
> > +     if (!vma)
> > +             return NULL;
> > +
> > +     /*
> > +      * vm_file memory can be reused for another instance of struct file,
> > +      * but can't be freed from under us, so it's safe to read fields from
> > +      * it, even if the values are some garbage values; ultimately
> > +      * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> > +      * that whatever we speculatively found is correct
>
> If vm_file is a garbage value, may `vm_file->f_inode` access be dangerous?
>
> > +      */
> > +     vm_file = READ_ONCE(vma->vm_file);
> > +     if (!vm_file)
> > +             return NULL;
> > +
> > +     offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
> > +     uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
>                                        ^^^^ Here
>
> if it only stores vm_file or NULL, there's no problem.

IIRC correctly, vma->vm_file is RCU-safe and we are in the read RCU
section, so it should not contain a garbage value.

>
> Thank you,
>
> > +     if (!uprobe)
> > +             return NULL;
> > +
> > +     /* now double check that nothing about MM changed */
> > +     if (!mmap_lock_speculation_end(mm, seq))
> > +             return NULL;
> > +
> > +     return uprobe;
> > +}
> > +
> >  /* assumes being inside RCU protected region */
> >  static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
> >  {
> > @@ -2081,6 +2122,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
> >       struct uprobe *uprobe = NULL;
> >       struct vm_area_struct *vma;
> >
> > +     uprobe = find_active_uprobe_speculative(bp_vaddr);
> > +     if (uprobe)
> > +             return uprobe;
> > +
> >       mmap_read_lock(mm);
> >       vma = vma_lookup(mm, bp_vaddr);
> >       if (vma) {
> > --
> > 2.43.5
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-11-12  1:04     ` Suren Baghdasaryan
@ 2024-11-12 18:09       ` Andrii Nakryiko
  2024-11-12 23:53         ` Masami Hiramatsu
  0 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-11-12 18:09 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Masami Hiramatsu, Andrii Nakryiko, linux-trace-kernel, linux-mm,
	akpm, peterz, oleg, rostedt, bpf, linux-kernel, jolsa, paulmck,
	willy, mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt,
	hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca

On Mon, Nov 11, 2024 at 5:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Nov 11, 2024 at 4:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Sun, 27 Oct 2024 18:08:18 -0700
> > Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > > Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING
> > > files, a special case, now goes through RCU-delated freeing), we can
> > > safely access vma->vm_file->f_inode field locklessly under just
> > > rcu_read_lock() protection, which enables looking up uprobe from
> > > uprobes_tree completely locklessly and speculatively without the need to
> > > acquire mmap_lock for reads. In most cases, anyway, assuming that there
> > > are no parallel mm and/or VMA modifications. The underlying struct
> > > file's memory won't go away from under us (even if struct file can be
> > > reused in the meantime).
> > >
> > > We rely on newly added mmap_lock_speculation_{begin,end}() helpers to
> > > validate that mm_struct stays intact for entire duration of this
> > > speculation. If not, we fall back to mmap_lock-protected lookup.
> > > The speculative logic is written in such a way that it will safely
> > > handle any garbage values that might be read from vma or file structs.
> > >
> > > Benchmarking results speak for themselves.
> > >
> > > BEFORE (latest tip/perf/core)
> > > =============================
> > > uprobe-nop            ( 1 cpus):    3.384 ± 0.004M/s  (  3.384M/s/cpu)
> > > uprobe-nop            ( 2 cpus):    5.456 ± 0.005M/s  (  2.728M/s/cpu)
> > > uprobe-nop            ( 3 cpus):    7.863 ± 0.015M/s  (  2.621M/s/cpu)
> > > uprobe-nop            ( 4 cpus):    9.442 ± 0.008M/s  (  2.360M/s/cpu)
> > > uprobe-nop            ( 5 cpus):   11.036 ± 0.013M/s  (  2.207M/s/cpu)
> > > uprobe-nop            ( 6 cpus):   10.884 ± 0.019M/s  (  1.814M/s/cpu)
> > > uprobe-nop            ( 7 cpus):    7.897 ± 0.145M/s  (  1.128M/s/cpu)
> > > uprobe-nop            ( 8 cpus):   10.021 ± 0.128M/s  (  1.253M/s/cpu)
> > > uprobe-nop            (10 cpus):    9.932 ± 0.170M/s  (  0.993M/s/cpu)
> > > uprobe-nop            (12 cpus):    8.369 ± 0.056M/s  (  0.697M/s/cpu)
> > > uprobe-nop            (14 cpus):    8.678 ± 0.017M/s  (  0.620M/s/cpu)
> > > uprobe-nop            (16 cpus):    7.392 ± 0.003M/s  (  0.462M/s/cpu)
> > > uprobe-nop            (24 cpus):    5.326 ± 0.178M/s  (  0.222M/s/cpu)
> > > uprobe-nop            (32 cpus):    5.426 ± 0.059M/s  (  0.170M/s/cpu)
> > > uprobe-nop            (40 cpus):    5.262 ± 0.070M/s  (  0.132M/s/cpu)
> > > uprobe-nop            (48 cpus):    6.121 ± 0.010M/s  (  0.128M/s/cpu)
> > > uprobe-nop            (56 cpus):    6.252 ± 0.035M/s  (  0.112M/s/cpu)
> > > uprobe-nop            (64 cpus):    7.644 ± 0.023M/s  (  0.119M/s/cpu)
> > > uprobe-nop            (72 cpus):    7.781 ± 0.001M/s  (  0.108M/s/cpu)
> > > uprobe-nop            (80 cpus):    8.992 ± 0.048M/s  (  0.112M/s/cpu)
> > >
> > > AFTER
> > > =====
> > > uprobe-nop            ( 1 cpus):    3.534 ± 0.033M/s  (  3.534M/s/cpu)
> > > uprobe-nop            ( 2 cpus):    6.701 ± 0.007M/s  (  3.351M/s/cpu)
> > > uprobe-nop            ( 3 cpus):   10.031 ± 0.007M/s  (  3.344M/s/cpu)
> > > uprobe-nop            ( 4 cpus):   13.003 ± 0.012M/s  (  3.251M/s/cpu)
> > > uprobe-nop            ( 5 cpus):   16.274 ± 0.006M/s  (  3.255M/s/cpu)
> > > uprobe-nop            ( 6 cpus):   19.563 ± 0.024M/s  (  3.261M/s/cpu)
> > > uprobe-nop            ( 7 cpus):   22.696 ± 0.054M/s  (  3.242M/s/cpu)
> > > uprobe-nop            ( 8 cpus):   24.534 ± 0.010M/s  (  3.067M/s/cpu)
> > > uprobe-nop            (10 cpus):   30.475 ± 0.117M/s  (  3.047M/s/cpu)
> > > uprobe-nop            (12 cpus):   33.371 ± 0.017M/s  (  2.781M/s/cpu)
> > > uprobe-nop            (14 cpus):   38.864 ± 0.004M/s  (  2.776M/s/cpu)
> > > uprobe-nop            (16 cpus):   41.476 ± 0.020M/s  (  2.592M/s/cpu)
> > > uprobe-nop            (24 cpus):   64.696 ± 0.021M/s  (  2.696M/s/cpu)
> > > uprobe-nop            (32 cpus):   85.054 ± 0.027M/s  (  2.658M/s/cpu)
> > > uprobe-nop            (40 cpus):  101.979 ± 0.032M/s  (  2.549M/s/cpu)
> > > uprobe-nop            (48 cpus):  110.518 ± 0.056M/s  (  2.302M/s/cpu)
> > > uprobe-nop            (56 cpus):  117.737 ± 0.020M/s  (  2.102M/s/cpu)
> > > uprobe-nop            (64 cpus):  124.613 ± 0.079M/s  (  1.947M/s/cpu)
> > > uprobe-nop            (72 cpus):  133.239 ± 0.032M/s  (  1.851M/s/cpu)
> > > uprobe-nop            (80 cpus):  142.037 ± 0.138M/s  (  1.775M/s/cpu)
> > >
> > > Previously total throughput was maxing out at 11mln/s, and gradually
> > > declining past 8 cores. With this change, it now keeps growing with each
> > > added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core
> > > Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).
> > >
> >
> > Looks good to me, except one question below.
> >
> > > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  kernel/events/uprobes.c | 45 +++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 45 insertions(+)
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 290c445768fa..efcd62f7051d 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -2074,6 +2074,47 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
> > >       return is_trap_insn(&opcode);
> > >  }
> > >
> > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > +{
> > > +     struct mm_struct *mm = current->mm;
> > > +     struct uprobe *uprobe = NULL;
> > > +     struct vm_area_struct *vma;
> > > +     struct file *vm_file;
> > > +     loff_t offset;
> > > +     unsigned int seq;
> > > +
> > > +     guard(rcu)();
> > > +
> > > +     if (!mmap_lock_speculation_begin(mm, &seq))
> > > +             return NULL;
> > > +
> > > +     vma = vma_lookup(mm, bp_vaddr);
> > > +     if (!vma)
> > > +             return NULL;
> > > +
> > > +     /*
> > > +      * vm_file memory can be reused for another instance of struct file,
> > > +      * but can't be freed from under us, so it's safe to read fields from
> > > +      * it, even if the values are some garbage values; ultimately
> > > +      * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> > > +      * that whatever we speculatively found is correct
> >
> > If vm_file is a garbage value, may `vm_file->f_inode` access be dangerous?
> >
> > > +      */
> > > +     vm_file = READ_ONCE(vma->vm_file);
> > > +     if (!vm_file)
> > > +             return NULL;
> > > +
> > > +     offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
> > > +     uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
> >                                        ^^^^ Here
> >
> > if it only stores vm_file or NULL, there's no problem.
>
> IIRC correctly, vma->vm_file is RCU-safe and we are in the read RCU
> section, so it should not contain a garbage value.

Correct. vm_file itself can be either TYPESAFE_BY_RCU for normal
files, or properly RCU protected for FMODE_BACKING ones. Either way,
there is some correct struct file pointed to, and so all this is valid
and won't dereference invalid memory.

>
> >
> > Thank you,
> >
> > > +     if (!uprobe)
> > > +             return NULL;
> > > +
> > > +     /* now double check that nothing about MM changed */
> > > +     if (!mmap_lock_speculation_end(mm, seq))
> > > +             return NULL;
> > > +
> > > +     return uprobe;
> > > +}
> > > +
> > >  /* assumes being inside RCU protected region */
> > >  static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
> > >  {
> > > @@ -2081,6 +2122,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
> > >       struct uprobe *uprobe = NULL;
> > >       struct vm_area_struct *vma;
> > >
> > > +     uprobe = find_active_uprobe_speculative(bp_vaddr);
> > > +     if (uprobe)
> > > +             return uprobe;
> > > +
> > >       mmap_read_lock(mm);
> > >       vma = vma_lookup(mm, bp_vaddr);
> > >       if (vma) {
> > > --
> > > 2.43.5
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-11-12 18:09       ` Andrii Nakryiko
@ 2024-11-12 23:53         ` Masami Hiramatsu
  0 siblings, 0 replies; 27+ messages in thread
From: Masami Hiramatsu @ 2024-11-12 23:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, Masami Hiramatsu, Andrii Nakryiko,
	linux-trace-kernel, linux-mm, akpm, peterz, oleg, rostedt, bpf,
	linux-kernel, jolsa, paulmck, willy, mjguzik, brauner, jannh,
	mhocko, vbabka, shakeel.butt, hannes, Liam.Howlett,
	lorenzo.stoakes, david, arnd, richard.weiyang, zhangpeng.00,
	linmiaohe, viro, hca

On Tue, 12 Nov 2024 10:09:58 -0800
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Mon, Nov 11, 2024 at 5:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Nov 11, 2024 at 4:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > >
> > > On Sun, 27 Oct 2024 18:08:18 -0700
> > > Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > > Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING
> > > > files, a special case, now goes through RCU-delated freeing), we can
> > > > safely access vma->vm_file->f_inode field locklessly under just
> > > > rcu_read_lock() protection, which enables looking up uprobe from
> > > > uprobes_tree completely locklessly and speculatively without the need to
> > > > acquire mmap_lock for reads. In most cases, anyway, assuming that there
> > > > are no parallel mm and/or VMA modifications. The underlying struct
> > > > file's memory won't go away from under us (even if struct file can be
> > > > reused in the meantime).
> > > >
> > > > We rely on newly added mmap_lock_speculation_{begin,end}() helpers to
> > > > validate that mm_struct stays intact for entire duration of this
> > > > speculation. If not, we fall back to mmap_lock-protected lookup.
> > > > The speculative logic is written in such a way that it will safely
> > > > handle any garbage values that might be read from vma or file structs.
> > > >
> > > > Benchmarking results speak for themselves.
> > > >
> > > > BEFORE (latest tip/perf/core)
> > > > =============================
> > > > uprobe-nop            ( 1 cpus):    3.384 ± 0.004M/s  (  3.384M/s/cpu)
> > > > uprobe-nop            ( 2 cpus):    5.456 ± 0.005M/s  (  2.728M/s/cpu)
> > > > uprobe-nop            ( 3 cpus):    7.863 ± 0.015M/s  (  2.621M/s/cpu)
> > > > uprobe-nop            ( 4 cpus):    9.442 ± 0.008M/s  (  2.360M/s/cpu)
> > > > uprobe-nop            ( 5 cpus):   11.036 ± 0.013M/s  (  2.207M/s/cpu)
> > > > uprobe-nop            ( 6 cpus):   10.884 ± 0.019M/s  (  1.814M/s/cpu)
> > > > uprobe-nop            ( 7 cpus):    7.897 ± 0.145M/s  (  1.128M/s/cpu)
> > > > uprobe-nop            ( 8 cpus):   10.021 ± 0.128M/s  (  1.253M/s/cpu)
> > > > uprobe-nop            (10 cpus):    9.932 ± 0.170M/s  (  0.993M/s/cpu)
> > > > uprobe-nop            (12 cpus):    8.369 ± 0.056M/s  (  0.697M/s/cpu)
> > > > uprobe-nop            (14 cpus):    8.678 ± 0.017M/s  (  0.620M/s/cpu)
> > > > uprobe-nop            (16 cpus):    7.392 ± 0.003M/s  (  0.462M/s/cpu)
> > > > uprobe-nop            (24 cpus):    5.326 ± 0.178M/s  (  0.222M/s/cpu)
> > > > uprobe-nop            (32 cpus):    5.426 ± 0.059M/s  (  0.170M/s/cpu)
> > > > uprobe-nop            (40 cpus):    5.262 ± 0.070M/s  (  0.132M/s/cpu)
> > > > uprobe-nop            (48 cpus):    6.121 ± 0.010M/s  (  0.128M/s/cpu)
> > > > uprobe-nop            (56 cpus):    6.252 ± 0.035M/s  (  0.112M/s/cpu)
> > > > uprobe-nop            (64 cpus):    7.644 ± 0.023M/s  (  0.119M/s/cpu)
> > > > uprobe-nop            (72 cpus):    7.781 ± 0.001M/s  (  0.108M/s/cpu)
> > > > uprobe-nop            (80 cpus):    8.992 ± 0.048M/s  (  0.112M/s/cpu)
> > > >
> > > > AFTER
> > > > =====
> > > > uprobe-nop            ( 1 cpus):    3.534 ± 0.033M/s  (  3.534M/s/cpu)
> > > > uprobe-nop            ( 2 cpus):    6.701 ± 0.007M/s  (  3.351M/s/cpu)
> > > > uprobe-nop            ( 3 cpus):   10.031 ± 0.007M/s  (  3.344M/s/cpu)
> > > > uprobe-nop            ( 4 cpus):   13.003 ± 0.012M/s  (  3.251M/s/cpu)
> > > > uprobe-nop            ( 5 cpus):   16.274 ± 0.006M/s  (  3.255M/s/cpu)
> > > > uprobe-nop            ( 6 cpus):   19.563 ± 0.024M/s  (  3.261M/s/cpu)
> > > > uprobe-nop            ( 7 cpus):   22.696 ± 0.054M/s  (  3.242M/s/cpu)
> > > > uprobe-nop            ( 8 cpus):   24.534 ± 0.010M/s  (  3.067M/s/cpu)
> > > > uprobe-nop            (10 cpus):   30.475 ± 0.117M/s  (  3.047M/s/cpu)
> > > > uprobe-nop            (12 cpus):   33.371 ± 0.017M/s  (  2.781M/s/cpu)
> > > > uprobe-nop            (14 cpus):   38.864 ± 0.004M/s  (  2.776M/s/cpu)
> > > > uprobe-nop            (16 cpus):   41.476 ± 0.020M/s  (  2.592M/s/cpu)
> > > > uprobe-nop            (24 cpus):   64.696 ± 0.021M/s  (  2.696M/s/cpu)
> > > > uprobe-nop            (32 cpus):   85.054 ± 0.027M/s  (  2.658M/s/cpu)
> > > > uprobe-nop            (40 cpus):  101.979 ± 0.032M/s  (  2.549M/s/cpu)
> > > > uprobe-nop            (48 cpus):  110.518 ± 0.056M/s  (  2.302M/s/cpu)
> > > > uprobe-nop            (56 cpus):  117.737 ± 0.020M/s  (  2.102M/s/cpu)
> > > > uprobe-nop            (64 cpus):  124.613 ± 0.079M/s  (  1.947M/s/cpu)
> > > > uprobe-nop            (72 cpus):  133.239 ± 0.032M/s  (  1.851M/s/cpu)
> > > > uprobe-nop            (80 cpus):  142.037 ± 0.138M/s  (  1.775M/s/cpu)
> > > >
> > > > Previously total throughput was maxing out at 11mln/s, and gradually
> > > > declining past 8 cores. With this change, it now keeps growing with each
> > > > added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core
> > > > Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).
> > > >
> > >
> > > Looks good to me, except one question below.
> > >
> > > > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  kernel/events/uprobes.c | 45 +++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 45 insertions(+)
> > > >
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index 290c445768fa..efcd62f7051d 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -2074,6 +2074,47 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
> > > >       return is_trap_insn(&opcode);
> > > >  }
> > > >
> > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > > +{
> > > > +     struct mm_struct *mm = current->mm;
> > > > +     struct uprobe *uprobe = NULL;
> > > > +     struct vm_area_struct *vma;
> > > > +     struct file *vm_file;
> > > > +     loff_t offset;
> > > > +     unsigned int seq;
> > > > +
> > > > +     guard(rcu)();
> > > > +
> > > > +     if (!mmap_lock_speculation_begin(mm, &seq))
> > > > +             return NULL;
> > > > +
> > > > +     vma = vma_lookup(mm, bp_vaddr);
> > > > +     if (!vma)
> > > > +             return NULL;
> > > > +
> > > > +     /*
> > > > +      * vm_file memory can be reused for another instance of struct file,
> > > > +      * but can't be freed from under us, so it's safe to read fields from
> > > > +      * it, even if the values are some garbage values; ultimately
> > > > +      * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> > > > +      * that whatever we speculatively found is correct
> > >
> > > If vm_file is a garbage value, may `vm_file->f_inode` access be dangerous?
> > >
> > > > +      */
> > > > +     vm_file = READ_ONCE(vma->vm_file);
> > > > +     if (!vm_file)
> > > > +             return NULL;
> > > > +
> > > > +     offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
> > > > +     uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
> > >                                        ^^^^ Here
> > >
> > > if it only stores vm_file or NULL, there's no problem.
> >
> > IIRC correctly, vma->vm_file is RCU-safe and we are in the read RCU
> > section, so it should not contain a garbage value.
> 
> Correct. vm_file itself can be either TYPESAFE_BY_RCU for normal
> files, or properly RCU protected for FMODE_BACKING ones. Either way,
> there is some correct struct file pointed to, and so all this is valid
> and won't dereference invalid memory.

OK, thanks for confirmation! This looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thank you,

> 
> >
> > >
> > > Thank you,
> > >
> > > > +     if (!uprobe)
> > > > +             return NULL;
> > > > +
> > > > +     /* now double check that nothing about MM changed */
> > > > +     if (!mmap_lock_speculation_end(mm, seq))
> > > > +             return NULL;
> > > > +
> > > > +     return uprobe;
> > > > +}
> > > > +
> > > >  /* assumes being inside RCU protected region */
> > > >  static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
> > > >  {
> > > > @@ -2081,6 +2122,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
> > > >       struct uprobe *uprobe = NULL;
> > > >       struct vm_area_struct *vma;
> > > >
> > > > +     uprobe = find_active_uprobe_speculative(bp_vaddr);
> > > > +     if (uprobe)
> > > > +             return uprobe;
> > > > +
> > > >       mmap_read_lock(mm);
> > > >       vma = vma_lookup(mm, bp_vaddr);
> > > >       if (vma) {
> > > > --
> > > > 2.43.5
> > > >
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-10-28  1:08 [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-10-28  1:08 ` [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
@ 2024-11-06  2:01 ` Andrii Nakryiko
  2024-11-11 17:26   ` Andrii Nakryiko
  4 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-11-06  2:01 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, akpm, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca

On Sun, Oct 27, 2024 at 6:09 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Implement speculative (lockless) resolution of VMA to inode to uprobe,
> bypassing the need to take mmap_lock for reads, if possible. First two patches
> by Suren adds mm_struct helpers that help detect whether mm_struct was
> changed, which is used by uprobe logic to validate that speculative results
> can be trusted after all the lookup logic results in a valid uprobe instance.
>
> Patch #3 is a simplification to uprobe VMA flag checking, suggested by Oleg.
>
> And, finally, patch #4 is the speculative VMA-to-uprobe resolution logic
> itself, and is the focal point of this patch set. It makes entry uprobes in
> common case scale very well with number of CPUs, as we avoid any locking or
> cache line bouncing between CPUs. See corresponding patch for details and
> benchmarking results.
>
> Note, this patch set assumes that FMODE_BACKING files were switched to have
> SLAB_TYPE_SAFE_BY_RCU semantics, which was recently done by Christian Brauner
> in [0]. This change can be pulled into perf/core through stable
> tags/vfs-6.13.for-bpf.file tag from [1].
>
>   [0] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.13.for-bpf.file&id=8b1bc2590af61129b82a189e9dc7c2804c34400e
>   [1] git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
>
> v3->v4:
> - rebased and dropped data_race(), given mm_struct uses real seqcount (Peter);
> v2->v3:
> - dropped kfree_rcu() patch (Christian);
> - added data_race() annotations for fields of vma and vma->vm_file which could
>   be modified during speculative lookup (Oleg);
> - fixed int->long problem in stubs for mmap_lock_speculation_{start,end}(),
>   caught by Kernel test robot;
> v1->v2:
> - adjusted vma_end_write_all() comment to point out it should never be called
>   manually now, but I wasn't sure how ACQUIRE/RELEASE comments should be
>   reworded (previously requested by Jann), so I'd appreciate some help there
>   (Jann);
> - int -> long change for mm_lock_seq, as agreed at LPC2024 (Jann, Suren, Liam);
> - kfree_rcu_mightsleep() for FMODE_BACKING (Suren, Christian);
> - vm_flags simplification in find_active_uprobe_rcu() and
>   find_active_uprobe_speculative() (Oleg);
> - guard(rcu)() simplified find_active_uprobe_speculative() implementation.
>
> Andrii Nakryiko (2):
>   uprobes: simplify find_active_uprobe_rcu() VMA checks
>   uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
>
> Suren Baghdasaryan (2):
>   mm: Convert mm_lock_seq to a proper seqcount
>   mm: Introduce mmap_lock_speculation_{begin|end}
>
>  include/linux/mm.h               | 12 ++---
>  include/linux/mm_types.h         |  7 ++-
>  include/linux/mmap_lock.h        | 87 ++++++++++++++++++++++++--------
>  kernel/events/uprobes.c          | 47 ++++++++++++++++-
>  kernel/fork.c                    |  5 +-
>  mm/init-mm.c                     |  2 +-
>  tools/testing/vma/vma.c          |  4 +-
>  tools/testing/vma/vma_internal.h |  4 +-
>  8 files changed, 129 insertions(+), 39 deletions(-)
>
> --
> 2.43.5
>

Hi!

What's the status of this patch set? Are there any blockers for it to
be applied to perf/core? MM folks are OK with landing the first two
patches in perf/core, so hopefully we should be good to go?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-11-06  2:01 ` [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
@ 2024-11-11 17:26   ` Andrii Nakryiko
  2024-11-20 15:40     ` Andrii Nakryiko
  0 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-11-11 17:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, linux-mm, akpm, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb,
	mjguzik, brauner, jannh, mhocko, vbabka, shakeel.butt, hannes,
	Liam.Howlett, lorenzo.stoakes, david, arnd, richard.weiyang,
	zhangpeng.00, linmiaohe, viro, hca

On Tue, Nov 5, 2024 at 6:01 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sun, Oct 27, 2024 at 6:09 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Implement speculative (lockless) resolution of VMA to inode to uprobe,
> > bypassing the need to take mmap_lock for reads, if possible. First two patches
> > by Suren adds mm_struct helpers that help detect whether mm_struct was
> > changed, which is used by uprobe logic to validate that speculative results
> > can be trusted after all the lookup logic results in a valid uprobe instance.
> >
> > Patch #3 is a simplification to uprobe VMA flag checking, suggested by Oleg.
> >
> > And, finally, patch #4 is the speculative VMA-to-uprobe resolution logic
> > itself, and is the focal point of this patch set. It makes entry uprobes in
> > common case scale very well with number of CPUs, as we avoid any locking or
> > cache line bouncing between CPUs. See corresponding patch for details and
> > benchmarking results.
> >
> > Note, this patch set assumes that FMODE_BACKING files were switched to have
> > SLAB_TYPE_SAFE_BY_RCU semantics, which was recently done by Christian Brauner
> > in [0]. This change can be pulled into perf/core through stable
> > tags/vfs-6.13.for-bpf.file tag from [1].
> >
> >   [0] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.13.for-bpf.file&id=8b1bc2590af61129b82a189e9dc7c2804c34400e
> >   [1] git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> >
> > v3->v4:
> > - rebased and dropped data_race(), given mm_struct uses real seqcount (Peter);
> > v2->v3:
> > - dropped kfree_rcu() patch (Christian);
> > - added data_race() annotations for fields of vma and vma->vm_file which could
> >   be modified during speculative lookup (Oleg);
> > - fixed int->long problem in stubs for mmap_lock_speculation_{start,end}(),
> >   caught by Kernel test robot;
> > v1->v2:
> > - adjusted vma_end_write_all() comment to point out it should never be called
> >   manually now, but I wasn't sure how ACQUIRE/RELEASE comments should be
> >   reworded (previously requested by Jann), so I'd appreciate some help there
> >   (Jann);
> > - int -> long change for mm_lock_seq, as agreed at LPC2024 (Jann, Suren, Liam);
> > - kfree_rcu_mightsleep() for FMODE_BACKING (Suren, Christian);
> > - vm_flags simplification in find_active_uprobe_rcu() and
> >   find_active_uprobe_speculative() (Oleg);
> > - guard(rcu)() simplified find_active_uprobe_speculative() implementation.
> >
> > Andrii Nakryiko (2):
> >   uprobes: simplify find_active_uprobe_rcu() VMA checks
> >   uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
> >
> > Suren Baghdasaryan (2):
> >   mm: Convert mm_lock_seq to a proper seqcount
> >   mm: Introduce mmap_lock_speculation_{begin|end}
> >
> >  include/linux/mm.h               | 12 ++---
> >  include/linux/mm_types.h         |  7 ++-
> >  include/linux/mmap_lock.h        | 87 ++++++++++++++++++++++++--------
> >  kernel/events/uprobes.c          | 47 ++++++++++++++++-
> >  kernel/fork.c                    |  5 +-
> >  mm/init-mm.c                     |  2 +-
> >  tools/testing/vma/vma.c          |  4 +-
> >  tools/testing/vma/vma_internal.h |  4 +-
> >  8 files changed, 129 insertions(+), 39 deletions(-)
> >
> > --
> > 2.43.5
> >
>
> Hi!
>
> What's the status of this patch set? Are there any blockers for it to
> be applied to perf/core? MM folks are OK with landing the first two
> patches in perf/core, so hopefully we should be good to go?

Another week, another ping. Peter, what can I do to make this land? MM
parts are clearly ok with Andrew Morton, uprobe-side logic didn't
change (modulo inconsequential data_race() back and forth) since at
least August, was approved by Oleg, and seems to be very stable in
testing. I think it's time to let me forget about this patch set and
make actual use of it in production, please.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-11-11 17:26   ` Andrii Nakryiko
@ 2024-11-20 15:40     ` Andrii Nakryiko
  2024-11-20 15:43       ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-11-20 15:40 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: linux-trace-kernel, linux-mm, akpm, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb,
	mjguzik, brauner, jannh, mhocko, Andrii Nakryiko, vbabka,
	shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca, Mark Rutland,
	Will Deacon, linux-arm-kernel, Catalin Marinas, Kernel Team

Linus,

I'm not sure what's going on here, this patch set seems to be in some
sort of "ignore list" on Peter's side with no indication on its
destiny.

I'd really like for this change to go into the new release with the
rest of uprobe improvements that happened this cycle, as they all
nicely complement each other. This patch set has been done-done since
Oct 24 when Suren sent the final version of mm-side changes ([0]),
which I subsequently resent as part of this mm+uprobe patch set on Oct
27, after coordinating that this will go through uprobe subsystem with
Andrew Morton ([1]). The uprobe part was effectively unchanged since
this summer, when this speculative uprobe lookup logic was posted as
part of an earlier RFC series ([2]). That's just to say that this was
thoroughly reviewed, discussed, and stress-tested, meanwhile, and I
see no reason to delay landing it for so long.

I've even written a separate overview email with a summary of all the
uprobe-related work and how it all fits together ([3]), realizing that
there are a few seemingly independent email threads and patch sets,
trying to engage involved maintainers. The outcome was:
  - two patch sets did land (uretprobe + SRCU and Jiri's uprobe
session prerequisites) after a bunch of extra pings, but that's at
least something;
  - Liao's siglock optimization ([4]) still hasn't landed with no
explanation what's the delay;
  - this patch set is also stuck in limbo for weeks now;
  - there was little engagement on arm64 front for Liao's optimization
of uprobes on STP instructions [5], which is perhaps a separate topic
for another email, but just another instance of maintainers not
engaging in timely fashion.

In short, I hope to get your help with the next steps. What can I do
to help land this patch set (and hopefully also others I mentioned
above)?

More broadly, what should be contributors' expectations on timeliness
of maintainers' engagement? Maintainer record in MAINTAINERS can't be
just a veto power, right? It is also a responsibility before others to
move the kernel development along. I'd like to understand what you
think is reasonable to expect here? Same question for patch handling
(applying, reviewing, rejecting, etc.) latency.

Thank you!


  [0] https://lore.kernel.org/linux-mm/20241024205231.1944747-1-surenb@google.com/
  [1] https://lore.kernel.org/linux-mm/20241028204822.6638f330fad809381eafb49c@linux-foundation.org/
  [2] https://lore.kernel.org/linux-trace-kernel/20240813042917.506057-14-andrii@kernel.org/
  [3] https://lore.kernel.org/linux-trace-kernel/CAEf4BzY-0Eu27jyT_s2kRO1UuUPOkE9_SRrBOqu2gJfmxsv+3A@mail.gmail.com/
  [4] https://lore.kernel.org/linux-trace-kernel/CAEf4BzarhiBHAQXECJzP5e-z0fbSaTpfQNPaSXwdgErz2f0vUA@mail.gmail.com/
  [5] https://lore.kernel.org/linux-trace-kernel/CAEf4BzZ3trjMWjvWX4Zy1GzW5RN1ihXZSnLZax7V-mCzAUg2cg@mail.gmail.com/
  [6] https://lore.kernel.org/all/172074397710.247544.17045299807723238107.stgit@devnote2/


On Mon, Nov 11, 2024 at 9:26 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Nov 5, 2024 at 6:01 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sun, Oct 27, 2024 at 6:09 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Implement speculative (lockless) resolution of VMA to inode to uprobe,
> > > bypassing the need to take mmap_lock for reads, if possible. First two patches
> > > by Suren adds mm_struct helpers that help detect whether mm_struct was
> > > changed, which is used by uprobe logic to validate that speculative results
> > > can be trusted after all the lookup logic results in a valid uprobe instance.
> > >
> > > Patch #3 is a simplification to uprobe VMA flag checking, suggested by Oleg.
> > >
> > > And, finally, patch #4 is the speculative VMA-to-uprobe resolution logic
> > > itself, and is the focal point of this patch set. It makes entry uprobes in
> > > common case scale very well with number of CPUs, as we avoid any locking or
> > > cache line bouncing between CPUs. See corresponding patch for details and
> > > benchmarking results.
> > >
> > > Note, this patch set assumes that FMODE_BACKING files were switched to have
> > > SLAB_TYPE_SAFE_BY_RCU semantics, which was recently done by Christian Brauner
> > > in [0]. This change can be pulled into perf/core through stable
> > > tags/vfs-6.13.for-bpf.file tag from [1].
> > >
> > >   [0] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.13.for-bpf.file&id=8b1bc2590af61129b82a189e9dc7c2804c34400e
> > >   [1] git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> > >
> > > v3->v4:
> > > - rebased and dropped data_race(), given mm_struct uses real seqcount (Peter);
> > > v2->v3:
> > > - dropped kfree_rcu() patch (Christian);
> > > - added data_race() annotations for fields of vma and vma->vm_file which could
> > >   be modified during speculative lookup (Oleg);
> > > - fixed int->long problem in stubs for mmap_lock_speculation_{start,end}(),
> > >   caught by Kernel test robot;
> > > v1->v2:
> > > - adjusted vma_end_write_all() comment to point out it should never be called
> > >   manually now, but I wasn't sure how ACQUIRE/RELEASE comments should be
> > >   reworded (previously requested by Jann), so I'd appreciate some help there
> > >   (Jann);
> > > - int -> long change for mm_lock_seq, as agreed at LPC2024 (Jann, Suren, Liam);
> > > - kfree_rcu_mightsleep() for FMODE_BACKING (Suren, Christian);
> > > - vm_flags simplification in find_active_uprobe_rcu() and
> > >   find_active_uprobe_speculative() (Oleg);
> > > - guard(rcu)() simplified find_active_uprobe_speculative() implementation.
> > >
> > > Andrii Nakryiko (2):
> > >   uprobes: simplify find_active_uprobe_rcu() VMA checks
> > >   uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
> > >
> > > Suren Baghdasaryan (2):
> > >   mm: Convert mm_lock_seq to a proper seqcount
> > >   mm: Introduce mmap_lock_speculation_{begin|end}
> > >
> > >  include/linux/mm.h               | 12 ++---
> > >  include/linux/mm_types.h         |  7 ++-
> > >  include/linux/mmap_lock.h        | 87 ++++++++++++++++++++++++--------
> > >  kernel/events/uprobes.c          | 47 ++++++++++++++++-
> > >  kernel/fork.c                    |  5 +-
> > >  mm/init-mm.c                     |  2 +-
> > >  tools/testing/vma/vma.c          |  4 +-
> > >  tools/testing/vma/vma_internal.h |  4 +-
> > >  8 files changed, 129 insertions(+), 39 deletions(-)
> > >
> > > --
> > > 2.43.5
> > >
> >
> > Hi!
> >
> > What's the status of this patch set? Are there any blockers for it to
> > be applied to perf/core? MM folks are OK with landing the first two
> > patches in perf/core, so hopefully we should be good to go?
>
> Another week, another ping. Peter, what can I do to make this land? MM
> parts are clearly ok with Andrew Morton, uprobe-side logic didn't
> change (modulo inconsequential data_race() back and forth) since at
> least August, was approved by Oleg, and seems to be very stable in
> testing. I think it's time to let me forget about this patch set and
> make actual use of it in production, please.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-11-20 15:40     ` Andrii Nakryiko
@ 2024-11-20 15:43       ` Peter Zijlstra
  2024-11-20 16:03         ` Ingo Molnar
  2024-11-20 17:23         ` Andrii Nakryiko
  0 siblings, 2 replies; 27+ messages in thread
From: Peter Zijlstra @ 2024-11-20 15:43 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Linus Torvalds, Ingo Molnar, linux-trace-kernel, linux-mm, akpm,
	oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, Andrii Nakryiko, vbabka,
	shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca, Mark Rutland,
	Will Deacon, linux-arm-kernel, Catalin Marinas, Kernel Team

On Wed, Nov 20, 2024 at 07:40:15AM -0800, Andrii Nakryiko wrote:
> Linus,
> 
> I'm not sure what's going on here, this patch set seems to be in some
> sort of "ignore list" on Peter's side with no indication on its
> destiny.

*sigh* it is not, but my inbox is like drinking from a firehose :/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-11-20 15:43       ` Peter Zijlstra
@ 2024-11-20 16:03         ` Ingo Molnar
  2024-11-20 17:23           ` Andrii Nakryiko
  2024-11-20 17:23         ` Andrii Nakryiko
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2024-11-20 16:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Linus Torvalds, linux-trace-kernel, linux-mm,
	akpm, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
	willy, surenb, mjguzik, brauner, jannh, mhocko, Andrii Nakryiko,
	vbabka, shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	david, arnd, richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Mark Rutland, Will Deacon, linux-arm-kernel, Catalin Marinas,
	Kernel Team


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 20, 2024 at 07:40:15AM -0800, Andrii Nakryiko wrote:
> > Linus,
> > 
> > I'm not sure what's going on here, this patch set seems to be in some
> > sort of "ignore list" on Peter's side with no indication on its
> > destiny.
> 
> *sigh* it is not, but my inbox is like drinking from a firehose :/

And I've been considering that particular series WIP for two reasons:

 1) Oleg was still unconvinced about patch 5/5 in the v2 discussion. 
    Upon re-reading it I think he might have come around and has agreed 
    to the current approach - but sending a v3 & not seeing Oleg object 
    would ascertain that.

 2) There was a build failure reported against -v2 at:

       https://lore.kernel.org/all/202410050745.2Nuvusy4-lkp@intel.com/t.mbox.gz

    We cannot and will not merge patches with build failures.

Andrii did get some other uprobes scalability work merged in v6.13:

    - Switch to RCU Tasks Trace flavor for better performance (Andrii Nakryiko)

    - Massively increase uretprobe SMP scalability by SRCU-protecting
      the uretprobe lifetime (Andrii Nakryiko)

So we've certainly not been ignoring his patches, to the contrary ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-11-20 16:03         ` Ingo Molnar
@ 2024-11-20 17:23           ` Andrii Nakryiko
  2024-11-21  9:33             ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-11-20 17:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, linux-trace-kernel, linux-mm,
	akpm, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
	willy, surenb, mjguzik, brauner, jannh, mhocko, Andrii Nakryiko,
	vbabka, shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	david, arnd, richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Mark Rutland, Will Deacon, linux-arm-kernel, Catalin Marinas,
	Kernel Team

On Wed, Nov 20, 2024 at 8:03 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Wed, Nov 20, 2024 at 07:40:15AM -0800, Andrii Nakryiko wrote:
> > > Linus,
> > >
> > > I'm not sure what's going on here, this patch set seems to be in some
> > > sort of "ignore list" on Peter's side with no indication on its
> > > destiny.
> >
> > *sigh* it is not, but my inbox is like drinking from a firehose :/
>
> And I've been considering that particular series WIP for two reasons:
>
>  1) Oleg was still unconvinced about patch 5/5 in the v2 discussion.
>     Upon re-reading it I think he might have come around and has agreed
>     to the current approach - but sending a v3 & not seeing Oleg object
>     would ascertain that.

Is this about Liao's siglock patch set? We are at v4 (!) already (see
[0]) with Oleg's Acked-by added.

>
>  2) There was a build failure reported against -v2 at:
>
>        https://lore.kernel.org/all/202410050745.2Nuvusy4-lkp@intel.com/t.mbox.gz
>
>     We cannot and will not merge patches with build failures.

This one is about this patch set (speculative uprobe lookup), right?
It is already at v4 ([1]), while you are mentioning v2 as the reason
for this to not yet be applied. Those build failures were fixed *a
long time ago*, v4 itself has been sitting idle for almost a month
(since Oct 27). If there are any other problems, do bring them up,
don't wait for weeks.

>
> Andrii did get some other uprobes scalability work merged in v6.13:
>
>     - Switch to RCU Tasks Trace flavor for better performance (Andrii Nakryiko)
>
>     - Massively increase uretprobe SMP scalability by SRCU-protecting
>       the uretprobe lifetime (Andrii Nakryiko)
>
> So we've certainly not been ignoring his patches, to the contrary ...

Yes, and as I mentioned, this one is a) ready, reviewed, tested and b)
complements the other work you mention. It removes mmap_lock which
limits scalability of the rest of the work. Is there some rule that I
get to land only two patch sets in a single release?

>
> Thanks,
>
>         Ingo


[0] https://lore.kernel.org/linux-trace-kernel/20241022073141.3291245-1-liaochang1@huawei.com/
[1] https://lore.kernel.org/linux-trace-kernel/20241028010818.2487581-1-andrii@kernel.org/
[2] https://lore.kernel.org/linux-trace-kernel/CAEf4BzYPajbgyvcvm7z1EiPgkee1D1r=a8gaqxzd7k13gh9Uzw@mail.gmail.com/
[3] https://lore.kernel.org/linux-trace-kernel/CAEf4Bza=pwrZvd+3dz-a7eiAQMk9rwBDO1Kk_iwXSCM70CAARw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-11-20 17:23           ` Andrii Nakryiko
@ 2024-11-21  9:33             ` Ingo Molnar
  2024-11-21 14:43               ` Andrii Nakryiko
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2024-11-21  9:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Peter Zijlstra, Linus Torvalds, linux-trace-kernel, linux-mm,
	akpm, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
	willy, surenb, mjguzik, brauner, jannh, mhocko, Andrii Nakryiko,
	vbabka, shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	david, arnd, richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Mark Rutland, Will Deacon, linux-arm-kernel, Catalin Marinas,
	Kernel Team


* Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> Is this about Liao's siglock patch set? We are at v4 (!) already (see 
> [0]) with Oleg's Acked-by added.

AFAICS you didn't Cc: me to -v3 and -v4 - and while I'll generally see 
those patches too, eventually, there's a delay.

> > Andrii did get some other uprobes scalability work merged in v6.13:
> >
> >     - Switch to RCU Tasks Trace flavor for better performance 
> >     (Andrii Nakryiko)
> >
> >     - Massively increase uretprobe SMP scalability by 
> >       SRCU-protecting the uretprobe lifetime (Andrii Nakryiko)
> >
> > So we've certainly not been ignoring his patches, to the contrary 
> > ...
> 
> Yes, and as I mentioned, this one is a) ready, reviewed, tested and 
> b) complements the other work you mention.

Sorry, but patchsets that didn't even build a few weeks before the 
development window closed are generally pushed further down the 
backlog. Think of this as rate-limiting the risk of potentially broken 
code entering the kernel. You can avoid this problem by doing more 
testing, or by accepting that sometimes one more cycle is needed to get 
your patchsets merged.

> [...] It removes mmap_lock which limits scalability of the rest of 
> the work. Is there some rule that I get to land only two patch sets 
> in a single release?

Your facetous question and the hostile tone of your emails is not 
appreciated.

Me pointing out that two other patchsets of yours got integrated simply 
demonstrates how your original complaint of an 'ignore list' is not 
just unprofessional on its face, but also demonstrably unfair:

> > > > I'm not sure what's going on here, this patch set seems to be 
> > > > in some sort of "ignore list" on Peter's side with no 
> > > > indication on its destiny.

Trying to pressure maintainers over a patchset that recently had build 
failures isn't going to get your patches applied faster.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-11-21  9:33             ` Ingo Molnar
@ 2024-11-21 14:43               ` Andrii Nakryiko
  0 siblings, 0 replies; 27+ messages in thread
From: Andrii Nakryiko @ 2024-11-21 14:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, linux-trace-kernel, linux-mm,
	akpm, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
	willy, surenb, mjguzik, brauner, jannh, mhocko, Andrii Nakryiko,
	vbabka, shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes,
	david, arnd, richard.weiyang, zhangpeng.00, linmiaohe, viro, hca,
	Mark Rutland, Will Deacon, linux-arm-kernel, Catalin Marinas,
	Kernel Team, Liao Chang

cc'ing Liao for awareness (should have done it earlier, sorry)

On Thu, Nov 21, 2024 at 1:33 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > Is this about Liao's siglock patch set? We are at v4 (!) already (see
> > [0]) with Oleg's Acked-by added.
>
> AFAICS you didn't Cc: me to -v3 and -v4 - and while I'll generally see
> those patches too, eventually, there's a delay.

Ok, I think we are now switching to my patch set here ([0]), because
Liao's v4 ([1]) does have mingo@redhat.com in CC. So, on Liao's
behalf, there wasn't really anything specific pointed out that would
explain a month's delay.

But let's switch to my patch set. Yes, my bad, I didn't CC you
directly, that wasn't in any way intentional, and that's my bad and I
will make sure to CC you on every patch for uprobes subsystem, even
though you are not explicitly listed in UPROBES section of MAINTAINERS
([2]), and Peter was the one who was handling all the uprobe-related
stuff since before this summer.

But let's please not randomly jump between discussing two separate
patch sets here, it's confusing.

>
> > > Andrii did get some other uprobes scalability work merged in v6.13:
> > >
> > >     - Switch to RCU Tasks Trace flavor for better performance
> > >     (Andrii Nakryiko)
> > >
> > >     - Massively increase uretprobe SMP scalability by
> > >       SRCU-protecting the uretprobe lifetime (Andrii Nakryiko)
> > >
> > > So we've certainly not been ignoring his patches, to the contrary
> > > ...
> >
> > Yes, and as I mentioned, this one is a) ready, reviewed, tested and
> > b) complements the other work you mention.
>
> Sorry, but patchsets that didn't even build a few weeks before the
> development window closed are generally pushed further down the
> backlog. Think of this as rate-limiting the risk of potentially broken
> code entering the kernel. You can avoid this problem by doing more
> testing, or by accepting that sometimes one more cycle is needed to get
> your patchsets merged.

Those build failures were happening in stub implementations, which
were used only on !CONFIG_PER_VMA_LOCK configuration, which I'm not
even sure if possible to get on x86-64. When I tried to reproduce this
locally I couldn't even get such configuration. Thankfully we have a
kernel test robot that would test multiple architectures and
configurations and it did catch it on i386 and loongarch64. And was
fixed quickly.

It doesn't seem fair or reasonable to penalize patch sets for *weeks*,
silently and without any notice just for this, IMO.

The patch set was very thoroughly tested, actually. Not just building,
but also running various unit tests (BPF selftests in particular). But
even more so, I built an entire uprobe stress-testing tool just to
test all my uprobe-related. I deployed custom kernels and ran these
stress tests on all uprobe patch sets and their revisions, over many
hours.

Sure, my main platform is x86-64, so that's where all the testing was
done. But you can't accuse me of negligence.

>
> > [...] It removes mmap_lock which limits scalability of the rest of
> > the work. Is there some rule that I get to land only two patch sets
> > in a single release?
>
> Your facetous question and the hostile tone of your emails is not
> appreciated.

I'm sticking to the facts in these emails. And when I get a response
in the style of "you got two patch sets in, why are you complaining",
that's not exactly friendly and fair. I put a lot of effort and time
not just into producing and testing all those patches, but also into
the logistics of it, coordinating with other people working within
uprobes subsystem.

And instead of accusations, I'd like to get an understanding of
expectations I can have in terms of handling patches. Being ignored
for many weeks is not OK. If you don't like something about what I do
or how, procedurally or technically, please call it out and I'll try
to fix whatever the problem might be. Silent treatment is not
productive.

But while on the topic. Those two patch sets you mentioned didn't go
in smoothly and quickly either. "Switch to RCU Tasks Trace flavor for
better performance" in particular should have gone in with the
original patch set one release earlier. But instead that patch was
dropped from the tree after applying it. Silently. I was not notified
at all (5 days that passed before I asked would be plenty of time to
mention this, IMO). It's good I noticed this, inquired with an email
reply (after making sure it's not some transient patch handling
issue), and only after that I got a reply that there was a build
failure I had to fix. You can see the thread at [4].

>
> Me pointing out that two other patchsets of yours got integrated simply
> demonstrates how your original complaint of an 'ignore list' is not
> just unprofessional on its face, but also demonstrably unfair:
>
> > > > > I'm not sure what's going on here, this patch set seems to be
> > > > > in some sort of "ignore list" on Peter's side with no
> > > > > indication on its destiny.

My "ignore list" complaint is specifically about this patch set, which
I explicitly stated above in the quote you provided. So yes, it's a
professional, and demonstrably fair statement, and I provided the
timeline and supporting links.

>
> Trying to pressure maintainers over a patchset that recently had build
> failures isn't going to get your patches applied faster.
>

I'm not asking to apply my patches blindly without critical review or
anything like that. I'm not expecting reviews or even just email
replies within a few days of posting. I *do* expect some sort of
reaction, though, and not after many weeks of inactivity and pinging
from my side, yes. And note, I got replies only after sending an email
straight to Linus.

I'm not pressuring anyone into anything. But as a maintainer myself, I
do think that being a maintainer is not just about having rights, it's
also about responsibilities.

Let's please stop with the excuses and instead discuss constructive solutions.

> Thanks,
>
>         Ingo

[0] https://lore.kernel.org/linux-trace-kernel/20241028010818.2487581-1-andrii@kernel.org/
[1] https://lore.kernel.org/linux-trace-kernel/20241022073141.3291245-1-liaochang1@huawei.com/
[2] https://lore.kernel.org/all/172074397710.247544.17045299807723238107.stgit@devnote2/
[3] https://github.com/libbpf/libbpf-bootstrap/commit/2f88cef90f9728ec8c7bee7bd48fdbcf197806c3
[4] https://lore.kernel.org/all/CAEf4BzZihPPiReE3anhrVOzjoZW5v4vFVouK_Arm8vJexCTT4g@mail.gmail.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-11-20 15:43       ` Peter Zijlstra
  2024-11-20 16:03         ` Ingo Molnar
@ 2024-11-20 17:23         ` Andrii Nakryiko
  1 sibling, 0 replies; 27+ messages in thread
From: Andrii Nakryiko @ 2024-11-20 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, linux-trace-kernel, linux-mm, akpm,
	oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, mjguzik, brauner, jannh, mhocko, Andrii Nakryiko, vbabka,
	shakeel.butt, hannes, Liam.Howlett, lorenzo.stoakes, david, arnd,
	richard.weiyang, zhangpeng.00, linmiaohe, viro, hca, Mark Rutland,
	Will Deacon, linux-arm-kernel, Catalin Marinas, Kernel Team

On Wed, Nov 20, 2024 at 7:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 20, 2024 at 07:40:15AM -0800, Andrii Nakryiko wrote:
> > Linus,
> >
> > I'm not sure what's going on here, this patch set seems to be in some
> > sort of "ignore list" on Peter's side with no indication on its
> > destiny.
>
> *sigh* it is not, but my inbox is like drinking from a firehose :/

Yet, you had time to look at and reply to much more recent patch sets
(e.g., [0] and [1], which landed 5 and 3 days ago).

And to be clear, your reviews and input there is appreciated, but
there has to be some wider timeliness and fairness here. This
particular patch set has been ready for a month, it's not that much
time to apply patches. Liao's patch set is even more stale. And for
the latter one I did give you a ping as well ([2]), just in case it
slipped through the cracks. That wasn't enough, unfortunately.

I'm not going to advise you on handling emails out of respect, sorry.
I'm sure you can figure it out. But if you feel overloaded and
overwhelmed, consider not *gaining* more responsibilities, like what
happened with the uprobe subsystem ([2]). Work can be shared,
delegated, and, sometimes, maybe just be "let go" and trust others to
do the right thing.

  [0] https://lore.kernel.org/bpf/20241116194202.GR22801@noisy.programming.kicks-ass.net/
  [1] https://lore.kernel.org/bpf/20241119111809.GB2328@noisy.programming.kicks-ass.net/
  [2] https://lore.kernel.org/linux-trace-kernel/CAEf4BzY-0Eu27jyT_s2kRO1UuUPOkE9_SRrBOqu2gJfmxsv+3A@mail.gmail.com/
  [3] https://lore.kernel.org/all/172074397710.247544.17045299807723238107.stgit@devnote2/

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-11-21 16:32 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-28  1:08 [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
2024-10-28  1:08 ` [PATCH v4 tip/perf/core 1/4] mm: Convert mm_lock_seq to a proper seqcount Andrii Nakryiko
2024-10-29 11:52   ` Vlastimil Babka
2024-11-21 12:40   ` Peter Zijlstra
2024-11-21 15:35     ` Suren Baghdasaryan
2024-10-28  1:08 ` [PATCH v4 tip/perf/core 2/4] mm: Introduce mmap_lock_speculation_{begin|end} Andrii Nakryiko
2024-10-29 16:48   ` Vlastimil Babka
2024-11-21 14:44   ` Peter Zijlstra
2024-11-21 15:22     ` Peter Zijlstra
2024-11-21 15:36       ` Suren Baghdasaryan
2024-11-21 16:32         ` Suren Baghdasaryan
2024-10-28  1:08 ` [PATCH v4 tip/perf/core 3/4] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
2024-10-28  1:51   ` Masami Hiramatsu
2024-10-28  1:08 ` [PATCH v4 tip/perf/core 4/4] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
2024-11-12  0:28   ` Masami Hiramatsu
2024-11-12  1:04     ` Suren Baghdasaryan
2024-11-12 18:09       ` Andrii Nakryiko
2024-11-12 23:53         ` Masami Hiramatsu
2024-11-06  2:01 ` [PATCH v4 tip/perf/core 0/4] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
2024-11-11 17:26   ` Andrii Nakryiko
2024-11-20 15:40     ` Andrii Nakryiko
2024-11-20 15:43       ` Peter Zijlstra
2024-11-20 16:03         ` Ingo Molnar
2024-11-20 17:23           ` Andrii Nakryiko
2024-11-21  9:33             ` Ingo Molnar
2024-11-21 14:43               ` Andrii Nakryiko
2024-11-20 17:23         ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).