linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup
@ 2024-10-01 22:52 Andrii Nakryiko
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 1/5] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-01 22:52 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, mjguzik, brauner, jannh, mhocko, vbabka,
	mingo, Andrii Nakryiko

Implement speculative (lockless) resolution of VMA to inode to uprobe,
bypassing the need to take mmap_lock for reads, if possible. Patch #1 by Suren
adds mm_struct helpers that help detect whether mm_struct were changed, which
is used by uprobe logic to validate that speculative results can be trusted
after all the lookup logic results in a valid uprobe instance. Patch #2
follows to make mm_lock_seq into 64-bit counter (on 64-bit architectures).

Patch #3 adds back RCU-delayed freeing for FMODE_BACKING files, which is
necessary to make speculation safe to access struct file's memory in any
possible situation.

Patch #4 is a simplification to uprobe VMA flag checking, suggested by Oleg.

And, finally, patch #5 is the speculative VMA-to-uprobe resolution logic. See
corresponding patch for details and benchmarking results.

v1->v2:
- adjusted vma_end_write_all() comment to point out it should never be called
  manually now, but I wasn't sure how ACQUIRE/RELEASE comments should be
  reworded (previously requested by Jann), so I'd appreciate some help there
  (Jann);
- int -> long change for mm_lock_seq, as agreed at LPC2024 (Jann, Suren, Liam);
- kfree_rcu_mightsleep() for FMODE_BACKING (Suren, Christian);
- vm_flags simplification in find_active_uprobe_rcu() and
  find_active_uprobe_speculative() (Oleg);
- guard(rcu)() simplified find_active_uprobe_speculative() implementation.

Andrii Nakryiko (4):
  mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  fs: add back RCU-delayed freeing of FMODE_BACKING file
  uprobes: simplify find_active_uprobe_rcu() VMA checks
  uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution

Suren Baghdasaryan (1):
  mm: introduce mmap_lock_speculation_{start|end}

 fs/file_table.c           |  2 +-
 include/linux/mm.h        |  6 ++--
 include/linux/mm_types.h  |  7 ++--
 include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
 kernel/events/uprobes.c   | 46 ++++++++++++++++++++++++-
 kernel/fork.c             |  3 --
 6 files changed, 114 insertions(+), 22 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 tip/perf/core 1/5] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-01 22:52 [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
@ 2024-10-01 22:52 ` Andrii Nakryiko
  2024-10-07 17:05   ` Andrii Nakryiko
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 2/5] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures Andrii Nakryiko
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-01 22:52 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, mjguzik, brauner, jannh, mhocko, vbabka,
	mingo, Andrii Nakryiko

From: Suren Baghdasaryan <surenb@google.com>

Add helper functions to speculatively perform operations without
read-locking mmap_lock, expecting that mmap_lock will not be
write-locked and mm is not modified from under us.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240912210222.186542-1-surenb@google.com
---
 include/linux/mm_types.h  |  3 ++
 include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
 kernel/fork.c             |  3 --
 3 files changed, 63 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..d5e3f907eea4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -876,6 +876,9 @@ struct mm_struct {
 		 * Roughly speaking, incrementing the sequence number is
 		 * equivalent to releasing locks on VMAs; reading the sequence
 		 * number can be part of taking a read lock on a VMA.
+		 * Incremented every time mmap_lock is write-locked/unlocked.
+		 * Initialized to 0, therefore odd values indicate mmap_lock
+		 * is write-locked and even values that it's released.
 		 *
 		 * Can be modified under write mmap_lock using RELEASE
 		 * semantics.
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index de9dc20b01ba..9d23635bc701 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -71,39 +71,84 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline void init_mm_lock_seq(struct mm_struct *mm)
+{
+	mm->mm_lock_seq = 0;
+}
+
 /*
- * Drop all currently-held per-VMA locks.
- * This is called from the mmap_lock implementation directly before releasing
- * a write-locked mmap_lock (or downgrading it to read-locked).
- * This should normally NOT be called manually from other places.
- * If you want to call this manually anyway, keep in mind that this will release
- * *all* VMA write locks, including ones from further up the stack.
+ * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
+ * or write-unlocked (RELEASE semantics).
  */
-static inline void vma_end_write_all(struct mm_struct *mm)
+static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
 {
 	mmap_assert_write_locked(mm);
 	/*
 	 * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
 	 * mmap_lock being held.
-	 * We need RELEASE semantics here to ensure that preceding stores into
-	 * the VMA take effect before we unlock it with this store.
-	 * Pairs with ACQUIRE semantics in vma_start_read().
 	 */
-	smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
+
+	if (acquire) {
+		WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+		/*
+		 * For ACQUIRE semantics we should ensure no following stores are
+		 * reordered to appear before the mm->mm_lock_seq modification.
+		 */
+		smp_wmb();
+	} else {
+		/*
+		 * We need RELEASE semantics here to ensure that preceding stores
+		 * into the VMA take effect before we unlock it with this store.
+		 * Pairs with ACQUIRE semantics in vma_start_read().
+		 */
+		smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
+	}
+}
+
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
+{
+	/* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
+	*seq = smp_load_acquire(&mm->mm_lock_seq);
+	/* Allow speculation if mmap_lock is not write-locked */
+	return (*seq & 1) == 0;
+}
+
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
+{
+	/* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
+	smp_rmb();
+	return seq == READ_ONCE(mm->mm_lock_seq);
 }
+
 #else
-static inline void vma_end_write_all(struct mm_struct *mm) {}
+static inline void init_mm_lock_seq(struct mm_struct *mm) {}
+static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
 #endif
 
+/*
+ * Drop all currently-held per-VMA locks.
+ * This is called from the mmap_lock implementation directly before releasing
+ * a write-locked mmap_lock (or downgrading it to read-locked).
+ * This should NOT be called manually from other places.
+ */
+static inline void vma_end_write_all(struct mm_struct *mm)
+{
+	inc_mm_lock_seq(mm, false);
+}
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
+	init_mm_lock_seq(mm);
 }
 
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_start_locking(mm, true);
 	down_write(&mm->mmap_lock);
+	inc_mm_lock_seq(mm, true);
 	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
@@ -111,6 +156,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
 {
 	__mmap_lock_trace_start_locking(mm, true);
 	down_write_nested(&mm->mmap_lock, subclass);
+	inc_mm_lock_seq(mm, true);
 	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
@@ -120,6 +166,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
 
 	__mmap_lock_trace_start_locking(mm, true);
 	ret = down_write_killable(&mm->mmap_lock);
+	if (!ret)
+		inc_mm_lock_seq(mm, true);
 	__mmap_lock_trace_acquire_returned(mm, true, ret == 0);
 	return ret;
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 18bdc87209d0..c44b71d354ee 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
-#ifdef CONFIG_PER_VMA_LOCK
-	mm->mm_lock_seq = 0;
-#endif
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 tip/perf/core 2/5] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
  2024-10-01 22:52 [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 1/5] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
@ 2024-10-01 22:52 ` Andrii Nakryiko
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file Andrii Nakryiko
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-01 22:52 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, mjguzik, brauner, jannh, mhocko, vbabka,
	mingo, Andrii Nakryiko

To increase mm->mm_lock_seq robustness, switch it from int to long, so
that it's a 64-bit counter on 64-bit systems and we can stop worrying
about it wrapping around in just ~4 billion iterations. Same goes for
VMA's matching vm_lock_seq, which is derived from mm_lock_seq.

I didn't use __u64 outright to keep 32-bit architectures unaffected, but
if it seems important enough, I have nothing against using __u64.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/mm.h        | 6 +++---
 include/linux/mm_types.h  | 4 ++--
 include/linux/mmap_lock.h | 4 ++--
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6549d0979b28..f8e75d0642a8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -716,7 +716,7 @@ static inline void vma_end_read(struct vm_area_struct *vma)
 }
 
 /* WARNING! Can only be used if mmap_lock is expected to be write-locked */
-static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
+static bool __is_vma_write_locked(struct vm_area_struct *vma, long *mm_lock_seq)
 {
 	mmap_assert_write_locked(vma->vm_mm);
 
@@ -735,7 +735,7 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
  */
 static inline void vma_start_write(struct vm_area_struct *vma)
 {
-	int mm_lock_seq;
+	long mm_lock_seq;
 
 	if (__is_vma_write_locked(vma, &mm_lock_seq))
 		return;
@@ -753,7 +753,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
-	int mm_lock_seq;
+	long mm_lock_seq;
 
 	VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma);
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d5e3f907eea4..c045543f43d9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -705,7 +705,7 @@ struct vm_area_struct {
 	 * counter reuse can only lead to occasional unnecessary use of the
 	 * slowpath.
 	 */
-	int vm_lock_seq;
+	long vm_lock_seq;
 	struct vma_lock *vm_lock;
 #endif
 
@@ -887,7 +887,7 @@ struct mm_struct {
 		 * Can be read with ACQUIRE semantics if not holding write
 		 * mmap_lock.
 		 */
-		int mm_lock_seq;
+		long mm_lock_seq;
 #endif
 
 
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 9d23635bc701..fca527dece63 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -105,7 +105,7 @@ static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
 	}
 }
 
-static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, long *seq)
 {
 	/* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
 	*seq = smp_load_acquire(&mm->mm_lock_seq);
@@ -113,7 +113,7 @@ static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
 	return (*seq & 1) == 0;
 }
 
-static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, long seq)
 {
 	/* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
 	smp_rmb();
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file
  2024-10-01 22:52 [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 1/5] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 2/5] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures Andrii Nakryiko
@ 2024-10-01 22:52 ` Andrii Nakryiko
  2024-10-03  9:13   ` Christian Brauner
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 4/5] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-01 22:52 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, mjguzik, brauner, jannh, mhocko, vbabka,
	mingo, Andrii Nakryiko, Amir Goldstein

6cf41fcfe099 ("backing file: free directly") switched FMODE_BACKING
files to direct freeing as back then there were no use cases requiring
RCU protected access to such files.

Now, with speculative lockless VMA-to-uprobe lookup logic, we do need to
have a guarantee that struct file memory is not going to be freed from
under us during speculative check. So add back RCU-delayed freeing
logic.

We use headless kfree_rcu_mightsleep() variant, as file_free() is only
called for FMODE_BACKING files in might_sleep() context.

Suggested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/file_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index ca7843dde56d..257691d358ee 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -68,7 +68,7 @@ static inline void file_free(struct file *f)
 	put_cred(f->f_cred);
 	if (unlikely(f->f_mode & FMODE_BACKING)) {
 		path_put(backing_file_user_path(f));
-		kfree(backing_file(f));
+		kfree_rcu_mightsleep(backing_file(f));
 	} else {
 		kmem_cache_free(filp_cachep, f);
 	}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 tip/perf/core 4/5] uprobes: simplify find_active_uprobe_rcu() VMA checks
  2024-10-01 22:52 [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file Andrii Nakryiko
@ 2024-10-01 22:52 ` Andrii Nakryiko
  2024-10-02  6:22   ` Oleg Nesterov
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
  2024-10-08 15:20 ` [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Oleg Nesterov
  5 siblings, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-01 22:52 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, mjguzik, brauner, jannh, mhocko, vbabka,
	mingo, Andrii Nakryiko

At the point where find_active_uprobe_rcu() is used we know that VMA in
question has triggered software breakpoint, so we don't need to validate
vma->vm_flags. Keep only vma->vm_file NULL check.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a2e6a57f79f2..7bd9111b4e8b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2091,7 +2091,7 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
 	mmap_read_lock(mm);
 	vma = vma_lookup(mm, bp_vaddr);
 	if (vma) {
-		if (valid_vma(vma, false)) {
+		if (vma->vm_file) {
 			struct inode *inode = file_inode(vma->vm_file);
 			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-01 22:52 [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 4/5] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
@ 2024-10-01 22:52 ` Andrii Nakryiko
  2024-10-02  7:25   ` Oleg Nesterov
                     ` (2 more replies)
  2024-10-08 15:20 ` [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Oleg Nesterov
  5 siblings, 3 replies; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-01 22:52 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, mjguzik, brauner, jannh, mhocko, vbabka,
	mingo, Andrii Nakryiko

Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING
files, a special case, now goes through RCU-delated freeing), we can
safely access vma->vm_file->f_inode field locklessly under just
rcu_read_lock() protection, which enables looking up uprobe from
uprobes_tree completely locklessly and speculatively without the need to
acquire mmap_lock for reads. In most cases, anyway, assuming that there
are no parallel mm and/or VMA modifications. The underlying struct
file's memory won't go away from under us (even if struct file can be
reused in the meantime).

We rely on newly added mmap_lock_speculation_{start,end}() helpers to
validate that mm_struct stays intact for entire duration of this
speculation. If not, we fall back to mmap_lock-protected lookup.
The speculative logic is written in such a way that it will safely
handle any garbage values that might be read from vma or file structs.

Benchmarking results speak for themselves.

BEFORE (latest tip/perf/core)
=============================
uprobe-nop            ( 1 cpus):    3.384 ± 0.004M/s  (  3.384M/s/cpu)
uprobe-nop            ( 2 cpus):    5.456 ± 0.005M/s  (  2.728M/s/cpu)
uprobe-nop            ( 3 cpus):    7.863 ± 0.015M/s  (  2.621M/s/cpu)
uprobe-nop            ( 4 cpus):    9.442 ± 0.008M/s  (  2.360M/s/cpu)
uprobe-nop            ( 5 cpus):   11.036 ± 0.013M/s  (  2.207M/s/cpu)
uprobe-nop            ( 6 cpus):   10.884 ± 0.019M/s  (  1.814M/s/cpu)
uprobe-nop            ( 7 cpus):    7.897 ± 0.145M/s  (  1.128M/s/cpu)
uprobe-nop            ( 8 cpus):   10.021 ± 0.128M/s  (  1.253M/s/cpu)
uprobe-nop            (10 cpus):    9.932 ± 0.170M/s  (  0.993M/s/cpu)
uprobe-nop            (12 cpus):    8.369 ± 0.056M/s  (  0.697M/s/cpu)
uprobe-nop            (14 cpus):    8.678 ± 0.017M/s  (  0.620M/s/cpu)
uprobe-nop            (16 cpus):    7.392 ± 0.003M/s  (  0.462M/s/cpu)
uprobe-nop            (24 cpus):    5.326 ± 0.178M/s  (  0.222M/s/cpu)
uprobe-nop            (32 cpus):    5.426 ± 0.059M/s  (  0.170M/s/cpu)
uprobe-nop            (40 cpus):    5.262 ± 0.070M/s  (  0.132M/s/cpu)
uprobe-nop            (48 cpus):    6.121 ± 0.010M/s  (  0.128M/s/cpu)
uprobe-nop            (56 cpus):    6.252 ± 0.035M/s  (  0.112M/s/cpu)
uprobe-nop            (64 cpus):    7.644 ± 0.023M/s  (  0.119M/s/cpu)
uprobe-nop            (72 cpus):    7.781 ± 0.001M/s  (  0.108M/s/cpu)
uprobe-nop            (80 cpus):    8.992 ± 0.048M/s  (  0.112M/s/cpu)

AFTER
=====
uprobe-nop            ( 1 cpus):    3.534 ± 0.033M/s  (  3.534M/s/cpu)
uprobe-nop            ( 2 cpus):    6.701 ± 0.007M/s  (  3.351M/s/cpu)
uprobe-nop            ( 3 cpus):   10.031 ± 0.007M/s  (  3.344M/s/cpu)
uprobe-nop            ( 4 cpus):   13.003 ± 0.012M/s  (  3.251M/s/cpu)
uprobe-nop            ( 5 cpus):   16.274 ± 0.006M/s  (  3.255M/s/cpu)
uprobe-nop            ( 6 cpus):   19.563 ± 0.024M/s  (  3.261M/s/cpu)
uprobe-nop            ( 7 cpus):   22.696 ± 0.054M/s  (  3.242M/s/cpu)
uprobe-nop            ( 8 cpus):   24.534 ± 0.010M/s  (  3.067M/s/cpu)
uprobe-nop            (10 cpus):   30.475 ± 0.117M/s  (  3.047M/s/cpu)
uprobe-nop            (12 cpus):   33.371 ± 0.017M/s  (  2.781M/s/cpu)
uprobe-nop            (14 cpus):   38.864 ± 0.004M/s  (  2.776M/s/cpu)
uprobe-nop            (16 cpus):   41.476 ± 0.020M/s  (  2.592M/s/cpu)
uprobe-nop            (24 cpus):   64.696 ± 0.021M/s  (  2.696M/s/cpu)
uprobe-nop            (32 cpus):   85.054 ± 0.027M/s  (  2.658M/s/cpu)
uprobe-nop            (40 cpus):  101.979 ± 0.032M/s  (  2.549M/s/cpu)
uprobe-nop            (48 cpus):  110.518 ± 0.056M/s  (  2.302M/s/cpu)
uprobe-nop            (56 cpus):  117.737 ± 0.020M/s  (  2.102M/s/cpu)
uprobe-nop            (64 cpus):  124.613 ± 0.079M/s  (  1.947M/s/cpu)
uprobe-nop            (72 cpus):  133.239 ± 0.032M/s  (  1.851M/s/cpu)
uprobe-nop            (80 cpus):  142.037 ± 0.138M/s  (  1.775M/s/cpu)

Previously total throughput was maxing out at 11mln/s, and gradually
declining past 8 cores. With this change, it now keeps growing with each
added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core
Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).

Note, results above assume that commit [0] from linux-trace tree is
applied as well, otherwise it will limit scalability to about 10mln/s
total throughput.

  [0] 10cdb82aa77f ("uprobes: turn trace_uprobe's nhit counter to be per-CPU one")

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 44 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 7bd9111b4e8b..960130275656 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2081,6 +2081,46 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	return is_trap_insn(&opcode);
 }
 
+static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct uprobe *uprobe = NULL;
+	struct vm_area_struct *vma;
+	struct file *vm_file;
+	loff_t offset;
+	long seq;
+
+	guard(rcu)();
+
+	if (!mmap_lock_speculation_start(mm, &seq))
+		return NULL;
+
+	vma = vma_lookup(mm, bp_vaddr);
+	if (!vma)
+		return NULL;
+
+	/* vm_file memory can be reused for another instance of struct file,
+	 * but can't be freed from under us, so it's safe to read fields from
+	 * it, even if the values are some garbage values; ultimately
+	 * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
+	 * that whatever we speculatively found is correct
+	 */
+	vm_file = READ_ONCE(vma->vm_file);
+	if (!vm_file)
+		return NULL;
+
+	offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
+	uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
+	if (!uprobe)
+		return NULL;
+
+	/* now double check that nothing about MM changed */
+	if (!mmap_lock_speculation_end(mm, seq))
+		return NULL;
+
+	return uprobe;
+}
+
 /* assumes being inside RCU protected region */
 static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
 {
@@ -2088,6 +2128,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
 	struct uprobe *uprobe = NULL;
 	struct vm_area_struct *vma;
 
+	uprobe = find_active_uprobe_speculative(bp_vaddr);
+	if (uprobe)
+		return uprobe;
+
 	mmap_read_lock(mm);
 	vma = vma_lookup(mm, bp_vaddr);
 	if (vma) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 4/5] uprobes: simplify find_active_uprobe_rcu() VMA checks
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 4/5] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
@ 2024-10-02  6:22   ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2024-10-02  6:22 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck, willy, surenb, akpm, linux-mm, mjguzik, brauner,
	jannh, mhocko, vbabka, mingo

On 10/01, Andrii Nakryiko wrote:
>
> At the point where find_active_uprobe_rcu() is used we know that VMA in
> question has triggered software breakpoint, so we don't need to validate
> vma->vm_flags. Keep only vma->vm_file NULL check.
> 
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Oleg Nesterov <oleg@redhat.com>


> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index a2e6a57f79f2..7bd9111b4e8b 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2091,7 +2091,7 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
>  	mmap_read_lock(mm);
>  	vma = vma_lookup(mm, bp_vaddr);
>  	if (vma) {
> -		if (valid_vma(vma, false)) {
> +		if (vma->vm_file) {
>  			struct inode *inode = file_inode(vma->vm_file);
>  			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
>  
> -- 
> 2.43.5
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
@ 2024-10-02  7:25   ` Oleg Nesterov
  2024-10-02 20:02     ` Andrii Nakryiko
  2024-10-04 23:59   ` kernel test robot
  2024-10-05  1:12   ` kernel test robot
  2 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2024-10-02  7:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck, willy, surenb, akpm, linux-mm, mjguzik, brauner,
	jannh, mhocko, vbabka, mingo

On 10/01, Andrii Nakryiko wrote:
>
> +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct uprobe *uprobe = NULL;
> +	struct vm_area_struct *vma;
> +	struct file *vm_file;
> +	loff_t offset;
> +	long seq;
> +
> +	guard(rcu)();
> +
> +	if (!mmap_lock_speculation_start(mm, &seq))
> +		return NULL;
> +
> +	vma = vma_lookup(mm, bp_vaddr);
> +	if (!vma)
> +		return NULL;
> +
> +	/* vm_file memory can be reused for another instance of struct file,
> +	 * but can't be freed from under us, so it's safe to read fields from
> +	 * it, even if the values are some garbage values; ultimately
> +	 * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> +	 * that whatever we speculatively found is correct
> +	 */
> +	vm_file = READ_ONCE(vma->vm_file);
> +	if (!vm_file)
> +		return NULL;
> +
> +	offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);

LGTM. But perhaps vma->vm_pgoff and vma->vm_start need READ_ONCE() as well,
if nothing else to shut up KCSAN if this code races with, say, __split_vma() ?

> +	uprobe = find_uprobe_rcu(vm_file->f_inode, offset);

OK, I guess vm_file->f_inode is fine without READ_ONCE...

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-02  7:25   ` Oleg Nesterov
@ 2024-10-02 20:02     ` Andrii Nakryiko
  2024-10-03  9:32       ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-02 20:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
	mjguzik, brauner, jannh, mhocko, vbabka, mingo

On Wed, Oct 2, 2024 at 12:25 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 10/01, Andrii Nakryiko wrote:
> >
> > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > +{
> > +     struct mm_struct *mm = current->mm;
> > +     struct uprobe *uprobe = NULL;
> > +     struct vm_area_struct *vma;
> > +     struct file *vm_file;
> > +     loff_t offset;
> > +     long seq;
> > +
> > +     guard(rcu)();
> > +
> > +     if (!mmap_lock_speculation_start(mm, &seq))
> > +             return NULL;
> > +
> > +     vma = vma_lookup(mm, bp_vaddr);
> > +     if (!vma)
> > +             return NULL;
> > +
> > +     /* vm_file memory can be reused for another instance of struct file,
> > +      * but can't be freed from under us, so it's safe to read fields from
> > +      * it, even if the values are some garbage values; ultimately
> > +      * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
> > +      * that whatever we speculatively found is correct
> > +      */
> > +     vm_file = READ_ONCE(vma->vm_file);
> > +     if (!vm_file)
> > +             return NULL;
> > +
> > +     offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
>
> LGTM. But perhaps vma->vm_pgoff and vma->vm_start need READ_ONCE() as well,
> if nothing else to shut up KCSAN if this code races with, say, __split_vma() ?

We keep going back and forth between reading directly, using
READ_ONCE(), and annotating with data_race(). I don't think it matters
in terms of correctness or performance, so I'm happy to add whatever
incantations that will make everyone satisfied. Let's see what others
think, and I'll incorporate that into the next revision.

>
> > +     uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
>
> OK, I guess vm_file->f_inode is fine without READ_ONCE...
>
> Oleg.
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file Andrii Nakryiko
@ 2024-10-03  9:13   ` Christian Brauner
  2024-10-04  8:01     ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2024-10-03  9:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
	mjguzik, jannh, mhocko, vbabka, mingo, Amir Goldstein

On Tue, Oct 01, 2024 at 03:52:05PM GMT, Andrii Nakryiko wrote:
> 6cf41fcfe099 ("backing file: free directly") switched FMODE_BACKING
> files to direct freeing as back then there were no use cases requiring
> RCU protected access to such files.
> 
> Now, with speculative lockless VMA-to-uprobe lookup logic, we do need to
> have a guarantee that struct file memory is not going to be freed from
> under us during speculative check. So add back RCU-delayed freeing
> logic.
> 
> We use headless kfree_rcu_mightsleep() variant, as file_free() is only
> called for FMODE_BACKING files in might_sleep() context.
> 
> Suggested-by: Suren Baghdasaryan <surenb@google.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-02 20:02     ` Andrii Nakryiko
@ 2024-10-03  9:32       ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2024-10-03  9:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
	mjguzik, brauner, jannh, mhocko, vbabka, mingo

On 10/02, Andrii Nakryiko wrote:
>
> On Wed, Oct 2, 2024 at 12:25 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > +     vm_file = READ_ONCE(vma->vm_file);
> > > +     if (!vm_file)
> > > +             return NULL;
> > > +
> > > +     offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
> >
> > LGTM. But perhaps vma->vm_pgoff and vma->vm_start need READ_ONCE() as well,
> > if nothing else to shut up KCSAN if this code races with, say, __split_vma() ?
>
> We keep going back and forth between reading directly, using
> READ_ONCE(), and annotating with data_race(). I don't think it matters
> in terms of correctness or performance, so I'm happy to add whatever
> incantations that will make everyone satisfied. Let's see what others
> think, and I'll incorporate that into the next revision.

OK, agreed...

And I guess I was wrong anyway, READ_ONCE() alone won't shutup KCSAN in
this case.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file
  2024-10-03  9:13   ` Christian Brauner
@ 2024-10-04  8:01     ` Christian Brauner
  2024-10-04 19:58       ` Andrii Nakryiko
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2024-10-04  8:01 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
	mjguzik, jannh, mhocko, vbabka, mingo, Amir Goldstein

On Thu, Oct 03, 2024 at 11:13:54AM GMT, Christian Brauner wrote:
> On Tue, Oct 01, 2024 at 03:52:05PM GMT, Andrii Nakryiko wrote:
> > 6cf41fcfe099 ("backing file: free directly") switched FMODE_BACKING
> > files to direct freeing as back then there were no use cases requiring
> > RCU protected access to such files.
> > 
> > Now, with speculative lockless VMA-to-uprobe lookup logic, we do need to
> > have a guarantee that struct file memory is not going to be freed from
> > under us during speculative check. So add back RCU-delayed freeing
> > logic.
> > 
> > We use headless kfree_rcu_mightsleep() variant, as file_free() is only
> > called for FMODE_BACKING files in might_sleep() context.
> > 
> > Suggested-by: Suren Baghdasaryan <surenb@google.com>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> 
> Reviewed-by: Christian Brauner <brauner@kernel.org>

Fwiw, I have another patch series for files that I'm testing that will
require me to switch FMODE_BACKING to a SLAB_TYPSAFE_BY_RCU cache. That
shouldn't matter for your use-case though.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file
  2024-10-04  8:01     ` Christian Brauner
@ 2024-10-04 19:58       ` Andrii Nakryiko
  2024-10-09 10:35         ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-04 19:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
	linux-mm, mjguzik, jannh, mhocko, vbabka, mingo, Amir Goldstein

On Fri, Oct 4, 2024 at 1:01 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Oct 03, 2024 at 11:13:54AM GMT, Christian Brauner wrote:
> > On Tue, Oct 01, 2024 at 03:52:05PM GMT, Andrii Nakryiko wrote:
> > > 6cf41fcfe099 ("backing file: free directly") switched FMODE_BACKING
> > > files to direct freeing as back then there were no use cases requiring
> > > RCU protected access to such files.
> > >
> > > Now, with speculative lockless VMA-to-uprobe lookup logic, we do need to
> > > have a guarantee that struct file memory is not going to be freed from
> > > under us during speculative check. So add back RCU-delayed freeing
> > > logic.
> > >
> > > We use headless kfree_rcu_mightsleep() variant, as file_free() is only
> > > called for FMODE_BACKING files in might_sleep() context.
> > >
> > > Suggested-by: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> >
> > Reviewed-by: Christian Brauner <brauner@kernel.org>
>
> Fwiw, I have another patch series for files that I'm testing that will
> require me to switch FMODE_BACKING to a SLAB_TYPSAFE_BY_RCU cache. That
> shouldn't matter for your use-case though.

Correct, we assume SLAB_TYPESAFE_BY_RCU semantics for the common case
anyways. But hopefully my change won't cause major merge conflicts
with your patch set.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
  2024-10-02  7:25   ` Oleg Nesterov
@ 2024-10-04 23:59   ` kernel test robot
  2024-10-05  1:12   ` kernel test robot
  2 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2024-10-04 23:59 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, peterz, oleg
  Cc: oe-kbuild-all, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, surenb, akpm, linux-mm, mjguzik, brauner, jannh,
	mhocko, vbabka, mingo, Andrii Nakryiko

Hi Andrii,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/perf/core]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/mm-introduce-mmap_lock_speculation_-start-end/20241002-065354
base:   tip/perf/core
patch link:    https://lore.kernel.org/r/20241001225207.2215639-6-andrii%40kernel.org
patch subject: [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20241005/202410050745.2Nuvusy4-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241005/202410050745.2Nuvusy4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410050745.2Nuvusy4-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/events/uprobes.c: In function 'find_active_uprobe_speculative':
>> kernel/events/uprobes.c:2098:46: error: passing argument 2 of 'mmap_lock_speculation_start' from incompatible pointer type [-Wincompatible-pointer-types]
    2098 |         if (!mmap_lock_speculation_start(mm, &seq))
         |                                              ^~~~
         |                                              |
         |                                              long int *
   In file included from include/linux/mm.h:16,
                    from arch/loongarch/include/asm/cacheflush.h:8,
                    from include/linux/cacheflush.h:5,
                    from include/linux/highmem.h:8,
                    from kernel/events/uprobes.c:13:
   include/linux/mmap_lock.h:126:75: note: expected 'int *' but argument is of type 'long int *'
     126 | static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
         |                                                                      ~~~~~^~~


vim +/mmap_lock_speculation_start +2098 kernel/events/uprobes.c

  2086	
  2087	static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
  2088	{
  2089		struct mm_struct *mm = current->mm;
  2090		struct uprobe *uprobe = NULL;
  2091		struct vm_area_struct *vma;
  2092		struct file *vm_file;
  2093		loff_t offset;
  2094		long seq;
  2095	
  2096		guard(rcu)();
  2097	
> 2098		if (!mmap_lock_speculation_start(mm, &seq))
  2099			return NULL;
  2100	
  2101		vma = vma_lookup(mm, bp_vaddr);
  2102		if (!vma)
  2103			return NULL;
  2104	
  2105		/* vm_file memory can be reused for another instance of struct file,
  2106		 * but can't be freed from under us, so it's safe to read fields from
  2107		 * it, even if the values are some garbage values; ultimately
  2108		 * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
  2109		 * that whatever we speculatively found is correct
  2110		 */
  2111		vm_file = READ_ONCE(vma->vm_file);
  2112		if (!vm_file)
  2113			return NULL;
  2114	
  2115		offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
  2116		uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
  2117		if (!uprobe)
  2118			return NULL;
  2119	
  2120		/* now double check that nothing about MM changed */
  2121		if (!mmap_lock_speculation_end(mm, seq))
  2122			return NULL;
  2123	
  2124		return uprobe;
  2125	}
  2126	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
  2024-10-02  7:25   ` Oleg Nesterov
  2024-10-04 23:59   ` kernel test robot
@ 2024-10-05  1:12   ` kernel test robot
  2 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2024-10-05  1:12 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, peterz, oleg
  Cc: llvm, oe-kbuild-all, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, surenb, akpm, linux-mm, mjguzik, brauner, jannh,
	mhocko, vbabka, mingo, Andrii Nakryiko

Hi Andrii,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/perf/core]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/mm-introduce-mmap_lock_speculation_-start-end/20241002-065354
base:   tip/perf/core
patch link:    https://lore.kernel.org/r/20241001225207.2215639-6-andrii%40kernel.org
patch subject: [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
config: i386-defconfig (https://download.01.org/0day-ci/archive/20241005/202410050846.Zbsm9OI1-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241005/202410050846.Zbsm9OI1-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410050846.Zbsm9OI1-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/events/uprobes.c:2098:39: error: incompatible pointer types passing 'long *' to parameter of type 'int *' [-Werror,-Wincompatible-pointer-types]
    2098 |         if (!mmap_lock_speculation_start(mm, &seq))
         |                                              ^~~~
   include/linux/mmap_lock.h:126:75: note: passing argument to parameter 'seq' here
     126 | static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
         |                                                                           ^
   1 error generated.


vim +2098 kernel/events/uprobes.c

  2086	
  2087	static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
  2088	{
  2089		struct mm_struct *mm = current->mm;
  2090		struct uprobe *uprobe = NULL;
  2091		struct vm_area_struct *vma;
  2092		struct file *vm_file;
  2093		loff_t offset;
  2094		long seq;
  2095	
  2096		guard(rcu)();
  2097	
> 2098		if (!mmap_lock_speculation_start(mm, &seq))
  2099			return NULL;
  2100	
  2101		vma = vma_lookup(mm, bp_vaddr);
  2102		if (!vma)
  2103			return NULL;
  2104	
  2105		/* vm_file memory can be reused for another instance of struct file,
  2106		 * but can't be freed from under us, so it's safe to read fields from
  2107		 * it, even if the values are some garbage values; ultimately
  2108		 * find_uprobe_rcu() + mmap_lock_speculation_end() check will ensure
  2109		 * that whatever we speculatively found is correct
  2110		 */
  2111		vm_file = READ_ONCE(vma->vm_file);
  2112		if (!vm_file)
  2113			return NULL;
  2114	
  2115		offset = (loff_t)(vma->vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vma->vm_start);
  2116		uprobe = find_uprobe_rcu(vm_file->f_inode, offset);
  2117		if (!uprobe)
  2118			return NULL;
  2119	
  2120		/* now double check that nothing about MM changed */
  2121		if (!mmap_lock_speculation_end(mm, seq))
  2122			return NULL;
  2123	
  2124		return uprobe;
  2125	}
  2126	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 1/5] mm: introduce mmap_lock_speculation_{start|end}
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 1/5] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
@ 2024-10-07 17:05   ` Andrii Nakryiko
  0 siblings, 0 replies; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-07 17:05 UTC (permalink / raw)
  To: akpm, willy, mhocko, vbabka, linux-mm
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, surenb, mjguzik, brauner, jannh,
	mingo, Andrii Nakryiko, Liam Howlett

+cc Liam, sorry, seems like I forgot to add you to the entire patch
set on initial submission.

On Tue, Oct 1, 2024 at 3:52 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> From: Suren Baghdasaryan <surenb@google.com>
>
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.
>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> Link: https://lore.kernel.org/bpf/20240912210222.186542-1-surenb@google.com
> ---
>  include/linux/mm_types.h  |  3 ++
>  include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
>  kernel/fork.c             |  3 --
>  3 files changed, 63 insertions(+), 15 deletions(-)
>

Are memory-management folks OK with these changes? It would be nice to
get some acks, if so, and I'd include it into respin, fixing minor
things in uprobe patches. Thank you!

Note, while this is initially needed for uprobe functionality, having
an ability to quickly change whether mm_struct changed inbetween some
speculative querying is generally useful functionality, and I believe
it would help eliminating mmap_lock usage from /proc/PID/maps code.
Which is a great outcome for everyone, as that mmap_lock can be quite
disruptive in production workloads.

So please don't see it as some irrelevant uprobe-related requirement,
the applicability of this is much wider.

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup
  2024-10-01 22:52 [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-10-01 22:52 ` [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
@ 2024-10-08 15:20 ` Oleg Nesterov
  5 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2024-10-08 15:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck, willy, surenb, akpm, linux-mm, mjguzik, brauner,
	jannh, mhocko, vbabka, mingo

Well. I am in no position to ack these changes.

But let me say that I personally like this series and I see nothing wrong,
except perhaps 5/5 needs some data_race/etc annotations as we have already
discussed.

Thanks,

Oleg.

On 10/01, Andrii Nakryiko wrote:
>
> Implement speculative (lockless) resolution of VMA to inode to uprobe,
> bypassing the need to take mmap_lock for reads, if possible. Patch #1 by Suren
> adds mm_struct helpers that help detect whether mm_struct were changed, which
> is used by uprobe logic to validate that speculative results can be trusted
> after all the lookup logic results in a valid uprobe instance. Patch #2
> follows to make mm_lock_seq into 64-bit counter (on 64-bit architectures).
>
> Patch #3 adds back RCU-delayed freeing for FMODE_BACKING files, which is
> necessary to make speculation safe to access struct file's memory in any
> possible situation.
>
> Patch #4 is a simplification to uprobe VMA flag checking, suggested by Oleg.
>
> And, finally, patch #5 is the speculative VMA-to-uprobe resolution logic. See
> corresponding patch for details and benchmarking results.
>
> v1->v2:
> - adjusted vma_end_write_all() comment to point out it should never be called
>   manually now, but I wasn't sure how ACQUIRE/RELEASE comments should be
>   reworded (previously requested by Jann), so I'd appreciate some help there
>   (Jann);
> - int -> long change for mm_lock_seq, as agreed at LPC2024 (Jann, Suren, Liam);
> - kfree_rcu_mightsleep() for FMODE_BACKING (Suren, Christian);
> - vm_flags simplification in find_active_uprobe_rcu() and
>   find_active_uprobe_speculative() (Oleg);
> - guard(rcu)() simplified find_active_uprobe_speculative() implementation.
>
> Andrii Nakryiko (4):
>   mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures
>   fs: add back RCU-delayed freeing of FMODE_BACKING file
>   uprobes: simplify find_active_uprobe_rcu() VMA checks
>   uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
>
> Suren Baghdasaryan (1):
>   mm: introduce mmap_lock_speculation_{start|end}
>
>  fs/file_table.c           |  2 +-
>  include/linux/mm.h        |  6 ++--
>  include/linux/mm_types.h  |  7 ++--
>  include/linux/mmap_lock.h | 72 ++++++++++++++++++++++++++++++++-------
>  kernel/events/uprobes.c   | 46 ++++++++++++++++++++++++-
>  kernel/fork.c             |  3 --
>  6 files changed, 114 insertions(+), 22 deletions(-)
>
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file
  2024-10-04 19:58       ` Andrii Nakryiko
@ 2024-10-09 10:35         ` Christian Brauner
  2024-10-09 19:37           ` Andrii Nakryiko
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2024-10-09 10:35 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
	linux-mm, mjguzik, jannh, mhocko, vbabka, mingo, Amir Goldstein

On Fri, Oct 04, 2024 at 12:58:00PM GMT, Andrii Nakryiko wrote:
> On Fri, Oct 4, 2024 at 1:01 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Thu, Oct 03, 2024 at 11:13:54AM GMT, Christian Brauner wrote:
> > > On Tue, Oct 01, 2024 at 03:52:05PM GMT, Andrii Nakryiko wrote:
> > > > 6cf41fcfe099 ("backing file: free directly") switched FMODE_BACKING
> > > > files to direct freeing as back then there were no use cases requiring
> > > > RCU protected access to such files.
> > > >
> > > > Now, with speculative lockless VMA-to-uprobe lookup logic, we do need to
> > > > have a guarantee that struct file memory is not going to be freed from
> > > > under us during speculative check. So add back RCU-delayed freeing
> > > > logic.
> > > >
> > > > We use headless kfree_rcu_mightsleep() variant, as file_free() is only
> > > > called for FMODE_BACKING files in might_sleep() context.
> > > >
> > > > Suggested-by: Suren Baghdasaryan <surenb@google.com>
> > > > Cc: Christian Brauner <brauner@kernel.org>
> > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > >
> > > Reviewed-by: Christian Brauner <brauner@kernel.org>
> >
> > Fwiw, I have another patch series for files that I'm testing that will
> > require me to switch FMODE_BACKING to a SLAB_TYPSAFE_BY_RCU cache. That
> > shouldn't matter for your use-case though.
> 
> Correct, we assume SLAB_TYPESAFE_BY_RCU semantics for the common case
> anyways. But hopefully my change won't cause major merge conflicts
> with your patch set.

Please drop this patch and pull the following tag which adds
SLAB_TYPE_SAFE_BY_RCU protection for FMODE_BACKING files aligning them
with regular files lifetime (even though not needed). The branch the tag
is based on is stable and won't change anymore:

git pull -S git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git tags/vfs-6.13.for-bpf.file

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file
  2024-10-09 10:35         ` Christian Brauner
@ 2024-10-09 19:37           ` Andrii Nakryiko
  0 siblings, 0 replies; 19+ messages in thread
From: Andrii Nakryiko @ 2024-10-09 19:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
	linux-mm, mjguzik, jannh, mhocko, vbabka, mingo, Amir Goldstein

On Wed, Oct 9, 2024 at 3:36 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Oct 04, 2024 at 12:58:00PM GMT, Andrii Nakryiko wrote:
> > On Fri, Oct 4, 2024 at 1:01 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Thu, Oct 03, 2024 at 11:13:54AM GMT, Christian Brauner wrote:
> > > > On Tue, Oct 01, 2024 at 03:52:05PM GMT, Andrii Nakryiko wrote:
> > > > > 6cf41fcfe099 ("backing file: free directly") switched FMODE_BACKING
> > > > > files to direct freeing as back then there were no use cases requiring
> > > > > RCU protected access to such files.
> > > > >
> > > > > Now, with speculative lockless VMA-to-uprobe lookup logic, we do need to
> > > > > have a guarantee that struct file memory is not going to be freed from
> > > > > under us during speculative check. So add back RCU-delayed freeing
> > > > > logic.
> > > > >
> > > > > We use headless kfree_rcu_mightsleep() variant, as file_free() is only
> > > > > called for FMODE_BACKING files in might_sleep() context.
> > > > >
> > > > > Suggested-by: Suren Baghdasaryan <surenb@google.com>
> > > > > Cc: Christian Brauner <brauner@kernel.org>
> > > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > > ---
> > > >
> > > > Reviewed-by: Christian Brauner <brauner@kernel.org>
> > >
> > > Fwiw, I have another patch series for files that I'm testing that will
> > > require me to switch FMODE_BACKING to a SLAB_TYPSAFE_BY_RCU cache. That
> > > shouldn't matter for your use-case though.
> >
> > Correct, we assume SLAB_TYPESAFE_BY_RCU semantics for the common case
> > anyways. But hopefully my change won't cause major merge conflicts
> > with your patch set.
>
> Please drop this patch and pull the following tag which adds
> SLAB_TYPE_SAFE_BY_RCU protection for FMODE_BACKING files aligning them
> with regular files lifetime (even though not needed). The branch the tag
> is based on is stable and won't change anymore:
>
> git pull -S git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git tags/vfs-6.13.for-bpf.file

Ok, will drop. It will on Peter to pull this tag into tip/perf/core,
but I'll mention all this in the cover letter (and will pull locally
for testing, of course). Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-10-09 19:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-01 22:52 [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
2024-10-01 22:52 ` [PATCH v2 tip/perf/core 1/5] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
2024-10-07 17:05   ` Andrii Nakryiko
2024-10-01 22:52 ` [PATCH v2 tip/perf/core 2/5] mm: switch to 64-bit mm_lock_seq/vm_lock_seq on 64-bit architectures Andrii Nakryiko
2024-10-01 22:52 ` [PATCH v2 tip/perf/core 3/5] fs: add back RCU-delayed freeing of FMODE_BACKING file Andrii Nakryiko
2024-10-03  9:13   ` Christian Brauner
2024-10-04  8:01     ` Christian Brauner
2024-10-04 19:58       ` Andrii Nakryiko
2024-10-09 10:35         ` Christian Brauner
2024-10-09 19:37           ` Andrii Nakryiko
2024-10-01 22:52 ` [PATCH v2 tip/perf/core 4/5] uprobes: simplify find_active_uprobe_rcu() VMA checks Andrii Nakryiko
2024-10-02  6:22   ` Oleg Nesterov
2024-10-01 22:52 ` [PATCH v2 tip/perf/core 5/5] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
2024-10-02  7:25   ` Oleg Nesterov
2024-10-02 20:02     ` Andrii Nakryiko
2024-10-03  9:32       ` Oleg Nesterov
2024-10-04 23:59   ` kernel test robot
2024-10-05  1:12   ` kernel test robot
2024-10-08 15:20 ` [PATCH v2 tip/perf/core 0/5] uprobes,mm: speculative lockless VMA-to-uprobe lookup Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).