linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations
@ 2024-09-03 17:45 Andrii Nakryiko
  2024-09-03 17:45 ` [PATCH v5 1/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:45 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

This patch set is heavily inspired by Peter Zijlstra's uprobe optimization
patches ([0]) and continues that work, albeit trying to keep complexity to the
minimum, and attepting to reuse existing primitives as much as possible. The
goal here is to optimize common uprobe triggering hot path, while keeping the
rest of locking mostly intact.

I've added uprobe_unregister_sync() into the error handling code path inside
uprobe_unregister(). This is due to recent refactorings from Oleg Nesterov
([1]), which necessitates this addition.

Except for refcounting change patch (which I stongly believe is a good
improvement we should do and forget about quasi-refcounting schema of
uprobe->consumers list), the rest of the changes are similar to Peter's
initial changes in [0].

Main differences would be:
  - no special RCU protection for mmap and fork handling, we just stick to
    refcounts there, as those are infrequent and not a performance-sensitive
    code path, while being complex and thus benefiting from proper locking;
  - the above means we don't need to do any custom SRCU additions to handle
    forking code path;
  - I handled UPROBE_HANDLER_REMOVE problem in handler_chain() differently,
    again, leveraging existing locking scheam;
  - I kept refcount usage for uretprobe and single-stepping uprobes, I plan to
    address that in a separate follow up patches. The plan is to avoid
    task_work with a lockless atomic xchg()/cmpxchg()-based solution;
  - finally, I dutifully was using SRCU throughout all the changes, and only
    last patch switches SRCU to RCU Tasks Trace and demonstrates significant
    performance and scalability gains from this.

The changes in this patch set were tested using BPF selftests and using
uprobe-stress ([2]) tool.

Now, for the benchmarking results. I've used the following script (which
utilizes BPF selftests-based bench tool). The CPU used was 80-core Intel Xeon
Gold 6138 CPU @ 2.00GHz running kernel with production-like config. I minimized
background noise by stopping any service I could identify and stop, so results
are pretty stable and variability is pretty small, overall.

Benchmark script:

#!/bin/bash

set -eufo pipefail

for i in uprobe-nop uretprobe-nop; do
    for p in 1 2 4 8 16 32 64; do
        summary=$(sudo ./bench -w3 -d5 -p$p -a trig-$i | tail -n1)
        total=$(echo "$summary" | cut -d'(' -f1 | cut -d' ' -f3-)
        percpu=$(echo "$summary" | cut -d'(' -f2 | cut -d')' -f1 | cut -d'/' -f1)
        printf "%-15s (%2d cpus): %s (%s/s/cpu)\n" $i $p "$total" "$percpu"
    done
    echo
done

With all the lock-avoiding changes done in this patch set, we get a pretty
decent improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is due to
the remaining mmap_lock, which is currently taken to resolve interrupt address
to inode+offset and then uprobe instance. And, of course, uretprobes still need
similar RCU to avoid refcount in the hot path, which will be addressed in the
follow up patches.

BASELINE (on top of Oleg's clean up patches)
============================================
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

Up to second-to-last patch (i.e., SRCU-based optimizations)
===========================================================
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

RCU Tasks Trace
===============
uprobe-nop      ( 1 cpus):    3.396 ± 0.002M/s  (  3.396M/s/cpu)
uprobe-nop      ( 2 cpus):    4.271 ± 0.006M/s  (  2.135M/s/cpu)
uprobe-nop      ( 4 cpus):    8.499 ± 0.015M/s  (  2.125M/s/cpu)
uprobe-nop      ( 8 cpus):   10.355 ± 0.028M/s  (  1.294M/s/cpu)
uprobe-nop      (16 cpus):    7.615 ± 0.099M/s  (  0.476M/s/cpu)
uprobe-nop      (32 cpus):    4.430 ± 0.007M/s  (  0.138M/s/cpu)
uprobe-nop      (64 cpus):    6.887 ± 0.020M/s  (  0.108M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.174 ± 0.001M/s  (  2.174M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.853 ± 0.001M/s  (  1.426M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.913 ± 0.002M/s  (  1.228M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.883 ± 0.002M/s  (  0.735M/s/cpu)
uretprobe-nop   (16 cpus):    5.147 ± 0.001M/s  (  0.322M/s/cpu)
uretprobe-nop   (32 cpus):    3.738 ± 0.008M/s  (  0.117M/s/cpu)
uretprobe-nop   (64 cpus):    4.397 ± 0.002M/s  (  0.069M/s/cpu)

For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe
triggerings per second) up to about 8 M/s. For uretprobes it's a bit more
modest with bump from 2.4 M/s to 5 M/s.

For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from
8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%),
as we have more work to do on uretprobes side.

Even single-thread (no contention) performance is slightly better: 3.276 M/s to
3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%)
for uretprobes.

  [0] https://lore.kernel.org/linux-trace-kernel/20240711110235.098009979@infradead.org/
  [1] https://lore.kernel.org/linux-trace-kernel/20240729134444.GA12293@redhat.com/
  [2] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress

v4->v5:
  - added missing linux/rcupdate_trace.h include (Kernel Test Bot);
  - added comment in uprobe_consumer about expected consistency between
    handler() returning UPROBE_HANDLER_REMOVE and the need for filter() to
    confirm the decision (Oleg);
  - this time added Oleg's Reviewed-by;
v3->v4:
  - added back consumer_rwsem into consumer_del(), which was accidentally
    omitted earlier (Jiri);
  - left out RFC patches, we can continue discussion on v3 patch set, if
    necessary;
v2->v3:
  - undid rcu and rb_node fields colocation which were causing crashes (Oleg);
  - ensure synchronize_srcu() on registration failure in patch #4 (Oleg);
v1->v2:
  - added back missed kfree() in patch #1 (Oleg);
  - forgot the rest, but there were a few small things here and there.

Andrii Nakryiko (6):
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: protected uprobe lifetime with SRCU
  uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  uprobes: travers uprobe's consumer list locklessly under SRCU
    protection
  uprobes: perform lockless SRCU-protected uprobes_tree lookup
  uprobes: switch to RCU Tasks Trace flavor for better performance

Peter Zijlstra (2):
  perf/uprobe: split uprobe_unregister()
  rbtree: provide rb_find_rcu() / rb_find_add_rcu()

 include/linux/rbtree.h                        |  67 +++
 include/linux/uprobes.h                       |  28 +-
 kernel/events/uprobes.c                       | 392 +++++++++++-------
 kernel/trace/bpf_trace.c                      |   8 +-
 kernel/trace/trace_uprobe.c                   |  15 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |   3 +-
 6 files changed, 330 insertions(+), 183 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v5 1/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
@ 2024-09-03 17:45 ` Andrii Nakryiko
  2024-09-03 17:45 ` [PATCH v5 2/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:45 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
    and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
    handler code, temporary refcount bump is done to keep uprobe from
    being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
    take another refcount to keep uprobe alive until user space code
    returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is confusing and problematic. No
matter how many actual consumers are attached, they all share the same
refcount, and we have an extra logic to drop the "last" (which might not
really be last) refcount once uprobe's consumer list becomes empty.

This is unconventional and has to be kept in mind as a special case all
the time. Further, because of this design we have the situations where
find_uprobe() will find uprobe, bump refcount, return it to the caller,
but that uprobe will still need uprobe_is_active() check, after which
the caller is required to drop refcount and try again. This is just too
many details leaking to the higher level logic.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead, each
uprobe_consumer is assuming its own refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
will attempt to bump refcount, unless it already is zero. Caller needs
to guarantee that uprobe instance won't be freed in parallel, which is
the case while we keep uprobes_treelock (for read or write, doesn't
matter).

Note also, we now don't leak the race between registration and
unregistration, so we remove the retry logic completely. If
find_uprobe() returns valid uprobe, it's guaranteed to remain in
uprobes_tree with properly incremented refcount. The race is handled
inside __insert_uprobe() and put_uprobe() working together:
__insert_uprobe() will remove uprobe from RB-tree, if it can't bump
refcount and will retry to insert the new uprobe instance. put_uprobe()
won't attempt to remove uprobe from RB-tree, if it's already not there.
All that is protected by uprobes_treelock, which keeps things simple.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 179 +++++++++++++++++++++++-----------------
 1 file changed, 101 insertions(+), 78 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 33349cc8de0c..147561c19d57 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -109,6 +109,11 @@ struct xol_area {
 	unsigned long 			vaddr;		/* Page(s) of instruction slots */
 };
 
+static void uprobe_warn(struct task_struct *t, const char *msg)
+{
+	pr_warn("uprobe: %s:%d failed to %s\n", current->comm, current->pid, msg);
+}
+
 /*
  * valid_vma: Verify if the specified vma is an executable vma
  * Relax restrictions while unregistering: vm_flags might have
@@ -587,25 +592,53 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
+/* uprobe should have guaranteed positive refcount */
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
 	refcount_inc(&uprobe->ref);
 	return uprobe;
 }
 
+/*
+ * uprobe should have guaranteed lifetime, which can be either of:
+ *   - caller already has refcount taken (and wants an extra one);
+ *   - uprobe is RCU protected and won't be freed until after grace period;
+ *   - we are holding uprobes_treelock (for read or write, doesn't matter).
+ */
+static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
+{
+	if (refcount_inc_not_zero(&uprobe->ref))
+		return uprobe;
+	return NULL;
+}
+
+static inline bool uprobe_is_active(struct uprobe *uprobe)
+{
+	return !RB_EMPTY_NODE(&uprobe->rb_node);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
-	if (refcount_dec_and_test(&uprobe->ref)) {
-		/*
-		 * If application munmap(exec_vma) before uprobe_unregister()
-		 * gets called, we don't get a chance to remove uprobe from
-		 * delayed_uprobe_list from remove_breakpoint(). Do it here.
-		 */
-		mutex_lock(&delayed_uprobe_lock);
-		delayed_uprobe_remove(uprobe, NULL);
-		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
-	}
+	if (!refcount_dec_and_test(&uprobe->ref))
+		return;
+
+	write_lock(&uprobes_treelock);
+
+	if (uprobe_is_active(uprobe))
+		rb_erase(&uprobe->rb_node, &uprobes_tree);
+
+	write_unlock(&uprobes_treelock);
+
+	/*
+	 * If application munmap(exec_vma) before uprobe_unregister()
+	 * gets called, we don't get a chance to remove uprobe from
+	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
+	 */
+	mutex_lock(&delayed_uprobe_lock);
+	delayed_uprobe_remove(uprobe, NULL);
+	mutex_unlock(&delayed_uprobe_lock);
+
+	kfree(uprobe);
 }
 
 static __always_inline
@@ -656,7 +689,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		return try_get_uprobe(__node_2_uprobe(node));
 
 	return NULL;
 }
@@ -676,26 +709,44 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/*
+ * Attempt to insert a new uprobe into uprobes_tree.
+ *
+ * If uprobe already exists (for given inode+offset), we just increment
+ * refcount of previously existing uprobe.
+ *
+ * If not, a provided new instance of uprobe is inserted into the tree (with
+ * assumed initial refcount == 1).
+ *
+ * In any case, we return a uprobe instance that ends up being in uprobes_tree.
+ * Caller has to clean up new uprobe instance, if it ended up not being
+ * inserted into the tree.
+ *
+ * We assume that uprobes_treelock is held for writing.
+ */
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
-
+again:
 	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
-	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+	if (node) {
+		struct uprobe *u = __node_2_uprobe(node);
 
-	/* get access + creation ref */
-	refcount_set(&uprobe->ref, 2);
-	return NULL;
+		if (!try_get_uprobe(u)) {
+			rb_erase(node, &uprobes_tree);
+			RB_CLEAR_NODE(&u->rb_node);
+			goto again;
+		}
+
+		return u;
+	}
+
+	return uprobe;
 }
 
 /*
- * Acquire uprobes_treelock.
- * Matching uprobe already exists in rbtree;
- *	increment (access refcount) and return the matching uprobe.
- *
- * No matching uprobe; insert the uprobe in rb_tree;
- *	get a double refcount (access + creation) and return NULL.
+ * Acquire uprobes_treelock and insert uprobe into uprobes_tree
+ * (or reuse existing one, see __insert_uprobe() comments above).
  */
 static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 {
@@ -732,11 +783,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->ref_ctr_offset = ref_ctr_offset;
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
+	RB_CLEAR_NODE(&uprobe->rb_node);
+	refcount_set(&uprobe->ref, 1);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
 	/* a uprobe exists for this inode:offset combination */
-	if (cur_uprobe) {
+	if (cur_uprobe != uprobe) {
 		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 			ref_ctr_mismatch_warn(cur_uprobe, uprobe);
 			put_uprobe(cur_uprobe);
@@ -921,26 +974,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad
 	return set_orig_insn(&uprobe->arch, mm, vaddr);
 }
 
-static inline bool uprobe_is_active(struct uprobe *uprobe)
-{
-	return !RB_EMPTY_NODE(&uprobe->rb_node);
-}
-/*
- * There could be threads that have already hit the breakpoint. They
- * will recheck the current insn and restart if find_uprobe() fails.
- * See find_active_uprobe().
- */
-static void delete_uprobe(struct uprobe *uprobe)
-{
-	if (WARN_ON(!uprobe_is_active(uprobe)))
-		return;
-
-	write_lock(&uprobes_treelock);
-	rb_erase(&uprobe->rb_node, &uprobes_tree);
-	write_unlock(&uprobes_treelock);
-	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
-}
-
 struct map_info {
 	struct map_info *next;
 	struct mm_struct *mm;
@@ -1094,17 +1127,13 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	int err;
 
 	down_write(&uprobe->register_rwsem);
-	if (WARN_ON(!consumer_del(uprobe, uc)))
+	if (WARN_ON(!consumer_del(uprobe, uc))) {
 		err = -ENOENT;
-	else
+	} else {
 		err = register_for_each_vma(uprobe, NULL);
-
-	/* TODO : cant unregister? schedule a worker thread */
-	if (!err) {
-		if (!uprobe->consumers)
-			delete_uprobe(uprobe);
-		else
-			err = -EBUSY;
+		/* TODO : cant unregister? schedule a worker thread */
+		if (unlikely(err))
+			uprobe_warn(current, "unregister, leaking uprobe");
 	}
 	up_write(&uprobe->register_rwsem);
 
@@ -1159,27 +1188,16 @@ struct uprobe *uprobe_register(struct inode *inode,
 	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
 		return ERR_PTR(-EINVAL);
 
- retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (IS_ERR(uprobe))
 		return uprobe;
 
-	/*
-	 * We can race with uprobe_unregister()->delete_uprobe().
-	 * Check uprobe_is_active() and retry if it is false.
-	 */
 	down_write(&uprobe->register_rwsem);
-	ret = -EAGAIN;
-	if (likely(uprobe_is_active(uprobe))) {
-		consumer_add(uprobe, uc);
-		ret = register_for_each_vma(uprobe, uc);
-	}
+	consumer_add(uprobe, uc);
+	ret = register_for_each_vma(uprobe, uc);
 	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
 
 	if (ret) {
-		if (unlikely(ret == -EAGAIN))
-			goto retry;
 		uprobe_unregister(uprobe, uc);
 		return ERR_PTR(ret);
 	}
@@ -1286,15 +1304,17 @@ static void build_probe_list(struct inode *inode,
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset < min)
 				break;
-			list_add(&u->pending_list, head);
-			get_uprobe(u);
+			/* if uprobe went away, it's safe to ignore it */
+			if (try_get_uprobe(u))
+				list_add(&u->pending_list, head);
 		}
 		for (t = n; (t = rb_next(t)); ) {
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset > max)
 				break;
-			list_add(&u->pending_list, head);
-			get_uprobe(u);
+			/* if uprobe went away, it's safe to ignore it */
+			if (try_get_uprobe(u))
+				list_add(&u->pending_list, head);
 		}
 	}
 	read_unlock(&uprobes_treelock);
@@ -1752,6 +1772,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 			return -ENOMEM;
 
 		*n = *o;
+		/*
+		 * uprobe's refcnt has to be positive at this point, kept by
+		 * utask->return_instances items; return_instances can't be
+		 * removed right now, as task is blocked due to duping; so
+		 * get_uprobe() is safe to use here.
+		 */
 		get_uprobe(n->uprobe);
 		n->next = NULL;
 
@@ -1763,12 +1789,6 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 	return 0;
 }
 
-static void uprobe_warn(struct task_struct *t, const char *msg)
-{
-	pr_warn("uprobe: %s:%d failed to %s\n",
-			current->comm, current->pid, msg);
-}
-
 static void dup_xol_work(struct callback_head *work)
 {
 	if (current->flags & PF_EXITING)
@@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-
+	/*
+	 * uprobe's refcnt is positive, held by caller, so it's safe to
+	 * unconditionally bump it one more time here
+	 */
 	ri->uprobe = get_uprobe(uprobe);
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 2/8] uprobes: protected uprobe lifetime with SRCU
  2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
  2024-09-03 17:45 ` [PATCH v5 1/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
@ 2024-09-03 17:45 ` Andrii Nakryiko
  2024-09-03 17:45 ` [PATCH v5 3/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:45 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

To avoid unnecessarily taking a (brief) refcount on uprobe during
breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
not take refcount, but protect the lifetime of a uprobe instance with
RCU. This improves scalability, as refcount gets quite expensive due to
cache line bouncing between multiple CPUs.

Specifically, we utilize our own uprobe-specific SRCU instance for this
RCU protection. put_uprobe() will delay actual kfree() using call_srcu().

For now, uretprobe and single-stepping handling will still acquire
refcount as necessary. We'll address these issues in follow up patches
by making them use SRCU with timeout.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 94 +++++++++++++++++++++++------------------
 1 file changed, 54 insertions(+), 40 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 147561c19d57..3e3595753e2c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -41,6 +41,8 @@ static struct rb_root uprobes_tree = RB_ROOT;
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
 
+DEFINE_STATIC_SRCU(uprobes_srcu);
+
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
 static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
@@ -59,6 +61,7 @@ struct uprobe {
 	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
+	struct rcu_head		rcu;
 	loff_t			offset;
 	loff_t			ref_ctr_offset;
 	unsigned long		flags;
@@ -617,6 +620,13 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
 	return !RB_EMPTY_NODE(&uprobe->rb_node);
 }
 
+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+
+	kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (!refcount_dec_and_test(&uprobe->ref))
@@ -638,7 +648,7 @@ static void put_uprobe(struct uprobe *uprobe)
 	delayed_uprobe_remove(uprobe, NULL);
 	mutex_unlock(&delayed_uprobe_lock);
 
-	kfree(uprobe);
+	call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
 }
 
 static __always_inline
@@ -680,33 +690,25 @@ static inline int __uprobe_cmp(struct rb_node *a, const struct rb_node *b)
 	return uprobe_cmp(u->inode, u->offset, __node_2_uprobe(b));
 }
 
-static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
+/*
+ * Assumes being inside RCU protected region.
+ * No refcount is taken on returned uprobe.
+ */
+static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 {
 	struct __uprobe_key key = {
 		.inode = inode,
 		.offset = offset,
 	};
-	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
-
-	if (node)
-		return try_get_uprobe(__node_2_uprobe(node));
-
-	return NULL;
-}
+	struct rb_node *node;
 
-/*
- * Find a uprobe corresponding to a given inode:offset
- * Acquires uprobes_treelock
- */
-static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
-{
-	struct uprobe *uprobe;
+	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
 	read_lock(&uprobes_treelock);
-	uprobe = __find_uprobe(inode, offset);
+	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 	read_unlock(&uprobes_treelock);
 
-	return uprobe;
+	return node ? __node_2_uprobe(node) : NULL;
 }
 
 /*
@@ -1080,10 +1082,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 			goto free;
 		/*
 		 * We take mmap_lock for writing to avoid the race with
-		 * find_active_uprobe() which takes mmap_lock for reading.
+		 * find_active_uprobe_rcu() which takes mmap_lock for reading.
 		 * Thus this install_breakpoint() can not make
-		 * is_trap_at_addr() true right after find_uprobe()
-		 * returns NULL in find_active_uprobe().
+		 * is_trap_at_addr() true right after find_uprobe_rcu()
+		 * returns NULL in find_active_uprobe_rcu().
 		 */
 		mmap_write_lock(mm);
 		vma = find_vma(mm, info->vaddr);
@@ -1885,9 +1887,13 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		return;
 	}
 
+	/* we need to bump refcount to store uprobe in utask */
+	if (!try_get_uprobe(uprobe))
+		return;
+
 	ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
 	if (!ri)
-		return;
+		goto fail;
 
 	trampoline_vaddr = uprobe_get_trampoline_vaddr();
 	orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
@@ -1914,11 +1920,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-	/*
-	 * uprobe's refcnt is positive, held by caller, so it's safe to
-	 * unconditionally bump it one more time here
-	 */
-	ri->uprobe = get_uprobe(uprobe);
+	ri->uprobe = uprobe;
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
@@ -1929,8 +1931,9 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 	utask->return_instances = ri;
 
 	return;
- fail:
+fail:
 	kfree(ri);
+	put_uprobe(uprobe);
 }
 
 /* Prepare to single-step probed instruction out of line. */
@@ -1945,9 +1948,14 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	if (!utask)
 		return -ENOMEM;
 
+	if (!try_get_uprobe(uprobe))
+		return -EINVAL;
+
 	xol_vaddr = xol_get_insn_slot(uprobe);
-	if (!xol_vaddr)
-		return -ENOMEM;
+	if (!xol_vaddr) {
+		err = -ENOMEM;
+		goto err_out;
+	}
 
 	utask->xol_vaddr = xol_vaddr;
 	utask->vaddr = bp_vaddr;
@@ -1955,12 +1963,15 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
 	if (unlikely(err)) {
 		xol_free_insn_slot(current);
-		return err;
+		goto err_out;
 	}
 
 	utask->active_uprobe = uprobe;
 	utask->state = UTASK_SSTEP;
 	return 0;
+err_out:
+	put_uprobe(uprobe);
+	return err;
 }
 
 /*
@@ -2043,7 +2054,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	return is_trap_insn(&opcode);
 }
 
-static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
+/* assumes being inside RCU protected region */
+static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
 {
 	struct mm_struct *mm = current->mm;
 	struct uprobe *uprobe = NULL;
@@ -2056,7 +2068,7 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
 			struct inode *inode = file_inode(vma->vm_file);
 			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
 
-			uprobe = find_uprobe(inode, offset);
+			uprobe = find_uprobe_rcu(inode, offset);
 		}
 
 		if (!uprobe)
@@ -2202,13 +2214,15 @@ static void handle_swbp(struct pt_regs *regs)
 {
 	struct uprobe *uprobe;
 	unsigned long bp_vaddr;
-	int is_swbp;
+	int is_swbp, srcu_idx;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == uprobe_get_trampoline_vaddr())
 		return uprobe_handle_trampoline(regs);
 
-	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+
+	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
 	if (!uprobe) {
 		if (is_swbp > 0) {
 			/* No matching uprobe; signal SIGTRAP. */
@@ -2224,7 +2238,7 @@ static void handle_swbp(struct pt_regs *regs)
 			 */
 			instruction_pointer_set(regs, bp_vaddr);
 		}
-		return;
+		goto out;
 	}
 
 	/* change it in advance for ->handler() and restart */
@@ -2259,12 +2273,12 @@ static void handle_swbp(struct pt_regs *regs)
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
 		goto out;
 
-	if (!pre_ssout(uprobe, regs, bp_vaddr))
-		return;
+	if (pre_ssout(uprobe, regs, bp_vaddr))
+		goto out;
 
-	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
 out:
-	put_uprobe(uprobe);
+	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 }
 
 /*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 3/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
  2024-09-03 17:45 ` [PATCH v5 1/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
  2024-09-03 17:45 ` [PATCH v5 2/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
@ 2024-09-03 17:45 ` Andrii Nakryiko
  2024-09-03 17:45 ` [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:45 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

It serves no purpose beyond adding unnecessray argument passed to the
filter callback. Just get rid of it, no one is actually using it.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h     | 10 +---------
 kernel/events/uprobes.c     | 18 +++++++-----------
 kernel/trace/bpf_trace.c    |  3 +--
 kernel/trace/trace_uprobe.c |  9 +++------
 4 files changed, 12 insertions(+), 28 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 6332c111036e..9cf0dce62e4c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -28,20 +28,12 @@ struct page;
 
 #define MAX_URETPROBE_DEPTH		64
 
-enum uprobe_filter_ctx {
-	UPROBE_FILTER_REGISTER,
-	UPROBE_FILTER_UNREGISTER,
-	UPROBE_FILTER_MMAP,
-};
-
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	int (*ret_handler)(struct uprobe_consumer *self,
 				unsigned long func,
 				struct pt_regs *regs);
-	bool (*filter)(struct uprobe_consumer *self,
-				enum uprobe_filter_ctx ctx,
-				struct mm_struct *mm);
+	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
 
 	struct uprobe_consumer *next;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 3e3595753e2c..8bdcdc6901b2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -918,21 +918,19 @@ static int prepare_uprobe(struct uprobe *uprobe, struct file *file,
 	return ret;
 }
 
-static inline bool consumer_filter(struct uprobe_consumer *uc,
-				   enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static inline bool consumer_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
 {
-	return !uc->filter || uc->filter(uc, ctx, mm);
+	return !uc->filter || uc->filter(uc, mm);
 }
 
-static bool filter_chain(struct uprobe *uprobe,
-			 enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 {
 	struct uprobe_consumer *uc;
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
 	for (uc = uprobe->consumers; uc; uc = uc->next) {
-		ret = consumer_filter(uc, ctx, mm);
+		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
 	}
@@ -1099,12 +1097,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 
 		if (is_register) {
 			/* consult only the "caller", new consumer. */
-			if (consumer_filter(new,
-					UPROBE_FILTER_REGISTER, mm))
+			if (consumer_filter(new, mm))
 				err = install_breakpoint(uprobe, mm, vma, info->vaddr);
 		} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
-			if (!filter_chain(uprobe,
-					UPROBE_FILTER_UNREGISTER, mm))
+			if (!filter_chain(uprobe, mm))
 				err |= remove_breakpoint(uprobe, mm, info->vaddr);
 		}
 
@@ -1387,7 +1383,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	 */
 	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
 		if (!fatal_signal_pending(current) &&
-		    filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) {
+		    filter_chain(uprobe, vma->vm_mm)) {
 			unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
 			install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
 		}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4e391daafa64..73c570b5988b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3320,8 +3320,7 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe,
 }
 
 static bool
-uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx,
-			 struct mm_struct *mm)
+uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm)
 {
 	struct bpf_uprobe *uprobe;
 
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 52e76a73fa7c..7eb79e0a5352 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1078,9 +1078,7 @@ print_uprobe_event(struct trace_iterator *iter, int flags, struct trace_event *e
 	return trace_handle_return(s);
 }
 
-typedef bool (*filter_func_t)(struct uprobe_consumer *self,
-				enum uprobe_filter_ctx ctx,
-				struct mm_struct *mm);
+typedef bool (*filter_func_t)(struct uprobe_consumer *self, struct mm_struct *mm);
 
 static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 {
@@ -1339,8 +1337,7 @@ static int uprobe_perf_open(struct trace_event_call *call,
 	return err;
 }
 
-static bool uprobe_perf_filter(struct uprobe_consumer *uc,
-				enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static bool uprobe_perf_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
 {
 	struct trace_uprobe_filter *filter;
 	struct trace_uprobe *tu;
@@ -1426,7 +1423,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 static int uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs,
 			    struct uprobe_cpu_buffer **ucbp)
 {
-	if (!uprobe_perf_filter(&tu->consumer, 0, current->mm))
+	if (!uprobe_perf_filter(&tu->consumer, current->mm))
 		return UPROBE_HANDLER_REMOVE;
 
 	if (!is_ret_probe(tu))
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-09-03 17:45 ` [PATCH v5 3/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
@ 2024-09-03 17:45 ` Andrii Nakryiko
  2024-11-06 12:03   ` Breno Leitao
  2024-09-03 17:46 ` [PATCH v5 5/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:45 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

uprobe->register_rwsem is one of a few big bottlenecks to scalability of
uprobes, so we need to get rid of it to improve uprobe performance and
multi-CPU scalability.

First, we turn uprobe's consumer list to a typical doubly-linked list
and utilize existing RCU-aware helpers for traversing such lists, as
well as adding and removing elements from it.

For entry uprobes we already have SRCU protection active since before
uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
won't go away from under us, but we add SRCU protection around consumer
list traversal.

Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
we remember whether any removal was requested during handler calls, but
then we double-check the decision under a proper register_rwsem using
consumers' filter callbacks. Handler removal is very rare, so this extra
lock won't hurt performance, overall, but we also avoid the need for any
extra protection (e.g., seqcount locks).

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |  10 +++-
 kernel/events/uprobes.c | 104 +++++++++++++++++++++++-----------------
 2 files changed, 70 insertions(+), 44 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 9cf0dce62e4c..2785d1bedb74 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,13 +29,21 @@ struct page;
 #define MAX_URETPROBE_DEPTH		64
 
 struct uprobe_consumer {
+	/*
+	 * handler() can return UPROBE_HANDLER_REMOVE to signal the need to
+	 * unregister uprobe for current process. If UPROBE_HANDLER_REMOVE is
+	 * returned, filter() callback has to be implemented as well and it
+	 * should return false to "confirm" the decision to uninstall uprobe
+	 * for the current process. If filter() is omitted or returns true,
+	 * UPROBE_HANDLER_REMOVE is effectively ignored.
+	 */
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	int (*ret_handler)(struct uprobe_consumer *self,
 				unsigned long func,
 				struct pt_regs *regs);
 	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
 
-	struct uprobe_consumer *next;
+	struct list_head cons_node;
 };
 
 #ifdef CONFIG_UPROBES
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8bdcdc6901b2..97e58d160647 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -59,7 +59,7 @@ struct uprobe {
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
-	struct uprobe_consumer	*consumers;
+	struct list_head	consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	struct rcu_head		rcu;
 	loff_t			offset;
@@ -783,6 +783,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->inode = inode;
 	uprobe->offset = offset;
 	uprobe->ref_ctr_offset = ref_ctr_offset;
+	INIT_LIST_HEAD(&uprobe->consumers);
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
 	RB_CLEAR_NODE(&uprobe->rb_node);
@@ -808,32 +809,19 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	down_write(&uprobe->consumer_rwsem);
-	uc->next = uprobe->consumers;
-	uprobe->consumers = uc;
+	list_add_rcu(&uc->cons_node, &uprobe->consumers);
 	up_write(&uprobe->consumer_rwsem);
 }
 
 /*
  * For uprobe @uprobe, delete the consumer @uc.
- * Return true if the @uc is deleted successfully
- * or return false.
+ * Should never be called with consumer that's not part of @uprobe->consumers.
  */
-static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
+static void consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
-	struct uprobe_consumer **con;
-	bool ret = false;
-
 	down_write(&uprobe->consumer_rwsem);
-	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
-		if (*con == uc) {
-			*con = uc->next;
-			ret = true;
-			break;
-		}
-	}
+	list_del_rcu(&uc->cons_node);
 	up_write(&uprobe->consumer_rwsem);
-
-	return ret;
 }
 
 static int __copy_insn(struct address_space *mapping, struct file *filp,
@@ -929,7 +917,8 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
@@ -1125,18 +1114,29 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	int err;
 
 	down_write(&uprobe->register_rwsem);
-	if (WARN_ON(!consumer_del(uprobe, uc))) {
-		err = -ENOENT;
-	} else {
-		err = register_for_each_vma(uprobe, NULL);
-		/* TODO : cant unregister? schedule a worker thread */
-		if (unlikely(err))
-			uprobe_warn(current, "unregister, leaking uprobe");
-	}
+	consumer_del(uprobe, uc);
+	err = register_for_each_vma(uprobe, NULL);
 	up_write(&uprobe->register_rwsem);
 
-	if (!err)
-		put_uprobe(uprobe);
+	/* TODO : cant unregister? schedule a worker thread */
+	if (unlikely(err)) {
+		uprobe_warn(current, "unregister, leaking uprobe");
+		goto out_sync;
+	}
+
+	put_uprobe(uprobe);
+
+out_sync:
+	/*
+	 * Now that handler_chain() and handle_uretprobe_chain() iterate over
+	 * uprobe->consumers list under RCU protection without holding
+	 * uprobe->register_rwsem, we need to wait for RCU grace period to
+	 * make sure that we can't call into just unregistered
+	 * uprobe_consumer's callbacks anymore. If we don't do that, fast and
+	 * unlucky enough caller can free consumer's memory and cause
+	 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
+	 */
+	synchronize_srcu(&uprobes_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
@@ -1214,13 +1214,20 @@ EXPORT_SYMBOL_GPL(uprobe_register);
 int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
 {
 	struct uprobe_consumer *con;
-	int ret = -ENOENT;
+	int ret = -ENOENT, srcu_idx;
 
 	down_write(&uprobe->register_rwsem);
-	for (con = uprobe->consumers; con && con != uc ; con = con->next)
-		;
-	if (con)
-		ret = register_for_each_vma(uprobe, add ? uc : NULL);
+
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
+		if (con == uc) {
+			ret = register_for_each_vma(uprobe, add ? uc : NULL);
+			break;
+		}
+	}
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+
 	up_write(&uprobe->register_rwsem);
 
 	return ret;
@@ -2085,10 +2092,12 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	struct uprobe_consumer *uc;
 	int remove = UPROBE_HANDLER_REMOVE;
 	bool need_prep = false; /* prepare return uprobe, when needed */
+	bool has_consumers = false;
 
-	down_read(&uprobe->register_rwsem);
 	current->utask->auprobe = &uprobe->arch;
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2101,17 +2110,24 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 			need_prep = true;
 
 		remove &= rc;
+		has_consumers = true;
 	}
 	current->utask->auprobe = NULL;
 
 	if (need_prep && !remove)
 		prepare_uretprobe(uprobe, regs); /* put bp at return */
 
-	if (remove && uprobe->consumers) {
-		WARN_ON(!uprobe_is_active(uprobe));
-		unapply_uprobe(uprobe, current->mm);
+	if (remove && has_consumers) {
+		down_read(&uprobe->register_rwsem);
+
+		/* re-check that removal is still required, this time under lock */
+		if (!filter_chain(uprobe, current->mm)) {
+			WARN_ON(!uprobe_is_active(uprobe));
+			unapply_uprobe(uprobe, current->mm);
+		}
+
+		up_read(&uprobe->register_rwsem);
 	}
-	up_read(&uprobe->register_rwsem);
 }
 
 static void
@@ -2119,13 +2135,15 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 {
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
+	int srcu_idx;
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	up_read(&uprobe->register_rwsem);
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 5/8] perf/uprobe: split uprobe_unregister()
  2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-09-03 17:45 ` [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
@ 2024-09-03 17:46 ` Andrii Nakryiko
  2024-09-03 17:46 ` [PATCH v5 6/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:46 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

From: Peter Zijlstra <peterz@infradead.org>

With uprobe_unregister() having grown a synchronize_srcu(), it becomes
fairly slow to call. Esp. since both users of this API call it in a
loop.

Peel off the sync_srcu() and do it once, after the loop.

We also need to add uprobe_unregister_sync() into uprobe_register()'s
error handling path, as we need to be careful about returning to the
caller before we have a guarantee that partially attached consumer won't
be called anymore. This is an unlikely slow path and this should be
totally fine to be slow in the case of a failed attach.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h                       |  8 +++++--
 kernel/events/uprobes.c                       | 21 +++++++++++++------
 kernel/trace/bpf_trace.c                      |  5 ++++-
 kernel/trace/trace_uprobe.c                   |  6 +++++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  3 ++-
 5 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2785d1bedb74..968230d81798 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -116,7 +116,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc);
+extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
+extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -165,7 +166,10 @@ uprobe_apply(struct uprobe* uprobe, struct uprobe_consumer *uc, bool add)
 	return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
+uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
+{
+}
+static inline void uprobe_unregister_sync(void)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 97e58d160647..e9b755ddf960 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1105,11 +1105,11 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 }
 
 /**
- * uprobe_unregister - unregister an already registered probe.
+ * uprobe_unregister_nosync - unregister an already registered probe.
  * @uprobe: uprobe to remove
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
+void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	int err;
 
@@ -1121,12 +1121,15 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	/* TODO : cant unregister? schedule a worker thread */
 	if (unlikely(err)) {
 		uprobe_warn(current, "unregister, leaking uprobe");
-		goto out_sync;
+		return;
 	}
 
 	put_uprobe(uprobe);
+}
+EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
 
-out_sync:
+void uprobe_unregister_sync(void)
+{
 	/*
 	 * Now that handler_chain() and handle_uretprobe_chain() iterate over
 	 * uprobe->consumers list under RCU protection without holding
@@ -1138,7 +1141,7 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	 */
 	synchronize_srcu(&uprobes_srcu);
 }
-EXPORT_SYMBOL_GPL(uprobe_unregister);
+EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
 /**
  * uprobe_register - register a probe
@@ -1196,7 +1199,13 @@ struct uprobe *uprobe_register(struct inode *inode,
 	up_write(&uprobe->register_rwsem);
 
 	if (ret) {
-		uprobe_unregister(uprobe, uc);
+		uprobe_unregister_nosync(uprobe, uc);
+		/*
+		 * Registration might have partially succeeded, so we can have
+		 * this consumer being called right at this time. We need to
+		 * sync here. It's ok, it's unlikely slow path.
+		 */
+		uprobe_unregister_sync();
 		return ERR_PTR(ret);
 	}
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 73c570b5988b..6b632710c98e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3184,7 +3184,10 @@ static void bpf_uprobe_unregister(struct bpf_uprobe *uprobes, u32 cnt)
 	u32 i;
 
 	for (i = 0; i < cnt; i++)
-		uprobe_unregister(uprobes[i].uprobe, &uprobes[i].consumer);
+		uprobe_unregister_nosync(uprobes[i].uprobe, &uprobes[i].consumer);
+
+	if (cnt)
+		uprobe_unregister_sync();
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 7eb79e0a5352..f7443e996b1b 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1097,6 +1097,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 static void __probe_event_disable(struct trace_probe *tp)
 {
 	struct trace_uprobe *tu;
+	bool sync = false;
 
 	tu = container_of(tp, struct trace_uprobe, tp);
 	WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
@@ -1105,9 +1106,12 @@ static void __probe_event_disable(struct trace_probe *tp)
 		if (!tu->uprobe)
 			continue;
 
-		uprobe_unregister(tu->uprobe, &tu->consumer);
+		uprobe_unregister_nosync(tu->uprobe, &tu->consumer);
+		sync = true;
 		tu->uprobe = NULL;
 	}
+	if (sync)
+		uprobe_unregister_sync();
 }
 
 static int probe_event_enable(struct trace_event_call *call,
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 3c0515a27842..1fc16657cf42 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -475,7 +475,8 @@ static void testmod_unregister_uprobe(void)
 	mutex_lock(&testmod_uprobe_mutex);
 
 	if (uprobe.uprobe) {
-		uprobe_unregister(uprobe.uprobe, &uprobe.consumer);
+		uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
+		uprobe_unregister_sync();
 		path_put(&uprobe.path);
 		uprobe.uprobe = NULL;
 	}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 6/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu()
  2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-09-03 17:46 ` [PATCH v5 5/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
@ 2024-09-03 17:46 ` Andrii Nakryiko
  2024-09-03 17:46 ` [PATCH v5 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
  2024-09-03 17:46 ` [PATCH v5 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
  7 siblings, 0 replies; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:46 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

From: Peter Zijlstra <peterz@infradead.org>

Much like latch_tree, add two RCU methods for the regular RB-tree,
which can be used in conjunction with a seqcount to provide lockless
lookups.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 include/linux/rbtree.h | 67 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index f7edca369eda..7c173aa64e1e 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -244,6 +244,42 @@ rb_find_add(struct rb_node *node, struct rb_root *tree,
 	return NULL;
 }
 
+/**
+ * rb_find_add_rcu() - find equivalent @node in @tree, or add @node
+ * @node: node to look-for / insert
+ * @tree: tree to search / modify
+ * @cmp: operator defining the node order
+ *
+ * Adds a Store-Release for link_node.
+ *
+ * Returns the rb_node matching @node, or NULL when no match is found and @node
+ * is inserted.
+ */
+static __always_inline struct rb_node *
+rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
+		int (*cmp)(struct rb_node *, const struct rb_node *))
+{
+	struct rb_node **link = &tree->rb_node;
+	struct rb_node *parent = NULL;
+	int c;
+
+	while (*link) {
+		parent = *link;
+		c = cmp(node, parent);
+
+		if (c < 0)
+			link = &parent->rb_left;
+		else if (c > 0)
+			link = &parent->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node_rcu(node, parent, link);
+	rb_insert_color(node, tree);
+	return NULL;
+}
+
 /**
  * rb_find() - find @key in tree @tree
  * @key: key to match
@@ -272,6 +308,37 @@ rb_find(const void *key, const struct rb_root *tree,
 	return NULL;
 }
 
+/**
+ * rb_find_rcu() - find @key in tree @tree
+ * @key: key to match
+ * @tree: tree to search
+ * @cmp: operator defining the node order
+ *
+ * Notably, tree descent vs concurrent tree rotations is unsound and can result
+ * in false-negatives.
+ *
+ * Returns the rb_node matching @key or NULL.
+ */
+static __always_inline struct rb_node *
+rb_find_rcu(const void *key, const struct rb_root *tree,
+	    int (*cmp)(const void *key, const struct rb_node *))
+{
+	struct rb_node *node = tree->rb_node;
+
+	while (node) {
+		int c = cmp(key, node);
+
+		if (c < 0)
+			node = rcu_dereference_raw(node->rb_left);
+		else if (c > 0)
+			node = rcu_dereference_raw(node->rb_right);
+		else
+			return node;
+	}
+
+	return NULL;
+}
+
 /**
  * rb_find_first() - find the first @key in @tree
  * @key: key to match
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2024-09-03 17:46 ` [PATCH v5 6/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
@ 2024-09-03 17:46 ` Andrii Nakryiko
  2024-09-03 17:46 ` [PATCH v5 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
  7 siblings, 0 replies; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:46 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

Another big bottleneck to scalablity is uprobe_treelock that's taken in
a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
take advantage of that and make uprobes_tree RB-tree look up lockless.

To make RB-tree RCU-protected lockless lookup correct, we need to take
into account that such RB-tree lookup can return false negatives if there
are parallel RB-tree modifications (rotations) going on. We use seqcount
lock to detect whether RB-tree changed, and if we find nothing while
RB-tree got modified inbetween, we just retry. If uprobe was found, then
it's guaranteed to be a correct lookup.

With all the lock-avoiding changes done, we get a pretty decent
improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is
due to SRCU not really scaling very well with number of CPUs on
a particular hardware that was used for testing (80-core Intel Xeon Gold
6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
currently taken to resolve interrupt address to inode+offset and then
uprobe instance. And, of course, uretprobes still need similar RCU to
avoid refcount in the hot path, which will be addressed in the follow up
patches.

Nevertheless, the improvement is good. We used BPF selftest-based
uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
varying number of CPUs on which uprobes and uretprobes are triggered.

BASELINE
========
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

SRCU CHANGES
============
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
to 5mln/s.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e9b755ddf960..8a464cf38127 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
 DEFINE_STATIC_SRCU(uprobes_srcu);
 
@@ -634,8 +635,11 @@ static void put_uprobe(struct uprobe *uprobe)
 
 	write_lock(&uprobes_treelock);
 
-	if (uprobe_is_active(uprobe))
+	if (uprobe_is_active(uprobe)) {
+		write_seqcount_begin(&uprobes_seqcount);
 		rb_erase(&uprobe->rb_node, &uprobes_tree);
+		write_seqcount_end(&uprobes_seqcount);
+	}
 
 	write_unlock(&uprobes_treelock);
 
@@ -701,14 +705,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 		.offset = offset,
 	};
 	struct rb_node *node;
+	unsigned int seq;
 
 	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
-	read_lock(&uprobes_treelock);
-	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
-	read_unlock(&uprobes_treelock);
+	do {
+		seq = read_seqcount_begin(&uprobes_seqcount);
+		node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
+		/*
+		 * Lockless RB-tree lookups can result only in false negatives.
+		 * If the element is found, it is correct and can be returned
+		 * under RCU protection. If we find nothing, we need to
+		 * validate that seqcount didn't change. If it did, we have to
+		 * try again as we might have missed the element (false
+		 * negative). If seqcount is unchanged, search truly failed.
+		 */
+		if (node)
+			return __node_2_uprobe(node);
+	} while (read_seqcount_retry(&uprobes_seqcount, seq));
 
-	return node ? __node_2_uprobe(node) : NULL;
+	return NULL;
 }
 
 /*
@@ -730,7 +746,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
 again:
-	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
+	node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
 	if (node) {
 		struct uprobe *u = __node_2_uprobe(node);
 
@@ -755,7 +771,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	struct uprobe *u;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	u = __insert_uprobe(uprobe);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 
 	return u;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2024-09-03 17:46 ` [PATCH v5 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
@ 2024-09-03 17:46 ` Andrii Nakryiko
  7 siblings, 0 replies; 14+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:46 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

This patch switches uprobes SRCU usage to RCU Tasks Trace flavor, which
is optimized for more lightweight and quick readers (at the expense of
slower writers, which for uprobes is a fine tradeof) and has better
performance and scalability with number of CPUs.

Similarly to baseline vs SRCU, we've benchmarked SRCU-based
implementation vs RCU Tasks Trace implementation.

SRCU
====
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

RCU Tasks Trace
===============
uprobe-nop      ( 1 cpus):    3.396 ± 0.002M/s  (  3.396M/s/cpu)
uprobe-nop      ( 2 cpus):    4.271 ± 0.006M/s  (  2.135M/s/cpu)
uprobe-nop      ( 4 cpus):    8.499 ± 0.015M/s  (  2.125M/s/cpu)
uprobe-nop      ( 8 cpus):   10.355 ± 0.028M/s  (  1.294M/s/cpu)
uprobe-nop      (16 cpus):    7.615 ± 0.099M/s  (  0.476M/s/cpu)
uprobe-nop      (32 cpus):    4.430 ± 0.007M/s  (  0.138M/s/cpu)
uprobe-nop      (64 cpus):    6.887 ± 0.020M/s  (  0.108M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.174 ± 0.001M/s  (  2.174M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.853 ± 0.001M/s  (  1.426M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.913 ± 0.002M/s  (  1.228M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.883 ± 0.002M/s  (  0.735M/s/cpu)
uretprobe-nop   (16 cpus):    5.147 ± 0.001M/s  (  0.322M/s/cpu)
uretprobe-nop   (32 cpus):    3.738 ± 0.008M/s  (  0.117M/s/cpu)
uretprobe-nop   (64 cpus):    4.397 ± 0.002M/s  (  0.069M/s/cpu)

Peak throughput for uprobes increases from 8 mln/s to 10.3 mln/s
(+28%!), and for uretprobes from 5.3 mln/s to 5.8 mln/s (+11%), as we
have more work to do on uretprobes side.

Even single-thread (no contention) performance is slightly better: 3.276
mln/s to 3.396 mln/s (+3.5%) for uprobes, and 2.055 mln/s to 2.174 mln/s
(+5.8%) for uretprobes.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 38 ++++++++++++++++----------------------
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8a464cf38127..1b3990dd9c93 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -26,6 +26,7 @@
 #include <linux/task_work.h>
 #include <linux/shmem_fs.h>
 #include <linux/khugepaged.h>
+#include <linux/rcupdate_trace.h>
 
 #include <linux/uprobes.h>
 
@@ -42,8 +43,6 @@ static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
 static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
-DEFINE_STATIC_SRCU(uprobes_srcu);
-
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
 static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
@@ -652,7 +651,7 @@ static void put_uprobe(struct uprobe *uprobe)
 	delayed_uprobe_remove(uprobe, NULL);
 	mutex_unlock(&delayed_uprobe_lock);
 
-	call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
+	call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
 }
 
 static __always_inline
@@ -707,7 +706,7 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 	struct rb_node *node;
 	unsigned int seq;
 
-	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
+	lockdep_assert(rcu_read_lock_trace_held());
 
 	do {
 		seq = read_seqcount_begin(&uprobes_seqcount);
@@ -935,8 +934,7 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
@@ -1157,7 +1155,7 @@ void uprobe_unregister_sync(void)
 	 * unlucky enough caller can free consumer's memory and cause
 	 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
 	 */
-	synchronize_srcu(&uprobes_srcu);
+	synchronize_rcu_tasks_trace();
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
@@ -1241,19 +1239,18 @@ EXPORT_SYMBOL_GPL(uprobe_register);
 int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
 {
 	struct uprobe_consumer *con;
-	int ret = -ENOENT, srcu_idx;
+	int ret = -ENOENT;
 
 	down_write(&uprobe->register_rwsem);
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
-	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	rcu_read_lock_trace();
+	list_for_each_entry_rcu(con, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		if (con == uc) {
 			ret = register_for_each_vma(uprobe, add ? uc : NULL);
 			break;
 		}
 	}
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 
 	up_write(&uprobe->register_rwsem);
 
@@ -2123,8 +2120,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 
 	current->utask->auprobe = &uprobe->arch;
 
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2162,15 +2158,13 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 {
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
-	int srcu_idx;
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	rcu_read_lock_trace();
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)
@@ -2255,13 +2249,13 @@ static void handle_swbp(struct pt_regs *regs)
 {
 	struct uprobe *uprobe;
 	unsigned long bp_vaddr;
-	int is_swbp, srcu_idx;
+	int is_swbp;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == uprobe_get_trampoline_vaddr())
 		return uprobe_handle_trampoline(regs);
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	rcu_read_lock_trace();
 
 	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
 	if (!uprobe) {
@@ -2319,7 +2313,7 @@ static void handle_swbp(struct pt_regs *regs)
 
 out:
 	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 }
 
 /*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-09-03 17:45 ` [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
@ 2024-11-06 12:03   ` Breno Leitao
  2024-11-06 16:25     ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Breno Leitao @ 2024-11-06 12:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm

Hello Andrii,

On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> uprobes, so we need to get rid of it to improve uprobe performance and
> multi-CPU scalability.
> 
> First, we turn uprobe's consumer list to a typical doubly-linked list
> and utilize existing RCU-aware helpers for traversing such lists, as
> well as adding and removing elements from it.
> 
> For entry uprobes we already have SRCU protection active since before
> uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> won't go away from under us, but we add SRCU protection around consumer
> list traversal.

I am seeing the following message in a kernel with RCU_PROVE_LOCKING:

	kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!

It seems the SRCU is not held, when coming from mmap_region ->
uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
not decoding it, but, the stack trace is clear enough).

         WARNING: suspicious RCU usage
           6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
           -----------------------------
           kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
           3 locks held by env/441330:
            #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
            #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
            #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8

stack backtrace:
           CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
           Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
           Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
           Call trace:
            dump_backtrace+0x10c/0x198
            show_stack+0x24/0x38
            __dump_stack+0x28/0x38
            dump_stack_lvl+0x74/0xa8
            dump_stack+0x18/0x28
            lockdep_rcu_suspicious+0x178/0x2c8
            filter_chain+0xdc/0xe8
            uprobe_mmap+0x2e0/0x548
            mmap_region+0x510/0x988
            do_mmap+0x444/0x528
            vm_mmap_pgoff+0xf8/0x1d0
            ksys_mmap_pgoff+0x184/0x2d8


That said, it seems we want to hold the SRCU, before reaching the
filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
solves the problem, but, I might be missing something, since I am not familiar
with this code.

How does the following patch look like?

commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Nov 4 06:53:31 2024 -0800

    uprobes: Get SRCU lock before traverseing the list

    list_for_each_entry_srcu() is being called without holding the lock,
    which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
    as:

            kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!

    Get the SRCU uprobes_srcu lock before calling filter_chain(), which
    needs to have the SRCU lock hold, since it is going to call
    list_for_each_entry_srcu().

    Signed-off-by: Breno Leitao <leitao@debian.org>
    Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	struct list_head tmp_list;
 	struct uprobe *uprobe, *u;
 	struct inode *inode;
+	int srcu_idx;

 	if (no_uprobe_events())
 		return 0;
@@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)

 	mutex_lock(uprobes_mmap_hash(inode));
 	build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
 	/*
 	 * We can race with uprobe_unregister(), this uprobe can be already
 	 * removed. But in this case filter_chain() must return false, all
@@ -1422,6 +1424,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 		}
 		put_uprobe(uprobe);
 	}
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 	mutex_unlock(uprobes_mmap_hash(inode));

 	return 0;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-11-06 12:03   ` Breno Leitao
@ 2024-11-06 16:25     ` Andrii Nakryiko
  2024-11-07 11:35       ` Breno Leitao
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2024-11-06 16:25 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
	linux-mm

On Wed, Nov 6, 2024 at 4:03 AM Breno Leitao <leitao@debian.org> wrote:
>
> Hello Andrii,
>
> On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > uprobes, so we need to get rid of it to improve uprobe performance and
> > multi-CPU scalability.
> >
> > First, we turn uprobe's consumer list to a typical doubly-linked list
> > and utilize existing RCU-aware helpers for traversing such lists, as
> > well as adding and removing elements from it.
> >
> > For entry uprobes we already have SRCU protection active since before
> > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > won't go away from under us, but we add SRCU protection around consumer
> > list traversal.
>
> I am seeing the following message in a kernel with RCU_PROVE_LOCKING:
>
>         kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
>
> It seems the SRCU is not held, when coming from mmap_region ->
> uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
> not decoding it, but, the stack trace is clear enough).
>
>          WARNING: suspicious RCU usage
>            6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
>            -----------------------------
>            kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!
>
> other info that might help us debug this:
>
> rcu_scheduler_active = 2, debug_locks = 1
>            3 locks held by env/441330:
>             #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
>             #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
>             #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8
>
> stack backtrace:
>            CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
>            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
>            Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
>            Call trace:
>             dump_backtrace+0x10c/0x198
>             show_stack+0x24/0x38
>             __dump_stack+0x28/0x38
>             dump_stack_lvl+0x74/0xa8
>             dump_stack+0x18/0x28
>             lockdep_rcu_suspicious+0x178/0x2c8
>             filter_chain+0xdc/0xe8
>             uprobe_mmap+0x2e0/0x548
>             mmap_region+0x510/0x988
>             do_mmap+0x444/0x528
>             vm_mmap_pgoff+0xf8/0x1d0
>             ksys_mmap_pgoff+0x184/0x2d8
>
>
> That said, it seems we want to hold the SRCU, before reaching the
> filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
> solves the problem, but, I might be missing something, since I am not familiar
> with this code.
>
> How does the following patch look like?
>
> commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
> Author: Breno Leitao <leitao@debian.org>
> Date:   Mon Nov 4 06:53:31 2024 -0800
>
>     uprobes: Get SRCU lock before traverseing the list
>
>     list_for_each_entry_srcu() is being called without holding the lock,
>     which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
>     as:
>
>             kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
>
>     Get the SRCU uprobes_srcu lock before calling filter_chain(), which
>     needs to have the SRCU lock hold, since it is going to call
>     list_for_each_entry_srcu().
>
>     Signed-off-by: Breno Leitao <leitao@debian.org>
>     Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
>         struct list_head tmp_list;
>         struct uprobe *uprobe, *u;
>         struct inode *inode;
> +       int srcu_idx;
>
>         if (no_uprobe_events())
>                 return 0;
> @@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
>
>         mutex_lock(uprobes_mmap_hash(inode));
>         build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
> +       srcu_idx = srcu_read_lock(&uprobes_srcu);

Hey Breno,

Thanks for catching that (production testing FTW, right?!).

But I think you a) adding wrong RCU protection flavor (it has to be
rcu_read_lock_trace()/rcu_read_unlock_trace(), see uprobe_apply() for
an example) and b) I think this is the wrong place to add it. We
should add it inside filter_chain(). filter_chain() is called from
three places, only one of which is already RCU protected (that's the
handler_chain() case). But there is also register_for_each_vma(),
which needs RCU protection as well.

So can you resend the patch as a stand-alone patch, switch to RCU
Tasks Trace flavor, and add the protection inside filter_chain()?
Thank you!

P.S. pending_list traversal that you (accidentally) protect as well in
your patch doesn't need RCU protection, so there is no problem with
moving into filter_chain() for RCU stuff.

>         /*
>          * We can race with uprobe_unregister(), this uprobe can be already
>          * removed. But in this case filter_chain() must return false, all
> @@ -1422,6 +1424,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
>                 }
>                 put_uprobe(uprobe);
>         }
> +       srcu_read_unlock(&uprobes_srcu, srcu_idx);
>         mutex_unlock(uprobes_mmap_hash(inode));
>
>         return 0;
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-11-06 16:25     ` Andrii Nakryiko
@ 2024-11-07 11:35       ` Breno Leitao
  2024-11-07 16:01         ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Breno Leitao @ 2024-11-07 11:35 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
	linux-mm

Hello Andrii,

On Wed, Nov 06, 2024 at 08:25:25AM -0800, Andrii Nakryiko wrote:
> On Wed, Nov 6, 2024 at 4:03 AM Breno Leitao <leitao@debian.org> wrote:
> > On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> > > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > > uprobes, so we need to get rid of it to improve uprobe performance and
> > > multi-CPU scalability.
> > >
> > > First, we turn uprobe's consumer list to a typical doubly-linked list
> > > and utilize existing RCU-aware helpers for traversing such lists, as
> > > well as adding and removing elements from it.
> > >
> > > For entry uprobes we already have SRCU protection active since before
> > > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > > won't go away from under us, but we add SRCU protection around consumer
> > > list traversal.
> >
> > I am seeing the following message in a kernel with RCU_PROVE_LOCKING:
> >
> >         kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> >
> > It seems the SRCU is not held, when coming from mmap_region ->
> > uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
> > not decoding it, but, the stack trace is clear enough).
> >
> >          WARNING: suspicious RCU usage
> >            6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
> >            -----------------------------
> >            kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!
> >
> > other info that might help us debug this:
> >
> > rcu_scheduler_active = 2, debug_locks = 1
> >            3 locks held by env/441330:
> >             #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
> >             #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
> >             #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8
> >
> > stack backtrace:
> >            CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
> >            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> >            Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
> >            Call trace:
> >             dump_backtrace+0x10c/0x198
> >             show_stack+0x24/0x38
> >             __dump_stack+0x28/0x38
> >             dump_stack_lvl+0x74/0xa8
> >             dump_stack+0x18/0x28
> >             lockdep_rcu_suspicious+0x178/0x2c8
> >             filter_chain+0xdc/0xe8
> >             uprobe_mmap+0x2e0/0x548
> >             mmap_region+0x510/0x988
> >             do_mmap+0x444/0x528
> >             vm_mmap_pgoff+0xf8/0x1d0
> >             ksys_mmap_pgoff+0x184/0x2d8
> >
> >
> > That said, it seems we want to hold the SRCU, before reaching the
> > filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
> > solves the problem, but, I might be missing something, since I am not familiar
> > with this code.
> >
> > How does the following patch look like?
> >
> > commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
> > Author: Breno Leitao <leitao@debian.org>
> > Date:   Mon Nov 4 06:53:31 2024 -0800
> >
> >     uprobes: Get SRCU lock before traverseing the list
> >
> >     list_for_each_entry_srcu() is being called without holding the lock,
> >     which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
> >     as:
> >
> >             kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> >
> >     Get the SRCU uprobes_srcu lock before calling filter_chain(), which
> >     needs to have the SRCU lock hold, since it is going to call
> >     list_for_each_entry_srcu().
> >
> >     Signed-off-by: Breno Leitao <leitao@debian.org>
> >     Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> >         struct list_head tmp_list;
> >         struct uprobe *uprobe, *u;
> >         struct inode *inode;
> > +       int srcu_idx;
> >
> >         if (no_uprobe_events())
> >                 return 0;
> > @@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> >
> >         mutex_lock(uprobes_mmap_hash(inode));
> >         build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
> > +       srcu_idx = srcu_read_lock(&uprobes_srcu);
> 
> Thanks for catching that (production testing FTW, right?!).

Correct. I am running some hosts with RCU_PROVING and I am finding some
cases where RCU protected areas are touched without holding the RCU read
lock.

> But I think you a) adding wrong RCU protection flavor (it has to be
> rcu_read_lock_trace()/rcu_read_unlock_trace(), see uprobe_apply() for
> an example) and b) I think this is the wrong place to add it. We
> should add it inside filter_chain(). filter_chain() is called from
> three places, only one of which is already RCU protected (that's the
> handler_chain() case). But there is also register_for_each_vma(),
> which needs RCU protection as well.

Thanks for the guidance!

My initial plan was to protect filter_chain(), but, handler_chain()
already has the lock. Is it OK to get into a critical section in a
nested form?

The code will be something like:

handle_swbp() {
	rcu_read_lock_trace();
	handler_chain() {
		filter_chain() {
			rcu_read_lock_trace();
			list_for_each_entry_rcu()
			rcu_read_lock_trace();
		}
	}
	rcu_read_lock_trace();
}

Is this nested locking fine?

Thanks
--breno

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-11-07 11:35       ` Breno Leitao
@ 2024-11-07 16:01         ` Andrii Nakryiko
  2024-11-07 16:13           ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2024-11-07 16:01 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
	linux-mm

On Thu, Nov 7, 2024 at 3:35 AM Breno Leitao <leitao@debian.org> wrote:
>
> Hello Andrii,
>
> On Wed, Nov 06, 2024 at 08:25:25AM -0800, Andrii Nakryiko wrote:
> > On Wed, Nov 6, 2024 at 4:03 AM Breno Leitao <leitao@debian.org> wrote:
> > > On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> > > > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > > > uprobes, so we need to get rid of it to improve uprobe performance and
> > > > multi-CPU scalability.
> > > >
> > > > First, we turn uprobe's consumer list to a typical doubly-linked list
> > > > and utilize existing RCU-aware helpers for traversing such lists, as
> > > > well as adding and removing elements from it.
> > > >
> > > > For entry uprobes we already have SRCU protection active since before
> > > > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > > > won't go away from under us, but we add SRCU protection around consumer
> > > > list traversal.
> > >
> > > I am seeing the following message in a kernel with RCU_PROVE_LOCKING:
> > >
> > >         kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> > >
> > > It seems the SRCU is not held, when coming from mmap_region ->
> > > uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
> > > not decoding it, but, the stack trace is clear enough).
> > >
> > >          WARNING: suspicious RCU usage
> > >            6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
> > >            -----------------------------
> > >            kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!
> > >
> > > other info that might help us debug this:
> > >
> > > rcu_scheduler_active = 2, debug_locks = 1
> > >            3 locks held by env/441330:
> > >             #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
> > >             #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
> > >             #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8
> > >
> > > stack backtrace:
> > >            CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
> > >            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> > >            Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
> > >            Call trace:
> > >             dump_backtrace+0x10c/0x198
> > >             show_stack+0x24/0x38
> > >             __dump_stack+0x28/0x38
> > >             dump_stack_lvl+0x74/0xa8
> > >             dump_stack+0x18/0x28
> > >             lockdep_rcu_suspicious+0x178/0x2c8
> > >             filter_chain+0xdc/0xe8
> > >             uprobe_mmap+0x2e0/0x548
> > >             mmap_region+0x510/0x988
> > >             do_mmap+0x444/0x528
> > >             vm_mmap_pgoff+0xf8/0x1d0
> > >             ksys_mmap_pgoff+0x184/0x2d8
> > >
> > >
> > > That said, it seems we want to hold the SRCU, before reaching the
> > > filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
> > > solves the problem, but, I might be missing something, since I am not familiar
> > > with this code.
> > >
> > > How does the following patch look like?
> > >
> > > commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
> > > Author: Breno Leitao <leitao@debian.org>
> > > Date:   Mon Nov 4 06:53:31 2024 -0800
> > >
> > >     uprobes: Get SRCU lock before traverseing the list
> > >
> > >     list_for_each_entry_srcu() is being called without holding the lock,
> > >     which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
> > >     as:
> > >
> > >             kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> > >
> > >     Get the SRCU uprobes_srcu lock before calling filter_chain(), which
> > >     needs to have the SRCU lock hold, since it is going to call
> > >     list_for_each_entry_srcu().
> > >
> > >     Signed-off-by: Breno Leitao <leitao@debian.org>
> > >     Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> > >         struct list_head tmp_list;
> > >         struct uprobe *uprobe, *u;
> > >         struct inode *inode;
> > > +       int srcu_idx;
> > >
> > >         if (no_uprobe_events())
> > >                 return 0;
> > > @@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> > >
> > >         mutex_lock(uprobes_mmap_hash(inode));
> > >         build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
> > > +       srcu_idx = srcu_read_lock(&uprobes_srcu);
> >
> > Thanks for catching that (production testing FTW, right?!).
>
> Correct. I am running some hosts with RCU_PROVING and I am finding some
> cases where RCU protected areas are touched without holding the RCU read
> lock.
>
> > But I think you a) adding wrong RCU protection flavor (it has to be
> > rcu_read_lock_trace()/rcu_read_unlock_trace(), see uprobe_apply() for
> > an example) and b) I think this is the wrong place to add it. We
> > should add it inside filter_chain(). filter_chain() is called from
> > three places, only one of which is already RCU protected (that's the
> > handler_chain() case). But there is also register_for_each_vma(),
> > which needs RCU protection as well.
>
> Thanks for the guidance!
>
> My initial plan was to protect filter_chain(), but, handler_chain()
> already has the lock. Is it OK to get into a critical section in a
> nested form?
>
> The code will be something like:
>
> handle_swbp() {
>         rcu_read_lock_trace();
>         handler_chain() {
>                 filter_chain() {
>                         rcu_read_lock_trace();
>                         list_for_each_entry_rcu()
>                         rcu_read_lock_trace();
>                 }
>         }
>         rcu_read_lock_trace();
> }
>
> Is this nested locking fine?
>

Yes, it's totally fine to nest RCU lock regions.

> Thanks
> --breno

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-11-07 16:01         ` Andrii Nakryiko
@ 2024-11-07 16:13           ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2024-11-07 16:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Breno Leitao, Andrii Nakryiko, linux-trace-kernel, peterz, oleg,
	rostedt, mhiramat, bpf, linux-kernel, jolsa, willy, surenb, akpm,
	linux-mm

oN tHU, nov 07, 2024 at 08:01:05AM -0800, Andrii Nakryiko wrote:
> On Thu, Nov 7, 2024 at 3:35 AM Breno Leitao <leitao@debian.org> wrote:
> >
> > Hello Andrii,
> >
> > On Wed, Nov 06, 2024 at 08:25:25AM -0800, Andrii Nakryiko wrote:
> > > On Wed, Nov 6, 2024 at 4:03 AM Breno Leitao <leitao@debian.org> wrote:
> > > > On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> > > > > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > > > > uprobes, so we need to get rid of it to improve uprobe performance and
> > > > > multi-CPU scalability.
> > > > >
> > > > > First, we turn uprobe's consumer list to a typical doubly-linked list
> > > > > and utilize existing RCU-aware helpers for traversing such lists, as
> > > > > well as adding and removing elements from it.
> > > > >
> > > > > For entry uprobes we already have SRCU protection active since before
> > > > > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > > > > won't go away from under us, but we add SRCU protection around consumer
> > > > > list traversal.
> > > >
> > > > I am seeing the following message in a kernel with RCU_PROVE_LOCKING:
> > > >
> > > >         kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> > > >
> > > > It seems the SRCU is not held, when coming from mmap_region ->
> > > > uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
> > > > not decoding it, but, the stack trace is clear enough).
> > > >
> > > >          WARNING: suspicious RCU usage
> > > >            6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
> > > >            -----------------------------
> > > >            kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!
> > > >
> > > > other info that might help us debug this:
> > > >
> > > > rcu_scheduler_active = 2, debug_locks = 1
> > > >            3 locks held by env/441330:
> > > >             #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
> > > >             #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
> > > >             #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8
> > > >
> > > > stack backtrace:
> > > >            CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
> > > >            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> > > >            Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
> > > >            Call trace:
> > > >             dump_backtrace+0x10c/0x198
> > > >             show_stack+0x24/0x38
> > > >             __dump_stack+0x28/0x38
> > > >             dump_stack_lvl+0x74/0xa8
> > > >             dump_stack+0x18/0x28
> > > >             lockdep_rcu_suspicious+0x178/0x2c8
> > > >             filter_chain+0xdc/0xe8
> > > >             uprobe_mmap+0x2e0/0x548
> > > >             mmap_region+0x510/0x988
> > > >             do_mmap+0x444/0x528
> > > >             vm_mmap_pgoff+0xf8/0x1d0
> > > >             ksys_mmap_pgoff+0x184/0x2d8
> > > >
> > > >
> > > > That said, it seems we want to hold the SRCU, before reaching the
> > > > filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
> > > > solves the problem, but, I might be missing something, since I am not familiar
> > > > with this code.
> > > >
> > > > How does the following patch look like?
> > > >
> > > > commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
> > > > Author: Breno Leitao <leitao@debian.org>
> > > > Date:   Mon Nov 4 06:53:31 2024 -0800
> > > >
> > > >     uprobes: Get SRCU lock before traverseing the list
> > > >
> > > >     list_for_each_entry_srcu() is being called without holding the lock,
> > > >     which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
> > > >     as:
> > > >
> > > >             kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> > > >
> > > >     Get the SRCU uprobes_srcu lock before calling filter_chain(), which
> > > >     needs to have the SRCU lock hold, since it is going to call
> > > >     list_for_each_entry_srcu().
> > > >
> > > >     Signed-off-by: Breno Leitao <leitao@debian.org>
> > > >     Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
> > > >
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> > > >         struct list_head tmp_list;
> > > >         struct uprobe *uprobe, *u;
> > > >         struct inode *inode;
> > > > +       int srcu_idx;
> > > >
> > > >         if (no_uprobe_events())
> > > >                 return 0;
> > > > @@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> > > >
> > > >         mutex_lock(uprobes_mmap_hash(inode));
> > > >         build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
> > > > +       srcu_idx = srcu_read_lock(&uprobes_srcu);
> > >
> > > Thanks for catching that (production testing FTW, right?!).
> >
> > Correct. I am running some hosts with RCU_PROVING and I am finding some
> > cases where RCU protected areas are touched without holding the RCU read
> > lock.
> >
> > > But I think you a) adding wrong RCU protection flavor (it has to be
> > > rcu_read_lock_trace()/rcu_read_unlock_trace(), see uprobe_apply() for
> > > an example) and b) I think this is the wrong place to add it. We
> > > should add it inside filter_chain(). filter_chain() is called from
> > > three places, only one of which is already RCU protected (that's the
> > > handler_chain() case). But there is also register_for_each_vma(),
> > > which needs RCU protection as well.
> >
> > Thanks for the guidance!
> >
> > My initial plan was to protect filter_chain(), but, handler_chain()
> > already has the lock. Is it OK to get into a critical section in a
> > nested form?
> >
> > The code will be something like:
> >
> > handle_swbp() {
> >         rcu_read_lock_trace();
> >         handler_chain() {
> >                 filter_chain() {
> >                         rcu_read_lock_trace();
> >                         list_for_each_entry_rcu()
> >                         rcu_read_lock_trace();
> >                 }
> >         }
> >         rcu_read_lock_trace();
> > }
> >
> > Is this nested locking fine?
> 
> Yes, it's totally fine to nest RCU lock regions.

As long as you don't nest them more than 255 deep in CONFIG_PREEMPT=n
kernels that also have CONFIG_PREEMPT_COUNT=y, or more than 2G deep in
CONFIG_PREEMPT=y kernels.  For a limited time only, in CONFIG_PREEMPT=n
kernels that also have CONFIG_PREEMPT_COUNT=n, you can nest as deeply
as you want.  ;-)

Sorry, couldn't resist...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-11-07 16:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03 17:45 [PATCH v5 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
2024-09-03 17:45 ` [PATCH v5 1/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
2024-09-03 17:45 ` [PATCH v5 2/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
2024-09-03 17:45 ` [PATCH v5 3/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
2024-09-03 17:45 ` [PATCH v5 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
2024-11-06 12:03   ` Breno Leitao
2024-11-06 16:25     ` Andrii Nakryiko
2024-11-07 11:35       ` Breno Leitao
2024-11-07 16:01         ` Andrii Nakryiko
2024-11-07 16:13           ` Paul E. McKenney
2024-09-03 17:46 ` [PATCH v5 5/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
2024-09-03 17:46 ` [PATCH v5 6/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
2024-09-03 17:46 ` [PATCH v5 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
2024-09-03 17:46 ` [PATCH v5 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).