[PATCH v4 0/8] uprobes: RCU-protected hot path optimizations

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations
@ 2024-08-29 18:37 Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 1/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

This patch set is heavily inspired by Peter Zijlstra's uprobe optimization
patches ([0]) and continues that work, albeit trying to keep complexity to the
minimum, and attepting to reuse existing primitives as much as possible. The
goal here is to optimize common uprobe triggering hot path, while keeping the
rest of locking mostly intact.

I've added uprobe_unregister_sync() into the error handling code path inside
uprobe_unregister(). This is due to recent refactorings from Oleg Nesterov
([1]), which necessitates this addition.

Except for refcounting change patch (which I stongly believe is a good
improvement we should do and forget about quasi-refcounting schema of
uprobe->consumers list), the rest of the changes are similar to Peter's
initial changes in [0].

Main differences would be:
  - no special RCU protection for mmap and fork handling, we just stick to
    refcounts there, as those are infrequent and not a performance-sensitive
    code path, while being complex and thus benefiting from proper locking;
  - the above means we don't need to do any custom SRCU additions to handle
    forking code path;
  - I handled UPROBE_HANDLER_REMOVE problem in handler_chain() differently,
    again, leveraging existing locking scheam;
  - I kept refcount usage for uretprobe and single-stepping uprobes, I plan to
    address that in a separate follow up patches. The plan is to avoid
    task_work with a lockless atomic xchg()/cmpxchg()-based solution;
  - finally, I dutifully was using SRCU throughout all the changes, and only
    last patch switches SRCU to RCU Tasks Trace and demonstrates significant
    performance and scalability gains from this.

The changes in this patch set were tested using BPF selftests and using
uprobe-stress ([2]) tool.

Now, for the benchmarking results. I've used the following script (which
utilizes BPF selftests-based bench tool). The CPU used was 80-core Intel Xeon
Gold 6138 CPU @ 2.00GHz running kernel with production-like config. I minimized
background noise by stopping any service I could identify and stop, so results
are pretty stable and variability is pretty small, overall.

Benchmark script:

#!/bin/bash

set -eufo pipefail

for i in uprobe-nop uretprobe-nop; do
    for p in 1 2 4 8 16 32 64; do
        summary=$(sudo ./bench -w3 -d5 -p$p -a trig-$i | tail -n1)
        total=$(echo "$summary" | cut -d'(' -f1 | cut -d' ' -f3-)
        percpu=$(echo "$summary" | cut -d'(' -f2 | cut -d')' -f1 | cut -d'/' -f1)
        printf "%-15s (%2d cpus): %s (%s/s/cpu)\n" $i $p "$total" "$percpu"
    done
    echo
done

With all the lock-avoiding changes done in this patch set, we get a pretty
decent improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is due to
the remaining mmap_lock, which is currently taken to resolve interrupt address
to inode+offset and then uprobe instance. And, of course, uretprobes still need
similar RCU to avoid refcount in the hot path, which will be addressed in the
follow up patches.

BASELINE (on top of Oleg's clean up patches)
============================================
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

Up to second-to-last patch (i.e., SRCU-based optimizations)
===========================================================
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

RCU Tasks Trace
===============
uprobe-nop      ( 1 cpus):    3.396 ± 0.002M/s  (  3.396M/s/cpu)
uprobe-nop      ( 2 cpus):    4.271 ± 0.006M/s  (  2.135M/s/cpu)
uprobe-nop      ( 4 cpus):    8.499 ± 0.015M/s  (  2.125M/s/cpu)
uprobe-nop      ( 8 cpus):   10.355 ± 0.028M/s  (  1.294M/s/cpu)
uprobe-nop      (16 cpus):    7.615 ± 0.099M/s  (  0.476M/s/cpu)
uprobe-nop      (32 cpus):    4.430 ± 0.007M/s  (  0.138M/s/cpu)
uprobe-nop      (64 cpus):    6.887 ± 0.020M/s  (  0.108M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.174 ± 0.001M/s  (  2.174M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.853 ± 0.001M/s  (  1.426M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.913 ± 0.002M/s  (  1.228M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.883 ± 0.002M/s  (  0.735M/s/cpu)
uretprobe-nop   (16 cpus):    5.147 ± 0.001M/s  (  0.322M/s/cpu)
uretprobe-nop   (32 cpus):    3.738 ± 0.008M/s  (  0.117M/s/cpu)
uretprobe-nop   (64 cpus):    4.397 ± 0.002M/s  (  0.069M/s/cpu)

For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe
triggerings per second) up to about 8 M/s. For uretprobes it's a bit more
modest with bump from 2.4 M/s to 5 M/s.

For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from
8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%),
as we have more work to do on uretprobes side.

Even single-thread (no contention) performance is slightly better: 3.276 M/s to
3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%)
for uretprobes.

  [0] https://lore.kernel.org/linux-trace-kernel/20240711110235.098009979@infradead.org/
  [1] https://lore.kernel.org/linux-trace-kernel/20240729134444.GA12293@redhat.com/
  [2] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress

v3->v4:
  - added back consumer_rwsem into consumer_del(), which was accidentally
    omitted earlier (Jiri);
  - left out RFC patches, we can continue discussion on v3 patch set, if
    necessary;
v2->v3:
  - undid rcu and rb_node fields colocation which were causing crashes (Oleg);
  - ensure synchronize_srcu() on registration failure in patch #4 (Oleg);
v1->v2:
  - added back missed kfree() in patch #1 (Oleg);
  - forgot the rest, but there were a few small things here and there.

Andrii Nakryiko (6):
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: protected uprobe lifetime with SRCU
  uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  uprobes: travers uprobe's consumer list locklessly under SRCU
    protection
  uprobes: perform lockless SRCU-protected uprobes_tree lookup
  uprobes: switch to RCU Tasks Trace flavor for better performance

Peter Zijlstra (2):
  perf/uprobe: split uprobe_unregister()
  rbtree: provide rb_find_rcu() / rb_find_add_rcu()

 include/linux/rbtree.h                        |  67 +++
 include/linux/uprobes.h                       |  20 +-
 kernel/events/uprobes.c                       | 391 +++++++++++-------
 kernel/trace/bpf_trace.c                      |   8 +-
 kernel/trace/trace_uprobe.c                   |  15 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |   3 +-
 6 files changed, 321 insertions(+), 183 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 1/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
@ 2024-08-29 18:37 ` Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 2/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
    and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
    handler code, temporary refcount bump is done to keep uprobe from
    being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
    take another refcount to keep uprobe alive until user space code
    returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is confusing and problematic. No
matter how many actual consumers are attached, they all share the same
refcount, and we have an extra logic to drop the "last" (which might not
really be last) refcount once uprobe's consumer list becomes empty.

This is unconventional and has to be kept in mind as a special case all
the time. Further, because of this design we have the situations where
find_uprobe() will find uprobe, bump refcount, return it to the caller,
but that uprobe will still need uprobe_is_active() check, after which
the caller is required to drop refcount and try again. This is just too
many details leaking to the higher level logic.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead, each
uprobe_consumer is assuming its own refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
will attempt to bump refcount, unless it already is zero. Caller needs
to guarantee that uprobe instance won't be freed in parallel, which is
the case while we keep uprobes_treelock (for read or write, doesn't
matter).

Note also, we now don't leak the race between registration and
unregistration, so we remove the retry logic completely. If
find_uprobe() returns valid uprobe, it's guaranteed to remain in
uprobes_tree with properly incremented refcount. The race is handled
inside __insert_uprobe() and put_uprobe() working together:
__insert_uprobe() will remove uprobe from RB-tree, if it can't bump
refcount and will retry to insert the new uprobe instance. put_uprobe()
won't attempt to remove uprobe from RB-tree, if it's already not there.
All that is protected by uprobes_treelock, which keeps things simple.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 179 +++++++++++++++++++++++-----------------
 1 file changed, 101 insertions(+), 78 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 33349cc8de0c..147561c19d57 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -109,6 +109,11 @@ struct xol_area {
 	unsigned long 			vaddr;		/* Page(s) of instruction slots */
 };
 
+static void uprobe_warn(struct task_struct *t, const char *msg)
+{
+	pr_warn("uprobe: %s:%d failed to %s\n", current->comm, current->pid, msg);
+}
+
 /*
  * valid_vma: Verify if the specified vma is an executable vma
  * Relax restrictions while unregistering: vm_flags might have
@@ -587,25 +592,53 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
+/* uprobe should have guaranteed positive refcount */
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
 	refcount_inc(&uprobe->ref);
 	return uprobe;
 }
 
+/*
+ * uprobe should have guaranteed lifetime, which can be either of:
+ *   - caller already has refcount taken (and wants an extra one);
+ *   - uprobe is RCU protected and won't be freed until after grace period;
+ *   - we are holding uprobes_treelock (for read or write, doesn't matter).
+ */
+static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
+{
+	if (refcount_inc_not_zero(&uprobe->ref))
+		return uprobe;
+	return NULL;
+}
+
+static inline bool uprobe_is_active(struct uprobe *uprobe)
+{
+	return !RB_EMPTY_NODE(&uprobe->rb_node);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
-	if (refcount_dec_and_test(&uprobe->ref)) {
-		/*
-		 * If application munmap(exec_vma) before uprobe_unregister()
-		 * gets called, we don't get a chance to remove uprobe from
-		 * delayed_uprobe_list from remove_breakpoint(). Do it here.
-		 */
-		mutex_lock(&delayed_uprobe_lock);
-		delayed_uprobe_remove(uprobe, NULL);
-		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
-	}
+	if (!refcount_dec_and_test(&uprobe->ref))
+		return;
+
+	write_lock(&uprobes_treelock);
+
+	if (uprobe_is_active(uprobe))
+		rb_erase(&uprobe->rb_node, &uprobes_tree);
+
+	write_unlock(&uprobes_treelock);
+
+	/*
+	 * If application munmap(exec_vma) before uprobe_unregister()
+	 * gets called, we don't get a chance to remove uprobe from
+	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
+	 */
+	mutex_lock(&delayed_uprobe_lock);
+	delayed_uprobe_remove(uprobe, NULL);
+	mutex_unlock(&delayed_uprobe_lock);
+
+	kfree(uprobe);
 }
 
 static __always_inline
@@ -656,7 +689,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		return try_get_uprobe(__node_2_uprobe(node));
 
 	return NULL;
 }
@@ -676,26 +709,44 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/*
+ * Attempt to insert a new uprobe into uprobes_tree.
+ *
+ * If uprobe already exists (for given inode+offset), we just increment
+ * refcount of previously existing uprobe.
+ *
+ * If not, a provided new instance of uprobe is inserted into the tree (with
+ * assumed initial refcount == 1).
+ *
+ * In any case, we return a uprobe instance that ends up being in uprobes_tree.
+ * Caller has to clean up new uprobe instance, if it ended up not being
+ * inserted into the tree.
+ *
+ * We assume that uprobes_treelock is held for writing.
+ */
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
-
+again:
 	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
-	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+	if (node) {
+		struct uprobe *u = __node_2_uprobe(node);
 
-	/* get access + creation ref */
-	refcount_set(&uprobe->ref, 2);
-	return NULL;
+		if (!try_get_uprobe(u)) {
+			rb_erase(node, &uprobes_tree);
+			RB_CLEAR_NODE(&u->rb_node);
+			goto again;
+		}
+
+		return u;
+	}
+
+	return uprobe;
 }
 
 /*
- * Acquire uprobes_treelock.
- * Matching uprobe already exists in rbtree;
- *	increment (access refcount) and return the matching uprobe.
- *
- * No matching uprobe; insert the uprobe in rb_tree;
- *	get a double refcount (access + creation) and return NULL.
+ * Acquire uprobes_treelock and insert uprobe into uprobes_tree
+ * (or reuse existing one, see __insert_uprobe() comments above).
  */
 static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 {
@@ -732,11 +783,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->ref_ctr_offset = ref_ctr_offset;
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
+	RB_CLEAR_NODE(&uprobe->rb_node);
+	refcount_set(&uprobe->ref, 1);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
 	/* a uprobe exists for this inode:offset combination */
-	if (cur_uprobe) {
+	if (cur_uprobe != uprobe) {
 		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 			ref_ctr_mismatch_warn(cur_uprobe, uprobe);
 			put_uprobe(cur_uprobe);
@@ -921,26 +974,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad
 	return set_orig_insn(&uprobe->arch, mm, vaddr);
 }
 
-static inline bool uprobe_is_active(struct uprobe *uprobe)
-{
-	return !RB_EMPTY_NODE(&uprobe->rb_node);
-}
-/*
- * There could be threads that have already hit the breakpoint. They
- * will recheck the current insn and restart if find_uprobe() fails.
- * See find_active_uprobe().
- */
-static void delete_uprobe(struct uprobe *uprobe)
-{
-	if (WARN_ON(!uprobe_is_active(uprobe)))
-		return;
-
-	write_lock(&uprobes_treelock);
-	rb_erase(&uprobe->rb_node, &uprobes_tree);
-	write_unlock(&uprobes_treelock);
-	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
-}
-
 struct map_info {
 	struct map_info *next;
 	struct mm_struct *mm;
@@ -1094,17 +1127,13 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	int err;
 
 	down_write(&uprobe->register_rwsem);
-	if (WARN_ON(!consumer_del(uprobe, uc)))
+	if (WARN_ON(!consumer_del(uprobe, uc))) {
 		err = -ENOENT;
-	else
+	} else {
 		err = register_for_each_vma(uprobe, NULL);
-
-	/* TODO : cant unregister? schedule a worker thread */
-	if (!err) {
-		if (!uprobe->consumers)
-			delete_uprobe(uprobe);
-		else
-			err = -EBUSY;
+		/* TODO : cant unregister? schedule a worker thread */
+		if (unlikely(err))
+			uprobe_warn(current, "unregister, leaking uprobe");
 	}
 	up_write(&uprobe->register_rwsem);
 
@@ -1159,27 +1188,16 @@ struct uprobe *uprobe_register(struct inode *inode,
 	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
 		return ERR_PTR(-EINVAL);
 
- retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (IS_ERR(uprobe))
 		return uprobe;
 
-	/*
-	 * We can race with uprobe_unregister()->delete_uprobe().
-	 * Check uprobe_is_active() and retry if it is false.
-	 */
 	down_write(&uprobe->register_rwsem);
-	ret = -EAGAIN;
-	if (likely(uprobe_is_active(uprobe))) {
-		consumer_add(uprobe, uc);
-		ret = register_for_each_vma(uprobe, uc);
-	}
+	consumer_add(uprobe, uc);
+	ret = register_for_each_vma(uprobe, uc);
 	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
 
 	if (ret) {
-		if (unlikely(ret == -EAGAIN))
-			goto retry;
 		uprobe_unregister(uprobe, uc);
 		return ERR_PTR(ret);
 	}
@@ -1286,15 +1304,17 @@ static void build_probe_list(struct inode *inode,
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset < min)
 				break;
-			list_add(&u->pending_list, head);
-			get_uprobe(u);
+			/* if uprobe went away, it's safe to ignore it */
+			if (try_get_uprobe(u))
+				list_add(&u->pending_list, head);
 		}
 		for (t = n; (t = rb_next(t)); ) {
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset > max)
 				break;
-			list_add(&u->pending_list, head);
-			get_uprobe(u);
+			/* if uprobe went away, it's safe to ignore it */
+			if (try_get_uprobe(u))
+				list_add(&u->pending_list, head);
 		}
 	}
 	read_unlock(&uprobes_treelock);
@@ -1752,6 +1772,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 			return -ENOMEM;
 
 		*n = *o;
+		/*
+		 * uprobe's refcnt has to be positive at this point, kept by
+		 * utask->return_instances items; return_instances can't be
+		 * removed right now, as task is blocked due to duping; so
+		 * get_uprobe() is safe to use here.
+		 */
 		get_uprobe(n->uprobe);
 		n->next = NULL;
 
@@ -1763,12 +1789,6 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 	return 0;
 }
 
-static void uprobe_warn(struct task_struct *t, const char *msg)
-{
-	pr_warn("uprobe: %s:%d failed to %s\n",
-			current->comm, current->pid, msg);
-}
-
 static void dup_xol_work(struct callback_head *work)
 {
 	if (current->flags & PF_EXITING)
@@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-
+	/*
+	 * uprobe's refcnt is positive, held by caller, so it's safe to
+	 * unconditionally bump it one more time here
+	 */
 	ri->uprobe = get_uprobe(uprobe);
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 2/8] uprobes: protected uprobe lifetime with SRCU
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 1/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
@ 2024-08-29 18:37 ` Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 3/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

To avoid unnecessarily taking a (brief) refcount on uprobe during
breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
not take refcount, but protect the lifetime of a uprobe instance with
RCU. This improves scalability, as refcount gets quite expensive due to
cache line bouncing between multiple CPUs.

Specifically, we utilize our own uprobe-specific SRCU instance for this
RCU protection. put_uprobe() will delay actual kfree() using call_srcu().

For now, uretprobe and single-stepping handling will still acquire
refcount as necessary. We'll address these issues in follow up patches
by making them use SRCU with timeout.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 94 +++++++++++++++++++++++------------------
 1 file changed, 54 insertions(+), 40 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 147561c19d57..3e3595753e2c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -41,6 +41,8 @@ static struct rb_root uprobes_tree = RB_ROOT;
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
 
+DEFINE_STATIC_SRCU(uprobes_srcu);
+
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
 static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
@@ -59,6 +61,7 @@ struct uprobe {
 	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
+	struct rcu_head		rcu;
 	loff_t			offset;
 	loff_t			ref_ctr_offset;
 	unsigned long		flags;
@@ -617,6 +620,13 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
 	return !RB_EMPTY_NODE(&uprobe->rb_node);
 }
 
+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+
+	kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (!refcount_dec_and_test(&uprobe->ref))
@@ -638,7 +648,7 @@ static void put_uprobe(struct uprobe *uprobe)
 	delayed_uprobe_remove(uprobe, NULL);
 	mutex_unlock(&delayed_uprobe_lock);
 
-	kfree(uprobe);
+	call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
 }
 
 static __always_inline
@@ -680,33 +690,25 @@ static inline int __uprobe_cmp(struct rb_node *a, const struct rb_node *b)
 	return uprobe_cmp(u->inode, u->offset, __node_2_uprobe(b));
 }
 
-static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
+/*
+ * Assumes being inside RCU protected region.
+ * No refcount is taken on returned uprobe.
+ */
+static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 {
 	struct __uprobe_key key = {
 		.inode = inode,
 		.offset = offset,
 	};
-	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
-
-	if (node)
-		return try_get_uprobe(__node_2_uprobe(node));
-
-	return NULL;
-}
+	struct rb_node *node;
 
-/*
- * Find a uprobe corresponding to a given inode:offset
- * Acquires uprobes_treelock
- */
-static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
-{
-	struct uprobe *uprobe;
+	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
 	read_lock(&uprobes_treelock);
-	uprobe = __find_uprobe(inode, offset);
+	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 	read_unlock(&uprobes_treelock);
 
-	return uprobe;
+	return node ? __node_2_uprobe(node) : NULL;
 }
 
 /*
@@ -1080,10 +1082,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 			goto free;
 		/*
 		 * We take mmap_lock for writing to avoid the race with
-		 * find_active_uprobe() which takes mmap_lock for reading.
+		 * find_active_uprobe_rcu() which takes mmap_lock for reading.
 		 * Thus this install_breakpoint() can not make
-		 * is_trap_at_addr() true right after find_uprobe()
-		 * returns NULL in find_active_uprobe().
+		 * is_trap_at_addr() true right after find_uprobe_rcu()
+		 * returns NULL in find_active_uprobe_rcu().
 		 */
 		mmap_write_lock(mm);
 		vma = find_vma(mm, info->vaddr);
@@ -1885,9 +1887,13 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		return;
 	}
 
+	/* we need to bump refcount to store uprobe in utask */
+	if (!try_get_uprobe(uprobe))
+		return;
+
 	ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
 	if (!ri)
-		return;
+		goto fail;
 
 	trampoline_vaddr = uprobe_get_trampoline_vaddr();
 	orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
@@ -1914,11 +1920,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-	/*
-	 * uprobe's refcnt is positive, held by caller, so it's safe to
-	 * unconditionally bump it one more time here
-	 */
-	ri->uprobe = get_uprobe(uprobe);
+	ri->uprobe = uprobe;
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
@@ -1929,8 +1931,9 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 	utask->return_instances = ri;
 
 	return;
- fail:
+fail:
 	kfree(ri);
+	put_uprobe(uprobe);
 }
 
 /* Prepare to single-step probed instruction out of line. */
@@ -1945,9 +1948,14 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	if (!utask)
 		return -ENOMEM;
 
+	if (!try_get_uprobe(uprobe))
+		return -EINVAL;
+
 	xol_vaddr = xol_get_insn_slot(uprobe);
-	if (!xol_vaddr)
-		return -ENOMEM;
+	if (!xol_vaddr) {
+		err = -ENOMEM;
+		goto err_out;
+	}
 
 	utask->xol_vaddr = xol_vaddr;
 	utask->vaddr = bp_vaddr;
@@ -1955,12 +1963,15 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
 	if (unlikely(err)) {
 		xol_free_insn_slot(current);
-		return err;
+		goto err_out;
 	}
 
 	utask->active_uprobe = uprobe;
 	utask->state = UTASK_SSTEP;
 	return 0;
+err_out:
+	put_uprobe(uprobe);
+	return err;
 }
 
 /*
@@ -2043,7 +2054,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	return is_trap_insn(&opcode);
 }
 
-static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
+/* assumes being inside RCU protected region */
+static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
 {
 	struct mm_struct *mm = current->mm;
 	struct uprobe *uprobe = NULL;
@@ -2056,7 +2068,7 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
 			struct inode *inode = file_inode(vma->vm_file);
 			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
 
-			uprobe = find_uprobe(inode, offset);
+			uprobe = find_uprobe_rcu(inode, offset);
 		}
 
 		if (!uprobe)
@@ -2202,13 +2214,15 @@ static void handle_swbp(struct pt_regs *regs)
 {
 	struct uprobe *uprobe;
 	unsigned long bp_vaddr;
-	int is_swbp;
+	int is_swbp, srcu_idx;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == uprobe_get_trampoline_vaddr())
 		return uprobe_handle_trampoline(regs);
 
-	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+
+	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
 	if (!uprobe) {
 		if (is_swbp > 0) {
 			/* No matching uprobe; signal SIGTRAP. */
@@ -2224,7 +2238,7 @@ static void handle_swbp(struct pt_regs *regs)
 			 */
 			instruction_pointer_set(regs, bp_vaddr);
 		}
-		return;
+		goto out;
 	}
 
 	/* change it in advance for ->handler() and restart */
@@ -2259,12 +2273,12 @@ static void handle_swbp(struct pt_regs *regs)
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
 		goto out;
 
-	if (!pre_ssout(uprobe, regs, bp_vaddr))
-		return;
+	if (pre_ssout(uprobe, regs, bp_vaddr))
+		goto out;
 
-	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
 out:
-	put_uprobe(uprobe);
+	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 }
 
 /*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 3/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 1/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 2/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
@ 2024-08-29 18:37 ` Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

It serves no purpose beyond adding unnecessray argument passed to the
filter callback. Just get rid of it, no one is actually using it.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h     | 10 +---------
 kernel/events/uprobes.c     | 18 +++++++-----------
 kernel/trace/bpf_trace.c    |  3 +--
 kernel/trace/trace_uprobe.c |  9 +++------
 4 files changed, 12 insertions(+), 28 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 6332c111036e..9cf0dce62e4c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -28,20 +28,12 @@ struct page;
 
 #define MAX_URETPROBE_DEPTH		64
 
-enum uprobe_filter_ctx {
-	UPROBE_FILTER_REGISTER,
-	UPROBE_FILTER_UNREGISTER,
-	UPROBE_FILTER_MMAP,
-};
-
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	int (*ret_handler)(struct uprobe_consumer *self,
 				unsigned long func,
 				struct pt_regs *regs);
-	bool (*filter)(struct uprobe_consumer *self,
-				enum uprobe_filter_ctx ctx,
-				struct mm_struct *mm);
+	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
 
 	struct uprobe_consumer *next;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 3e3595753e2c..8bdcdc6901b2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -918,21 +918,19 @@ static int prepare_uprobe(struct uprobe *uprobe, struct file *file,
 	return ret;
 }
 
-static inline bool consumer_filter(struct uprobe_consumer *uc,
-				   enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static inline bool consumer_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
 {
-	return !uc->filter || uc->filter(uc, ctx, mm);
+	return !uc->filter || uc->filter(uc, mm);
 }
 
-static bool filter_chain(struct uprobe *uprobe,
-			 enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 {
 	struct uprobe_consumer *uc;
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
 	for (uc = uprobe->consumers; uc; uc = uc->next) {
-		ret = consumer_filter(uc, ctx, mm);
+		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
 	}
@@ -1099,12 +1097,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 
 		if (is_register) {
 			/* consult only the "caller", new consumer. */
-			if (consumer_filter(new,
-					UPROBE_FILTER_REGISTER, mm))
+			if (consumer_filter(new, mm))
 				err = install_breakpoint(uprobe, mm, vma, info->vaddr);
 		} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
-			if (!filter_chain(uprobe,
-					UPROBE_FILTER_UNREGISTER, mm))
+			if (!filter_chain(uprobe, mm))
 				err |= remove_breakpoint(uprobe, mm, info->vaddr);
 		}
 
@@ -1387,7 +1383,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	 */
 	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
 		if (!fatal_signal_pending(current) &&
-		    filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) {
+		    filter_chain(uprobe, vma->vm_mm)) {
 			unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
 			install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
 		}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4e391daafa64..73c570b5988b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3320,8 +3320,7 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe,
 }
 
 static bool
-uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx,
-			 struct mm_struct *mm)
+uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm)
 {
 	struct bpf_uprobe *uprobe;
 
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 52e76a73fa7c..7eb79e0a5352 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1078,9 +1078,7 @@ print_uprobe_event(struct trace_iterator *iter, int flags, struct trace_event *e
 	return trace_handle_return(s);
 }
 
-typedef bool (*filter_func_t)(struct uprobe_consumer *self,
-				enum uprobe_filter_ctx ctx,
-				struct mm_struct *mm);
+typedef bool (*filter_func_t)(struct uprobe_consumer *self, struct mm_struct *mm);
 
 static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 {
@@ -1339,8 +1337,7 @@ static int uprobe_perf_open(struct trace_event_call *call,
 	return err;
 }
 
-static bool uprobe_perf_filter(struct uprobe_consumer *uc,
-				enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static bool uprobe_perf_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
 {
 	struct trace_uprobe_filter *filter;
 	struct trace_uprobe *tu;
@@ -1426,7 +1423,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 static int uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs,
 			    struct uprobe_cpu_buffer **ucbp)
 {
-	if (!uprobe_perf_filter(&tu->consumer, 0, current->mm))
+	if (!uprobe_perf_filter(&tu->consumer, current->mm))
 		return UPROBE_HANDLER_REMOVE;
 
 	if (!is_ret_probe(tu))
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-08-29 18:37 ` [PATCH v4 3/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
@ 2024-08-29 18:37 ` Andrii Nakryiko
  2024-08-29 23:09   ` Jiri Olsa
  2024-08-29 18:37 ` [PATCH v4 5/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

uprobe->register_rwsem is one of a few big bottlenecks to scalability of
uprobes, so we need to get rid of it to improve uprobe performance and
multi-CPU scalability.

First, we turn uprobe's consumer list to a typical doubly-linked list
and utilize existing RCU-aware helpers for traversing such lists, as
well as adding and removing elements from it.

For entry uprobes we already have SRCU protection active since before
uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
won't go away from under us, but we add SRCU protection around consumer
list traversal.

Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
we remember whether any removal was requested during handler calls, but
then we double-check the decision under a proper register_rwsem using
consumers' filter callbacks. Handler removal is very rare, so this extra
lock won't hurt performance, overall, but we also avoid the need for any
extra protection (e.g., seqcount locks).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |   2 +-
 kernel/events/uprobes.c | 104 +++++++++++++++++++++++-----------------
 2 files changed, 62 insertions(+), 44 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 9cf0dce62e4c..29c935b0d504 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -35,7 +35,7 @@ struct uprobe_consumer {
 				struct pt_regs *regs);
 	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
 
-	struct uprobe_consumer *next;
+	struct list_head cons_node;
 };
 
 #ifdef CONFIG_UPROBES
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8bdcdc6901b2..97e58d160647 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -59,7 +59,7 @@ struct uprobe {
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
-	struct uprobe_consumer	*consumers;
+	struct list_head	consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	struct rcu_head		rcu;
 	loff_t			offset;
@@ -783,6 +783,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->inode = inode;
 	uprobe->offset = offset;
 	uprobe->ref_ctr_offset = ref_ctr_offset;
+	INIT_LIST_HEAD(&uprobe->consumers);
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
 	RB_CLEAR_NODE(&uprobe->rb_node);
@@ -808,32 +809,19 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	down_write(&uprobe->consumer_rwsem);
-	uc->next = uprobe->consumers;
-	uprobe->consumers = uc;
+	list_add_rcu(&uc->cons_node, &uprobe->consumers);
 	up_write(&uprobe->consumer_rwsem);
 }
 
 /*
  * For uprobe @uprobe, delete the consumer @uc.
- * Return true if the @uc is deleted successfully
- * or return false.
+ * Should never be called with consumer that's not part of @uprobe->consumers.
  */
-static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
+static void consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
-	struct uprobe_consumer **con;
-	bool ret = false;
-
 	down_write(&uprobe->consumer_rwsem);
-	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
-		if (*con == uc) {
-			*con = uc->next;
-			ret = true;
-			break;
-		}
-	}
+	list_del_rcu(&uc->cons_node);
 	up_write(&uprobe->consumer_rwsem);
-
-	return ret;
 }
 
 static int __copy_insn(struct address_space *mapping, struct file *filp,
@@ -929,7 +917,8 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
@@ -1125,18 +1114,29 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	int err;
 
 	down_write(&uprobe->register_rwsem);
-	if (WARN_ON(!consumer_del(uprobe, uc))) {
-		err = -ENOENT;
-	} else {
-		err = register_for_each_vma(uprobe, NULL);
-		/* TODO : cant unregister? schedule a worker thread */
-		if (unlikely(err))
-			uprobe_warn(current, "unregister, leaking uprobe");
-	}
+	consumer_del(uprobe, uc);
+	err = register_for_each_vma(uprobe, NULL);
 	up_write(&uprobe->register_rwsem);
 
-	if (!err)
-		put_uprobe(uprobe);
+	/* TODO : cant unregister? schedule a worker thread */
+	if (unlikely(err)) {
+		uprobe_warn(current, "unregister, leaking uprobe");
+		goto out_sync;
+	}
+
+	put_uprobe(uprobe);
+
+out_sync:
+	/*
+	 * Now that handler_chain() and handle_uretprobe_chain() iterate over
+	 * uprobe->consumers list under RCU protection without holding
+	 * uprobe->register_rwsem, we need to wait for RCU grace period to
+	 * make sure that we can't call into just unregistered
+	 * uprobe_consumer's callbacks anymore. If we don't do that, fast and
+	 * unlucky enough caller can free consumer's memory and cause
+	 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
+	 */
+	synchronize_srcu(&uprobes_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
@@ -1214,13 +1214,20 @@ EXPORT_SYMBOL_GPL(uprobe_register);
 int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
 {
 	struct uprobe_consumer *con;
-	int ret = -ENOENT;
+	int ret = -ENOENT, srcu_idx;
 
 	down_write(&uprobe->register_rwsem);
-	for (con = uprobe->consumers; con && con != uc ; con = con->next)
-		;
-	if (con)
-		ret = register_for_each_vma(uprobe, add ? uc : NULL);
+
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
+		if (con == uc) {
+			ret = register_for_each_vma(uprobe, add ? uc : NULL);
+			break;
+		}
+	}
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+
 	up_write(&uprobe->register_rwsem);
 
 	return ret;
@@ -2085,10 +2092,12 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	struct uprobe_consumer *uc;
 	int remove = UPROBE_HANDLER_REMOVE;
 	bool need_prep = false; /* prepare return uprobe, when needed */
+	bool has_consumers = false;
 
-	down_read(&uprobe->register_rwsem);
 	current->utask->auprobe = &uprobe->arch;
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2101,17 +2110,24 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 			need_prep = true;
 
 		remove &= rc;
+		has_consumers = true;
 	}
 	current->utask->auprobe = NULL;
 
 	if (need_prep && !remove)
 		prepare_uretprobe(uprobe, regs); /* put bp at return */
 
-	if (remove && uprobe->consumers) {
-		WARN_ON(!uprobe_is_active(uprobe));
-		unapply_uprobe(uprobe, current->mm);
+	if (remove && has_consumers) {
+		down_read(&uprobe->register_rwsem);
+
+		/* re-check that removal is still required, this time under lock */
+		if (!filter_chain(uprobe, current->mm)) {
+			WARN_ON(!uprobe_is_active(uprobe));
+			unapply_uprobe(uprobe, current->mm);
+		}
+
+		up_read(&uprobe->register_rwsem);
 	}
-	up_read(&uprobe->register_rwsem);
 }
 
 static void
@@ -2119,13 +2135,15 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 {
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
+	int srcu_idx;
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	up_read(&uprobe->register_rwsem);
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 5/8] perf/uprobe: split uprobe_unregister()
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-08-29 18:37 ` [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
@ 2024-08-29 18:37 ` Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 6/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

From: Peter Zijlstra <peterz@infradead.org>

With uprobe_unregister() having grown a synchronize_srcu(), it becomes
fairly slow to call. Esp. since both users of this API call it in a
loop.

Peel off the sync_srcu() and do it once, after the loop.

We also need to add uprobe_unregister_sync() into uprobe_register()'s
error handling path, as we need to be careful about returning to the
caller before we have a guarantee that partially attached consumer won't
be called anymore. This is an unlikely slow path and this should be
totally fine to be slow in the case of a failed attach.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h                       |  8 +++++--
 kernel/events/uprobes.c                       | 21 +++++++++++++------
 kernel/trace/bpf_trace.c                      |  5 ++++-
 kernel/trace/trace_uprobe.c                   |  6 +++++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  3 ++-
 5 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 29c935b0d504..e41cdae5597b 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -108,7 +108,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc);
+extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
+extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -157,7 +158,10 @@ uprobe_apply(struct uprobe* uprobe, struct uprobe_consumer *uc, bool add)
 	return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
+uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
+{
+}
+static inline void uprobe_unregister_sync(void)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 97e58d160647..e9b755ddf960 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1105,11 +1105,11 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 }
 
 /**
- * uprobe_unregister - unregister an already registered probe.
+ * uprobe_unregister_nosync - unregister an already registered probe.
  * @uprobe: uprobe to remove
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
+void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	int err;
 
@@ -1121,12 +1121,15 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	/* TODO : cant unregister? schedule a worker thread */
 	if (unlikely(err)) {
 		uprobe_warn(current, "unregister, leaking uprobe");
-		goto out_sync;
+		return;
 	}
 
 	put_uprobe(uprobe);
+}
+EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
 
-out_sync:
+void uprobe_unregister_sync(void)
+{
 	/*
 	 * Now that handler_chain() and handle_uretprobe_chain() iterate over
 	 * uprobe->consumers list under RCU protection without holding
@@ -1138,7 +1141,7 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	 */
 	synchronize_srcu(&uprobes_srcu);
 }
-EXPORT_SYMBOL_GPL(uprobe_unregister);
+EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
 /**
  * uprobe_register - register a probe
@@ -1196,7 +1199,13 @@ struct uprobe *uprobe_register(struct inode *inode,
 	up_write(&uprobe->register_rwsem);
 
 	if (ret) {
-		uprobe_unregister(uprobe, uc);
+		uprobe_unregister_nosync(uprobe, uc);
+		/*
+		 * Registration might have partially succeeded, so we can have
+		 * this consumer being called right at this time. We need to
+		 * sync here. It's ok, it's unlikely slow path.
+		 */
+		uprobe_unregister_sync();
 		return ERR_PTR(ret);
 	}
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 73c570b5988b..6b632710c98e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3184,7 +3184,10 @@ static void bpf_uprobe_unregister(struct bpf_uprobe *uprobes, u32 cnt)
 	u32 i;
 
 	for (i = 0; i < cnt; i++)
-		uprobe_unregister(uprobes[i].uprobe, &uprobes[i].consumer);
+		uprobe_unregister_nosync(uprobes[i].uprobe, &uprobes[i].consumer);
+
+	if (cnt)
+		uprobe_unregister_sync();
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 7eb79e0a5352..f7443e996b1b 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1097,6 +1097,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 static void __probe_event_disable(struct trace_probe *tp)
 {
 	struct trace_uprobe *tu;
+	bool sync = false;
 
 	tu = container_of(tp, struct trace_uprobe, tp);
 	WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
@@ -1105,9 +1106,12 @@ static void __probe_event_disable(struct trace_probe *tp)
 		if (!tu->uprobe)
 			continue;
 
-		uprobe_unregister(tu->uprobe, &tu->consumer);
+		uprobe_unregister_nosync(tu->uprobe, &tu->consumer);
+		sync = true;
 		tu->uprobe = NULL;
 	}
+	if (sync)
+		uprobe_unregister_sync();
 }
 
 static int probe_event_enable(struct trace_event_call *call,
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 3c0515a27842..1fc16657cf42 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -475,7 +475,8 @@ static void testmod_unregister_uprobe(void)
 	mutex_lock(&testmod_uprobe_mutex);
 
 	if (uprobe.uprobe) {
-		uprobe_unregister(uprobe.uprobe, &uprobe.consumer);
+		uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
+		uprobe_unregister_sync();
 		path_put(&uprobe.path);
 		uprobe.uprobe = NULL;
 	}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 6/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu()
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-08-29 18:37 ` [PATCH v4 5/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
@ 2024-08-29 18:37 ` Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

From: Peter Zijlstra <peterz@infradead.org>

Much like latch_tree, add two RCU methods for the regular RB-tree,
which can be used in conjunction with a seqcount to provide lockless
lookups.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 include/linux/rbtree.h | 67 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index f7edca369eda..7c173aa64e1e 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -244,6 +244,42 @@ rb_find_add(struct rb_node *node, struct rb_root *tree,
 	return NULL;
 }
 
+/**
+ * rb_find_add_rcu() - find equivalent @node in @tree, or add @node
+ * @node: node to look-for / insert
+ * @tree: tree to search / modify
+ * @cmp: operator defining the node order
+ *
+ * Adds a Store-Release for link_node.
+ *
+ * Returns the rb_node matching @node, or NULL when no match is found and @node
+ * is inserted.
+ */
+static __always_inline struct rb_node *
+rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
+		int (*cmp)(struct rb_node *, const struct rb_node *))
+{
+	struct rb_node **link = &tree->rb_node;
+	struct rb_node *parent = NULL;
+	int c;
+
+	while (*link) {
+		parent = *link;
+		c = cmp(node, parent);
+
+		if (c < 0)
+			link = &parent->rb_left;
+		else if (c > 0)
+			link = &parent->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node_rcu(node, parent, link);
+	rb_insert_color(node, tree);
+	return NULL;
+}
+
 /**
  * rb_find() - find @key in tree @tree
  * @key: key to match
@@ -272,6 +308,37 @@ rb_find(const void *key, const struct rb_root *tree,
 	return NULL;
 }
 
+/**
+ * rb_find_rcu() - find @key in tree @tree
+ * @key: key to match
+ * @tree: tree to search
+ * @cmp: operator defining the node order
+ *
+ * Notably, tree descent vs concurrent tree rotations is unsound and can result
+ * in false-negatives.
+ *
+ * Returns the rb_node matching @key or NULL.
+ */
+static __always_inline struct rb_node *
+rb_find_rcu(const void *key, const struct rb_root *tree,
+	    int (*cmp)(const void *key, const struct rb_node *))
+{
+	struct rb_node *node = tree->rb_node;
+
+	while (node) {
+		int c = cmp(key, node);
+
+		if (c < 0)
+			node = rcu_dereference_raw(node->rb_left);
+		else if (c > 0)
+			node = rcu_dereference_raw(node->rb_right);
+		else
+			return node;
+	}
+
+	return NULL;
+}
+
 /**
  * rb_find_first() - find the first @key in @tree
  * @key: key to match
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2024-08-29 18:37 ` [PATCH v4 6/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
@ 2024-08-29 18:37 ` Andrii Nakryiko
  2024-08-29 18:37 ` [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
  2024-08-30 10:24 ` [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Oleg Nesterov
  8 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

Another big bottleneck to scalablity is uprobe_treelock that's taken in
a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
take advantage of that and make uprobes_tree RB-tree look up lockless.

To make RB-tree RCU-protected lockless lookup correct, we need to take
into account that such RB-tree lookup can return false negatives if there
are parallel RB-tree modifications (rotations) going on. We use seqcount
lock to detect whether RB-tree changed, and if we find nothing while
RB-tree got modified inbetween, we just retry. If uprobe was found, then
it's guaranteed to be a correct lookup.

With all the lock-avoiding changes done, we get a pretty decent
improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is
due to SRCU not really scaling very well with number of CPUs on
a particular hardware that was used for testing (80-core Intel Xeon Gold
6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
currently taken to resolve interrupt address to inode+offset and then
uprobe instance. And, of course, uretprobes still need similar RCU to
avoid refcount in the hot path, which will be addressed in the follow up
patches.

Nevertheless, the improvement is good. We used BPF selftest-based
uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
varying number of CPUs on which uprobes and uretprobes are triggered.

BASELINE
========
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

SRCU CHANGES
============
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
to 5mln/s.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e9b755ddf960..8a464cf38127 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
 DEFINE_STATIC_SRCU(uprobes_srcu);
 
@@ -634,8 +635,11 @@ static void put_uprobe(struct uprobe *uprobe)
 
 	write_lock(&uprobes_treelock);
 
-	if (uprobe_is_active(uprobe))
+	if (uprobe_is_active(uprobe)) {
+		write_seqcount_begin(&uprobes_seqcount);
 		rb_erase(&uprobe->rb_node, &uprobes_tree);
+		write_seqcount_end(&uprobes_seqcount);
+	}
 
 	write_unlock(&uprobes_treelock);
 
@@ -701,14 +705,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 		.offset = offset,
 	};
 	struct rb_node *node;
+	unsigned int seq;
 
 	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
-	read_lock(&uprobes_treelock);
-	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
-	read_unlock(&uprobes_treelock);
+	do {
+		seq = read_seqcount_begin(&uprobes_seqcount);
+		node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
+		/*
+		 * Lockless RB-tree lookups can result only in false negatives.
+		 * If the element is found, it is correct and can be returned
+		 * under RCU protection. If we find nothing, we need to
+		 * validate that seqcount didn't change. If it did, we have to
+		 * try again as we might have missed the element (false
+		 * negative). If seqcount is unchanged, search truly failed.
+		 */
+		if (node)
+			return __node_2_uprobe(node);
+	} while (read_seqcount_retry(&uprobes_seqcount, seq));
 
-	return node ? __node_2_uprobe(node) : NULL;
+	return NULL;
 }
 
 /*
@@ -730,7 +746,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
 again:
-	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
+	node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
 	if (node) {
 		struct uprobe *u = __node_2_uprobe(node);
 
@@ -755,7 +771,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	struct uprobe *u;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	u = __insert_uprobe(uprobe);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 
 	return u;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2024-08-29 18:37 ` [PATCH v4 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
@ 2024-08-29 18:37 ` Andrii Nakryiko
  2024-08-30 17:41   ` kernel test robot
  2024-08-30 20:36   ` kernel test robot
  2024-08-30 10:24 ` [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Oleg Nesterov
  8 siblings, 2 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 18:37 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
	surenb, akpm, linux-mm, Andrii Nakryiko

This patch switches uprobes SRCU usage to RCU Tasks Trace flavor, which
is optimized for more lightweight and quick readers (at the expense of
slower writers, which for uprobes is a fine tradeof) and has better
performance and scalability with number of CPUs.

Similarly to baseline vs SRCU, we've benchmarked SRCU-based
implementation vs RCU Tasks Trace implementation.

SRCU
====
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

RCU Tasks Trace
===============
uprobe-nop      ( 1 cpus):    3.396 ± 0.002M/s  (  3.396M/s/cpu)
uprobe-nop      ( 2 cpus):    4.271 ± 0.006M/s  (  2.135M/s/cpu)
uprobe-nop      ( 4 cpus):    8.499 ± 0.015M/s  (  2.125M/s/cpu)
uprobe-nop      ( 8 cpus):   10.355 ± 0.028M/s  (  1.294M/s/cpu)
uprobe-nop      (16 cpus):    7.615 ± 0.099M/s  (  0.476M/s/cpu)
uprobe-nop      (32 cpus):    4.430 ± 0.007M/s  (  0.138M/s/cpu)
uprobe-nop      (64 cpus):    6.887 ± 0.020M/s  (  0.108M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.174 ± 0.001M/s  (  2.174M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.853 ± 0.001M/s  (  1.426M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.913 ± 0.002M/s  (  1.228M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.883 ± 0.002M/s  (  0.735M/s/cpu)
uretprobe-nop   (16 cpus):    5.147 ± 0.001M/s  (  0.322M/s/cpu)
uretprobe-nop   (32 cpus):    3.738 ± 0.008M/s  (  0.117M/s/cpu)
uretprobe-nop   (64 cpus):    4.397 ± 0.002M/s  (  0.069M/s/cpu)

Peak throughput for uprobes increases from 8 mln/s to 10.3 mln/s
(+28%!), and for uretprobes from 5.3 mln/s to 5.8 mln/s (+11%), as we
have more work to do on uretprobes side.

Even single-thread (no contention) performance is slightly better: 3.276
mln/s to 3.396 mln/s (+3.5%) for uprobes, and 2.055 mln/s to 2.174 mln/s
(+5.8%) for uretprobes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8a464cf38127..a5d39cec53d5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -42,8 +42,6 @@ static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
 static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
-DEFINE_STATIC_SRCU(uprobes_srcu);
-
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
 static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
@@ -652,7 +650,7 @@ static void put_uprobe(struct uprobe *uprobe)
 	delayed_uprobe_remove(uprobe, NULL);
 	mutex_unlock(&delayed_uprobe_lock);
 
-	call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
+	call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
 }
 
 static __always_inline
@@ -707,7 +705,7 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 	struct rb_node *node;
 	unsigned int seq;
 
-	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
+	lockdep_assert(rcu_read_lock_trace_held());
 
 	do {
 		seq = read_seqcount_begin(&uprobes_seqcount);
@@ -935,8 +933,7 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
@@ -1157,7 +1154,7 @@ void uprobe_unregister_sync(void)
 	 * unlucky enough caller can free consumer's memory and cause
 	 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
 	 */
-	synchronize_srcu(&uprobes_srcu);
+	synchronize_rcu_tasks_trace();
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
@@ -1241,19 +1238,18 @@ EXPORT_SYMBOL_GPL(uprobe_register);
 int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
 {
 	struct uprobe_consumer *con;
-	int ret = -ENOENT, srcu_idx;
+	int ret = -ENOENT;
 
 	down_write(&uprobe->register_rwsem);
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
-	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	rcu_read_lock_trace();
+	list_for_each_entry_rcu(con, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		if (con == uc) {
 			ret = register_for_each_vma(uprobe, add ? uc : NULL);
 			break;
 		}
 	}
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 
 	up_write(&uprobe->register_rwsem);
 
@@ -2123,8 +2119,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 
 	current->utask->auprobe = &uprobe->arch;
 
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2162,15 +2157,13 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 {
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
-	int srcu_idx;
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	rcu_read_lock_trace();
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)
@@ -2255,13 +2248,13 @@ static void handle_swbp(struct pt_regs *regs)
 {
 	struct uprobe *uprobe;
 	unsigned long bp_vaddr;
-	int is_swbp, srcu_idx;
+	int is_swbp;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == uprobe_get_trampoline_vaddr())
 		return uprobe_handle_trampoline(regs);
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	rcu_read_lock_trace();
 
 	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
 	if (!uprobe) {
@@ -2319,7 +2312,7 @@ static void handle_swbp(struct pt_regs *regs)
 
 out:
 	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 }
 
 /*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-29 18:37 ` [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
@ 2024-08-29 23:09   ` Jiri Olsa
  2024-08-29 23:31     ` Andrii Nakryiko
  0 siblings, 1 reply; 32+ messages in thread
From: Jiri Olsa @ 2024-08-29 23:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, paulmck, willy, surenb, akpm, linux-mm

On Thu, Aug 29, 2024 at 11:37:37AM -0700, Andrii Nakryiko wrote:
> uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> uprobes, so we need to get rid of it to improve uprobe performance and
> multi-CPU scalability.
> 
> First, we turn uprobe's consumer list to a typical doubly-linked list
> and utilize existing RCU-aware helpers for traversing such lists, as
> well as adding and removing elements from it.
> 
> For entry uprobes we already have SRCU protection active since before
> uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> won't go away from under us, but we add SRCU protection around consumer
> list traversal.
> 
> Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
> we remember whether any removal was requested during handler calls, but
> then we double-check the decision under a proper register_rwsem using
> consumers' filter callbacks. Handler removal is very rare, so this extra
> lock won't hurt performance, overall, but we also avoid the need for any
> extra protection (e.g., seqcount locks).
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/uprobes.h |   2 +-
>  kernel/events/uprobes.c | 104 +++++++++++++++++++++++-----------------
>  2 files changed, 62 insertions(+), 44 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 9cf0dce62e4c..29c935b0d504 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -35,7 +35,7 @@ struct uprobe_consumer {
>  				struct pt_regs *regs);
>  	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
>  
> -	struct uprobe_consumer *next;
> +	struct list_head cons_node;
>  };
>  
>  #ifdef CONFIG_UPROBES
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 8bdcdc6901b2..97e58d160647 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -59,7 +59,7 @@ struct uprobe {
>  	struct rw_semaphore	register_rwsem;
>  	struct rw_semaphore	consumer_rwsem;
>  	struct list_head	pending_list;
> -	struct uprobe_consumer	*consumers;
> +	struct list_head	consumers;
>  	struct inode		*inode;		/* Also hold a ref to inode */
>  	struct rcu_head		rcu;
>  	loff_t			offset;
> @@ -783,6 +783,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
>  	uprobe->inode = inode;
>  	uprobe->offset = offset;
>  	uprobe->ref_ctr_offset = ref_ctr_offset;
> +	INIT_LIST_HEAD(&uprobe->consumers);
>  	init_rwsem(&uprobe->register_rwsem);
>  	init_rwsem(&uprobe->consumer_rwsem);
>  	RB_CLEAR_NODE(&uprobe->rb_node);
> @@ -808,32 +809,19 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
>  static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  {
>  	down_write(&uprobe->consumer_rwsem);
> -	uc->next = uprobe->consumers;
> -	uprobe->consumers = uc;
> +	list_add_rcu(&uc->cons_node, &uprobe->consumers);
>  	up_write(&uprobe->consumer_rwsem);
>  }
>  
>  /*
>   * For uprobe @uprobe, delete the consumer @uc.
> - * Return true if the @uc is deleted successfully
> - * or return false.
> + * Should never be called with consumer that's not part of @uprobe->consumers.
>   */
> -static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
> +static void consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  {
> -	struct uprobe_consumer **con;
> -	bool ret = false;
> -
>  	down_write(&uprobe->consumer_rwsem);
> -	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
> -		if (*con == uc) {
> -			*con = uc->next;
> -			ret = true;
> -			break;
> -		}
> -	}
> +	list_del_rcu(&uc->cons_node);
>  	up_write(&uprobe->consumer_rwsem);
> -
> -	return ret;
>  }
>  
>  static int __copy_insn(struct address_space *mapping, struct file *filp,
> @@ -929,7 +917,8 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
>  	bool ret = false;
>  
>  	down_read(&uprobe->consumer_rwsem);
> -	for (uc = uprobe->consumers; uc; uc = uc->next) {
> +	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> +				 srcu_read_lock_held(&uprobes_srcu)) {
>  		ret = consumer_filter(uc, mm);
>  		if (ret)
>  			break;
> @@ -1125,18 +1114,29 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  	int err;
>  
>  	down_write(&uprobe->register_rwsem);
> -	if (WARN_ON(!consumer_del(uprobe, uc))) {
> -		err = -ENOENT;
> -	} else {
> -		err = register_for_each_vma(uprobe, NULL);
> -		/* TODO : cant unregister? schedule a worker thread */
> -		if (unlikely(err))
> -			uprobe_warn(current, "unregister, leaking uprobe");
> -	}
> +	consumer_del(uprobe, uc);
> +	err = register_for_each_vma(uprobe, NULL);
>  	up_write(&uprobe->register_rwsem);
>  
> -	if (!err)
> -		put_uprobe(uprobe);
> +	/* TODO : cant unregister? schedule a worker thread */
> +	if (unlikely(err)) {
> +		uprobe_warn(current, "unregister, leaking uprobe");
> +		goto out_sync;
> +	}
> +
> +	put_uprobe(uprobe);
> +
> +out_sync:
> +	/*
> +	 * Now that handler_chain() and handle_uretprobe_chain() iterate over
> +	 * uprobe->consumers list under RCU protection without holding
> +	 * uprobe->register_rwsem, we need to wait for RCU grace period to
> +	 * make sure that we can't call into just unregistered
> +	 * uprobe_consumer's callbacks anymore. If we don't do that, fast and
> +	 * unlucky enough caller can free consumer's memory and cause
> +	 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
> +	 */
> +	synchronize_srcu(&uprobes_srcu);
>  }
>  EXPORT_SYMBOL_GPL(uprobe_unregister);
>  
> @@ -1214,13 +1214,20 @@ EXPORT_SYMBOL_GPL(uprobe_register);
>  int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
>  {
>  	struct uprobe_consumer *con;
> -	int ret = -ENOENT;
> +	int ret = -ENOENT, srcu_idx;
>  
>  	down_write(&uprobe->register_rwsem);
> -	for (con = uprobe->consumers; con && con != uc ; con = con->next)
> -		;
> -	if (con)
> -		ret = register_for_each_vma(uprobe, add ? uc : NULL);
> +
> +	srcu_idx = srcu_read_lock(&uprobes_srcu);
> +	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
> +				 srcu_read_lock_held(&uprobes_srcu)) {
> +		if (con == uc) {
> +			ret = register_for_each_vma(uprobe, add ? uc : NULL);
> +			break;
> +		}
> +	}
> +	srcu_read_unlock(&uprobes_srcu, srcu_idx);
> +
>  	up_write(&uprobe->register_rwsem);
>  
>  	return ret;
> @@ -2085,10 +2092,12 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
>  	struct uprobe_consumer *uc;
>  	int remove = UPROBE_HANDLER_REMOVE;
>  	bool need_prep = false; /* prepare return uprobe, when needed */
> +	bool has_consumers = false;
>  
> -	down_read(&uprobe->register_rwsem);
>  	current->utask->auprobe = &uprobe->arch;
> -	for (uc = uprobe->consumers; uc; uc = uc->next) {
> +
> +	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> +				 srcu_read_lock_held(&uprobes_srcu)) {
>  		int rc = 0;
>  
>  		if (uc->handler) {
> @@ -2101,17 +2110,24 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
>  			need_prep = true;
>  
>  		remove &= rc;
> +		has_consumers = true;
>  	}
>  	current->utask->auprobe = NULL;
>  
>  	if (need_prep && !remove)
>  		prepare_uretprobe(uprobe, regs); /* put bp at return */
>  
> -	if (remove && uprobe->consumers) {
> -		WARN_ON(!uprobe_is_active(uprobe));
> -		unapply_uprobe(uprobe, current->mm);
> +	if (remove && has_consumers) {
> +		down_read(&uprobe->register_rwsem);
> +
> +		/* re-check that removal is still required, this time under lock */
> +		if (!filter_chain(uprobe, current->mm)) {

sorry for late question, but I do not follow this change.. 

at this point we got 1 as handler's return value from all the uprobe's consumers,
why do we need to call filter_chain in here.. IIUC this will likely skip over
the removal?

with single uprobe_multi consumer:

  handler_chain
    uprobe_multi_link_handler
      uprobe_prog_run
        bpf_prog returns 1

    remove = 1

    if (remove && has_consumers) {

      filter_chain - uprobe_multi_link_filter returns true.. so the uprobe stays?

maybe I just need to write test for it ;-)

thanks,
jirka


> +			WARN_ON(!uprobe_is_active(uprobe));
> +			unapply_uprobe(uprobe, current->mm);
> +		}
> +
> +		up_read(&uprobe->register_rwsem);
>  	}
> -	up_read(&uprobe->register_rwsem);
>  }
>  
>  static void
> @@ -2119,13 +2135,15 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
>  {
>  	struct uprobe *uprobe = ri->uprobe;
>  	struct uprobe_consumer *uc;
> +	int srcu_idx;
>  
> -	down_read(&uprobe->register_rwsem);
> -	for (uc = uprobe->consumers; uc; uc = uc->next) {
> +	srcu_idx = srcu_read_lock(&uprobes_srcu);
> +	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> +				 srcu_read_lock_held(&uprobes_srcu)) {
>  		if (uc->ret_handler)
>  			uc->ret_handler(uc, ri->func, regs);
>  	}
> -	up_read(&uprobe->register_rwsem);
> +	srcu_read_unlock(&uprobes_srcu, srcu_idx);
>  }
>  
>  static struct return_instance *find_next_ret_chain(struct return_instance *ri)
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-29 23:09   ` Jiri Olsa
@ 2024-08-29 23:31     ` Andrii Nakryiko
  2024-08-30 13:45       ` Jiri Olsa
  2024-08-30 14:18       ` Oleg Nesterov
  0 siblings, 2 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-29 23:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On Thu, Aug 29, 2024 at 4:10 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Aug 29, 2024 at 11:37:37AM -0700, Andrii Nakryiko wrote:
> > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > uprobes, so we need to get rid of it to improve uprobe performance and
> > multi-CPU scalability.
> >
> > First, we turn uprobe's consumer list to a typical doubly-linked list
> > and utilize existing RCU-aware helpers for traversing such lists, as
> > well as adding and removing elements from it.
> >
> > For entry uprobes we already have SRCU protection active since before
> > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > won't go away from under us, but we add SRCU protection around consumer
> > list traversal.
> >
> > Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
> > we remember whether any removal was requested during handler calls, but
> > then we double-check the decision under a proper register_rwsem using
> > consumers' filter callbacks. Handler removal is very rare, so this extra
> > lock won't hurt performance, overall, but we also avoid the need for any
> > extra protection (e.g., seqcount locks).
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/uprobes.h |   2 +-
> >  kernel/events/uprobes.c | 104 +++++++++++++++++++++++-----------------
> >  2 files changed, 62 insertions(+), 44 deletions(-)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 9cf0dce62e4c..29c935b0d504 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -35,7 +35,7 @@ struct uprobe_consumer {
> >                               struct pt_regs *regs);
> >       bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
> >
> > -     struct uprobe_consumer *next;
> > +     struct list_head cons_node;
> >  };
> >
> >  #ifdef CONFIG_UPROBES
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 8bdcdc6901b2..97e58d160647 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -59,7 +59,7 @@ struct uprobe {
> >       struct rw_semaphore     register_rwsem;
> >       struct rw_semaphore     consumer_rwsem;
> >       struct list_head        pending_list;
> > -     struct uprobe_consumer  *consumers;
> > +     struct list_head        consumers;
> >       struct inode            *inode;         /* Also hold a ref to inode */
> >       struct rcu_head         rcu;
> >       loff_t                  offset;
> > @@ -783,6 +783,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> >       uprobe->inode = inode;
> >       uprobe->offset = offset;
> >       uprobe->ref_ctr_offset = ref_ctr_offset;
> > +     INIT_LIST_HEAD(&uprobe->consumers);
> >       init_rwsem(&uprobe->register_rwsem);
> >       init_rwsem(&uprobe->consumer_rwsem);
> >       RB_CLEAR_NODE(&uprobe->rb_node);
> > @@ -808,32 +809,19 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> >  static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >  {
> >       down_write(&uprobe->consumer_rwsem);
> > -     uc->next = uprobe->consumers;
> > -     uprobe->consumers = uc;
> > +     list_add_rcu(&uc->cons_node, &uprobe->consumers);
> >       up_write(&uprobe->consumer_rwsem);
> >  }
> >
> >  /*
> >   * For uprobe @uprobe, delete the consumer @uc.
> > - * Return true if the @uc is deleted successfully
> > - * or return false.
> > + * Should never be called with consumer that's not part of @uprobe->consumers.
> >   */
> > -static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > +static void consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >  {
> > -     struct uprobe_consumer **con;
> > -     bool ret = false;
> > -
> >       down_write(&uprobe->consumer_rwsem);
> > -     for (con = &uprobe->consumers; *con; con = &(*con)->next) {
> > -             if (*con == uc) {
> > -                     *con = uc->next;
> > -                     ret = true;
> > -                     break;
> > -             }
> > -     }
> > +     list_del_rcu(&uc->cons_node);
> >       up_write(&uprobe->consumer_rwsem);
> > -
> > -     return ret;
> >  }
> >
> >  static int __copy_insn(struct address_space *mapping, struct file *filp,
> > @@ -929,7 +917,8 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
> >       bool ret = false;
> >
> >       down_read(&uprobe->consumer_rwsem);
> > -     for (uc = uprobe->consumers; uc; uc = uc->next) {
> > +     list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> > +                              srcu_read_lock_held(&uprobes_srcu)) {
> >               ret = consumer_filter(uc, mm);
> >               if (ret)
> >                       break;
> > @@ -1125,18 +1114,29 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >       int err;
> >
> >       down_write(&uprobe->register_rwsem);
> > -     if (WARN_ON(!consumer_del(uprobe, uc))) {
> > -             err = -ENOENT;
> > -     } else {
> > -             err = register_for_each_vma(uprobe, NULL);
> > -             /* TODO : cant unregister? schedule a worker thread */
> > -             if (unlikely(err))
> > -                     uprobe_warn(current, "unregister, leaking uprobe");
> > -     }
> > +     consumer_del(uprobe, uc);
> > +     err = register_for_each_vma(uprobe, NULL);
> >       up_write(&uprobe->register_rwsem);
> >
> > -     if (!err)
> > -             put_uprobe(uprobe);
> > +     /* TODO : cant unregister? schedule a worker thread */
> > +     if (unlikely(err)) {
> > +             uprobe_warn(current, "unregister, leaking uprobe");
> > +             goto out_sync;
> > +     }
> > +
> > +     put_uprobe(uprobe);
> > +
> > +out_sync:
> > +     /*
> > +      * Now that handler_chain() and handle_uretprobe_chain() iterate over
> > +      * uprobe->consumers list under RCU protection without holding
> > +      * uprobe->register_rwsem, we need to wait for RCU grace period to
> > +      * make sure that we can't call into just unregistered
> > +      * uprobe_consumer's callbacks anymore. If we don't do that, fast and
> > +      * unlucky enough caller can free consumer's memory and cause
> > +      * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
> > +      */
> > +     synchronize_srcu(&uprobes_srcu);
> >  }
> >  EXPORT_SYMBOL_GPL(uprobe_unregister);
> >
> > @@ -1214,13 +1214,20 @@ EXPORT_SYMBOL_GPL(uprobe_register);
> >  int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
> >  {
> >       struct uprobe_consumer *con;
> > -     int ret = -ENOENT;
> > +     int ret = -ENOENT, srcu_idx;
> >
> >       down_write(&uprobe->register_rwsem);
> > -     for (con = uprobe->consumers; con && con != uc ; con = con->next)
> > -             ;
> > -     if (con)
> > -             ret = register_for_each_vma(uprobe, add ? uc : NULL);
> > +
> > +     srcu_idx = srcu_read_lock(&uprobes_srcu);
> > +     list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
> > +                              srcu_read_lock_held(&uprobes_srcu)) {
> > +             if (con == uc) {
> > +                     ret = register_for_each_vma(uprobe, add ? uc : NULL);
> > +                     break;
> > +             }
> > +     }
> > +     srcu_read_unlock(&uprobes_srcu, srcu_idx);
> > +
> >       up_write(&uprobe->register_rwsem);
> >
> >       return ret;
> > @@ -2085,10 +2092,12 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
> >       struct uprobe_consumer *uc;
> >       int remove = UPROBE_HANDLER_REMOVE;
> >       bool need_prep = false; /* prepare return uprobe, when needed */
> > +     bool has_consumers = false;
> >
> > -     down_read(&uprobe->register_rwsem);
> >       current->utask->auprobe = &uprobe->arch;
> > -     for (uc = uprobe->consumers; uc; uc = uc->next) {
> > +
> > +     list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> > +                              srcu_read_lock_held(&uprobes_srcu)) {
> >               int rc = 0;
> >
> >               if (uc->handler) {
> > @@ -2101,17 +2110,24 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
> >                       need_prep = true;
> >
> >               remove &= rc;
> > +             has_consumers = true;
> >       }
> >       current->utask->auprobe = NULL;
> >
> >       if (need_prep && !remove)
> >               prepare_uretprobe(uprobe, regs); /* put bp at return */
> >
> > -     if (remove && uprobe->consumers) {
> > -             WARN_ON(!uprobe_is_active(uprobe));
> > -             unapply_uprobe(uprobe, current->mm);
> > +     if (remove && has_consumers) {
> > +             down_read(&uprobe->register_rwsem);
> > +
> > +             /* re-check that removal is still required, this time under lock */
> > +             if (!filter_chain(uprobe, current->mm)) {
>
> sorry for late question, but I do not follow this change..
>
> at this point we got 1 as handler's return value from all the uprobe's consumers,
> why do we need to call filter_chain in here.. IIUC this will likely skip over
> the removal?
>

Because we don't hold register_rwsem we are now racing with
registration. So while we can get all consumers at the time we were
iterating over the consumer list to request deletion, a parallel CPU
can add another consumer that needs this uprobe+PID combination. So if
we don't double-check, we are risking having a consumer that will not
be triggered for the desired process.

Does it make sense? Given removal is rare, it's ok to take lock if we
*suspect* removal, and then check authoritatively again under lock.

> with single uprobe_multi consumer:
>
>   handler_chain
>     uprobe_multi_link_handler
>       uprobe_prog_run
>         bpf_prog returns 1
>
>     remove = 1
>
>     if (remove && has_consumers) {
>
>       filter_chain - uprobe_multi_link_filter returns true.. so the uprobe stays?
>
> maybe I just need to write test for it ;-)
>
> thanks,
> jirka
>
>
> > +                     WARN_ON(!uprobe_is_active(uprobe));
> > +                     unapply_uprobe(uprobe, current->mm);
> > +             }
> > +
> > +             up_read(&uprobe->register_rwsem);
> >       }
> > -     up_read(&uprobe->register_rwsem);
> >  }
> >
> >  static void
> > @@ -2119,13 +2135,15 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
> >  {
> >       struct uprobe *uprobe = ri->uprobe;
> >       struct uprobe_consumer *uc;
> > +     int srcu_idx;
> >
> > -     down_read(&uprobe->register_rwsem);
> > -     for (uc = uprobe->consumers; uc; uc = uc->next) {
> > +     srcu_idx = srcu_read_lock(&uprobes_srcu);
> > +     list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> > +                              srcu_read_lock_held(&uprobes_srcu)) {
> >               if (uc->ret_handler)
> >                       uc->ret_handler(uc, ri->func, regs);
> >       }
> > -     up_read(&uprobe->register_rwsem);
> > +     srcu_read_unlock(&uprobes_srcu, srcu_idx);
> >  }
> >
> >  static struct return_instance *find_next_ret_chain(struct return_instance *ri)
> > --
> > 2.43.5
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations
  2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (7 preceding siblings ...)
  2024-08-29 18:37 ` [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
@ 2024-08-30 10:24 ` Oleg Nesterov
  2024-09-03 13:21   ` Peter Zijlstra
  8 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-08-30 10:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck, willy, surenb, akpm, linux-mm

On 08/29, Andrii Nakryiko wrote:
>
> v3->v4:
>   - added back consumer_rwsem into consumer_del(), which was accidentally
>     omitted earlier (Jiri);

still,

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-29 23:31     ` Andrii Nakryiko
@ 2024-08-30 13:45       ` Jiri Olsa
  2024-08-30 14:31         ` Oleg Nesterov
  2024-08-31 17:25         ` Oleg Nesterov
  2024-08-30 14:18       ` Oleg Nesterov
  1 sibling, 2 replies; 32+ messages in thread
From: Jiri Olsa @ 2024-08-30 13:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, oleg,
	rostedt, mhiramat, bpf, linux-kernel, paulmck, willy, surenb,
	akpm, linux-mm

On Thu, Aug 29, 2024 at 04:31:18PM -0700, Andrii Nakryiko wrote:
> On Thu, Aug 29, 2024 at 4:10 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 11:37:37AM -0700, Andrii Nakryiko wrote:
> > > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > > uprobes, so we need to get rid of it to improve uprobe performance and
> > > multi-CPU scalability.
> > >
> > > First, we turn uprobe's consumer list to a typical doubly-linked list
> > > and utilize existing RCU-aware helpers for traversing such lists, as
> > > well as adding and removing elements from it.
> > >
> > > For entry uprobes we already have SRCU protection active since before
> > > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > > won't go away from under us, but we add SRCU protection around consumer
> > > list traversal.
> > >
> > > Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
> > > we remember whether any removal was requested during handler calls, but
> > > then we double-check the decision under a proper register_rwsem using
> > > consumers' filter callbacks. Handler removal is very rare, so this extra
> > > lock won't hurt performance, overall, but we also avoid the need for any
> > > extra protection (e.g., seqcount locks).
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/uprobes.h |   2 +-
> > >  kernel/events/uprobes.c | 104 +++++++++++++++++++++++-----------------
> > >  2 files changed, 62 insertions(+), 44 deletions(-)
> > >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index 9cf0dce62e4c..29c935b0d504 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -35,7 +35,7 @@ struct uprobe_consumer {
> > >                               struct pt_regs *regs);
> > >       bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
> > >
> > > -     struct uprobe_consumer *next;
> > > +     struct list_head cons_node;
> > >  };
> > >
> > >  #ifdef CONFIG_UPROBES
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 8bdcdc6901b2..97e58d160647 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -59,7 +59,7 @@ struct uprobe {
> > >       struct rw_semaphore     register_rwsem;
> > >       struct rw_semaphore     consumer_rwsem;
> > >       struct list_head        pending_list;
> > > -     struct uprobe_consumer  *consumers;
> > > +     struct list_head        consumers;
> > >       struct inode            *inode;         /* Also hold a ref to inode */
> > >       struct rcu_head         rcu;
> > >       loff_t                  offset;
> > > @@ -783,6 +783,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> > >       uprobe->inode = inode;
> > >       uprobe->offset = offset;
> > >       uprobe->ref_ctr_offset = ref_ctr_offset;
> > > +     INIT_LIST_HEAD(&uprobe->consumers);
> > >       init_rwsem(&uprobe->register_rwsem);
> > >       init_rwsem(&uprobe->consumer_rwsem);
> > >       RB_CLEAR_NODE(&uprobe->rb_node);
> > > @@ -808,32 +809,19 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> > >  static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > >  {
> > >       down_write(&uprobe->consumer_rwsem);
> > > -     uc->next = uprobe->consumers;
> > > -     uprobe->consumers = uc;
> > > +     list_add_rcu(&uc->cons_node, &uprobe->consumers);
> > >       up_write(&uprobe->consumer_rwsem);
> > >  }
> > >
> > >  /*
> > >   * For uprobe @uprobe, delete the consumer @uc.
> > > - * Return true if the @uc is deleted successfully
> > > - * or return false.
> > > + * Should never be called with consumer that's not part of @uprobe->consumers.
> > >   */
> > > -static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > > +static void consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > >  {
> > > -     struct uprobe_consumer **con;
> > > -     bool ret = false;
> > > -
> > >       down_write(&uprobe->consumer_rwsem);
> > > -     for (con = &uprobe->consumers; *con; con = &(*con)->next) {
> > > -             if (*con == uc) {
> > > -                     *con = uc->next;
> > > -                     ret = true;
> > > -                     break;
> > > -             }
> > > -     }
> > > +     list_del_rcu(&uc->cons_node);
> > >       up_write(&uprobe->consumer_rwsem);
> > > -
> > > -     return ret;
> > >  }
> > >
> > >  static int __copy_insn(struct address_space *mapping, struct file *filp,
> > > @@ -929,7 +917,8 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
> > >       bool ret = false;
> > >
> > >       down_read(&uprobe->consumer_rwsem);
> > > -     for (uc = uprobe->consumers; uc; uc = uc->next) {
> > > +     list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> > > +                              srcu_read_lock_held(&uprobes_srcu)) {
> > >               ret = consumer_filter(uc, mm);
> > >               if (ret)
> > >                       break;
> > > @@ -1125,18 +1114,29 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > >       int err;
> > >
> > >       down_write(&uprobe->register_rwsem);
> > > -     if (WARN_ON(!consumer_del(uprobe, uc))) {
> > > -             err = -ENOENT;
> > > -     } else {
> > > -             err = register_for_each_vma(uprobe, NULL);
> > > -             /* TODO : cant unregister? schedule a worker thread */
> > > -             if (unlikely(err))
> > > -                     uprobe_warn(current, "unregister, leaking uprobe");
> > > -     }
> > > +     consumer_del(uprobe, uc);
> > > +     err = register_for_each_vma(uprobe, NULL);
> > >       up_write(&uprobe->register_rwsem);
> > >
> > > -     if (!err)
> > > -             put_uprobe(uprobe);
> > > +     /* TODO : cant unregister? schedule a worker thread */
> > > +     if (unlikely(err)) {
> > > +             uprobe_warn(current, "unregister, leaking uprobe");
> > > +             goto out_sync;
> > > +     }
> > > +
> > > +     put_uprobe(uprobe);
> > > +
> > > +out_sync:
> > > +     /*
> > > +      * Now that handler_chain() and handle_uretprobe_chain() iterate over
> > > +      * uprobe->consumers list under RCU protection without holding
> > > +      * uprobe->register_rwsem, we need to wait for RCU grace period to
> > > +      * make sure that we can't call into just unregistered
> > > +      * uprobe_consumer's callbacks anymore. If we don't do that, fast and
> > > +      * unlucky enough caller can free consumer's memory and cause
> > > +      * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
> > > +      */
> > > +     synchronize_srcu(&uprobes_srcu);
> > >  }
> > >  EXPORT_SYMBOL_GPL(uprobe_unregister);
> > >
> > > @@ -1214,13 +1214,20 @@ EXPORT_SYMBOL_GPL(uprobe_register);
> > >  int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
> > >  {
> > >       struct uprobe_consumer *con;
> > > -     int ret = -ENOENT;
> > > +     int ret = -ENOENT, srcu_idx;
> > >
> > >       down_write(&uprobe->register_rwsem);
> > > -     for (con = uprobe->consumers; con && con != uc ; con = con->next)
> > > -             ;
> > > -     if (con)
> > > -             ret = register_for_each_vma(uprobe, add ? uc : NULL);
> > > +
> > > +     srcu_idx = srcu_read_lock(&uprobes_srcu);
> > > +     list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
> > > +                              srcu_read_lock_held(&uprobes_srcu)) {
> > > +             if (con == uc) {
> > > +                     ret = register_for_each_vma(uprobe, add ? uc : NULL);
> > > +                     break;
> > > +             }
> > > +     }
> > > +     srcu_read_unlock(&uprobes_srcu, srcu_idx);
> > > +
> > >       up_write(&uprobe->register_rwsem);
> > >
> > >       return ret;
> > > @@ -2085,10 +2092,12 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
> > >       struct uprobe_consumer *uc;
> > >       int remove = UPROBE_HANDLER_REMOVE;
> > >       bool need_prep = false; /* prepare return uprobe, when needed */
> > > +     bool has_consumers = false;
> > >
> > > -     down_read(&uprobe->register_rwsem);
> > >       current->utask->auprobe = &uprobe->arch;
> > > -     for (uc = uprobe->consumers; uc; uc = uc->next) {
> > > +
> > > +     list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> > > +                              srcu_read_lock_held(&uprobes_srcu)) {
> > >               int rc = 0;
> > >
> > >               if (uc->handler) {
> > > @@ -2101,17 +2110,24 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
> > >                       need_prep = true;
> > >
> > >               remove &= rc;
> > > +             has_consumers = true;
> > >       }
> > >       current->utask->auprobe = NULL;
> > >
> > >       if (need_prep && !remove)
> > >               prepare_uretprobe(uprobe, regs); /* put bp at return */
> > >
> > > -     if (remove && uprobe->consumers) {
> > > -             WARN_ON(!uprobe_is_active(uprobe));
> > > -             unapply_uprobe(uprobe, current->mm);
> > > +     if (remove && has_consumers) {
> > > +             down_read(&uprobe->register_rwsem);
> > > +
> > > +             /* re-check that removal is still required, this time under lock */
> > > +             if (!filter_chain(uprobe, current->mm)) {
> >
> > sorry for late question, but I do not follow this change..
> >
> > at this point we got 1 as handler's return value from all the uprobe's consumers,
> > why do we need to call filter_chain in here.. IIUC this will likely skip over
> > the removal?
> >
> 
> Because we don't hold register_rwsem we are now racing with
> registration. So while we can get all consumers at the time we were
> iterating over the consumer list to request deletion, a parallel CPU
> can add another consumer that needs this uprobe+PID combination. So if
> we don't double-check, we are risking having a consumer that will not
> be triggered for the desired process.
> 
> Does it make sense? Given removal is rare, it's ok to take lock if we
> *suspect* removal, and then check authoritatively again under lock.

with this change the probe will not get removed in the attached test,
it'll get 2 hits, without this change just 1 hit

but I'm not sure it's a big problem, because seems like that's not the
intended way the removal should be used anyway, as explained by Oleg [1]

jirka


[1] https://lore.kernel.org/linux-trace-kernel/ZtHKTtn7sqaLeVxV@krava/T/#m07cdc37307cfd06f17f5755a067c9b300a19ee78

---
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
index bf6ca8e3eb13..86d37a8e6169 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_multi_test.c
@@ -6,6 +6,7 @@
 #include "uprobe_multi.skel.h"
 #include "uprobe_multi_bench.skel.h"
 #include "uprobe_multi_usdt.skel.h"
+#include "uprobe_multi_removal.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "testing_helpers.h"
 #include "../sdt.h"
@@ -687,6 +688,28 @@ static void test_bench_attach_usdt(void)
 	printf("%s: detached in %7.3lfs\n", __func__, detach_delta);
 }
 
+static void test_removal(void)
+{
+	struct uprobe_multi_removal *skel = NULL;
+	int err;
+
+	skel = uprobe_multi_removal__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_multi_removal__open_and_load"))
+		return;
+
+	err = uprobe_multi_removal__attach(skel);
+	if (!ASSERT_OK(err, "uprobe_multi_removal__attach"))
+		goto cleanup;
+
+	uprobe_multi_func_1();
+	uprobe_multi_func_1();
+
+	ASSERT_EQ(skel->bss->test, 1, "test");
+
+cleanup:
+	uprobe_multi_removal__destroy(skel);
+}
+
 void test_uprobe_multi_test(void)
 {
 	if (test__start_subtest("skel_api"))
@@ -703,4 +726,6 @@ void test_uprobe_multi_test(void)
 		test_bench_attach_usdt();
 	if (test__start_subtest("attach_api_fails"))
 		test_attach_api_fails();
+	if (test__start_subtest("removal"))
+		test_removal();
 }
diff --git a/tools/testing/selftests/bpf/progs/uprobe_multi_removal.c b/tools/testing/selftests/bpf/progs/uprobe_multi_removal.c
new file mode 100644
index 000000000000..0a948cc1e05b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_multi_removal.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/usdt.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+int test;
+
+SEC("uprobe.multi//proc/self/exe:uprobe_multi_func_1")
+int uprobe(struct pt_regs *ctx)
+{
+	test++;
+	return 1;
+}

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-29 23:31     ` Andrii Nakryiko
  2024-08-30 13:45       ` Jiri Olsa
@ 2024-08-30 14:18       ` Oleg Nesterov
  1 sibling, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2024-08-30 14:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On 08/29, Andrii Nakryiko wrote:
>
> On Thu, Aug 29, 2024 at 4:10 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > > @@ -2101,17 +2110,24 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
> > >                       need_prep = true;
> > >
> > >               remove &= rc;
> > > +             has_consumers = true;
> > >       }
> > >       current->utask->auprobe = NULL;
> > >
> > >       if (need_prep && !remove)
> > >               prepare_uretprobe(uprobe, regs); /* put bp at return */
> > >
> > > -     if (remove && uprobe->consumers) {
> > > -             WARN_ON(!uprobe_is_active(uprobe));
> > > -             unapply_uprobe(uprobe, current->mm);
> > > +     if (remove && has_consumers) {
> > > +             down_read(&uprobe->register_rwsem);
> > > +
> > > +             /* re-check that removal is still required, this time under lock */
> > > +             if (!filter_chain(uprobe, current->mm)) {
> >
> > sorry for late question, but I do not follow this change..
> >
> > at this point we got 1 as handler's return value from all the uprobe's consumers,
> > why do we need to call filter_chain in here.. IIUC this will likely skip over
> > the removal?
> >
>
> Because we don't hold register_rwsem we are now racing with
> registration. So while we can get all consumers at the time we were
> iterating over the consumer list to request deletion, a parallel CPU
> can add another consumer that needs this uprobe+PID combination. So if
> we don't double-check, we are risking having a consumer that will not
> be triggered for the desired process.

Oh, yes, but this logic is wrong in that it assumes that uc->filter != NULL.
At least it adds the noticeable change in behaviour.

Suppose we have a singler consumer UC with ->filter == NULL. Now suppose
that UC->handler() returns UPROBE_HANDLER_REMOVE.

Before this patch handler_chain() calls unapply_uprobe(), and I think
we should keep this behaviour.

After this patch unapply_uprobe() won't be called: consumer_filter(UC)
returns true, UC->filter == NULL means "probe everything". But I think
that UPROBE_HANDLER_REMOVE must be respected in this case anyway.

Thanks Jiri, I missed that too :/

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-30 13:45       ` Jiri Olsa
@ 2024-08-30 14:31         ` Oleg Nesterov
  2024-08-30 15:44           ` Andrii Nakryiko
  2024-08-31 17:25         ` Oleg Nesterov
  1 sibling, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-08-30 14:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, peterz,
	rostedt, mhiramat, bpf, linux-kernel, paulmck, willy, surenb,
	akpm, linux-mm

On 08/30, Jiri Olsa wrote:
>
> with this change the probe will not get removed in the attached test,
> it'll get 2 hits, without this change just 1 hit

I don't understand the code in tools/...bpf../ at all, can't comment,

> but I'm not sure it's a big problem, because seems like that's not the
> intended way the removal should be used anyway, as explained by Oleg [1]

It seems that I confused you again ;)

No, I think you found a problem. UPROBE_HANDLER_REMOVE can be lost if
uc->filter == NULL of if it returns true. See another reply I sent a
minute ago.

I think the fix is simple, plus we need to cleanup this logic anyway,
I'll try to send some code on Monday.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-30 14:31         ` Oleg Nesterov
@ 2024-08-30 15:44           ` Andrii Nakryiko
  2024-08-30 20:20             ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-30 15:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On Fri, Aug 30, 2024 at 7:33 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/30, Jiri Olsa wrote:
> >
> > with this change the probe will not get removed in the attached test,
> > it'll get 2 hits, without this change just 1 hit
>
> I don't understand the code in tools/...bpf../ at all, can't comment,
>
> > but I'm not sure it's a big problem, because seems like that's not the
> > intended way the removal should be used anyway, as explained by Oleg [1]
>
> It seems that I confused you again ;)
>
> No, I think you found a problem. UPROBE_HANDLER_REMOVE can be lost if
> uc->filter == NULL of if it returns true. See another reply I sent a
> minute ago.
>

For better or worse, but I think there is (or has to be) and implicit
contract that if uprobe (or uretprobe for that matter as well, but
that's a separate issue) handler can return UPROBE_HANDLER_REMOVE,
then it *has to* also provide filter. If it doesn't provide filter
callback, it doesn't care about PID filtering and thus can't and
shouldn't cause unregistration.

In ideal world, we wouldn't need handler to do the filtering, and
instead generic uprobe/uretprobe code would just call uc->filter to
know whether to trigger consumer or not. Unfortunately, that's a lot
of overhead due to indirect function call, especially with retpolines
and stuff like that.

So I think it's reasonable to have an (implicit, yeah...) contract
that whoever cares about UPROBE_HANDLER_REMOVE has to provide filter,
they go together.

Jiri, the fact that uprobe/uretprobe can cause detachment by returning
1 is a bug, we should not allow that. But that's a separate issue
which we can fix in bpf-next tree. Please send a patch.

> I think the fix is simple, plus we need to cleanup this logic anyway,
> I'll try to send some code on Monday.

Can we please let me land these patches first? It's been a while. I
don't think anything is really broken with the logic.

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-08-29 18:37 ` [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
@ 2024-08-30 17:41   ` kernel test robot
  2024-08-30 17:55     ` Andrii Nakryiko
  2024-08-30 20:36   ` kernel test robot
  1 sibling, 1 reply; 32+ messages in thread
From: kernel test robot @ 2024-08-30 17:41 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, peterz, oleg
  Cc: llvm, oe-kbuild-all, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, surenb, akpm, linux-mm, Andrii Nakryiko

Hi Andrii,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on next-20240830]
[cannot apply to perf-tools-next/perf-tools-next perf-tools/perf-tools linus/master acme/perf/core v6.11-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/uprobes-revamp-uprobe-refcounting-and-lifetime-management/20240830-024135
base:   tip/perf/core
patch link:    https://lore.kernel.org/r/20240829183741.3331213-9-andrii%40kernel.org
patch subject: [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
config: i386-buildonly-randconfig-004-20240830 (https://download.01.org/0day-ci/archive/20240831/202408310130.t9EBKteQ-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240831/202408310130.t9EBKteQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408310130.t9EBKteQ-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/events/uprobes.c:1157:2: error: call to undeclared function 'synchronize_rcu_tasks_trace'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1157 |         synchronize_rcu_tasks_trace();
         |         ^
   kernel/events/uprobes.c:1157:2: note: did you mean 'synchronize_rcu_tasks_rude'?
   include/linux/rcupdate.h:206:6: note: 'synchronize_rcu_tasks_rude' declared here
     206 | void synchronize_rcu_tasks_rude(void);
         |      ^
   1 error generated.


vim +/synchronize_rcu_tasks_trace +1157 kernel/events/uprobes.c

  1145	
  1146	void uprobe_unregister_sync(void)
  1147	{
  1148		/*
  1149		 * Now that handler_chain() and handle_uretprobe_chain() iterate over
  1150		 * uprobe->consumers list under RCU protection without holding
  1151		 * uprobe->register_rwsem, we need to wait for RCU grace period to
  1152		 * make sure that we can't call into just unregistered
  1153		 * uprobe_consumer's callbacks anymore. If we don't do that, fast and
  1154		 * unlucky enough caller can free consumer's memory and cause
  1155		 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
  1156		 */
> 1157		synchronize_rcu_tasks_trace();
  1158	}
  1159	EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
  1160	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-08-30 17:41   ` kernel test robot
@ 2024-08-30 17:55     ` Andrii Nakryiko
  0 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-30 17:55 UTC (permalink / raw)
  To: kernel test robot
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, llvm,
	oe-kbuild-all, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, surenb, akpm, linux-mm

On Fri, Aug 30, 2024 at 10:42 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Andrii,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on tip/perf/core]
> [also build test ERROR on next-20240830]
> [cannot apply to perf-tools-next/perf-tools-next perf-tools/perf-tools linus/master acme/perf/core v6.11-rc5]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/uprobes-revamp-uprobe-refcounting-and-lifetime-management/20240830-024135
> base:   tip/perf/core
> patch link:    https://lore.kernel.org/r/20240829183741.3331213-9-andrii%40kernel.org
> patch subject: [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
> config: i386-buildonly-randconfig-004-20240830 (https://download.01.org/0day-ci/archive/20240831/202408310130.t9EBKteQ-lkp@intel.com/config)
> compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240831/202408310130.t9EBKteQ-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202408310130.t9EBKteQ-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> >> kernel/events/uprobes.c:1157:2: error: call to undeclared function 'synchronize_rcu_tasks_trace'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>     1157 |         synchronize_rcu_tasks_trace();
>          |         ^
>    kernel/events/uprobes.c:1157:2: note: did you mean 'synchronize_rcu_tasks_rude'?
>    include/linux/rcupdate.h:206:6: note: 'synchronize_rcu_tasks_rude' declared here
>      206 | void synchronize_rcu_tasks_rude(void);
>          |      ^
>    1 error generated.

Missing #include <linux/rcupdate_trace.h>, will add.

>
>
> vim +/synchronize_rcu_tasks_trace +1157 kernel/events/uprobes.c
>
>   1145
>   1146  void uprobe_unregister_sync(void)
>   1147  {
>   1148          /*
>   1149           * Now that handler_chain() and handle_uretprobe_chain() iterate over
>   1150           * uprobe->consumers list under RCU protection without holding
>   1151           * uprobe->register_rwsem, we need to wait for RCU grace period to
>   1152           * make sure that we can't call into just unregistered
>   1153           * uprobe_consumer's callbacks anymore. If we don't do that, fast and
>   1154           * unlucky enough caller can free consumer's memory and cause
>   1155           * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
>   1156           */
> > 1157          synchronize_rcu_tasks_trace();
>   1158  }
>   1159  EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
>   1160
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-30 15:44           ` Andrii Nakryiko
@ 2024-08-30 20:20             ` Oleg Nesterov
  2024-08-30 20:43               ` Andrii Nakryiko
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-08-30 20:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On 08/30, Andrii Nakryiko wrote:
>

Andrii, let me reply to your email "out of order". First of all:

> Can we please let me land these patches first? It's been a while. I
> don't think anything is really broken with the logic.

OK, agreed.

I'll probably write another email (too late for me today), but I agree
that "avoid register_rwsem in handler_chain" is obviously a good goal,
lets discuss the possible cleanups or even fixlets later, when this
series is already applied.

> On Fri, Aug 30, 2024 at 7:33 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > No, I think you found a problem. UPROBE_HANDLER_REMOVE can be lost if
> > uc->filter == NULL of if it returns true. See another reply I sent a
> > minute ago.
> >
>
> For better or worse, but I think there is (or has to be) and implicit
> contract that if uprobe (or uretprobe for that matter as well, but
> that's a separate issue) handler can return UPROBE_HANDLER_REMOVE,
> then it *has to* also provide filter.

IOW, uc->handler and uc->filter must be consistent. But the current API
doesn't require this contract, so this patch adds a difference which I
didn't notice when I reviewed this change.

(In fact I noticed the difference, but I thought that it should be fine).

> If it doesn't provide filter
> callback, it doesn't care about PID filtering and thus can't and
> shouldn't cause unregistration.

At first glance I disagree, but see above.

> > I think the fix is simple, plus we need to cleanup this logic anyway,
> > I'll try to send some code on Monday.

Damn I am stupid. Nothing new ;) The "simple" fix I had in mind can't work.
But we can do other things which we can discuss later.

Oleg.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-08-29 18:37 ` [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
  2024-08-30 17:41   ` kernel test robot
@ 2024-08-30 20:36   ` kernel test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2024-08-30 20:36 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, peterz, oleg
  Cc: oe-kbuild-all, rostedt, mhiramat, bpf, linux-kernel, jolsa,
	paulmck, willy, surenb, akpm, linux-mm, Andrii Nakryiko

Hi Andrii,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on next-20240830]
[cannot apply to perf-tools-next/perf-tools-next perf-tools/perf-tools linus/master acme/perf/core v6.11-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/uprobes-revamp-uprobe-refcounting-and-lifetime-management/20240830-024135
base:   tip/perf/core
patch link:    https://lore.kernel.org/r/20240829183741.3331213-9-andrii%40kernel.org
patch subject: [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20240831/202408310111.2dkrylJ9-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240831/202408310111.2dkrylJ9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408310111.2dkrylJ9-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/events/uprobes.c: In function 'uprobe_unregister_sync':
>> kernel/events/uprobes.c:1157:9: error: implicit declaration of function 'synchronize_rcu_tasks_trace'; did you mean 'synchronize_rcu_tasks'? [-Werror=implicit-function-declaration]
    1157 |         synchronize_rcu_tasks_trace();
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
         |         synchronize_rcu_tasks
   cc1: some warnings being treated as errors


vim +1157 kernel/events/uprobes.c

  1145	
  1146	void uprobe_unregister_sync(void)
  1147	{
  1148		/*
  1149		 * Now that handler_chain() and handle_uretprobe_chain() iterate over
  1150		 * uprobe->consumers list under RCU protection without holding
  1151		 * uprobe->register_rwsem, we need to wait for RCU grace period to
  1152		 * make sure that we can't call into just unregistered
  1153		 * uprobe_consumer's callbacks anymore. If we don't do that, fast and
  1154		 * unlucky enough caller can free consumer's memory and cause
  1155		 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
  1156		 */
> 1157		synchronize_rcu_tasks_trace();
  1158	}
  1159	EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
  1160	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-30 20:20             ` Oleg Nesterov
@ 2024-08-30 20:43               ` Andrii Nakryiko
  2024-08-31 16:19                 ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2024-08-30 20:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On Fri, Aug 30, 2024 at 1:21 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/30, Andrii Nakryiko wrote:
> >
>
> Andrii, let me reply to your email "out of order". First of all:
>
> > Can we please let me land these patches first? It's been a while. I
> > don't think anything is really broken with the logic.
>
> OK, agreed.
>
> I'll probably write another email (too late for me today), but I agree
> that "avoid register_rwsem in handler_chain" is obviously a good goal,
> lets discuss the possible cleanups or even fixlets later, when this
> series is already applied.
>

Sounds good. It seems like I'll need another revision due to missing
include, so if there is any reasonably straightforward clean up we
should do, I can just incorporate that into my series.

>
>
> > On Fri, Aug 30, 2024 at 7:33 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > No, I think you found a problem. UPROBE_HANDLER_REMOVE can be lost if
> > > uc->filter == NULL of if it returns true. See another reply I sent a
> > > minute ago.
> > >
> >
> > For better or worse, but I think there is (or has to be) and implicit
> > contract that if uprobe (or uretprobe for that matter as well, but
> > that's a separate issue) handler can return UPROBE_HANDLER_REMOVE,
> > then it *has to* also provide filter.
>
> IOW, uc->handler and uc->filter must be consistent. But the current API
> doesn't require this contract, so this patch adds a difference which I
> didn't notice when I reviewed this change.
>
> (In fact I noticed the difference, but I thought that it should be fine).
>
> > If it doesn't provide filter
> > callback, it doesn't care about PID filtering and thus can't and
> > shouldn't cause unregistration.
>
> At first glance I disagree, but see above.

I still think it's fine, tbh. Which uprobe user violates this contract
in the kernel? Maybe we should just fix that while at it? Because
otherwise we are allowing some frankly too-dynamic and unnecessarily
complicated behavior where we can dynamically unsubscribe without
actually having corresponding filter logic.

As I mentioned earlier, I actually considered calling filter
explicitly to enforce this contract, but then got concerned about
indirect callback overhead and dropped that idea.

>
> > > I think the fix is simple, plus we need to cleanup this logic anyway,
> > > I'll try to send some code on Monday.
>
> Damn I am stupid. Nothing new ;) The "simple" fix I had in mind can't work.
> But we can do other things which we can discuss later.
>

I actually don't see how anything reasonably simple and
straightforward (short of just taking register_rwsem) can work if we
want this weird out-of-sync filter and dynamic UPROBE_HANDLER_REMOVE
behavior to remain. But again, does anyone actually rely on that and
should it be even allowed?

> Oleg.
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-30 20:43               ` Andrii Nakryiko
@ 2024-08-31 16:19                 ` Oleg Nesterov
  2024-09-02  9:14                   ` Jiri Olsa
  2024-09-03 17:27                   ` Andrii Nakryiko
  0 siblings, 2 replies; 32+ messages in thread
From: Oleg Nesterov @ 2024-08-31 16:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On 08/30, Andrii Nakryiko wrote:
>
> On Fri, Aug 30, 2024 at 1:21 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I'll probably write another email (too late for me today), but I agree
> > that "avoid register_rwsem in handler_chain" is obviously a good goal,
> > lets discuss the possible cleanups or even fixlets later, when this
> > series is already applied.
> >
>
> Sounds good. It seems like I'll need another revision due to missing
> include, so if there is any reasonably straightforward clean up we
> should do, I can just incorporate that into my series.

I was thinking about another seq counter incremented in register(), so
that handler_chain() can detect the race with uprobe_register() and skip
unapply_uprobe() in this case. This is what Peter did in one of his series.
Still changes the current behaviour, but not too much.

But see below,

> I still think it's fine, tbh.

and perhaps you are right,

> Which uprobe user violates this contract
> in the kernel?

The only in-kernel user of UPROBE_HANDLER_REMOVE is perf, and it is fine.

But there are out-of-tree users, say systemtap, I have no idea if this
change can affect them.

And in general, this change makes the API less "flexible".

But once again, I agree that it would be better to apply your series first,
then add the fixes in (unlikely) case it breaks something.

But. Since you are going to send another version, may I ask you to add a
note into the changelog to explain that this patch assumes (and enforces)
the rule about handler/filter consistency?

Oleg.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-30 13:45       ` Jiri Olsa
  2024-08-30 14:31         ` Oleg Nesterov
@ 2024-08-31 17:25         ` Oleg Nesterov
  2024-09-01  9:24           ` Jiri Olsa
  1 sibling, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-08-31 17:25 UTC (permalink / raw)
  To: Jiri Olsa, Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, paulmck, willy, surenb, akpm, linux-mm

On 08/30, Jiri Olsa wrote:
>
> with this change the probe will not get removed in the attached test,
> it'll get 2 hits, without this change just 1 hit

Thanks again for pointing out the subtle change in behaviour, but could
you add more details for me? ;)

I was going to read the test below today, but no. As I said many times
I know nothing about bpf, I simply can't understand what this test-case
actually do in kernel-space.

According to git grep, the only in kernel user of UPROBE_HANDLER_REMOVE
is uprobe_perf_func(), but if it returns UPROBE_HANDLER_REMOVE then
consumer->filter == uprobe_perf_filter() should return false?

So could you explay how/why exactly this changes affects your test-case?

But perhaps it uses bpf_uprobe_multi_link_attach() and ->handler is
uprobe_multi_link_handler() ? But uprobe_prog_run() returns zero if
current->mm != link->task->mm.

OTOH, otherwise it returns the error code from bpf_prog_run() and this looks
confusing to me. I have no idea what prog->bpf_func(ctx, insnsi) can return
in this case, but note the WARN(rc & ~UPROBE_HANDLER_MASK) in handler_chain...

Hmm... looking at your test-case again,

> +SEC("uprobe.multi//proc/self/exe:uprobe_multi_func_1")
> +int uprobe(struct pt_regs *ctx)
> +{
> +	test++;
> +	return 1;
> +}

So may be this (compiled to ebpf) is what prog->bpf_func() actually executes?
If yes, everything is clear. And this "proves" that the patch makes the current
API less flexible, as I mentioned in my reply to Andrii.

If I got it right, I'd suggest to add a comment into this code to explain
that we return UPROBE_HANDLER_REMOVE after the 1st hit, for git-grep.

Oleg.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-31 17:25         ` Oleg Nesterov
@ 2024-09-01  9:24           ` Jiri Olsa
  0 siblings, 0 replies; 32+ messages in thread
From: Jiri Olsa @ 2024-09-01  9:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel,
	peterz, rostedt, mhiramat, bpf, linux-kernel, paulmck, willy,
	surenb, akpm, linux-mm

On Sat, Aug 31, 2024 at 07:25:44PM +0200, Oleg Nesterov wrote:
> On 08/30, Jiri Olsa wrote:
> >
> > with this change the probe will not get removed in the attached test,
> > it'll get 2 hits, without this change just 1 hit
> 
> Thanks again for pointing out the subtle change in behaviour, but could
> you add more details for me? ;)
> 
> I was going to read the test below today, but no. As I said many times
> I know nothing about bpf, I simply can't understand what this test-case
> actually do in kernel-space.
> 
> According to git grep, the only in kernel user of UPROBE_HANDLER_REMOVE
> is uprobe_perf_func(), but if it returns UPROBE_HANDLER_REMOVE then
> consumer->filter == uprobe_perf_filter() should return false?
> 
> So could you explay how/why exactly this changes affects your test-case?
> 
> 
> But perhaps it uses bpf_uprobe_multi_link_attach() and ->handler is
> uprobe_multi_link_handler() ? But uprobe_prog_run() returns zero if
> current->mm != link->task->mm.
> 
> OTOH, otherwise it returns the error code from bpf_prog_run() and this looks
> confusing to me. I have no idea what prog->bpf_func(ctx, insnsi) can return
> in this case, but note the WARN(rc & ~UPROBE_HANDLER_MASK) in handler_chain...
> 
> Hmm... looking at your test-case again,
> 
> > +SEC("uprobe.multi//proc/self/exe:uprobe_multi_func_1")
> > +int uprobe(struct pt_regs *ctx)
> > +{
> > +	test++;
> > +	return 1;
> > +}
> 
> So may be this (compiled to ebpf) is what prog->bpf_func() actually executes?

yep, that's correct, it goes like:

  uprobe_multi_link_handler
    uprobe_prog_run
    {
      err = bpf_prog_run - runs above bpf program and returns its return
                           value (1 - UPROBE_HANDLER_REMOVE)

      return err;
    }       

> If yes, everything is clear. And this "proves" that the patch makes the current
> API less flexible, as I mentioned in my reply to Andrii.
> 
> If I got it right, I'd suggest to add a comment into this code to explain
> that we return UPROBE_HANDLER_REMOVE after the 1st hit, for git-grep.

ok, I'll add comment with that

thanks,
jirka

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-31 16:19                 ` Oleg Nesterov
@ 2024-09-02  9:14                   ` Jiri Olsa
  2024-09-03 17:27                   ` Andrii Nakryiko
  1 sibling, 0 replies; 32+ messages in thread
From: Jiri Olsa @ 2024-09-02  9:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, Jiri Olsa, Andrii Nakryiko, linux-trace-kernel,
	peterz, rostedt, mhiramat, bpf, linux-kernel, paulmck, willy,
	surenb, akpm, linux-mm

On Sat, Aug 31, 2024 at 06:19:15PM +0200, Oleg Nesterov wrote:
> On 08/30, Andrii Nakryiko wrote:
> >
> > On Fri, Aug 30, 2024 at 1:21 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > I'll probably write another email (too late for me today), but I agree
> > > that "avoid register_rwsem in handler_chain" is obviously a good goal,
> > > lets discuss the possible cleanups or even fixlets later, when this
> > > series is already applied.
> > >
> >
> > Sounds good. It seems like I'll need another revision due to missing
> > include, so if there is any reasonably straightforward clean up we
> > should do, I can just incorporate that into my series.
> 
> I was thinking about another seq counter incremented in register(), so
> that handler_chain() can detect the race with uprobe_register() and skip
> unapply_uprobe() in this case. This is what Peter did in one of his series.
> Still changes the current behaviour, but not too much.
> 
> But see below,
> 
> > I still think it's fine, tbh.
> 
> and perhaps you are right,
> 
> > Which uprobe user violates this contract
> > in the kernel?
> 
> The only in-kernel user of UPROBE_HANDLER_REMOVE is perf, and it is fine.
> 
> But there are out-of-tree users, say systemtap, I have no idea if this
> change can affect them.
> 
> And in general, this change makes the API less "flexible".
> 
> But once again, I agree that it would be better to apply your series first,
> then add the fixes in (unlikely) case it breaks something.

FWIW I (strongly) agree with merging this change and fixing the rest as follow up

thanks,
jirka

> 
> But. Since you are going to send another version, may I ask you to add a
> note into the changelog to explain that this patch assumes (and enforces)
> the rule about handler/filter consistency?
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations
  2024-08-30 10:24 ` [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Oleg Nesterov
@ 2024-09-03 13:21   ` Peter Zijlstra
  2024-09-03 13:59     ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2024-09-03 13:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm

On Fri, Aug 30, 2024 at 12:24:01PM +0200, Oleg Nesterov wrote:
> On 08/29, Andrii Nakryiko wrote:
> >
> > v3->v4:
> >   - added back consumer_rwsem into consumer_del(), which was accidentally
> >     omitted earlier (Jiri);
> 
> still,
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 

Let me go queue this in perf/core then. Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations
  2024-09-03 13:21   ` Peter Zijlstra
@ 2024-09-03 13:59     ` Oleg Nesterov
  2024-09-03 14:03       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-09-03 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm

On 09/03, Peter Zijlstra wrote:
>
> On Fri, Aug 30, 2024 at 12:24:01PM +0200, Oleg Nesterov wrote:
> > On 08/29, Andrii Nakryiko wrote:
> > >
> > > v3->v4:
> > >   - added back consumer_rwsem into consumer_del(), which was accidentally
> > >     omitted earlier (Jiri);
> >
> > still,
> >
> > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> >
>
> Let me go queue this in perf/core then. Thanks!

FYI, Andrii was going to send another revision due to missing include
inux/rcupdate_trace.h in kernel/events/uprobes.c.

See the build failure reported kernel test robot:
https://lore.kernel.org/all/202408310130.t9EBKteQ-lkp@intel.com/

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations
  2024-09-03 13:59     ` Oleg Nesterov
@ 2024-09-03 14:03       ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2024-09-03 14:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm

On Tue, Sep 03, 2024 at 03:59:43PM +0200, Oleg Nesterov wrote:
> On 09/03, Peter Zijlstra wrote:
> >
> > On Fri, Aug 30, 2024 at 12:24:01PM +0200, Oleg Nesterov wrote:
> > > On 08/29, Andrii Nakryiko wrote:
> > > >
> > > > v3->v4:
> > > >   - added back consumer_rwsem into consumer_del(), which was accidentally
> > > >     omitted earlier (Jiri);
> > >
> > > still,
> > >
> > > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > >
> >
> > Let me go queue this in perf/core then. Thanks!
> 
> FYI, Andrii was going to send another revision due to missing include
> inux/rcupdate_trace.h in kernel/events/uprobes.c.
> 
> See the build failure reported kernel test robot:
> https://lore.kernel.org/all/202408310130.t9EBKteQ-lkp@intel.com/

No problem, I'll sit on it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-31 16:19                 ` Oleg Nesterov
  2024-09-02  9:14                   ` Jiri Olsa
@ 2024-09-03 17:27                   ` Andrii Nakryiko
  2024-09-03 17:35                     ` Andrii Nakryiko
  2024-09-03 18:25                     ` Oleg Nesterov
  1 sibling, 2 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On Sat, Aug 31, 2024 at 9:19 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/30, Andrii Nakryiko wrote:
> >
> > On Fri, Aug 30, 2024 at 1:21 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > I'll probably write another email (too late for me today), but I agree
> > > that "avoid register_rwsem in handler_chain" is obviously a good goal,
> > > lets discuss the possible cleanups or even fixlets later, when this
> > > series is already applied.
> > >
> >
> > Sounds good. It seems like I'll need another revision due to missing
> > include, so if there is any reasonably straightforward clean up we
> > should do, I can just incorporate that into my series.
>
> I was thinking about another seq counter incremented in register(), so
> that handler_chain() can detect the race with uprobe_register() and skip
> unapply_uprobe() in this case. This is what Peter did in one of his series.
> Still changes the current behaviour, but not too much.

We could do that, but then worst case, when we do detect registration
race, what do we do? We still have to do the same. So instead of
polluting the logic with seq counter it's best to just codify the
protocol and take advantage of that.

But as you said, this all can/should be addressed as a follow up
discussion. You mentioned some clean ups you wanted to do, let's
discuss all that as part of that?

>
> But see below,
>
> > I still think it's fine, tbh.
>
> and perhaps you are right,
>
> > Which uprobe user violates this contract
> > in the kernel?
>
> The only in-kernel user of UPROBE_HANDLER_REMOVE is perf, and it is fine.
>

Well, BPF program can accidentally trigger this as well, but that's a
bug, we should fix it ASAP in the bpf tree.


> But there are out-of-tree users, say systemtap, I have no idea if this
> change can affect them.
>
> And in general, this change makes the API less "flexible".

it maybe makes a weird and too-flexible case a bit more work to
implement. Because if consumer want to be that flexible, they can
still define filter that will be coordinated between filter() and
handler() implementation.

>
> But once again, I agree that it would be better to apply your series first,
> then add the fixes in (unlikely) case it breaks something.

Yep, agreed, thanks! Will send a new version ASAP, so we have a common
base to work on top of.

>
> But. Since you are going to send another version, may I ask you to add a
> note into the changelog to explain that this patch assumes (and enforces)
> the rule about handler/filter consistency?

Yep, will do. I will also leave a comment next to the filter callback
definition in uprobe_consumer about this.

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-09-03 17:27                   ` Andrii Nakryiko
@ 2024-09-03 17:35                     ` Andrii Nakryiko
  2024-09-03 18:27                       ` Oleg Nesterov
  2024-09-03 18:25                     ` Oleg Nesterov
  1 sibling, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 17:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On Tue, Sep 3, 2024 at 10:27 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sat, Aug 31, 2024 at 9:19 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 08/30, Andrii Nakryiko wrote:
> > >
> > > On Fri, Aug 30, 2024 at 1:21 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > I'll probably write another email (too late for me today), but I agree
> > > > that "avoid register_rwsem in handler_chain" is obviously a good goal,
> > > > lets discuss the possible cleanups or even fixlets later, when this
> > > > series is already applied.
> > > >
> > >
> > > Sounds good. It seems like I'll need another revision due to missing
> > > include, so if there is any reasonably straightforward clean up we
> > > should do, I can just incorporate that into my series.
> >
> > I was thinking about another seq counter incremented in register(), so
> > that handler_chain() can detect the race with uprobe_register() and skip
> > unapply_uprobe() in this case. This is what Peter did in one of his series.
> > Still changes the current behaviour, but not too much.
>
> We could do that, but then worst case, when we do detect registration
> race, what do we do? We still have to do the same. So instead of
> polluting the logic with seq counter it's best to just codify the
> protocol and take advantage of that.
>
> But as you said, this all can/should be addressed as a follow up
> discussion. You mentioned some clean ups you wanted to do, let's
> discuss all that as part of that?
>
> >
> > But see below,
> >
> > > I still think it's fine, tbh.
> >
> > and perhaps you are right,
> >
> > > Which uprobe user violates this contract
> > > in the kernel?
> >
> > The only in-kernel user of UPROBE_HANDLER_REMOVE is perf, and it is fine.
> >
>
> Well, BPF program can accidentally trigger this as well, but that's a
> bug, we should fix it ASAP in the bpf tree.
>
>
> > But there are out-of-tree users, say systemtap, I have no idea if this
> > change can affect them.
> >
> > And in general, this change makes the API less "flexible".
>
> it maybe makes a weird and too-flexible case a bit more work to
> implement. Because if consumer want to be that flexible, they can
> still define filter that will be coordinated between filter() and
> handler() implementation.
>
> >
> > But once again, I agree that it would be better to apply your series first,
> > then add the fixes in (unlikely) case it breaks something.
>
> Yep, agreed, thanks! Will send a new version ASAP, so we have a common
> base to work on top of.
>
> >
> > But. Since you are going to send another version, may I ask you to add a
> > note into the changelog to explain that this patch assumes (and enforces)
> > the rule about handler/filter consistency?
>
> Yep, will do. I will also leave a comment next to the filter callback
> definition in uprobe_consumer about this.
>

Ok, I'm adding this:

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 29c935b0d504..33236d689d60 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,6 +29,14 @@ struct page;
 #define MAX_URETPROBE_DEPTH            64

 struct uprobe_consumer {
+       /*
+        * handler() can return UPROBE_HANDLER_REMOVE to signal the need to
+        * unregister uprobe for current process. If UPROBE_HANDLER_REMOVE is
+        * returned, filter() callback has to be implemented as well and it
+        * should return false to "confirm" the decision to uninstall uprobe
+        * for the current process. If filter() is omitted or returns true,
+        * UPROBE_HANDLER_REMOVE is effectively ignored.
+        */
        int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
        int (*ret_handler)(struct uprobe_consumer *self,
                                unsigned long func,


> >
> > Oleg.
> >

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-09-03 17:27                   ` Andrii Nakryiko
  2024-09-03 17:35                     ` Andrii Nakryiko
@ 2024-09-03 18:25                     ` Oleg Nesterov
  1 sibling, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2024-09-03 18:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On 09/03, Andrii Nakryiko wrote:
>
> On Sat, Aug 31, 2024 at 9:19 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I was thinking about another seq counter incremented in register(), so
> > that handler_chain() can detect the race with uprobe_register() and skip
> > unapply_uprobe() in this case. This is what Peter did in one of his series.
> > Still changes the current behaviour, but not too much.
>
> We could do that, but then worst case, when we do detect registration
> race, what do we do?

Do nothing and skip unapply_uprobe().

> But as you said, this all can/should be addressed as a follow up
> discussion.

Yes, yes,

> You mentioned some clean ups you wanted to do, let's
> discuss all that as part of that?

Yes, sure.

And please note that in reply to myself I also mentioned that I am stupid
and these cleanups can't help to change/improve this behaviour ;)

> > The only in-kernel user of UPROBE_HANDLER_REMOVE is perf, and it is fine.
> >
>
> Well, BPF program can accidentally trigger this as well, but that's a
> bug, we should fix it ASAP in the bpf tree.

not sure, but...

> > And in general, this change makes the API less "flexible".
>
> it maybe makes a weird and too-flexible case a bit more work to
> implement. Because if consumer want to be that flexible, they can
> still define filter that will be coordinated between filter() and
> handler() implementation.

perhaps, but lets discuss this later, on top of your series.

> > But once again, I agree that it would be better to apply your series first,
> > then add the fixes in (unlikely) case it breaks something.
>
> Yep, agreed, thanks! Will send a new version ASAP, so we have a common
> base to work on top of.

Thanks. Hopefully Peter will queue your V5 soon.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-09-03 17:35                     ` Andrii Nakryiko
@ 2024-09-03 18:27                       ` Oleg Nesterov
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2024-09-03 18:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, paulmck, willy, surenb, akpm,
	linux-mm

On 09/03, Andrii Nakryiko wrote:
>
> On Tue, Sep 3, 2024 at 10:27 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sat, Aug 31, 2024 at 9:19 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > But. Since you are going to send another version, may I ask you to add a
> > > note into the changelog to explain that this patch assumes (and enforces)
> > > the rule about handler/filter consistency?
> >
> > Yep, will do. I will also leave a comment next to the filter callback
> > definition in uprobe_consumer about this.
> >
>
> Ok, I'm adding this:
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 29c935b0d504..33236d689d60 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -29,6 +29,14 @@ struct page;
>  #define MAX_URETPROBE_DEPTH            64
> 
>  struct uprobe_consumer {
> +       /*
> +        * handler() can return UPROBE_HANDLER_REMOVE to signal the need to
> +        * unregister uprobe for current process. If UPROBE_HANDLER_REMOVE is
> +        * returned, filter() callback has to be implemented as well and it
> +        * should return false to "confirm" the decision to uninstall uprobe
> +        * for the current process. If filter() is omitted or returns true,
> +        * UPROBE_HANDLER_REMOVE is effectively ignored.
> +        */

Thanks, LGTM.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-09-03 18:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-29 18:37 [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
2024-08-29 18:37 ` [PATCH v4 1/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
2024-08-29 18:37 ` [PATCH v4 2/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
2024-08-29 18:37 ` [PATCH v4 3/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
2024-08-29 18:37 ` [PATCH v4 4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
2024-08-29 23:09   ` Jiri Olsa
2024-08-29 23:31     ` Andrii Nakryiko
2024-08-30 13:45       ` Jiri Olsa
2024-08-30 14:31         ` Oleg Nesterov
2024-08-30 15:44           ` Andrii Nakryiko
2024-08-30 20:20             ` Oleg Nesterov
2024-08-30 20:43               ` Andrii Nakryiko
2024-08-31 16:19                 ` Oleg Nesterov
2024-09-02  9:14                   ` Jiri Olsa
2024-09-03 17:27                   ` Andrii Nakryiko
2024-09-03 17:35                     ` Andrii Nakryiko
2024-09-03 18:27                       ` Oleg Nesterov
2024-09-03 18:25                     ` Oleg Nesterov
2024-08-31 17:25         ` Oleg Nesterov
2024-09-01  9:24           ` Jiri Olsa
2024-08-30 14:18       ` Oleg Nesterov
2024-08-29 18:37 ` [PATCH v4 5/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
2024-08-29 18:37 ` [PATCH v4 6/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
2024-08-29 18:37 ` [PATCH v4 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
2024-08-29 18:37 ` [PATCH v4 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
2024-08-30 17:41   ` kernel test robot
2024-08-30 17:55     ` Andrii Nakryiko
2024-08-30 20:36   ` kernel test robot
2024-08-30 10:24 ` [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Oleg Nesterov
2024-09-03 13:21   ` Peter Zijlstra
2024-09-03 13:59     ` Oleg Nesterov
2024-09-03 14:03       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).