[PATCH 0/8] uprobes: RCU-protected hot path optimizations

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] uprobes: RCU-protected hot path optimizations
@ 2024-07-31 21:42 Andrii Nakryiko
  2024-07-31 21:42 ` [PATCH 1/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
                   ` (8 more replies)
  0 siblings, 9 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

This patch set is heavily inspired by Peter Zijlstra's uprobe optimization
patches ([0]) and continue that work, albeit trying to keep complexity to the
minimum, and attepting to reuse existing primitives as much as possible. The
goal here is to optimize obvious uprobe triggering hot path, while keeping the
rest of locking mostly intact.

I've reused rb_find_rcu()/rb_find_add_rcu() patches as is, and the "split
uprobe_unregister()" is mostly intact, but I've added uprobe_unregister_sync()
into the error handling code path inside uprobe_unregister(). This is due to
recent refactorings from Oleg Nesterov ([1]), which necessitates this
addition. I'm not sure I got Co-Developed-by/SOB pieces right, for which
I apoligize in advance.

Except for refcounting change patch (which I stongly believe is a good
improvement we should do and forget about quasi-refcounting schema of
uprobe->consumers list), the rest of the changes are similar to Peter's
initial changes in [0].

Main differences would be:
  - no special RCU protection for mmap and fork handling, we just stick to
    refcounts there, as those are infrequent and not performance-sensitive
    code, while being complex and thus benefiting from proper locking;
  - the above means we don't need to do any custom SRCU additions to handle
    forking code path;
  - I handled UPROBE_HANDLER_REMOVE problem in handler_chain() differently,
    again, leveraging existing locking scheam;
  - I kept refcount usage for uretprobe and single-stepping uprobes, I plan to
    address that in a separate follow up patches. The plan is to avoid
    task_work, but I need to sit down and write and test the code.
  - finally, I dutifully was using SRCU throughout all the changes, and only
    last patch switches SRCU to RCU Tasks Trace and demonstrates significant
    performance and scalability gains from this.

The changes in this patch set were tested using BPF selftests and using
uprobe-stress ([2]) tool. One recent BPF selftest (uprobe_multi/consumers),
only recently added by Jiri Olsa will need a single-line adjustment to the
counting logic, but the patch itself is in bpf-next/master, so we'll have to
address that once linux-trace or tip and bpf-next trees merge. I'll take care
of that when this happens.

Now, for the benchmarking results. I've used the following script (which
utilizes BPF selftests-based bench tool). The CPU used was 80-core Intel Xeon
Gold 6138 CPU @ 2.00GHz running kernel with production-like config. I minimized
background noise by stopping any service I could identify and stop, so results
are pretty stable and variability is pretty small, overall.

Benchmark script:

#!/bin/bash

set -eufo pipefail

for i in uprobe-nop uretprobe-nop; do
    for p in 1 2 4 8 16 32 64; do
        summary=$(sudo ./bench -w3 -d5 -p$p -a trig-$i | tail -n1)
        total=$(echo "$summary" | cut -d'(' -f1 | cut -d' ' -f3-)
        percpu=$(echo "$summary" | cut -d'(' -f2 | cut -d')' -f1 | cut -d'/' -f1)
        printf "%-15s (%2d cpus): %s (%s/s/cpu)\n" $i $p "$total" "$percpu"
    done
    echo
done

With all the lock-avoiding changes done in this patch set, we get a pretty
decent improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is due to
the remaning mmap_lock, which is currently taken to resolve interrupt address
to inode+offset and then uprobe instance. And, of course, uretprobes still need
similar RCU to avoid refcount in the hot path, which will be addressed in the
follow up patches.

BASELINE (on top of Oleg's clean up patches)
============================================
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

Up to second-to-last patch (i.e., SRCU-based optimizations)
===========================================================
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

RCU Tasks Trace
===============
uprobe-nop      ( 1 cpus):    3.396 ± 0.002M/s  (  3.396M/s/cpu)
uprobe-nop      ( 2 cpus):    4.271 ± 0.006M/s  (  2.135M/s/cpu)
uprobe-nop      ( 4 cpus):    8.499 ± 0.015M/s  (  2.125M/s/cpu)
uprobe-nop      ( 8 cpus):   10.355 ± 0.028M/s  (  1.294M/s/cpu)
uprobe-nop      (16 cpus):    7.615 ± 0.099M/s  (  0.476M/s/cpu)
uprobe-nop      (32 cpus):    4.430 ± 0.007M/s  (  0.138M/s/cpu)
uprobe-nop      (64 cpus):    6.887 ± 0.020M/s  (  0.108M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.174 ± 0.001M/s  (  2.174M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.853 ± 0.001M/s  (  1.426M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.913 ± 0.002M/s  (  1.228M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.883 ± 0.002M/s  (  0.735M/s/cpu)
uretprobe-nop   (16 cpus):    5.147 ± 0.001M/s  (  0.322M/s/cpu)
uretprobe-nop   (32 cpus):    3.738 ± 0.008M/s  (  0.117M/s/cpu)
uretprobe-nop   (64 cpus):    4.397 ± 0.002M/s  (  0.069M/s/cpu)

For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe
triggerings per second) up to about 8 M/s. For uretprobes it's a bit more
modest with bump from 2.4 M/s to 5 M/s.

For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from
8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%),
as we have more work to do on uretprobes side.

Even single-thread (no contention) performance is slightly better: 3.276 M/s to
3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%)
for uretprobes.

  [0] https://lore.kernel.org/linux-trace-kernel/20240711110235.098009979@infradead.org/
  [1] https://lore.kernel.org/linux-trace-kernel/20240729134444.GA12293@redhat.com/
  [2] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress

Andrii Nakryiko (6):
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: protected uprobe lifetime with SRCU
  uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  uprobes: travers uprobe's consumer list locklessly under SRCU
    protection
  uprobes: perform lockless SRCU-protected uprobes_tree lookup
  uprobes: switch to RCU Tasks Trace flavor for better performance

Peter Zijlstra (2):
  rbtree: provide rb_find_rcu() / rb_find_add_rcu()
  perf/uprobe: split uprobe_unregister()

 include/linux/rbtree.h                        |  67 ++++
 include/linux/uprobes.h                       |  20 +-
 kernel/events/uprobes.c                       | 375 ++++++++++--------
 kernel/trace/bpf_trace.c                      |   8 +-
 kernel/trace/trace_uprobe.c                   |  15 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |   3 +-
 6 files changed, 305 insertions(+), 183 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu()
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
@ 2024-07-31 21:42 ` Andrii Nakryiko
  2024-07-31 21:42 ` [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

From: Peter Zijlstra <peterz@infradead.org>

Much like latch_tree, add two RCU methods for the regular RB-tree,
which can be used in conjunction with a seqcount to provide lockless
lookups.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 include/linux/rbtree.h | 67 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index f7edca369eda..7c173aa64e1e 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -244,6 +244,42 @@ rb_find_add(struct rb_node *node, struct rb_root *tree,
 	return NULL;
 }
 
+/**
+ * rb_find_add_rcu() - find equivalent @node in @tree, or add @node
+ * @node: node to look-for / insert
+ * @tree: tree to search / modify
+ * @cmp: operator defining the node order
+ *
+ * Adds a Store-Release for link_node.
+ *
+ * Returns the rb_node matching @node, or NULL when no match is found and @node
+ * is inserted.
+ */
+static __always_inline struct rb_node *
+rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
+		int (*cmp)(struct rb_node *, const struct rb_node *))
+{
+	struct rb_node **link = &tree->rb_node;
+	struct rb_node *parent = NULL;
+	int c;
+
+	while (*link) {
+		parent = *link;
+		c = cmp(node, parent);
+
+		if (c < 0)
+			link = &parent->rb_left;
+		else if (c > 0)
+			link = &parent->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node_rcu(node, parent, link);
+	rb_insert_color(node, tree);
+	return NULL;
+}
+
 /**
  * rb_find() - find @key in tree @tree
  * @key: key to match
@@ -272,6 +308,37 @@ rb_find(const void *key, const struct rb_root *tree,
 	return NULL;
 }
 
+/**
+ * rb_find_rcu() - find @key in tree @tree
+ * @key: key to match
+ * @tree: tree to search
+ * @cmp: operator defining the node order
+ *
+ * Notably, tree descent vs concurrent tree rotations is unsound and can result
+ * in false-negatives.
+ *
+ * Returns the rb_node matching @key or NULL.
+ */
+static __always_inline struct rb_node *
+rb_find_rcu(const void *key, const struct rb_root *tree,
+	    int (*cmp)(const void *key, const struct rb_node *))
+{
+	struct rb_node *node = tree->rb_node;
+
+	while (node) {
+		int c = cmp(key, node);
+
+		if (c < 0)
+			node = rcu_dereference_raw(node->rb_left);
+		else if (c > 0)
+			node = rcu_dereference_raw(node->rb_right);
+		else
+			return node;
+	}
+
+	return NULL;
+}
+
 /**
  * rb_find_first() - find the first @key in @tree
  * @key: key to match
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
  2024-07-31 21:42 ` [PATCH 1/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
@ 2024-07-31 21:42 ` Andrii Nakryiko
  2024-08-01 11:09   ` Jiri Olsa
                     ` (3 more replies)
  2024-07-31 21:42 ` [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
                   ` (6 subsequent siblings)
  8 siblings, 4 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
    and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
    handler code, temporary refcount bump is done to keep uprobe from
    being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
    take another refcount to keep uprobe alive until user space code
    returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is confusing and problematic. No
matter how many actual consumers are attached, they all share the same
refcount, and we have an extra logic to drop the "last" (which might not
really be last) refcount once uprobe's consumer list becomes empty.

This is unconventional and has to be kept in mind as a special case all
the time. Further, because of this design we have the situations where
find_uprobe() will find uprobe, bump refcount, return it to the caller,
but that uprobe will still need uprobe_is_active() check, after which
the caller is required to drop refcount and try again. This is just too
many details leaking to the higher level logic.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead, each
uprobe_consumer is assuming its own refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
will attempt to bump refcount, unless it already is zero. Caller needs
to guarantee that uprobe instance won't be freed in parallel, which is
the case while we keep uprobes_treelock (for read or write, doesn't
matter).

Note also, we now don't leak the race between registration and
unregistration, so we remove the retry logic completely. If
find_uprobe() returns valid uprobe, it's guaranteed to remain in
uprobes_tree with properly incremented refcount. The race is handled
inside __insert_uprobe() and put_uprobe() working together:
__insert_uprobe() will remove uprobe from RB-tree, if it can't bump
refcount and will retry to insert the new uprobe instance. put_uprobe()
won't attempt to remove uprobe from RB-tree, if it's already not there.
All that is protected by uprobes_treelock, which keeps things simple.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 163 +++++++++++++++++++++++-----------------
 1 file changed, 93 insertions(+), 70 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f88b7ff20587..23dde3ec5b09 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -587,25 +587,51 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
+/* uprobe should have guaranteed positive refcount */
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
 	refcount_inc(&uprobe->ref);
 	return uprobe;
 }
 
+/*
+ * uprobe should have guaranteed lifetime, which can be either of:
+ *   - caller already has refcount taken (and wants an extra one);
+ *   - uprobe is RCU protected and won't be freed until after grace period;
+ *   - we are holding uprobes_treelock (for read or write, doesn't matter).
+ */
+static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
+{
+	if (refcount_inc_not_zero(&uprobe->ref))
+		return uprobe;
+	return NULL;
+}
+
+static inline bool uprobe_is_active(struct uprobe *uprobe)
+{
+	return !RB_EMPTY_NODE(&uprobe->rb_node);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
-	if (refcount_dec_and_test(&uprobe->ref)) {
-		/*
-		 * If application munmap(exec_vma) before uprobe_unregister()
-		 * gets called, we don't get a chance to remove uprobe from
-		 * delayed_uprobe_list from remove_breakpoint(). Do it here.
-		 */
-		mutex_lock(&delayed_uprobe_lock);
-		delayed_uprobe_remove(uprobe, NULL);
-		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
-	}
+	if (!refcount_dec_and_test(&uprobe->ref))
+		return;
+
+	write_lock(&uprobes_treelock);
+
+	if (uprobe_is_active(uprobe))
+		rb_erase(&uprobe->rb_node, &uprobes_tree);
+
+	write_unlock(&uprobes_treelock);
+
+	/*
+	 * If application munmap(exec_vma) before uprobe_unregister()
+	 * gets called, we don't get a chance to remove uprobe from
+	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
+	 */
+	mutex_lock(&delayed_uprobe_lock);
+	delayed_uprobe_remove(uprobe, NULL);
+	mutex_unlock(&delayed_uprobe_lock);
 }
 
 static __always_inline
@@ -656,7 +682,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		return try_get_uprobe(__node_2_uprobe(node));
 
 	return NULL;
 }
@@ -676,26 +702,44 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/*
+ * Attempt to insert a new uprobe into uprobes_tree.
+ *
+ * If uprobe already exists (for given inode+offset), we just increment
+ * refcount of previously existing uprobe.
+ *
+ * If not, a provided new instance of uprobe is inserted into the tree (with
+ * assumed initial refcount == 1).
+ *
+ * In any case, we return a uprobe instance that ends up being in uprobes_tree.
+ * Caller has to clean up new uprobe instance, if it ended up not being
+ * inserted into the tree.
+ *
+ * We assume that uprobes_treelock is held for writing.
+ */
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
-
+again:
 	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
-	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+	if (node) {
+		struct uprobe *u = __node_2_uprobe(node);
 
-	/* get access + creation ref */
-	refcount_set(&uprobe->ref, 2);
-	return NULL;
+		if (!try_get_uprobe(u)) {
+			rb_erase(node, &uprobes_tree);
+			RB_CLEAR_NODE(&u->rb_node);
+			goto again;
+		}
+
+		return u;
+	}
+
+	return uprobe;
 }
 
 /*
- * Acquire uprobes_treelock.
- * Matching uprobe already exists in rbtree;
- *	increment (access refcount) and return the matching uprobe.
- *
- * No matching uprobe; insert the uprobe in rb_tree;
- *	get a double refcount (access + creation) and return NULL.
+ * Acquire uprobes_treelock and insert uprobe into uprobes_tree
+ * (or reuse existing one, see __insert_uprobe() comments above).
  */
 static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 {
@@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->ref_ctr_offset = ref_ctr_offset;
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
+	RB_CLEAR_NODE(&uprobe->rb_node);
+	refcount_set(&uprobe->ref, 1);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
 	/* a uprobe exists for this inode:offset combination */
-	if (cur_uprobe) {
+	if (cur_uprobe != uprobe) {
 		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 			ref_ctr_mismatch_warn(cur_uprobe, uprobe);
 			put_uprobe(cur_uprobe);
@@ -921,26 +967,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad
 	return set_orig_insn(&uprobe->arch, mm, vaddr);
 }
 
-static inline bool uprobe_is_active(struct uprobe *uprobe)
-{
-	return !RB_EMPTY_NODE(&uprobe->rb_node);
-}
-/*
- * There could be threads that have already hit the breakpoint. They
- * will recheck the current insn and restart if find_uprobe() fails.
- * See find_active_uprobe().
- */
-static void delete_uprobe(struct uprobe *uprobe)
-{
-	if (WARN_ON(!uprobe_is_active(uprobe)))
-		return;
-
-	write_lock(&uprobes_treelock);
-	rb_erase(&uprobe->rb_node, &uprobes_tree);
-	write_unlock(&uprobes_treelock);
-	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
-}
-
 struct map_info {
 	struct map_info *next;
 	struct mm_struct *mm;
@@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	int err;
 
 	down_write(&uprobe->register_rwsem);
-	if (WARN_ON(!consumer_del(uprobe, uc)))
+	if (WARN_ON(!consumer_del(uprobe, uc))) {
 		err = -ENOENT;
-	else
+	} else {
 		err = register_for_each_vma(uprobe, NULL);
-
-	/* TODO : cant unregister? schedule a worker thread */
-	if (!err) {
-		if (!uprobe->consumers)
-			delete_uprobe(uprobe);
-		else
-			err = -EBUSY;
+		/* TODO : cant unregister? schedule a worker thread */
+		WARN(err, "leaking uprobe due to failed unregistration");
 	}
 	up_write(&uprobe->register_rwsem);
 
@@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
 	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
 		return ERR_PTR(-EINVAL);
 
- retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (IS_ERR(uprobe))
 		return uprobe;
 
-	/*
-	 * We can race with uprobe_unregister()->delete_uprobe().
-	 * Check uprobe_is_active() and retry if it is false.
-	 */
 	down_write(&uprobe->register_rwsem);
-	ret = -EAGAIN;
-	if (likely(uprobe_is_active(uprobe))) {
-		consumer_add(uprobe, uc);
-		ret = register_for_each_vma(uprobe, uc);
-	}
+	consumer_add(uprobe, uc);
+	ret = register_for_each_vma(uprobe, uc);
 	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
 
 	if (ret) {
-		if (unlikely(ret == -EAGAIN))
-			goto retry;
 		uprobe_unregister(uprobe, uc);
 		return ERR_PTR(ret);
 	}
@@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset < min)
 				break;
+			u = try_get_uprobe(u);
+			if (!u) /* uprobe already went away, safe to ignore */
+				continue;
 			list_add(&u->pending_list, head);
-			get_uprobe(u);
 		}
 		for (t = n; (t = rb_next(t)); ) {
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset > max)
 				break;
+			u = try_get_uprobe(u);
+			if (!u) /* uprobe already went away, safe to ignore */
+				continue;
 			list_add(&u->pending_list, head);
-			get_uprobe(u);
 		}
 	}
 	read_unlock(&uprobes_treelock);
@@ -1752,6 +1766,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 			return -ENOMEM;
 
 		*n = *o;
+		/*
+		 * uprobe's refcnt has to be positive at this point, kept by
+		 * utask->return_instances items; return_instances can't be
+		 * removed right now, as task is blocked due to duping; so
+		 * get_uprobe() is safe to use here.
+		 */
 		get_uprobe(n->uprobe);
 		n->next = NULL;
 
@@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-
+	 /*
+	  * uprobe's refcnt is positive, held by caller, so it's safe to
+	  * unconditionally bump it one more time here
+	  */
 	ri->uprobe = get_uprobe(uprobe);
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
  2024-07-31 21:42 ` [PATCH 1/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
  2024-07-31 21:42 ` [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
@ 2024-07-31 21:42 ` Andrii Nakryiko
  2024-08-01 12:23   ` Liao, Chang
  2024-08-05 14:51   ` Oleg Nesterov
  2024-07-31 21:42 ` [PATCH 4/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

To avoid unnecessarily taking a (brief) refcount on uprobe during
breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
not take refcount, but protect the lifetime of a uprobe instance with
RCU. This improves scalability, as refcount gets quite expensive due to
cache line bouncing between multiple CPUs.

Specifically, we utilize our own uprobe-specific SRCU instance for this
RCU protection. put_uprobe() will delay actual kfree() using call_srcu().

For now, uretprobe and single-stepping handling will still acquire
refcount as necessary. We'll address these issues in follow up patches
by making them use SRCU with timeout.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 93 ++++++++++++++++++++++++-----------------
 1 file changed, 55 insertions(+), 38 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 23dde3ec5b09..6d5c3f4b210f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -41,6 +41,8 @@ static struct rb_root uprobes_tree = RB_ROOT;
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
 
+DEFINE_STATIC_SRCU(uprobes_srcu);
+
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
 static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
@@ -59,6 +61,7 @@ struct uprobe {
 	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
+	struct rcu_head		rcu;
 	loff_t			offset;
 	loff_t			ref_ctr_offset;
 	unsigned long		flags;
@@ -612,6 +615,13 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
 	return !RB_EMPTY_NODE(&uprobe->rb_node);
 }
 
+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+
+	kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (!refcount_dec_and_test(&uprobe->ref))
@@ -632,6 +642,8 @@ static void put_uprobe(struct uprobe *uprobe)
 	mutex_lock(&delayed_uprobe_lock);
 	delayed_uprobe_remove(uprobe, NULL);
 	mutex_unlock(&delayed_uprobe_lock);
+
+	call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
 }
 
 static __always_inline
@@ -673,33 +685,25 @@ static inline int __uprobe_cmp(struct rb_node *a, const struct rb_node *b)
 	return uprobe_cmp(u->inode, u->offset, __node_2_uprobe(b));
 }
 
-static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
+/*
+ * Assumes being inside RCU protected region.
+ * No refcount is taken on returned uprobe.
+ */
+static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 {
 	struct __uprobe_key key = {
 		.inode = inode,
 		.offset = offset,
 	};
-	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
-
-	if (node)
-		return try_get_uprobe(__node_2_uprobe(node));
+	struct rb_node *node;
 
-	return NULL;
-}
-
-/*
- * Find a uprobe corresponding to a given inode:offset
- * Acquires uprobes_treelock
- */
-static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
-{
-	struct uprobe *uprobe;
+	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
 	read_lock(&uprobes_treelock);
-	uprobe = __find_uprobe(inode, offset);
+	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 	read_unlock(&uprobes_treelock);
 
-	return uprobe;
+	return node ? __node_2_uprobe(node) : NULL;
 }
 
 /*
@@ -1073,10 +1077,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 			goto free;
 		/*
 		 * We take mmap_lock for writing to avoid the race with
-		 * find_active_uprobe() which takes mmap_lock for reading.
+		 * find_active_uprobe_rcu() which takes mmap_lock for reading.
 		 * Thus this install_breakpoint() can not make
-		 * is_trap_at_addr() true right after find_uprobe()
-		 * returns NULL in find_active_uprobe().
+		 * is_trap_at_addr() true right after find_uprobe_rcu()
+		 * returns NULL in find_active_uprobe_rcu().
 		 */
 		mmap_write_lock(mm);
 		vma = find_vma(mm, info->vaddr);
@@ -1885,9 +1889,13 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		return;
 	}
 
+	/* we need to bump refcount to store uprobe in utask */
+	if (!try_get_uprobe(uprobe))
+		return;
+
 	ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
 	if (!ri)
-		return;
+		goto fail;
 
 	trampoline_vaddr = uprobe_get_trampoline_vaddr();
 	orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
@@ -1914,11 +1922,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-	 /*
-	  * uprobe's refcnt is positive, held by caller, so it's safe to
-	  * unconditionally bump it one more time here
-	  */
-	ri->uprobe = get_uprobe(uprobe);
+	ri->uprobe = uprobe;
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
@@ -1929,8 +1933,9 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 	utask->return_instances = ri;
 
 	return;
- fail:
+fail:
 	kfree(ri);
+	put_uprobe(uprobe);
 }
 
 /* Prepare to single-step probed instruction out of line. */
@@ -1945,9 +1950,14 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	if (!utask)
 		return -ENOMEM;
 
+	if (!try_get_uprobe(uprobe))
+		return -EINVAL;
+
 	xol_vaddr = xol_get_insn_slot(uprobe);
-	if (!xol_vaddr)
-		return -ENOMEM;
+	if (!xol_vaddr) {
+		err = -ENOMEM;
+		goto err_out;
+	}
 
 	utask->xol_vaddr = xol_vaddr;
 	utask->vaddr = bp_vaddr;
@@ -1955,12 +1965,15 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
 	if (unlikely(err)) {
 		xol_free_insn_slot(current);
-		return err;
+		goto err_out;
 	}
 
 	utask->active_uprobe = uprobe;
 	utask->state = UTASK_SSTEP;
 	return 0;
+err_out:
+	put_uprobe(uprobe);
+	return err;
 }
 
 /*
@@ -2044,7 +2057,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	return is_trap_insn(&opcode);
 }
 
-static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
+/* assumes being inside RCU protected region */
+static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
 {
 	struct mm_struct *mm = current->mm;
 	struct uprobe *uprobe = NULL;
@@ -2057,7 +2071,7 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
 			struct inode *inode = file_inode(vma->vm_file);
 			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
 
-			uprobe = find_uprobe(inode, offset);
+			uprobe = find_uprobe_rcu(inode, offset);
 		}
 
 		if (!uprobe)
@@ -2201,13 +2215,15 @@ static void handle_swbp(struct pt_regs *regs)
 {
 	struct uprobe *uprobe;
 	unsigned long bp_vaddr;
-	int is_swbp;
+	int is_swbp, srcu_idx;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == uprobe_get_trampoline_vaddr())
 		return uprobe_handle_trampoline(regs);
 
-	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+
+	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
 	if (!uprobe) {
 		if (is_swbp > 0) {
 			/* No matching uprobe; signal SIGTRAP. */
@@ -2223,6 +2239,7 @@ static void handle_swbp(struct pt_regs *regs)
 			 */
 			instruction_pointer_set(regs, bp_vaddr);
 		}
+		srcu_read_unlock(&uprobes_srcu, srcu_idx);
 		return;
 	}
 
@@ -2258,12 +2275,12 @@ static void handle_swbp(struct pt_regs *regs)
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
 		goto out;
 
-	if (!pre_ssout(uprobe, regs, bp_vaddr))
-		return;
+	if (pre_ssout(uprobe, regs, bp_vaddr))
+		goto out;
 
-	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
 out:
-	put_uprobe(uprobe);
+	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 4/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-07-31 21:42 ` [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
@ 2024-07-31 21:42 ` Andrii Nakryiko
  2024-07-31 21:42 ` [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

It serves no purpose beyond adding unnecessray argument passed to the
filter callback. Just get rid of it, no one is actually using it.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h     | 10 +---------
 kernel/events/uprobes.c     | 18 +++++++-----------
 kernel/trace/bpf_trace.c    |  3 +--
 kernel/trace/trace_uprobe.c |  9 +++------
 4 files changed, 12 insertions(+), 28 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 137ddfc0b2f8..8d5bbad2048c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -28,20 +28,12 @@ struct page;
 
 #define MAX_URETPROBE_DEPTH		64
 
-enum uprobe_filter_ctx {
-	UPROBE_FILTER_REGISTER,
-	UPROBE_FILTER_UNREGISTER,
-	UPROBE_FILTER_MMAP,
-};
-
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	int (*ret_handler)(struct uprobe_consumer *self,
 				unsigned long func,
 				struct pt_regs *regs);
-	bool (*filter)(struct uprobe_consumer *self,
-				enum uprobe_filter_ctx ctx,
-				struct mm_struct *mm);
+	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
 
 	struct uprobe_consumer *next;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6d5c3f4b210f..71a8886608b1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -913,21 +913,19 @@ static int prepare_uprobe(struct uprobe *uprobe, struct file *file,
 	return ret;
 }
 
-static inline bool consumer_filter(struct uprobe_consumer *uc,
-				   enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static inline bool consumer_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
 {
-	return !uc->filter || uc->filter(uc, ctx, mm);
+	return !uc->filter || uc->filter(uc, mm);
 }
 
-static bool filter_chain(struct uprobe *uprobe,
-			 enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 {
 	struct uprobe_consumer *uc;
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
 	for (uc = uprobe->consumers; uc; uc = uc->next) {
-		ret = consumer_filter(uc, ctx, mm);
+		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
 	}
@@ -1094,12 +1092,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 
 		if (is_register) {
 			/* consult only the "caller", new consumer. */
-			if (consumer_filter(new,
-					UPROBE_FILTER_REGISTER, mm))
+			if (consumer_filter(new, mm))
 				err = install_breakpoint(uprobe, mm, vma, info->vaddr);
 		} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
-			if (!filter_chain(uprobe,
-					UPROBE_FILTER_UNREGISTER, mm))
+			if (!filter_chain(uprobe, mm))
 				err |= remove_breakpoint(uprobe, mm, info->vaddr);
 		}
 
@@ -1383,7 +1379,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	 */
 	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
 		if (!fatal_signal_pending(current) &&
-		    filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) {
+		    filter_chain(uprobe, vma->vm_mm)) {
 			unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
 			install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
 		}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4e391daafa64..73c570b5988b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3320,8 +3320,7 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe,
 }
 
 static bool
-uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx,
-			 struct mm_struct *mm)
+uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm)
 {
 	struct bpf_uprobe *uprobe;
 
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 52e76a73fa7c..7eb79e0a5352 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1078,9 +1078,7 @@ print_uprobe_event(struct trace_iterator *iter, int flags, struct trace_event *e
 	return trace_handle_return(s);
 }
 
-typedef bool (*filter_func_t)(struct uprobe_consumer *self,
-				enum uprobe_filter_ctx ctx,
-				struct mm_struct *mm);
+typedef bool (*filter_func_t)(struct uprobe_consumer *self, struct mm_struct *mm);
 
 static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 {
@@ -1339,8 +1337,7 @@ static int uprobe_perf_open(struct trace_event_call *call,
 	return err;
 }
 
-static bool uprobe_perf_filter(struct uprobe_consumer *uc,
-				enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+static bool uprobe_perf_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
 {
 	struct trace_uprobe_filter *filter;
 	struct trace_uprobe *tu;
@@ -1426,7 +1423,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 static int uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs,
 			    struct uprobe_cpu_buffer **ucbp)
 {
-	if (!uprobe_perf_filter(&tu->consumer, 0, current->mm))
+	if (!uprobe_perf_filter(&tu->consumer, current->mm))
 		return UPROBE_HANDLER_REMOVE;
 
 	if (!is_ret_probe(tu))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-07-31 21:42 ` [PATCH 4/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
@ 2024-07-31 21:42 ` Andrii Nakryiko
  2024-08-01 14:27   ` Jiri Olsa
  2024-08-05 15:59   ` Oleg Nesterov
  2024-07-31 21:42 ` [PATCH 6/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

uprobe->register_rwsem is one of a few big bottlenecks to scalability of
uprobes, so we need to get rid of it to improve uprobe performance and
multi-CPU scalability.

First, we turn uprobe's consumer list to a typical doubly-linked list
and utilize existing RCU-aware helpers for traversing such lists, as
well as adding and removing elements from it.

For entry uprobes we already have SRCU protection active since before
uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
won't go away from under us, but we add SRCU protection around consumer
list traversal.

Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
we remember whether any removal was requested during handler calls, but
then we double-check the decision under a proper register_rwsem using
consumers' filter callbacks. Handler removal is very rare, so this extra
lock won't hurt performance, overall, but we also avoid the need for any
extra protection (e.g., seqcount locks).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |  2 +-
 kernel/events/uprobes.c | 97 ++++++++++++++++++++---------------------
 2 files changed, 48 insertions(+), 51 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 8d5bbad2048c..a1686c1ebcb6 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -35,7 +35,7 @@ struct uprobe_consumer {
 				struct pt_regs *regs);
 	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
 
-	struct uprobe_consumer *next;
+	struct list_head cons_node;
 };
 
 #ifdef CONFIG_UPROBES
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 71a8886608b1..3b42fd355256 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -59,7 +59,7 @@ struct uprobe {
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
-	struct uprobe_consumer	*consumers;
+	struct list_head	consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	struct rcu_head		rcu;
 	loff_t			offset;
@@ -778,6 +778,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->inode = inode;
 	uprobe->offset = offset;
 	uprobe->ref_ctr_offset = ref_ctr_offset;
+	INIT_LIST_HEAD(&uprobe->consumers);
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
 	RB_CLEAR_NODE(&uprobe->rb_node);
@@ -803,34 +804,10 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	down_write(&uprobe->consumer_rwsem);
-	uc->next = uprobe->consumers;
-	uprobe->consumers = uc;
+	list_add_rcu(&uc->cons_node, &uprobe->consumers);
 	up_write(&uprobe->consumer_rwsem);
 }
 
-/*
- * For uprobe @uprobe, delete the consumer @uc.
- * Return true if the @uc is deleted successfully
- * or return false.
- */
-static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
-{
-	struct uprobe_consumer **con;
-	bool ret = false;
-
-	down_write(&uprobe->consumer_rwsem);
-	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
-		if (*con == uc) {
-			*con = uc->next;
-			ret = true;
-			break;
-		}
-	}
-	up_write(&uprobe->consumer_rwsem);
-
-	return ret;
-}
-
 static int __copy_insn(struct address_space *mapping, struct file *filp,
 			void *insn, int nbytes, loff_t offset)
 {
@@ -924,7 +901,8 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
@@ -1120,17 +1098,19 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	int err;
 
 	down_write(&uprobe->register_rwsem);
-	if (WARN_ON(!consumer_del(uprobe, uc))) {
-		err = -ENOENT;
-	} else {
-		err = register_for_each_vma(uprobe, NULL);
-		/* TODO : cant unregister? schedule a worker thread */
-		WARN(err, "leaking uprobe due to failed unregistration");
-	}
+
+	list_del_rcu(&uc->cons_node);
+	err = register_for_each_vma(uprobe, NULL);
+
 	up_write(&uprobe->register_rwsem);
 
-	if (!err)
-		put_uprobe(uprobe);
+	/* TODO : cant unregister? schedule a worker thread */
+	if (WARN(err, "leaking uprobe due to failed unregistration"))
+		return;
+
+	put_uprobe(uprobe);
+
+	synchronize_srcu(&uprobes_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
@@ -1208,13 +1188,20 @@ EXPORT_SYMBOL_GPL(uprobe_register);
 int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
 {
 	struct uprobe_consumer *con;
-	int ret = -ENOENT;
+	int ret = -ENOENT, srcu_idx;
 
 	down_write(&uprobe->register_rwsem);
-	for (con = uprobe->consumers; con && con != uc ; con = con->next)
-		;
-	if (con)
-		ret = register_for_each_vma(uprobe, add ? uc : NULL);
+
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
+		if (con == uc) {
+			ret = register_for_each_vma(uprobe, add ? uc : NULL);
+			break;
+		}
+	}
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+
 	up_write(&uprobe->register_rwsem);
 
 	return ret;
@@ -2088,9 +2075,10 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	struct uprobe_consumer *uc;
 	int remove = UPROBE_HANDLER_REMOVE;
 	bool need_prep = false; /* prepare return uprobe, when needed */
+	bool has_consumers = false;
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2103,16 +2091,23 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 			need_prep = true;
 
 		remove &= rc;
+		has_consumers = true;
 	}
 
 	if (need_prep && !remove)
 		prepare_uretprobe(uprobe, regs); /* put bp at return */
 
-	if (remove && uprobe->consumers) {
-		WARN_ON(!uprobe_is_active(uprobe));
-		unapply_uprobe(uprobe, current->mm);
+	if (remove && has_consumers) {
+		down_read(&uprobe->register_rwsem);
+
+		/* re-check that removal is still required, this time under lock */
+		if (!filter_chain(uprobe, current->mm)) {
+			WARN_ON(!uprobe_is_active(uprobe));
+			unapply_uprobe(uprobe, current->mm);
+		}
+
+		up_read(&uprobe->register_rwsem);
 	}
-	up_read(&uprobe->register_rwsem);
 }
 
 static void
@@ -2120,13 +2115,15 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 {
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
+	int srcu_idx;
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	up_read(&uprobe->register_rwsem);
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 6/8] perf/uprobe: split uprobe_unregister()
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-07-31 21:42 ` [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
@ 2024-07-31 21:42 ` Andrii Nakryiko
  2024-08-02  2:41   ` Liao, Chang
  2024-08-07 13:17   ` Oleg Nesterov
  2024-07-31 21:42 ` [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

From: Peter Zijlstra <peterz@infradead.org>

With uprobe_unregister() having grown a synchronize_srcu(), it becomes
fairly slow to call. Esp. since both users of this API call it in a
loop.

Peel off the sync_srcu() and do it once, after the loop.

With recent uprobe_register()'s error handling reusing full
uprobe_unregister() call, we need to be careful about returning to the
caller before we have a guarantee that partially attached consumer won't
be called anymore. So add uprobe_unregister_sync() in the error handling
path. This is an unlikely slow path and this should be totally fine to
be slow in the case of an failed attach.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h                        |  8 ++++++--
 kernel/events/uprobes.c                        | 18 ++++++++++++++----
 kernel/trace/bpf_trace.c                       |  5 ++++-
 kernel/trace/trace_uprobe.c                    |  6 +++++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c    |  3 ++-
 5 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a1686c1ebcb6..8f1999eb9d9f 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -105,7 +105,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc);
+extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
+extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -154,7 +155,10 @@ uprobe_apply(struct uprobe* uprobe, struct uprobe_consumer *uc, bool add)
 	return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
+uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
+{
+}
+static inline void uprobes_unregister_sync(void)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 3b42fd355256..b0488d356399 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1089,11 +1089,11 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 }
 
 /**
- * uprobe_unregister - unregister an already registered probe.
+ * uprobe_unregister_nosync - unregister an already registered probe.
  * @uprobe: uprobe to remove
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
+void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	int err;
 
@@ -1109,10 +1109,14 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 		return;
 
 	put_uprobe(uprobe);
+}
+EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
 
+void uprobe_unregister_sync(void)
+{
 	synchronize_srcu(&uprobes_srcu);
 }
-EXPORT_SYMBOL_GPL(uprobe_unregister);
+EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
 /**
  * uprobe_register - register a probe
@@ -1170,7 +1174,13 @@ struct uprobe *uprobe_register(struct inode *inode,
 	up_write(&uprobe->register_rwsem);
 
 	if (ret) {
-		uprobe_unregister(uprobe, uc);
+		uprobe_unregister_nosync(uprobe, uc);
+		/*
+		 * Registration might have partially succeeded, so we can have
+		 * this consumer being called right at this time. We need to
+		 * sync here. It's ok, it's unlikely slow path.
+		 */
+		uprobe_unregister_sync();
 		return ERR_PTR(ret);
 	}
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 73c570b5988b..6b632710c98e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3184,7 +3184,10 @@ static void bpf_uprobe_unregister(struct bpf_uprobe *uprobes, u32 cnt)
 	u32 i;
 
 	for (i = 0; i < cnt; i++)
-		uprobe_unregister(uprobes[i].uprobe, &uprobes[i].consumer);
+		uprobe_unregister_nosync(uprobes[i].uprobe, &uprobes[i].consumer);
+
+	if (cnt)
+		uprobe_unregister_sync();
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 7eb79e0a5352..f7443e996b1b 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1097,6 +1097,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 static void __probe_event_disable(struct trace_probe *tp)
 {
 	struct trace_uprobe *tu;
+	bool sync = false;
 
 	tu = container_of(tp, struct trace_uprobe, tp);
 	WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
@@ -1105,9 +1106,12 @@ static void __probe_event_disable(struct trace_probe *tp)
 		if (!tu->uprobe)
 			continue;
 
-		uprobe_unregister(tu->uprobe, &tu->consumer);
+		uprobe_unregister_nosync(tu->uprobe, &tu->consumer);
+		sync = true;
 		tu->uprobe = NULL;
 	}
+	if (sync)
+		uprobe_unregister_sync();
 }
 
 static int probe_event_enable(struct trace_event_call *call,
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 73a6b041bcce..928c73cde32e 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -478,7 +478,8 @@ static void testmod_unregister_uprobe(void)
 	mutex_lock(&testmod_uprobe_mutex);
 
 	if (uprobe.uprobe) {
-		uprobe_unregister(uprobe.uprobe, &uprobe.consumer);
+		uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
+		uprobe_unregister_sync();
 		uprobe.offset = 0;
 		uprobe.uprobe = NULL;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2024-07-31 21:42 ` [PATCH 6/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
@ 2024-07-31 21:42 ` Andrii Nakryiko
  2024-08-07 17:14   ` Andrii Nakryiko
  2024-08-08 13:40   ` Oleg Nesterov
  2024-07-31 21:42 ` [PATCH 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
  2024-08-07 13:29 ` [PATCH 0/8] uprobes: RCU-protected hot path optimizations Oleg Nesterov
  8 siblings, 2 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

Another big bottleneck to scalablity is uprobe_treelock that's taken in
a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
take advantage of that and make uprobes_tree RB-tree look up lockless.

To make RB-tree RCU-protected lockless lookup correct, we need to take
into account that such RB-tree lookup can return false negatives if there
are parallel RB-tree modifications (rotations) going on. We use seqcount
lock to detect whether RB-tree changed, and if we find nothing while
RB-tree got modified inbetween, we just retry. If uprobe was found, then
it's guaranteed to be a correct lookup.

With all the lock-avoiding changes done, we get a pretty decent
improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is
due to SRCU not really scaling very well with number of CPUs on
a particular hardware that was used for testing (80-core Intel Xeon Gold
6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
currently taken to resolve interrupt address to inode+offset and then
uprobe instance. And, of course, uretprobes still need similar RCU to
avoid refcount in the hot path, which will be addressed in the follow up
patches.

Nevertheless, the improvement is good. We used BPF selftest-based
uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
varying number of CPUs on which uprobes and uretprobes are triggered.

BASELINE
========
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

SRCU CHANGES
============
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
to 5mln/s.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index b0488d356399..d03962cc96de 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
 DEFINE_STATIC_SRCU(uprobes_srcu);
 
@@ -629,8 +630,11 @@ static void put_uprobe(struct uprobe *uprobe)
 
 	write_lock(&uprobes_treelock);
 
-	if (uprobe_is_active(uprobe))
+	if (uprobe_is_active(uprobe)) {
+		write_seqcount_begin(&uprobes_seqcount);
 		rb_erase(&uprobe->rb_node, &uprobes_tree);
+		write_seqcount_end(&uprobes_seqcount);
+	}
 
 	write_unlock(&uprobes_treelock);
 
@@ -696,14 +700,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 		.offset = offset,
 	};
 	struct rb_node *node;
+	unsigned int seq;
 
 	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
-	read_lock(&uprobes_treelock);
-	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
-	read_unlock(&uprobes_treelock);
+	do {
+		seq = read_seqcount_begin(&uprobes_seqcount);
+		node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
+		/*
+		 * Lockless RB-tree lookups can result only in false negatives.
+		 * If the element is found, it is correct and can be returned
+		 * under RCU protection. If we find nothing, we need to
+		 * validate that seqcount didn't change. If it did, we have to
+		 * try again as we might have missed the element (false
+		 * negative). If seqcount is unchanged, search truly failed.
+		 */
+		if (node)
+			return __node_2_uprobe(node);
+	} while (read_seqcount_retry(&uprobes_seqcount, seq));
 
-	return node ? __node_2_uprobe(node) : NULL;
+	return NULL;
 }
 
 /*
@@ -725,7 +741,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
 again:
-	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
+	node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
 	if (node) {
 		struct uprobe *u = __node_2_uprobe(node);
 
@@ -750,7 +766,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	struct uprobe *u;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	u = __insert_uprobe(uprobe);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 
 	return u;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2024-07-31 21:42 ` [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
@ 2024-07-31 21:42 ` Andrii Nakryiko
  2024-08-01  9:35   ` Peter Zijlstra
  2024-08-07 13:29 ` [PATCH 0/8] uprobes: RCU-protected hot path optimizations Oleg Nesterov
  8 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-07-31 21:42 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg, rostedt, mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck, Andrii Nakryiko

This patch switches uprobes SRCU usage to RCU Tasks Trace flavor, which
is optimized for more lightweight and quick readers (at the expense of
slower writers, which for uprobes is a fine tradeof) and has better
performance and scalability with number of CPUs.

Similarly to baseline vs SRCU, we've benchmarked SRCU-based
implementation vs RCU Tasks Trace implementation.

SRCU
====
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

RCU Tasks Trace
===============
uprobe-nop      ( 1 cpus):    3.396 ± 0.002M/s  (  3.396M/s/cpu)
uprobe-nop      ( 2 cpus):    4.271 ± 0.006M/s  (  2.135M/s/cpu)
uprobe-nop      ( 4 cpus):    8.499 ± 0.015M/s  (  2.125M/s/cpu)
uprobe-nop      ( 8 cpus):   10.355 ± 0.028M/s  (  1.294M/s/cpu)
uprobe-nop      (16 cpus):    7.615 ± 0.099M/s  (  0.476M/s/cpu)
uprobe-nop      (32 cpus):    4.430 ± 0.007M/s  (  0.138M/s/cpu)
uprobe-nop      (64 cpus):    6.887 ± 0.020M/s  (  0.108M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.174 ± 0.001M/s  (  2.174M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.853 ± 0.001M/s  (  1.426M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.913 ± 0.002M/s  (  1.228M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.883 ± 0.002M/s  (  0.735M/s/cpu)
uretprobe-nop   (16 cpus):    5.147 ± 0.001M/s  (  0.322M/s/cpu)
uretprobe-nop   (32 cpus):    3.738 ± 0.008M/s  (  0.117M/s/cpu)
uretprobe-nop   (64 cpus):    4.397 ± 0.002M/s  (  0.069M/s/cpu)

Peak throughput for uprobes increases from 8 mln/s to 10.3 mln/s
(+28%!), and for uretprobes from 5.3 mln/s to 5.8 mln/s (+11%), as we
have more work to do on uretprobes side.

Even single-thread (no contention) performance is slightly better: 3.276
mln/s to 3.396 mln/s (+3.5%) for uprobes, and 2.055 mln/s to 2.174 mln/s
(+5.8%) for uretprobes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 40 +++++++++++++++++-----------------------
 1 file changed, 17 insertions(+), 23 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d03962cc96de..ef915f87d27f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -42,8 +42,6 @@ static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
 static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
-DEFINE_STATIC_SRCU(uprobes_srcu);
-
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
 static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
@@ -647,7 +645,7 @@ static void put_uprobe(struct uprobe *uprobe)
 	delayed_uprobe_remove(uprobe, NULL);
 	mutex_unlock(&delayed_uprobe_lock);
 
-	call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
+	call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
 }
 
 static __always_inline
@@ -702,7 +700,7 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 	struct rb_node *node;
 	unsigned int seq;
 
-	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
+	lockdep_assert(rcu_read_lock_trace_held());
 
 	do {
 		seq = read_seqcount_begin(&uprobes_seqcount);
@@ -919,8 +917,7 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
@@ -1132,7 +1129,7 @@ EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
 
 void uprobe_unregister_sync(void)
 {
-	synchronize_srcu(&uprobes_srcu);
+	synchronize_rcu_tasks_trace();
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
@@ -1216,19 +1213,18 @@ EXPORT_SYMBOL_GPL(uprobe_register);
 int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
 {
 	struct uprobe_consumer *con;
-	int ret = -ENOENT, srcu_idx;
+	int ret = -ENOENT;
 
 	down_write(&uprobe->register_rwsem);
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
-	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	rcu_read_lock_trace();
+	list_for_each_entry_rcu(con, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		if (con == uc) {
 			ret = register_for_each_vma(uprobe, add ? uc : NULL);
 			break;
 		}
 	}
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 
 	up_write(&uprobe->register_rwsem);
 
@@ -2105,8 +2101,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	bool need_prep = false; /* prepare return uprobe, when needed */
 	bool has_consumers = false;
 
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2143,15 +2138,14 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 {
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
-	int srcu_idx;
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
-	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
-				 srcu_read_lock_held(&uprobes_srcu)) {
+	rcu_read_lock_trace();
+	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node,
+				 rcu_read_lock_trace_held()) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)
@@ -2236,13 +2230,13 @@ static void handle_swbp(struct pt_regs *regs)
 {
 	struct uprobe *uprobe;
 	unsigned long bp_vaddr;
-	int is_swbp, srcu_idx;
+	int is_swbp;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == uprobe_get_trampoline_vaddr())
 		return uprobe_handle_trampoline(regs);
 
-	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	rcu_read_lock_trace();
 
 	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
 	if (!uprobe) {
@@ -2260,7 +2254,7 @@ static void handle_swbp(struct pt_regs *regs)
 			 */
 			instruction_pointer_set(regs, bp_vaddr);
 		}
-		srcu_read_unlock(&uprobes_srcu, srcu_idx);
+		rcu_read_unlock_trace();
 		return;
 	}
 
@@ -2301,7 +2295,7 @@ static void handle_swbp(struct pt_regs *regs)
 
 out:
 	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
-	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+	rcu_read_unlock_trace();
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-07-31 21:42 ` [PATCH 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
@ 2024-08-01  9:35   ` Peter Zijlstra
  2024-08-01 16:49     ` Andrii Nakryiko
  2024-08-01 18:05     ` Paul E. McKenney
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2024-08-01  9:35 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, oleg, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

On Wed, Jul 31, 2024 at 02:42:56PM -0700, Andrii Nakryiko wrote:
> This patch switches uprobes SRCU usage to RCU Tasks Trace flavor, which
> is optimized for more lightweight and quick readers (at the expense of
> slower writers, which for uprobes is a fine tradeof) and has better
> performance and scalability with number of CPUs.
> 
> Similarly to baseline vs SRCU, we've benchmarked SRCU-based
> implementation vs RCU Tasks Trace implementation.

Yes, this one can be the trace flavour, the other one for the retprobes
must be SRCU because it crosses over into userspace. But you've not yet
done that side.

Anyway, I think I can make the SRCU read_{,un}lock() smp_mb()
conditional, much like we have for percpu_rwsem and trace rcu, but I
definitely don't have time to poke at that in the foreseeable future :(

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-31 21:42 ` [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
@ 2024-08-01 11:09   ` Jiri Olsa
  2024-08-01 16:49     ` Andrii Nakryiko
  2024-08-01 22:07   ` Andrii Nakryiko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-08-01 11:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, paulmck

On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote:

SNIP

>  static void put_uprobe(struct uprobe *uprobe)
>  {
> -	if (refcount_dec_and_test(&uprobe->ref)) {
> -		/*
> -		 * If application munmap(exec_vma) before uprobe_unregister()
> -		 * gets called, we don't get a chance to remove uprobe from
> -		 * delayed_uprobe_list from remove_breakpoint(). Do it here.
> -		 */
> -		mutex_lock(&delayed_uprobe_lock);
> -		delayed_uprobe_remove(uprobe, NULL);
> -		mutex_unlock(&delayed_uprobe_lock);
> -		kfree(uprobe);
> -	}
> +	if (!refcount_dec_and_test(&uprobe->ref))
> +		return;
> +
> +	write_lock(&uprobes_treelock);
> +
> +	if (uprobe_is_active(uprobe))
> +		rb_erase(&uprobe->rb_node, &uprobes_tree);
> +
> +	write_unlock(&uprobes_treelock);
> +
> +	/*
> +	 * If application munmap(exec_vma) before uprobe_unregister()
> +	 * gets called, we don't get a chance to remove uprobe from
> +	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
> +	 */
> +	mutex_lock(&delayed_uprobe_lock);
> +	delayed_uprobe_remove(uprobe, NULL);
> +	mutex_unlock(&delayed_uprobe_lock);

we should do kfree(uprobe) in here, right?

I think this is fixed later on when uprobe_free_rcu is introduced

SNIP

> @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
>  	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
>  		return ERR_PTR(-EINVAL);
>  
> - retry:
>  	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
>  	if (IS_ERR(uprobe))
>  		return uprobe;
>  
> -	/*
> -	 * We can race with uprobe_unregister()->delete_uprobe().
> -	 * Check uprobe_is_active() and retry if it is false.
> -	 */
>  	down_write(&uprobe->register_rwsem);
> -	ret = -EAGAIN;
> -	if (likely(uprobe_is_active(uprobe))) {
> -		consumer_add(uprobe, uc);
> -		ret = register_for_each_vma(uprobe, uc);
> -	}
> +	consumer_add(uprobe, uc);
> +	ret = register_for_each_vma(uprobe, uc);
>  	up_write(&uprobe->register_rwsem);
> -	put_uprobe(uprobe);
>  
>  	if (ret) {
> -		if (unlikely(ret == -EAGAIN))
> -			goto retry;

nice, I like getting rid of this.. so far lgtm ;-)

jirka


>  		uprobe_unregister(uprobe, uc);
>  		return ERR_PTR(ret);
>  	}
> @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
>  			u = rb_entry(t, struct uprobe, rb_node);
>  			if (u->inode != inode || u->offset < min)
>  				break;
> +			u = try_get_uprobe(u);
> +			if (!u) /* uprobe already went away, safe to ignore */
> +				continue;
>  			list_add(&u->pending_list, head);
> -			get_uprobe(u);
>  		}
>  		for (t = n; (t = rb_next(t)); ) {
>  			u = rb_entry(t, struct uprobe, rb_node);
>  			if (u->inode != inode || u->offset > max)
>  				break;
> +			u = try_get_uprobe(u);
> +			if (!u) /* uprobe already went away, safe to ignore */
> +				continue;
>  			list_add(&u->pending_list, head);
> -			get_uprobe(u);
>  		}
>  	}
>  	read_unlock(&uprobes_treelock);
> @@ -1752,6 +1766,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
>  			return -ENOMEM;
>  
>  		*n = *o;
> +		/*
> +		 * uprobe's refcnt has to be positive at this point, kept by
> +		 * utask->return_instances items; return_instances can't be
> +		 * removed right now, as task is blocked due to duping; so
> +		 * get_uprobe() is safe to use here.
> +		 */
>  		get_uprobe(n->uprobe);
>  		n->next = NULL;
>  
> @@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>  		}
>  		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
>  	}
> -
> +	 /*
> +	  * uprobe's refcnt is positive, held by caller, so it's safe to
> +	  * unconditionally bump it one more time here
> +	  */
>  	ri->uprobe = get_uprobe(uprobe);
>  	ri->func = instruction_pointer(regs);
>  	ri->stack = user_stack_pointer(regs);
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU
  2024-07-31 21:42 ` [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
@ 2024-08-01 12:23   ` Liao, Chang
  2024-08-01 16:49     ` Andrii Nakryiko
  2024-08-05 14:51   ` Oleg Nesterov
  1 sibling, 1 reply; 51+ messages in thread
From: Liao, Chang @ 2024-08-01 12:23 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck



在 2024/8/1 5:42, Andrii Nakryiko 写道:
> To avoid unnecessarily taking a (brief) refcount on uprobe during
> breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
> not take refcount, but protect the lifetime of a uprobe instance with
> RCU. This improves scalability, as refcount gets quite expensive due to
> cache line bouncing between multiple CPUs.
> 
> Specifically, we utilize our own uprobe-specific SRCU instance for this
> RCU protection. put_uprobe() will delay actual kfree() using call_srcu().
> 
> For now, uretprobe and single-stepping handling will still acquire
> refcount as necessary. We'll address these issues in follow up patches
> by making them use SRCU with timeout.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 93 ++++++++++++++++++++++++-----------------
>  1 file changed, 55 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 23dde3ec5b09..6d5c3f4b210f 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -41,6 +41,8 @@ static struct rb_root uprobes_tree = RB_ROOT;
>  
>  static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
>  
> +DEFINE_STATIC_SRCU(uprobes_srcu);
> +
>  #define UPROBES_HASH_SZ	13
>  /* serialize uprobe->pending_list */
>  static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
> @@ -59,6 +61,7 @@ struct uprobe {
>  	struct list_head	pending_list;
>  	struct uprobe_consumer	*consumers;
>  	struct inode		*inode;		/* Also hold a ref to inode */
> +	struct rcu_head		rcu;
>  	loff_t			offset;
>  	loff_t			ref_ctr_offset;
>  	unsigned long		flags;
> @@ -612,6 +615,13 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
>  	return !RB_EMPTY_NODE(&uprobe->rb_node);
>  }
>  
> +static void uprobe_free_rcu(struct rcu_head *rcu)
> +{
> +	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> +
> +	kfree(uprobe);
> +}
> +
>  static void put_uprobe(struct uprobe *uprobe)
>  {
>  	if (!refcount_dec_and_test(&uprobe->ref))
> @@ -632,6 +642,8 @@ static void put_uprobe(struct uprobe *uprobe)
>  	mutex_lock(&delayed_uprobe_lock);
>  	delayed_uprobe_remove(uprobe, NULL);
>  	mutex_unlock(&delayed_uprobe_lock);
> +
> +	call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
>  }
>  
>  static __always_inline
> @@ -673,33 +685,25 @@ static inline int __uprobe_cmp(struct rb_node *a, const struct rb_node *b)
>  	return uprobe_cmp(u->inode, u->offset, __node_2_uprobe(b));
>  }
>  
> -static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> +/*
> + * Assumes being inside RCU protected region.
> + * No refcount is taken on returned uprobe.
> + */
> +static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
>  {
>  	struct __uprobe_key key = {
>  		.inode = inode,
>  		.offset = offset,
>  	};
> -	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
> -
> -	if (node)
> -		return try_get_uprobe(__node_2_uprobe(node));
> +	struct rb_node *node;
>  
> -	return NULL;
> -}
> -
> -/*
> - * Find a uprobe corresponding to a given inode:offset
> - * Acquires uprobes_treelock
> - */
> -static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> -{
> -	struct uprobe *uprobe;
> +	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
>  
>  	read_lock(&uprobes_treelock);
> -	uprobe = __find_uprobe(inode, offset);
> +	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
>  	read_unlock(&uprobes_treelock);
>  
> -	return uprobe;
> +	return node ? __node_2_uprobe(node) : NULL;
>  }
>  
>  /*
> @@ -1073,10 +1077,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>  			goto free;
>  		/*
>  		 * We take mmap_lock for writing to avoid the race with
> -		 * find_active_uprobe() which takes mmap_lock for reading.
> +		 * find_active_uprobe_rcu() which takes mmap_lock for reading.
>  		 * Thus this install_breakpoint() can not make
> -		 * is_trap_at_addr() true right after find_uprobe()
> -		 * returns NULL in find_active_uprobe().
> +		 * is_trap_at_addr() true right after find_uprobe_rcu()
> +		 * returns NULL in find_active_uprobe_rcu().
>  		 */
>  		mmap_write_lock(mm);
>  		vma = find_vma(mm, info->vaddr);
> @@ -1885,9 +1889,13 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>  		return;
>  	}
>  
> +	/* we need to bump refcount to store uprobe in utask */
> +	if (!try_get_uprobe(uprobe))
> +		return;
> +
>  	ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
>  	if (!ri)
> -		return;
> +		goto fail;
>  
>  	trampoline_vaddr = uprobe_get_trampoline_vaddr();
>  	orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
> @@ -1914,11 +1922,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>  		}
>  		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
>  	}
> -	 /*
> -	  * uprobe's refcnt is positive, held by caller, so it's safe to
> -	  * unconditionally bump it one more time here
> -	  */
> -	ri->uprobe = get_uprobe(uprobe);
> +	ri->uprobe = uprobe;
>  	ri->func = instruction_pointer(regs);
>  	ri->stack = user_stack_pointer(regs);
>  	ri->orig_ret_vaddr = orig_ret_vaddr;
> @@ -1929,8 +1933,9 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>  	utask->return_instances = ri;
>  
>  	return;
> - fail:
> +fail:
>  	kfree(ri);
> +	put_uprobe(uprobe);
>  }
>  
>  /* Prepare to single-step probed instruction out of line. */
> @@ -1945,9 +1950,14 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
>  	if (!utask)
>  		return -ENOMEM;
>  
> +	if (!try_get_uprobe(uprobe))
> +		return -EINVAL;
> +
>  	xol_vaddr = xol_get_insn_slot(uprobe);
> -	if (!xol_vaddr)
> -		return -ENOMEM;
> +	if (!xol_vaddr) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
>  
>  	utask->xol_vaddr = xol_vaddr;
>  	utask->vaddr = bp_vaddr;
> @@ -1955,12 +1965,15 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
>  	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
>  	if (unlikely(err)) {
>  		xol_free_insn_slot(current);
> -		return err;
> +		goto err_out;
>  	}
>  
>  	utask->active_uprobe = uprobe;
>  	utask->state = UTASK_SSTEP;
>  	return 0;
> +err_out:
> +	put_uprobe(uprobe);
> +	return err;
>  }
>  
>  /*
> @@ -2044,7 +2057,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>  	return is_trap_insn(&opcode);
>  }
>  
> -static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
> +/* assumes being inside RCU protected region */
> +static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
>  {
>  	struct mm_struct *mm = current->mm;
>  	struct uprobe *uprobe = NULL;
> @@ -2057,7 +2071,7 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
>  			struct inode *inode = file_inode(vma->vm_file);
>  			loff_t offset = vaddr_to_offset(vma, bp_vaddr);
>  
> -			uprobe = find_uprobe(inode, offset);
> +			uprobe = find_uprobe_rcu(inode, offset);
>  		}
>  
>  		if (!uprobe)
> @@ -2201,13 +2215,15 @@ static void handle_swbp(struct pt_regs *regs)
>  {
>  	struct uprobe *uprobe;
>  	unsigned long bp_vaddr;
> -	int is_swbp;
> +	int is_swbp, srcu_idx;
>  
>  	bp_vaddr = uprobe_get_swbp_addr(regs);
>  	if (bp_vaddr == uprobe_get_trampoline_vaddr())
>  		return uprobe_handle_trampoline(regs);
>  
> -	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> +	srcu_idx = srcu_read_lock(&uprobes_srcu);
> +
> +	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
>  	if (!uprobe) {
>  		if (is_swbp > 0) {
>  			/* No matching uprobe; signal SIGTRAP. */
> @@ -2223,6 +2239,7 @@ static void handle_swbp(struct pt_regs *regs)
>  			 */
>  			instruction_pointer_set(regs, bp_vaddr);
>  		}
> +		srcu_read_unlock(&uprobes_srcu, srcu_idx);
>  		return;
>  	}
>  
> @@ -2258,12 +2275,12 @@ static void handle_swbp(struct pt_regs *regs)
>  	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>  		goto out;
>  
> -	if (!pre_ssout(uprobe, regs, bp_vaddr))
> -		return;
> +	if (pre_ssout(uprobe, regs, bp_vaddr))
> +		goto out;
>  

Regardless what pre_ssout() returns, it always reach the label 'out', so the
if block is unnecessary.


> -	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
>  out:
> -	put_uprobe(uprobe);
> +	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
> +	srcu_read_unlock(&uprobes_srcu, srcu_idx);
>  }
>  
>  /*

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-07-31 21:42 ` [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
@ 2024-08-01 14:27   ` Jiri Olsa
  2024-08-01 16:49     ` Andrii Nakryiko
  2024-08-05 15:59   ` Oleg Nesterov
  1 sibling, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-08-01 14:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, paulmck

On Wed, Jul 31, 2024 at 02:42:53PM -0700, Andrii Nakryiko wrote:

SNIP

>  static int __copy_insn(struct address_space *mapping, struct file *filp,
>  			void *insn, int nbytes, loff_t offset)
>  {
> @@ -924,7 +901,8 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
>  	bool ret = false;
>  
>  	down_read(&uprobe->consumer_rwsem);
> -	for (uc = uprobe->consumers; uc; uc = uc->next) {
> +	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> +				 srcu_read_lock_held(&uprobes_srcu)) {
>  		ret = consumer_filter(uc, mm);
>  		if (ret)
>  			break;
> @@ -1120,17 +1098,19 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  	int err;
>  
>  	down_write(&uprobe->register_rwsem);
> -	if (WARN_ON(!consumer_del(uprobe, uc))) {
> -		err = -ENOENT;
> -	} else {
> -		err = register_for_each_vma(uprobe, NULL);
> -		/* TODO : cant unregister? schedule a worker thread */
> -		WARN(err, "leaking uprobe due to failed unregistration");
> -	}
> +
> +	list_del_rcu(&uc->cons_node);

hum, so previous code had a check to verify that consumer is actually
registered in the uprobe, so it'd survive wrong argument while the new
code could likely do things?

> +	err = register_for_each_vma(uprobe, NULL);
> +
>  	up_write(&uprobe->register_rwsem);
>  
> -	if (!err)
> -		put_uprobe(uprobe);
> +	/* TODO : cant unregister? schedule a worker thread */
> +	if (WARN(err, "leaking uprobe due to failed unregistration"))
> +		return;
> +
> +	put_uprobe(uprobe);
> +
> +	synchronize_srcu(&uprobes_srcu);

could you comment on why it's needed in here? there's already potential
call_srcu(&uprobes_srcu, ... ) call in put_uprobe above

thanks,
jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-08-01  9:35   ` Peter Zijlstra
@ 2024-08-01 16:49     ` Andrii Nakryiko
  2024-08-01 18:05     ` Paul E. McKenney
  1 sibling, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-01 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck

On Thu, Aug 1, 2024 at 2:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jul 31, 2024 at 02:42:56PM -0700, Andrii Nakryiko wrote:
> > This patch switches uprobes SRCU usage to RCU Tasks Trace flavor, which
> > is optimized for more lightweight and quick readers (at the expense of
> > slower writers, which for uprobes is a fine tradeof) and has better
> > performance and scalability with number of CPUs.
> >
> > Similarly to baseline vs SRCU, we've benchmarked SRCU-based
> > implementation vs RCU Tasks Trace implementation.
>
> Yes, this one can be the trace flavour, the other one for the retprobes
> must be SRCU because it crosses over into userspace. But you've not yet
> done that side.

Yep, working on it at the moment. I'm trying to avoid task_work but
keep main logic as unaware of parallel timer callback as possible.
Will post patches once I have everything figured out and tested.

And yes, I think I'll stick to SRCU for uretprobes parts, as you said.

>
> Anyway, I think I can make the SRCU read_{,un}lock() smp_mb()
> conditional, much like we have for percpu_rwsem and trace rcu, but I
> definitely don't have time to poke at that in the foreseeable future :(

Who knows, maybe we can convince Paul McKenney to help :) But
regardless, mmap_lock is going to be a much bigger win if we can avoid
it in the hot path, so let's see how far we can get with
TYPESAFE_BY_RCU approach that's being discussed.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-08-01 11:09   ` Jiri Olsa
@ 2024-08-01 16:49     ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-01 16:49 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, paulmck

On Thu, Aug 1, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote:
>
> SNIP
>
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -     if (refcount_dec_and_test(&uprobe->ref)) {
> > -             /*
> > -              * If application munmap(exec_vma) before uprobe_unregister()
> > -              * gets called, we don't get a chance to remove uprobe from
> > -              * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > -              */
> > -             mutex_lock(&delayed_uprobe_lock);
> > -             delayed_uprobe_remove(uprobe, NULL);
> > -             mutex_unlock(&delayed_uprobe_lock);
> > -             kfree(uprobe);
> > -     }
> > +     if (!refcount_dec_and_test(&uprobe->ref))
> > +             return;
> > +
> > +     write_lock(&uprobes_treelock);
> > +
> > +     if (uprobe_is_active(uprobe))
> > +             rb_erase(&uprobe->rb_node, &uprobes_tree);
> > +
> > +     write_unlock(&uprobes_treelock);
> > +
> > +     /*
> > +      * If application munmap(exec_vma) before uprobe_unregister()
> > +      * gets called, we don't get a chance to remove uprobe from
> > +      * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > +      */
> > +     mutex_lock(&delayed_uprobe_lock);
> > +     delayed_uprobe_remove(uprobe, NULL);
> > +     mutex_unlock(&delayed_uprobe_lock);
>
> we should do kfree(uprobe) in here, right?

heh, yep, seems like I lost it while rebasing or something, good catch! fixed.


>
> I think this is fixed later on when uprobe_free_rcu is introduced
>
> SNIP
>
> > @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
> >       if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
> >               return ERR_PTR(-EINVAL);
> >
> > - retry:
> >       uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
> >       if (IS_ERR(uprobe))
> >               return uprobe;
> >
> > -     /*
> > -      * We can race with uprobe_unregister()->delete_uprobe().
> > -      * Check uprobe_is_active() and retry if it is false.
> > -      */
> >       down_write(&uprobe->register_rwsem);
> > -     ret = -EAGAIN;
> > -     if (likely(uprobe_is_active(uprobe))) {
> > -             consumer_add(uprobe, uc);
> > -             ret = register_for_each_vma(uprobe, uc);
> > -     }
> > +     consumer_add(uprobe, uc);
> > +     ret = register_for_each_vma(uprobe, uc);
> >       up_write(&uprobe->register_rwsem);
> > -     put_uprobe(uprobe);
> >
> >       if (ret) {
> > -             if (unlikely(ret == -EAGAIN))
> > -                     goto retry;
>
> nice, I like getting rid of this.. so far lgtm ;-)
>
> jirka
>
>
> >               uprobe_unregister(uprobe, uc);
> >               return ERR_PTR(ret);
> >       }
> > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
> >                       u = rb_entry(t, struct uprobe, rb_node);
> >                       if (u->inode != inode || u->offset < min)
> >                               break;
> > +                     u = try_get_uprobe(u);
> > +                     if (!u) /* uprobe already went away, safe to ignore */
> > +                             continue;
> >                       list_add(&u->pending_list, head);
> > -                     get_uprobe(u);
> >               }
> >               for (t = n; (t = rb_next(t)); ) {
> >                       u = rb_entry(t, struct uprobe, rb_node);
> >                       if (u->inode != inode || u->offset > max)
> >                               break;
> > +                     u = try_get_uprobe(u);
> > +                     if (!u) /* uprobe already went away, safe to ignore */
> > +                             continue;
> >                       list_add(&u->pending_list, head);
> > -                     get_uprobe(u);
> >               }
> >       }
> >       read_unlock(&uprobes_treelock);
> > @@ -1752,6 +1766,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
> >                       return -ENOMEM;
> >
> >               *n = *o;
> > +             /*
> > +              * uprobe's refcnt has to be positive at this point, kept by
> > +              * utask->return_instances items; return_instances can't be
> > +              * removed right now, as task is blocked due to duping; so
> > +              * get_uprobe() is safe to use here.
> > +              */
> >               get_uprobe(n->uprobe);
> >               n->next = NULL;
> >
> > @@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
> >               }
> >               orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
> >       }
> > -
> > +      /*
> > +       * uprobe's refcnt is positive, held by caller, so it's safe to
> > +       * unconditionally bump it one more time here
> > +       */
> >       ri->uprobe = get_uprobe(uprobe);
> >       ri->func = instruction_pointer(regs);
> >       ri->stack = user_stack_pointer(regs);
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU
  2024-08-01 12:23   ` Liao, Chang
@ 2024-08-01 16:49     ` Andrii Nakryiko
  2024-08-02  1:30       ` Liao, Chang
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-01 16:49 UTC (permalink / raw)
  To: Liao, Chang
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck

On Thu, Aug 1, 2024 at 5:23 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/1 5:42, Andrii Nakryiko 写道:
> > To avoid unnecessarily taking a (brief) refcount on uprobe during
> > breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
> > not take refcount, but protect the lifetime of a uprobe instance with
> > RCU. This improves scalability, as refcount gets quite expensive due to
> > cache line bouncing between multiple CPUs.
> >
> > Specifically, we utilize our own uprobe-specific SRCU instance for this
> > RCU protection. put_uprobe() will delay actual kfree() using call_srcu().
> >
> > For now, uretprobe and single-stepping handling will still acquire
> > refcount as necessary. We'll address these issues in follow up patches
> > by making them use SRCU with timeout.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/events/uprobes.c | 93 ++++++++++++++++++++++++-----------------
> >  1 file changed, 55 insertions(+), 38 deletions(-)
> >

[...]

> >
> > @@ -2258,12 +2275,12 @@ static void handle_swbp(struct pt_regs *regs)
> >       if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> >               goto out;
> >
> > -     if (!pre_ssout(uprobe, regs, bp_vaddr))
> > -             return;
> > +     if (pre_ssout(uprobe, regs, bp_vaddr))
> > +             goto out;
> >
>
> Regardless what pre_ssout() returns, it always reach the label 'out', so the
> if block is unnecessary.

yep, I know, but I felt like

if (something is wrong)
    goto out;

pattern was important to keep for each possible failing step for consistency.

so unless this is a big deal, I'd keep it as is, as in the future
there might be some other steps after pre_ssout() before returning, so
this is a bit more "composable"


>
>
> > -     /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
> >  out:
> > -     put_uprobe(uprobe);
> > +     /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
> > +     srcu_read_unlock(&uprobes_srcu, srcu_idx);
> >  }
> >
> >  /*
>
> --
> BR
> Liao, Chang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-01 14:27   ` Jiri Olsa
@ 2024-08-01 16:49     ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-01 16:49 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, paulmck

On Thu, Aug 1, 2024 at 7:27 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 02:42:53PM -0700, Andrii Nakryiko wrote:
>
> SNIP
>
> >  static int __copy_insn(struct address_space *mapping, struct file *filp,
> >                       void *insn, int nbytes, loff_t offset)
> >  {
> > @@ -924,7 +901,8 @@ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
> >       bool ret = false;
> >
> >       down_read(&uprobe->consumer_rwsem);
> > -     for (uc = uprobe->consumers; uc; uc = uc->next) {
> > +     list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
> > +                              srcu_read_lock_held(&uprobes_srcu)) {
> >               ret = consumer_filter(uc, mm);
> >               if (ret)
> >                       break;
> > @@ -1120,17 +1098,19 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >       int err;
> >
> >       down_write(&uprobe->register_rwsem);
> > -     if (WARN_ON(!consumer_del(uprobe, uc))) {
> > -             err = -ENOENT;
> > -     } else {
> > -             err = register_for_each_vma(uprobe, NULL);
> > -             /* TODO : cant unregister? schedule a worker thread */
> > -             WARN(err, "leaking uprobe due to failed unregistration");
> > -     }
> > +
> > +     list_del_rcu(&uc->cons_node);
>
> hum, so previous code had a check to verify that consumer is actually
> registered in the uprobe, so it'd survive wrong argument while the new
> code could likely do things?

correct, passing consumer that's not really registered to
uprobe_unregister() is a huge violation of uprobe API contract and it
should never happen (and it doesn't), so it feels like we can drop
this overly cautious and permissive part (we don't protect against
passing wrong pointers, NULLs, etc, right? so why would we protect
against wrong unregister or say double unregister?)

>
> > +     err = register_for_each_vma(uprobe, NULL);
> > +
> >       up_write(&uprobe->register_rwsem);
> >
> > -     if (!err)
> > -             put_uprobe(uprobe);
> > +     /* TODO : cant unregister? schedule a worker thread */
> > +     if (WARN(err, "leaking uprobe due to failed unregistration"))
> > +             return;
> > +
> > +     put_uprobe(uprobe);
> > +
> > +     synchronize_srcu(&uprobes_srcu);
>
> could you comment on why it's needed in here? there's already potential
> call_srcu(&uprobes_srcu, ... ) call in put_uprobe above
>

yep, I should. This is because we might have handle_swbp() traversing
the consumer list in parallel with unregistration, and so it might
have already seen this consumer and is calling its callback. So we
need to wait for srcu grace period to make sure we don't have any
calls to consumer's callback. If we don't do that, the caller can free
the consumer's memory as handle_swbp() is still using/calling into it.

> thanks,
> jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance
  2024-08-01  9:35   ` Peter Zijlstra
  2024-08-01 16:49     ` Andrii Nakryiko
@ 2024-08-01 18:05     ` Paul E. McKenney
  1 sibling, 0 replies; 51+ messages in thread
From: Paul E. McKenney @ 2024-08-01 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa

On Thu, Aug 01, 2024 at 11:35:05AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 31, 2024 at 02:42:56PM -0700, Andrii Nakryiko wrote:
> > This patch switches uprobes SRCU usage to RCU Tasks Trace flavor, which
> > is optimized for more lightweight and quick readers (at the expense of
> > slower writers, which for uprobes is a fine tradeof) and has better
> > performance and scalability with number of CPUs.
> > 
> > Similarly to baseline vs SRCU, we've benchmarked SRCU-based
> > implementation vs RCU Tasks Trace implementation.
> 
> Yes, this one can be the trace flavour, the other one for the retprobes
> must be SRCU because it crosses over into userspace. But you've not yet
> done that side.
> 
> Anyway, I think I can make the SRCU read_{,un}lock() smp_mb()
> conditional, much like we have for percpu_rwsem and trace rcu, but I
> definitely don't have time to poke at that in the foreseeable future :(

You most certainly can, but all of the approaches that I know of have
sharp edges in one place or another.  There were extensive unrecorded
and unminuted discussion of this about five years ago, and I have been
reconstituting those neurons to document what is feasible.  None of which
were useful for the use cases back then, whose performance requirements
could not be met by unsafe srcu_read_lock() and srcu_read_unlock() with
smp_mb() removed, and others of which really wanted CPU stall warnings.

But it is of course possible that newer use cases might benefit.
Who knows?

I haven't gotten very far, but it is on my list.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-31 21:42 ` [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
  2024-08-01 11:09   ` Jiri Olsa
@ 2024-08-01 22:07   ` Andrii Nakryiko
  2024-08-02  8:50     ` Oleg Nesterov
  2024-08-02 11:11   ` Jiri Olsa
  2024-08-05 13:44   ` Oleg Nesterov
  3 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-01 22:07 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck

On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Revamp how struct uprobe is refcounted, and thus how its lifetime is
> managed.
>
> Right now, there are a few possible "owners" of uprobe refcount:
>   - uprobes_tree RB tree assumes one refcount when uprobe is registered
>     and added to the lookup tree;
>   - while uprobe is triggered and kernel is handling it in the breakpoint
>     handler code, temporary refcount bump is done to keep uprobe from
>     being freed;
>   - if we have uretprobe requested on a given struct uprobe instance, we
>     take another refcount to keep uprobe alive until user space code
>     returns from the function and triggers return handler.
>
> The uprobe_tree's extra refcount of 1 is confusing and problematic. No
> matter how many actual consumers are attached, they all share the same
> refcount, and we have an extra logic to drop the "last" (which might not
> really be last) refcount once uprobe's consumer list becomes empty.
>
> This is unconventional and has to be kept in mind as a special case all
> the time. Further, because of this design we have the situations where
> find_uprobe() will find uprobe, bump refcount, return it to the caller,
> but that uprobe will still need uprobe_is_active() check, after which
> the caller is required to drop refcount and try again. This is just too
> many details leaking to the higher level logic.
>
> This patch changes refcounting scheme in such a way as to not have
> uprobes_tree keeping extra refcount for struct uprobe. Instead, each
> uprobe_consumer is assuming its own refcount, which will be dropped
> when consumer is unregistered. Other than that, all the active users of
> uprobe (entry and return uprobe handling code) keeps exactly the same
> refcounting approach.
>
> With the above setup, once uprobe's refcount drops to zero, we need to
> make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
> of course. This, though, races with uprobe entry handling code in
> handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
> can race with uprobe being destroyed after refcount drops to zero (e.g.,
> due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
> will attempt to bump refcount, unless it already is zero. Caller needs
> to guarantee that uprobe instance won't be freed in parallel, which is
> the case while we keep uprobes_treelock (for read or write, doesn't
> matter).
>
> Note also, we now don't leak the race between registration and
> unregistration, so we remove the retry logic completely. If
> find_uprobe() returns valid uprobe, it's guaranteed to remain in
> uprobes_tree with properly incremented refcount. The race is handled
> inside __insert_uprobe() and put_uprobe() working together:
> __insert_uprobe() will remove uprobe from RB-tree, if it can't bump
> refcount and will retry to insert the new uprobe instance. put_uprobe()
> won't attempt to remove uprobe from RB-tree, if it's already not there.
> All that is protected by uprobes_treelock, which keeps things simple.
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 163 +++++++++++++++++++++++-----------------
>  1 file changed, 93 insertions(+), 70 deletions(-)
>

[...]

> @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>         int err;
>
>         down_write(&uprobe->register_rwsem);
> -       if (WARN_ON(!consumer_del(uprobe, uc)))
> +       if (WARN_ON(!consumer_del(uprobe, uc))) {
>                 err = -ENOENT;
> -       else
> +       } else {
>                 err = register_for_each_vma(uprobe, NULL);
> -
> -       /* TODO : cant unregister? schedule a worker thread */
> -       if (!err) {
> -               if (!uprobe->consumers)
> -                       delete_uprobe(uprobe);
> -               else
> -                       err = -EBUSY;
> +               /* TODO : cant unregister? schedule a worker thread */
> +               WARN(err, "leaking uprobe due to failed unregistration");

Ok, so now that I added this very loud warning if
register_for_each_vma(uprobe, NULL) returns error, it turns out it's
not that unusual for this unregistration to fail. If I run my
uprobe-stress for just a little bit, and then terminate it with ^C, I
get this splat:

[ 1980.854229] leaking uprobe due to failed unregistration
[ 1980.854244] WARNING: CPU: 3 PID: 23013 at
kernel/events/uprobes.c:1123 uprobe_unregister_nosync+0x68/0x80
[ 1980.855356] Modules linked in: aesni_intel(E) crypto_simd(E)
cryptd(E) kvm_intel(E) kvm(E) floppy(E) i2c_piix4(E) i2c_]
[ 1980.856746] CPU: 3 UID: 0 PID: 23013 Comm: exe Tainted: G        W
OE      6.11.0-rc1-00032-g308d1f294b79 #129
[ 1980.857407] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 1980.857788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04
[ 1980.858489] RIP: 0010:uprobe_unregister_nosync+0x68/0x80
[ 1980.858826] Code: 6e fb ff ff 4c 89 e7 89 c3 e8 24 e8 e3 ff 85 db
75 0c 5b 48 89 ef 5d 41 5c e9 84 e5 ff ff 48 c7 c7 d0
[ 1980.860052] RSP: 0018:ffffc90002fb7e58 EFLAGS: 00010296
[ 1980.860428] RAX: 000000000000002b RBX: 00000000fffffffc RCX: 0000000000000000
[ 1980.860913] RDX: 0000000000000002 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1980.861379] RBP: ffff88811159ac00 R08: 00000000fffeffff R09: 0000000000000001
[ 1980.861871] R10: 0000000000000000 R11: ffffffff83299920 R12: ffff88811159ac20
[ 1980.862340] R13: ffff88810153c7a0 R14: ffff88810c3fe000 R15: 0000000000000000
[ 1980.862830] FS:  0000000000000000(0000) GS:ffff88881ca00000(0000)
knlGS:0000000000000000
[ 1980.863370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1980.863758] CR2: 00007fa08aea8276 CR3: 000000010f59c005 CR4: 0000000000370ef0
[ 1980.864239] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1980.864708] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1980.865202] Call Trace:
[ 1980.865356]  <TASK>
[ 1980.865524]  ? __warn+0x80/0x180
[ 1980.865745]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.866074]  ? report_bug+0x18d/0x1c0
[ 1980.866326]  ? handle_bug+0x3a/0x70
[ 1980.866568]  ? exc_invalid_op+0x13/0x60
[ 1980.866836]  ? asm_exc_invalid_op+0x16/0x20
[ 1980.867098]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.867390]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.867726]  bpf_uprobe_multi_link_release+0x31/0xd0
[ 1980.868044]  bpf_link_free+0x54/0xd0
[ 1980.868267]  bpf_link_release+0x17/0x20
[ 1980.868542]  __fput+0x102/0x2e0
[ 1980.868760]  task_work_run+0x55/0xa0
[ 1980.869027]  syscall_exit_to_user_mode+0x1dd/0x1f0
[ 1980.869344]  do_syscall_64+0x70/0x140
[ 1980.869603]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1980.869923] RIP: 0033:0x7fa08aea82a0
[ 1980.870171] Code: Unable to access opcode bytes at 0x7fa08aea8276.
[ 1980.870587] RSP: 002b:00007ffe838cd030 EFLAGS: 00000202 ORIG_RAX:
000000000000003b
[ 1980.871098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1980.871563] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1980.872055] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1980.872526] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1980.873044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1980.873568]  </TASK>


I traced it a little bit with retsnoop to figure out where this is
coming from, and here we go:

14:53:18.897165 -> 14:53:18.897171 TID/PID 23013/23013 (exe/exe):

FUNCTION CALLS                RESULT    DURATION
---------------------------   --------  --------
→ uprobe_write_opcode
    → get_user_pages_remote
        ↔ __get_user_pages    [-EINTR]   1.382us
    ← get_user_pages_remote   [-EINTR]   4.889us
← uprobe_write_opcode         [-EINTR]   6.908us

                   entry_SYSCALL_64_after_hwframe+0x76
(entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:130)
                   do_syscall_64+0x70
(arch/x86/entry/common.c:102)
                   syscall_exit_to_user_mode+0x1dd
(kernel/entry/common.c:218)
                   . __syscall_exit_to_user_mode_work
(kernel/entry/common.c:207)
                   . exit_to_user_mode_prepare
(include/linux/entry-common.h:328)
                   . exit_to_user_mode_loop
(kernel/entry/common.c:114)
                   . resume_user_mode_work
(include/linux/resume_user_mode.h:50)
                   task_work_run+0x55                   (kernel/task_work.c:222)
                   __fput+0x102                         (fs/file_table.c:422)
                   bpf_link_release+0x17
(kernel/bpf/syscall.c:3116)
                   bpf_link_free+0x54
(kernel/bpf/syscall.c:3067)
                   bpf_uprobe_multi_link_release+0x31
(kernel/trace/bpf_trace.c:3198)
                   . bpf_uprobe_unregister
(kernel/trace/bpf_trace.c:3186)
                   uprobe_unregister_nosync+0x42
(kernel/events/uprobes.c:1120)
                   register_for_each_vma+0x427
(kernel/events/uprobes.c:1092)
     6us [-EINTR]  uprobe_write_opcode+0x79
(kernel/events/uprobes.c:478)
                   . get_user_page_vma_remote
(include/linux/mm.h:2489)
     4us [-EINTR]  get_user_pages_remote+0x109          (mm/gup.c:2627)
                   . __get_user_pages_locked            (mm/gup.c:1762)
!    1us [-EINTR]  __get_user_pages


So, we do uprobe_unregister -> register_for_each_vma ->
remove_breakpoint -> set_orig_insn -> uprobe_write_opcode ->
get_user_page_vma_remote -> get_user_pages_remote ->
__get_user_pages_locked -> __get_user_pages and I think we then hit
`if (fatal_signal_pending(current)) return -EINTR;` check.


So, is there something smarter we can do in this case besides leaking
an uprobe (and note, my changes don't change this behavior)?

I can of course just drop the WARN given it's sort of expected now,
but if we can handle that more gracefully it would be good. I don't
think that should block optimization work, but just something to keep
in mind and maybe fix as a follow up.


>         }
>         up_write(&uprobe->register_rwsem);
>

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU
  2024-08-01 16:49     ` Andrii Nakryiko
@ 2024-08-02  1:30       ` Liao, Chang
  0 siblings, 0 replies; 51+ messages in thread
From: Liao, Chang @ 2024-08-02  1:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck



在 2024/8/2 0:49, Andrii Nakryiko 写道:
> On Thu, Aug 1, 2024 at 5:23 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/8/1 5:42, Andrii Nakryiko 写道:
>>> To avoid unnecessarily taking a (brief) refcount on uprobe during
>>> breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
>>> not take refcount, but protect the lifetime of a uprobe instance with
>>> RCU. This improves scalability, as refcount gets quite expensive due to
>>> cache line bouncing between multiple CPUs.
>>>
>>> Specifically, we utilize our own uprobe-specific SRCU instance for this
>>> RCU protection. put_uprobe() will delay actual kfree() using call_srcu().
>>>
>>> For now, uretprobe and single-stepping handling will still acquire
>>> refcount as necessary. We'll address these issues in follow up patches
>>> by making them use SRCU with timeout.
>>>
>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>>> ---
>>>  kernel/events/uprobes.c | 93 ++++++++++++++++++++++++-----------------
>>>  1 file changed, 55 insertions(+), 38 deletions(-)
>>>
> 
> [...]
> 
>>>
>>> @@ -2258,12 +2275,12 @@ static void handle_swbp(struct pt_regs *regs)
>>>       if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>>               goto out;
>>>
>>> -     if (!pre_ssout(uprobe, regs, bp_vaddr))
>>> -             return;
>>> +     if (pre_ssout(uprobe, regs, bp_vaddr))
>>> +             goto out;
>>>
>>
>> Regardless what pre_ssout() returns, it always reach the label 'out', so the
>> if block is unnecessary.
> 
> yep, I know, but I felt like
> 
> if (something is wrong)
>     goto out;
> 
> pattern was important to keep for each possible failing step for consistency.
> 
> so unless this is a big deal, I'd keep it as is, as in the future
> there might be some other steps after pre_ssout() before returning, so
> this is a bit more "composable"
> 
OK, I would say this conditional-check pattern is likely to be optimized away by
modern compiler.

Thanks.

> 
>>
>>
>>> -     /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
>>>  out:
>>> -     put_uprobe(uprobe);
>>> +     /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
>>> +     srcu_read_unlock(&uprobes_srcu, srcu_idx);
>>>  }
>>>
>>>  /*
>>
>> --
>> BR
>> Liao, Chang

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] perf/uprobe: split uprobe_unregister()
  2024-07-31 21:42 ` [PATCH 6/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
@ 2024-08-02  2:41   ` Liao, Chang
  2024-08-02 15:05     ` Andrii Nakryiko
  2024-08-07 13:17   ` Oleg Nesterov
  1 sibling, 1 reply; 51+ messages in thread
From: Liao, Chang @ 2024-08-02  2:41 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat
  Cc: bpf, linux-kernel, jolsa, paulmck



在 2024/8/1 5:42, Andrii Nakryiko 写道:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> With uprobe_unregister() having grown a synchronize_srcu(), it becomes
> fairly slow to call. Esp. since both users of this API call it in a
> loop.
> 
> Peel off the sync_srcu() and do it once, after the loop.
> 
> With recent uprobe_register()'s error handling reusing full
> uprobe_unregister() call, we need to be careful about returning to the
> caller before we have a guarantee that partially attached consumer won't
> be called anymore. So add uprobe_unregister_sync() in the error handling
> path. This is an unlikely slow path and this should be totally fine to
> be slow in the case of an failed attach.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/uprobes.h                        |  8 ++++++--
>  kernel/events/uprobes.c                        | 18 ++++++++++++++----
>  kernel/trace/bpf_trace.c                       |  5 ++++-
>  kernel/trace/trace_uprobe.c                    |  6 +++++-
>  .../selftests/bpf/bpf_testmod/bpf_testmod.c    |  3 ++-
>  5 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index a1686c1ebcb6..8f1999eb9d9f 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -105,7 +105,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
>  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
> -extern void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc);
> +extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> +extern void uprobe_unregister_sync(void);

[...]

>  static inline void
> -uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> +uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
> +{
> +}
> +static inline void uprobes_unregister_sync(void)

*uprobes*_unregister_sync, is it a typo?

>  {
>  }
>  static inline int uprobe_mmap(struct vm_area_struct *vma)
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 3b42fd355256..b0488d356399 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1089,11 +1089,11 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>  }
>  
>  /**
> - * uprobe_unregister - unregister an already registered probe.
> + * uprobe_unregister_nosync - unregister an already registered probe.
>   * @uprobe: uprobe to remove
>   * @uc: identify which probe if multiple probes are colocated.
>   */
> -void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> +void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  {
>  	int err;
>  
> @@ -1109,10 +1109,14 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  		return;
>  
>  	put_uprobe(uprobe);
> +}
> +EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
>  
> +void uprobe_unregister_sync(void)
> +{
>  	synchronize_srcu(&uprobes_srcu);
>  }
> -EXPORT_SYMBOL_GPL(uprobe_unregister);
> +EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
>  
>  /**
>   * uprobe_register - register a probe
> @@ -1170,7 +1174,13 @@ struct uprobe *uprobe_register(struct inode *inode,
>  	up_write(&uprobe->register_rwsem);
>  
>  	if (ret) {
> -		uprobe_unregister(uprobe, uc);
> +		uprobe_unregister_nosync(uprobe, uc);
> +		/*
> +		 * Registration might have partially succeeded, so we can have
> +		 * this consumer being called right at this time. We need to
> +		 * sync here. It's ok, it's unlikely slow path.
> +		 */
> +		uprobe_unregister_sync();
>  		return ERR_PTR(ret);
>  	}
>  
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 73c570b5988b..6b632710c98e 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -3184,7 +3184,10 @@ static void bpf_uprobe_unregister(struct bpf_uprobe *uprobes, u32 cnt)
>  	u32 i;
>  
>  	for (i = 0; i < cnt; i++)
> -		uprobe_unregister(uprobes[i].uprobe, &uprobes[i].consumer);
> +		uprobe_unregister_nosync(uprobes[i].uprobe, &uprobes[i].consumer);
> +
> +	if (cnt)
> +		uprobe_unregister_sync();
>  }
>  
>  static void bpf_uprobe_multi_link_release(struct bpf_link *link)
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 7eb79e0a5352..f7443e996b1b 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1097,6 +1097,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
>  static void __probe_event_disable(struct trace_probe *tp)
>  {
>  	struct trace_uprobe *tu;
> +	bool sync = false;
>  
>  	tu = container_of(tp, struct trace_uprobe, tp);
>  	WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
> @@ -1105,9 +1106,12 @@ static void __probe_event_disable(struct trace_probe *tp)
>  		if (!tu->uprobe)
>  			continue;
>  
> -		uprobe_unregister(tu->uprobe, &tu->consumer);
> +		uprobe_unregister_nosync(tu->uprobe, &tu->consumer);
> +		sync = true;
>  		tu->uprobe = NULL;
>  	}
> +	if (sync)
> +		uprobe_unregister_sync();
>  }
>  
>  static int probe_event_enable(struct trace_event_call *call,
> diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> index 73a6b041bcce..928c73cde32e 100644
> --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> @@ -478,7 +478,8 @@ static void testmod_unregister_uprobe(void)
>  	mutex_lock(&testmod_uprobe_mutex);
>  
>  	if (uprobe.uprobe) {
> -		uprobe_unregister(uprobe.uprobe, &uprobe.consumer);
> +		uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
> +		uprobe_unregister_sync();
>  		uprobe.offset = 0;
>  		uprobe.uprobe = NULL;
>  	}

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-08-01 22:07   ` Andrii Nakryiko
@ 2024-08-02  8:50     ` Oleg Nesterov
  2024-08-02 14:58       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-02  8:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On 08/01, Andrii Nakryiko wrote:
>
> > +               /* TODO : cant unregister? schedule a worker thread */
> > +               WARN(err, "leaking uprobe due to failed unregistration");

> Ok, so now that I added this very loud warning if
> register_for_each_vma(uprobe, NULL) returns error, it turns out it's
> not that unusual for this unregistration to fail.

...

> So, is there something smarter we can do in this case besides leaking
> an uprobe (and note, my changes don't change this behavior)?

Something like schedule_work() which retries register_for_each_vma()...

> I can of course just drop the WARN given it's sort of expected now,

Or least replace it with pr_warn() or uprobe_warn(), WARN() certainly
makes no sense imo...

> I don't
> think that should block optimization work, but just something to keep
> in mind and maybe fix as a follow up.

Agreed, lets do this separately.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-31 21:42 ` [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
  2024-08-01 11:09   ` Jiri Olsa
  2024-08-01 22:07   ` Andrii Nakryiko
@ 2024-08-02 11:11   ` Jiri Olsa
  2024-08-02 15:03     ` Andrii Nakryiko
  2024-08-05 13:44   ` Oleg Nesterov
  3 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-08-02 11:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, paulmck

On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote:

SNIP

> -/*
> - * There could be threads that have already hit the breakpoint. They
> - * will recheck the current insn and restart if find_uprobe() fails.
> - * See find_active_uprobe().
> - */
> -static void delete_uprobe(struct uprobe *uprobe)
> -{
> -	if (WARN_ON(!uprobe_is_active(uprobe)))
> -		return;
> -
> -	write_lock(&uprobes_treelock);
> -	rb_erase(&uprobe->rb_node, &uprobes_tree);
> -	write_unlock(&uprobes_treelock);
> -	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> -}
> -
>  struct map_info {
>  	struct map_info *next;
>  	struct mm_struct *mm;
> @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  	int err;
>  
>  	down_write(&uprobe->register_rwsem);
> -	if (WARN_ON(!consumer_del(uprobe, uc)))
> +	if (WARN_ON(!consumer_del(uprobe, uc))) {
>  		err = -ENOENT;
> -	else
> +	} else {
>  		err = register_for_each_vma(uprobe, NULL);
> -
> -	/* TODO : cant unregister? schedule a worker thread */
> -	if (!err) {
> -		if (!uprobe->consumers)
> -			delete_uprobe(uprobe);

ok, so removing this call is why the consumer test is failing, right?

IIUC the previous behaviour was to remove uprobe from the tree
even when there's active uprobe ref for installed uretprobe

so following scenario will now behaves differently:

  install uretprobe/consumer-1 on foo
  foo {
    remove uretprobe/consumer-1                (A)
    install uretprobe/consumer-2 on foo        (B)
  }

before the removal of consumer-1 (A) would remove the uprobe object
from the tree, so the installation of consumer-2 (b) would create
new uprobe object which would not be triggered at foo return because
it got installed too late (after foo uprobe was triggered)

the behaviour with this patch is that removal of consumer-1 (A) will
not remove the uprobe object (that happens only when we run out of
refs), and the following install of consumer-2 will use the existing
uprobe object so the consumer-2 will be triggered on foo return

uff ;-)

but I think it's better, because we get more hits

jirka

> -		else
> -			err = -EBUSY;
> +		/* TODO : cant unregister? schedule a worker thread */
> +		WARN(err, "leaking uprobe due to failed unregistration");
>  	}
>  	up_write(&uprobe->register_rwsem);
>  
> @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
>  	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
>  		return ERR_PTR(-EINVAL);
>  
> - retry:
>  	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
>  	if (IS_ERR(uprobe))
>  		return uprobe;
>  
> -	/*
> -	 * We can race with uprobe_unregister()->delete_uprobe().
> -	 * Check uprobe_is_active() and retry if it is false.
> -	 */
>  	down_write(&uprobe->register_rwsem);
> -	ret = -EAGAIN;
> -	if (likely(uprobe_is_active(uprobe))) {
> -		consumer_add(uprobe, uc);
> -		ret = register_for_each_vma(uprobe, uc);
> -	}
> +	consumer_add(uprobe, uc);
> +	ret = register_for_each_vma(uprobe, uc);
>  	up_write(&uprobe->register_rwsem);
> -	put_uprobe(uprobe);
>  
>  	if (ret) {
> -		if (unlikely(ret == -EAGAIN))
> -			goto retry;
>  		uprobe_unregister(uprobe, uc);
>  		return ERR_PTR(ret);

SNIP

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-08-02  8:50     ` Oleg Nesterov
@ 2024-08-02 14:58       ` Andrii Nakryiko
  2024-08-02 22:19         ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-02 14:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Fri, Aug 2, 2024 at 1:50 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/01, Andrii Nakryiko wrote:
> >
> > > +               /* TODO : cant unregister? schedule a worker thread */
> > > +               WARN(err, "leaking uprobe due to failed unregistration");
>
> > Ok, so now that I added this very loud warning if
> > register_for_each_vma(uprobe, NULL) returns error, it turns out it's
> > not that unusual for this unregistration to fail.
>
> ...
>
> > So, is there something smarter we can do in this case besides leaking
> > an uprobe (and note, my changes don't change this behavior)?
>
> Something like schedule_work() which retries register_for_each_vma()...

And if that fails again, what do we do? Because I don't think we even
need schedule_work(), we can just keep some list of "pending to be
retried" items and check them after each
uprobe_register()/uprobe_unregister() call. I'm just not clear how we
should handle stubborn cases (but honestly I haven't even tried to
understand all the details about this just yet).

>
> > I can of course just drop the WARN given it's sort of expected now,
>
> Or least replace it with pr_warn() or uprobe_warn(), WARN() certainly
> makes no sense imo...
>

ok, makes sense, will change to uprobe_warn()

> > I don't
> > think that should block optimization work, but just something to keep
> > in mind and maybe fix as a follow up.
>
> Agreed, lets do this separately.
>

yep

> Oleg.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-08-02 11:11   ` Jiri Olsa
@ 2024-08-02 15:03     ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-02 15:03 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, paulmck

On Fri, Aug 2, 2024 at 4:11 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote:
>
> SNIP
>
> > -/*
> > - * There could be threads that have already hit the breakpoint. They
> > - * will recheck the current insn and restart if find_uprobe() fails.
> > - * See find_active_uprobe().
> > - */
> > -static void delete_uprobe(struct uprobe *uprobe)
> > -{
> > -     if (WARN_ON(!uprobe_is_active(uprobe)))
> > -             return;
> > -
> > -     write_lock(&uprobes_treelock);
> > -     rb_erase(&uprobe->rb_node, &uprobes_tree);
> > -     write_unlock(&uprobes_treelock);
> > -     RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > -}
> > -
> >  struct map_info {
> >       struct map_info *next;
> >       struct mm_struct *mm;
> > @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >       int err;
> >
> >       down_write(&uprobe->register_rwsem);
> > -     if (WARN_ON(!consumer_del(uprobe, uc)))
> > +     if (WARN_ON(!consumer_del(uprobe, uc))) {
> >               err = -ENOENT;
> > -     else
> > +     } else {
> >               err = register_for_each_vma(uprobe, NULL);
> > -
> > -     /* TODO : cant unregister? schedule a worker thread */
> > -     if (!err) {
> > -             if (!uprobe->consumers)
> > -                     delete_uprobe(uprobe);
>
> ok, so removing this call is why the consumer test is failing, right?
>
> IIUC the previous behaviour was to remove uprobe from the tree
> even when there's active uprobe ref for installed uretprobe
>
> so following scenario will now behaves differently:
>
>   install uretprobe/consumer-1 on foo
>   foo {
>     remove uretprobe/consumer-1                (A)
>     install uretprobe/consumer-2 on foo        (B)
>   }
>
> before the removal of consumer-1 (A) would remove the uprobe object
> from the tree, so the installation of consumer-2 (b) would create
> new uprobe object which would not be triggered at foo return because
> it got installed too late (after foo uprobe was triggered)
>
> the behaviour with this patch is that removal of consumer-1 (A) will
> not remove the uprobe object (that happens only when we run out of
> refs), and the following install of consumer-2 will use the existing
> uprobe object so the consumer-2 will be triggered on foo return
>
> uff ;-)

yep, something like that

>
> but I think it's better, because we get more hits

note, with the next patch set that makes uretprobes SRCU protected
(but with timeout) the behavior becomes a bit time-sensitive :) so I
think we'll have to change your selftest to first attach all the new
uretprobes, then detach all the uretprobes that had to be detached,
and then do a bit more relaxed logic of the sort "if there were some
uretprobes before and after, then we *might* get uretprobe triggered
(but we might not as well, unless the same uretprobe stayed attached
at all times)".

Anyways, something to take care of in the bpf-next tree separately.

All this is very much an implementation detail, so I think we can
change these aspects freely.

>
> jirka
>
> > -             else
> > -                     err = -EBUSY;
> > +             /* TODO : cant unregister? schedule a worker thread */
> > +             WARN(err, "leaking uprobe due to failed unregistration");
> >       }
> >       up_write(&uprobe->register_rwsem);
> >
> > @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
> >       if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
> >               return ERR_PTR(-EINVAL);
> >
> > - retry:
> >       uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
> >       if (IS_ERR(uprobe))
> >               return uprobe;
> >
> > -     /*
> > -      * We can race with uprobe_unregister()->delete_uprobe().
> > -      * Check uprobe_is_active() and retry if it is false.
> > -      */
> >       down_write(&uprobe->register_rwsem);
> > -     ret = -EAGAIN;
> > -     if (likely(uprobe_is_active(uprobe))) {
> > -             consumer_add(uprobe, uc);
> > -             ret = register_for_each_vma(uprobe, uc);
> > -     }
> > +     consumer_add(uprobe, uc);
> > +     ret = register_for_each_vma(uprobe, uc);
> >       up_write(&uprobe->register_rwsem);
> > -     put_uprobe(uprobe);
> >
> >       if (ret) {
> > -             if (unlikely(ret == -EAGAIN))
> > -                     goto retry;
> >               uprobe_unregister(uprobe, uc);
> >               return ERR_PTR(ret);
>
> SNIP

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] perf/uprobe: split uprobe_unregister()
  2024-08-02  2:41   ` Liao, Chang
@ 2024-08-02 15:05     ` Andrii Nakryiko
  2024-08-05 20:01       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-02 15:05 UTC (permalink / raw)
  To: Liao, Chang
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck

On Thu, Aug 1, 2024 at 7:41 PM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/1 5:42, Andrii Nakryiko 写道:
> > From: Peter Zijlstra <peterz@infradead.org>
> >
> > With uprobe_unregister() having grown a synchronize_srcu(), it becomes
> > fairly slow to call. Esp. since both users of this API call it in a
> > loop.
> >
> > Peel off the sync_srcu() and do it once, after the loop.
> >
> > With recent uprobe_register()'s error handling reusing full
> > uprobe_unregister() call, we need to be careful about returning to the
> > caller before we have a guarantee that partially attached consumer won't
> > be called anymore. So add uprobe_unregister_sync() in the error handling
> > path. This is an unlikely slow path and this should be totally fine to
> > be slow in the case of an failed attach.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/uprobes.h                        |  8 ++++++--
> >  kernel/events/uprobes.c                        | 18 ++++++++++++++----
> >  kernel/trace/bpf_trace.c                       |  5 ++++-
> >  kernel/trace/trace_uprobe.c                    |  6 +++++-
> >  .../selftests/bpf/bpf_testmod/bpf_testmod.c    |  3 ++-
> >  5 files changed, 31 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index a1686c1ebcb6..8f1999eb9d9f 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -105,7 +105,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> >  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> >  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> >  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
> > -extern void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc);
> > +extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> > +extern void uprobe_unregister_sync(void);
>
> [...]
>
> >  static inline void
> > -uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > +uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > +{
> > +}
> > +static inline void uprobes_unregister_sync(void)
>
> *uprobes*_unregister_sync, is it a typo?
>

I think the idea behind this is that you do a lot of individual uprobe
unregistrations with multiple uprobe_unregister() calls, and then
follow with a single *uprobes*_unregister_sync(), because in general
it is meant to sync multiple uprobes unregistrations.

> >  {
> >  }
> >  static inline int uprobe_mmap(struct vm_area_struct *vma)
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 3b42fd355256..b0488d356399 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -1089,11 +1089,11 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> >  }
> >
> >  /**
> > - * uprobe_unregister - unregister an already registered probe.
> > + * uprobe_unregister_nosync - unregister an already registered probe.
> >   * @uprobe: uprobe to remove
> >   * @uc: identify which probe if multiple probes are colocated.
> >   */
> > -void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > +void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >  {
> >       int err;
> >
> > @@ -1109,10 +1109,14 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >               return;
> >
> >       put_uprobe(uprobe);
> > +}
> > +EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
> >
> > +void uprobe_unregister_sync(void)
> > +{
> >       synchronize_srcu(&uprobes_srcu);
> >  }
> > -EXPORT_SYMBOL_GPL(uprobe_unregister);
> > +EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
> >
> >  /**
> >   * uprobe_register - register a probe
> > @@ -1170,7 +1174,13 @@ struct uprobe *uprobe_register(struct inode *inode,
> >       up_write(&uprobe->register_rwsem);
> >
> >       if (ret) {
> > -             uprobe_unregister(uprobe, uc);
> > +             uprobe_unregister_nosync(uprobe, uc);
> > +             /*
> > +              * Registration might have partially succeeded, so we can have
> > +              * this consumer being called right at this time. We need to
> > +              * sync here. It's ok, it's unlikely slow path.
> > +              */
> > +             uprobe_unregister_sync();
> >               return ERR_PTR(ret);
> >       }
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 73c570b5988b..6b632710c98e 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -3184,7 +3184,10 @@ static void bpf_uprobe_unregister(struct bpf_uprobe *uprobes, u32 cnt)
> >       u32 i;
> >
> >       for (i = 0; i < cnt; i++)
> > -             uprobe_unregister(uprobes[i].uprobe, &uprobes[i].consumer);
> > +             uprobe_unregister_nosync(uprobes[i].uprobe, &uprobes[i].consumer);
> > +
> > +     if (cnt)
> > +             uprobe_unregister_sync();
> >  }
> >
> >  static void bpf_uprobe_multi_link_release(struct bpf_link *link)
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index 7eb79e0a5352..f7443e996b1b 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -1097,6 +1097,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
> >  static void __probe_event_disable(struct trace_probe *tp)
> >  {
> >       struct trace_uprobe *tu;
> > +     bool sync = false;
> >
> >       tu = container_of(tp, struct trace_uprobe, tp);
> >       WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
> > @@ -1105,9 +1106,12 @@ static void __probe_event_disable(struct trace_probe *tp)
> >               if (!tu->uprobe)
> >                       continue;
> >
> > -             uprobe_unregister(tu->uprobe, &tu->consumer);
> > +             uprobe_unregister_nosync(tu->uprobe, &tu->consumer);
> > +             sync = true;
> >               tu->uprobe = NULL;
> >       }
> > +     if (sync)
> > +             uprobe_unregister_sync();
> >  }
> >
> >  static int probe_event_enable(struct trace_event_call *call,
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > index 73a6b041bcce..928c73cde32e 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > @@ -478,7 +478,8 @@ static void testmod_unregister_uprobe(void)
> >       mutex_lock(&testmod_uprobe_mutex);
> >
> >       if (uprobe.uprobe) {
> > -             uprobe_unregister(uprobe.uprobe, &uprobe.consumer);
> > +             uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
> > +             uprobe_unregister_sync();
> >               uprobe.offset = 0;
> >               uprobe.uprobe = NULL;
> >       }
>
> --
> BR
> Liao, Chang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-08-02 14:58       ` Andrii Nakryiko
@ 2024-08-02 22:19         ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-02 22:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On 08/02, Andrii Nakryiko wrote:
>
> On Fri, Aug 2, 2024 at 1:50 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 08/01, Andrii Nakryiko wrote:
> > >
> > > > +               /* TODO : cant unregister? schedule a worker thread */
> > > > +               WARN(err, "leaking uprobe due to failed unregistration");
> >
> > > Ok, so now that I added this very loud warning if
> > > register_for_each_vma(uprobe, NULL) returns error, it turns out it's
> > > not that unusual for this unregistration to fail.
> >
> > ...
> >
> > > So, is there something smarter we can do in this case besides leaking
> > > an uprobe (and note, my changes don't change this behavior)?
> >
> > Something like schedule_work() which retries register_for_each_vma()...
>
> And if that fails again, what do we do?

try again after some timeout ;)

> Because I don't think we even
> need schedule_work(), we can just keep some list of "pending to be
> retried" items and check them after each
> uprobe_register()/uprobe_unregister() call.

Agreed, we need a list of "pending to be retried", but rightly or not
I think this should be done from work_struct->func.

Lets discuss this later? We seem to agree this is a known problem, and
this has nothing to do with your optimizations.

> I'm just not clear how we
> should handle stubborn cases (but honestly I haven't even tried to
> understand all the details about this just yet).

Same here.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-31 21:42 ` [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
                     ` (2 preceding siblings ...)
  2024-08-02 11:11   ` Jiri Olsa
@ 2024-08-05 13:44   ` Oleg Nesterov
  2024-08-05 17:29     ` Andrii Nakryiko
  3 siblings, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-05 13:44 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

On 07/31, Andrii Nakryiko wrote:
>
> @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
>  	uprobe->ref_ctr_offset = ref_ctr_offset;
>  	init_rwsem(&uprobe->register_rwsem);
>  	init_rwsem(&uprobe->consumer_rwsem);
> +	RB_CLEAR_NODE(&uprobe->rb_node);

I guess RB_CLEAR_NODE() is not necessary?

> @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
>  			u = rb_entry(t, struct uprobe, rb_node);
>  			if (u->inode != inode || u->offset < min)
>  				break;
> +			u = try_get_uprobe(u);
> +			if (!u) /* uprobe already went away, safe to ignore */
> +				continue;
>  			list_add(&u->pending_list, head);

cosmetic nit, feel to ignore, but to me

			if (try_get_uprobe(u))
				list_add(&u->pending_list, head);

looks more readable.

Other than the lack of kfree() in put_uprobe() and WARN() in _unregister()
the patch looks good to me.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU
  2024-07-31 21:42 ` [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
  2024-08-01 12:23   ` Liao, Chang
@ 2024-08-05 14:51   ` Oleg Nesterov
  2024-08-05 17:31     ` Andrii Nakryiko
  1 sibling, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-05 14:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

LGTM, just a few notes...

On 07/31, Andrii Nakryiko wrote:
>
> @@ -59,6 +61,7 @@ struct uprobe {
>  	struct list_head	pending_list;
>  	struct uprobe_consumer	*consumers;
>  	struct inode		*inode;		/* Also hold a ref to inode */
> +	struct rcu_head		rcu;

you can probably put the new member into the union with, say, rb_node.

> @@ -1945,9 +1950,14 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
>  	if (!utask)
>  		return -ENOMEM;
>
> +	if (!try_get_uprobe(uprobe))
> +		return -EINVAL;
> +

a bit off-topic right now, but it seems that we can simply kill
utask->active_uprobe. We can turn into into "bool has_active_uprobe"
and copy uprobe->arch into uprobe_task. Lets discuss this later.

> @@ -2201,13 +2215,15 @@ static void handle_swbp(struct pt_regs *regs)
>  {
>  	struct uprobe *uprobe;
>  	unsigned long bp_vaddr;
> -	int is_swbp;
> +	int is_swbp, srcu_idx;
>
>  	bp_vaddr = uprobe_get_swbp_addr(regs);
>  	if (bp_vaddr == uprobe_get_trampoline_vaddr())
>  		return uprobe_handle_trampoline(regs);
>
> -	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> +	srcu_idx = srcu_read_lock(&uprobes_srcu);
> +
> +	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
>  	if (!uprobe) {
>  		if (is_swbp > 0) {
>  			/* No matching uprobe; signal SIGTRAP. */
> @@ -2223,6 +2239,7 @@ static void handle_swbp(struct pt_regs *regs)
>  			 */
>  			instruction_pointer_set(regs, bp_vaddr);
>  		}
> +		srcu_read_unlock(&uprobes_srcu, srcu_idx);
>  		return;

Why not
		goto out;

?

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-07-31 21:42 ` [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
  2024-08-01 14:27   ` Jiri Olsa
@ 2024-08-05 15:59   ` Oleg Nesterov
  2024-08-05 17:31     ` Andrii Nakryiko
  1 sibling, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-05 15:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

On 07/31, Andrii Nakryiko wrote:
>
> @@ -1120,17 +1098,19 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  	int err;
>
>  	down_write(&uprobe->register_rwsem);
> -	if (WARN_ON(!consumer_del(uprobe, uc))) {
> -		err = -ENOENT;

OK, I agree, this should never happen.

But if you remove this check, then

>  int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
>  {
>  	struct uprobe_consumer *con;
> -	int ret = -ENOENT;
> +	int ret = -ENOENT, srcu_idx;
>
>  	down_write(&uprobe->register_rwsem);
> -	for (con = uprobe->consumers; con && con != uc ; con = con->next)
> -		;
> -	if (con)
> -		ret = register_for_each_vma(uprobe, add ? uc : NULL);
> +
> +	srcu_idx = srcu_read_lock(&uprobes_srcu);
> +	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
> +				 srcu_read_lock_held(&uprobes_srcu)) {
> +		if (con == uc) {
> +			ret = register_for_each_vma(uprobe, add ? uc : NULL);
> +			break;
> +		}
> +	}

we can probably remove the similar check above?

I mean, why do we need the list_for_each_entry_srcu() above? Is it possible
that uprobe_apply(uprobe, uc) is called when "uc" is not on the ->consumers
list?

At first glance I see no problems in this patch... but you know, my eyes are
already blurring, I'll continue tomorrow and read this patch again.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-08-05 13:44   ` Oleg Nesterov
@ 2024-08-05 17:29     ` Andrii Nakryiko
  2024-08-06 10:45       ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-05 17:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Mon, Aug 5, 2024 at 6:44 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/31, Andrii Nakryiko wrote:
> >
> > @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> >       uprobe->ref_ctr_offset = ref_ctr_offset;
> >       init_rwsem(&uprobe->register_rwsem);
> >       init_rwsem(&uprobe->consumer_rwsem);
> > +     RB_CLEAR_NODE(&uprobe->rb_node);
>
> I guess RB_CLEAR_NODE() is not necessary?

I definitely needed that with my batch API changes, but it might be
that I don't need it anymore. But I'm a bit hesitant to remove it,
because if we ever get put_uprobe() on an uprobe that hasn't been
inserted into RB-tree yet, this will cause a hard to understand crash.
RB_CLEAR_NODE() in __insert_uprobe() is critical to have, this one is
kind of optional (but still feels right to initialize the field
properly).

Let me know if you feel strongly about this, though.

>
> > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
> >                       u = rb_entry(t, struct uprobe, rb_node);
> >                       if (u->inode != inode || u->offset < min)
> >                               break;
> > +                     u = try_get_uprobe(u);
> > +                     if (!u) /* uprobe already went away, safe to ignore */
> > +                             continue;
> >                       list_add(&u->pending_list, head);
>
> cosmetic nit, feel to ignore, but to me
>
>                         if (try_get_uprobe(u))
>                                 list_add(&u->pending_list, head);
>
> looks more readable.

It's not my code base to enforce my preferences, but I'll at least
explain why I disagree. To me, something like `if (some condition)
<break/continue>;` is a very clear indication that this item (or even
the rest of items in case of break) won't be processed anymore.

While

if (some inverted condition)
   <do some something useful>
<might be some more code>

... is a pattern that requires double-checking that we really are not
going to use that item later on.

So I'll invert this just to not be PITA, but I disagree :)

>
> Other than the lack of kfree() in put_uprobe() and WARN() in _unregister()
> the patch looks good to me.

yep, fixed that locally already. Thanks for the review!

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU
  2024-08-05 14:51   ` Oleg Nesterov
@ 2024-08-05 17:31     ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-05 17:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Mon, Aug 5, 2024 at 7:52 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> LGTM, just a few notes...
>
> On 07/31, Andrii Nakryiko wrote:
> >
> > @@ -59,6 +61,7 @@ struct uprobe {
> >       struct list_head        pending_list;
> >       struct uprobe_consumer  *consumers;
> >       struct inode            *inode;         /* Also hold a ref to inode */
> > +     struct rcu_head         rcu;
>
> you can probably put the new member into the union with, say, rb_node.

yep, good point, will do

>
> > @@ -1945,9 +1950,14 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
> >       if (!utask)
> >               return -ENOMEM;
> >
> > +     if (!try_get_uprobe(uprobe))
> > +             return -EINVAL;
> > +
>
> a bit off-topic right now, but it seems that we can simply kill
> utask->active_uprobe. We can turn into into "bool has_active_uprobe"
> and copy uprobe->arch into uprobe_task. Lets discuss this later.

I'm going to change this active_uprobe thing to be either refcounted
or SRCU-protected (but with timeout), so I'll need a bit more
structure around this. Let's see how that lands and if we still can
get rid of it, we can discuss.

>
> > @@ -2201,13 +2215,15 @@ static void handle_swbp(struct pt_regs *regs)
> >  {
> >       struct uprobe *uprobe;
> >       unsigned long bp_vaddr;
> > -     int is_swbp;
> > +     int is_swbp, srcu_idx;
> >
> >       bp_vaddr = uprobe_get_swbp_addr(regs);
> >       if (bp_vaddr == uprobe_get_trampoline_vaddr())
> >               return uprobe_handle_trampoline(regs);
> >
> > -     uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> > +     srcu_idx = srcu_read_lock(&uprobes_srcu);
> > +
> > +     uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
> >       if (!uprobe) {
> >               if (is_swbp > 0) {
> >                       /* No matching uprobe; signal SIGTRAP. */
> > @@ -2223,6 +2239,7 @@ static void handle_swbp(struct pt_regs *regs)
> >                        */
> >                       instruction_pointer_set(regs, bp_vaddr);
> >               }
> > +             srcu_read_unlock(&uprobes_srcu, srcu_idx);
> >               return;
>
> Why not
>                 goto out;
>
> ?
>

Good point, can be goto out, will change.

> Oleg.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-05 15:59   ` Oleg Nesterov
@ 2024-08-05 17:31     ` Andrii Nakryiko
  2024-08-06 10:54       ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-05 17:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Mon, Aug 5, 2024 at 8:59 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/31, Andrii Nakryiko wrote:
> >
> > @@ -1120,17 +1098,19 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >       int err;
> >
> >       down_write(&uprobe->register_rwsem);
> > -     if (WARN_ON(!consumer_del(uprobe, uc))) {
> > -             err = -ENOENT;
>
> OK, I agree, this should never happen.
>
> But if you remove this check, then
>
> >  int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
> >  {
> >       struct uprobe_consumer *con;
> > -     int ret = -ENOENT;
> > +     int ret = -ENOENT, srcu_idx;
> >
> >       down_write(&uprobe->register_rwsem);
> > -     for (con = uprobe->consumers; con && con != uc ; con = con->next)
> > -             ;
> > -     if (con)
> > -             ret = register_for_each_vma(uprobe, add ? uc : NULL);
> > +
> > +     srcu_idx = srcu_read_lock(&uprobes_srcu);
> > +     list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
> > +                              srcu_read_lock_held(&uprobes_srcu)) {
> > +             if (con == uc) {
> > +                     ret = register_for_each_vma(uprobe, add ? uc : NULL);
> > +                     break;
> > +             }
> > +     }
>
> we can probably remove the similar check above?
>
> I mean, why do we need the list_for_each_entry_srcu() above? Is it possible
> that uprobe_apply(uprobe, uc) is called when "uc" is not on the ->consumers
> list?

Tbh, I just don't completely understand how (and why) uprobe_apply()
is used from kernel/trace/trace_uprobe.c, so I wanted to preserve the
logic exactly. I still don't see when this consumer is added before
uprobe_apply()... Exposing uprobe_apply() seems like a huge API
violation to me and I'd rather get rid of its users. But one step at a
time.


>
> At first glance I see no problems in this patch... but you know, my eyes are
> already blurring, I'll continue tomorrow and read this patch again.
>
> Oleg.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] perf/uprobe: split uprobe_unregister()
  2024-08-02 15:05     ` Andrii Nakryiko
@ 2024-08-05 20:01       ` Andrii Nakryiko
  2024-08-06  1:50         ` Liao, Chang
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-05 20:01 UTC (permalink / raw)
  To: Liao, Chang
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck

On Fri, Aug 2, 2024 at 8:05 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 7:41 PM Liao, Chang <liaochang1@huawei.com> wrote:
> >
> >
> >
> > 在 2024/8/1 5:42, Andrii Nakryiko 写道:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > >
> > > With uprobe_unregister() having grown a synchronize_srcu(), it becomes
> > > fairly slow to call. Esp. since both users of this API call it in a
> > > loop.
> > >
> > > Peel off the sync_srcu() and do it once, after the loop.
> > >
> > > With recent uprobe_register()'s error handling reusing full
> > > uprobe_unregister() call, we need to be careful about returning to the
> > > caller before we have a guarantee that partially attached consumer won't
> > > be called anymore. So add uprobe_unregister_sync() in the error handling
> > > path. This is an unlikely slow path and this should be totally fine to
> > > be slow in the case of an failed attach.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/uprobes.h                        |  8 ++++++--
> > >  kernel/events/uprobes.c                        | 18 ++++++++++++++----
> > >  kernel/trace/bpf_trace.c                       |  5 ++++-
> > >  kernel/trace/trace_uprobe.c                    |  6 +++++-
> > >  .../selftests/bpf/bpf_testmod/bpf_testmod.c    |  3 ++-
> > >  5 files changed, 31 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index a1686c1ebcb6..8f1999eb9d9f 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -105,7 +105,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> > >  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> > >  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> > >  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
> > > -extern void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc);
> > > +extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> > > +extern void uprobe_unregister_sync(void);
> >
> > [...]
> >
> > >  static inline void
> > > -uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > > +uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > > +{
> > > +}
> > > +static inline void uprobes_unregister_sync(void)
> >
> > *uprobes*_unregister_sync, is it a typo?
> >
>
> I think the idea behind this is that you do a lot of individual uprobe
> unregistrations with multiple uprobe_unregister() calls, and then
> follow with a single *uprobes*_unregister_sync(), because in general
> it is meant to sync multiple uprobes unregistrations.

Ah, I think you were trying to say that only static inline
implementation here is called uprobes_unregister_sync, while all the
other ones are uprobe_unregister_sync(). I fixed it up, kept it as
singular uprobe_unregister_sync().

>
> > >  {
> > >  }
> > >  static inline int uprobe_mmap(struct vm_area_struct *vma)
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 3b42fd355256..b0488d356399 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] perf/uprobe: split uprobe_unregister()
  2024-08-05 20:01       ` Andrii Nakryiko
@ 2024-08-06  1:50         ` Liao, Chang
  0 siblings, 0 replies; 51+ messages in thread
From: Liao, Chang @ 2024-08-06  1:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck



在 2024/8/6 4:01, Andrii Nakryiko 写道:
> On Fri, Aug 2, 2024 at 8:05 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>>
>> On Thu, Aug 1, 2024 at 7:41 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>>
>>>
>>>
>>> 在 2024/8/1 5:42, Andrii Nakryiko 写道:
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> With uprobe_unregister() having grown a synchronize_srcu(), it becomes
>>>> fairly slow to call. Esp. since both users of this API call it in a
>>>> loop.
>>>>
>>>> Peel off the sync_srcu() and do it once, after the loop.
>>>>
>>>> With recent uprobe_register()'s error handling reusing full
>>>> uprobe_unregister() call, we need to be careful about returning to the
>>>> caller before we have a guarantee that partially attached consumer won't
>>>> be called anymore. So add uprobe_unregister_sync() in the error handling
>>>> path. This is an unlikely slow path and this should be totally fine to
>>>> be slow in the case of an failed attach.
>>>>
>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>> Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
>>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>>>> ---
>>>>  include/linux/uprobes.h                        |  8 ++++++--
>>>>  kernel/events/uprobes.c                        | 18 ++++++++++++++----
>>>>  kernel/trace/bpf_trace.c                       |  5 ++++-
>>>>  kernel/trace/trace_uprobe.c                    |  6 +++++-
>>>>  .../selftests/bpf/bpf_testmod/bpf_testmod.c    |  3 ++-
>>>>  5 files changed, 31 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
>>>> index a1686c1ebcb6..8f1999eb9d9f 100644
>>>> --- a/include/linux/uprobes.h
>>>> +++ b/include/linux/uprobes.h
>>>> @@ -105,7 +105,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
>>>>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
>>>>  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>>>>  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
>>>> -extern void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc);
>>>> +extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
>>>> +extern void uprobe_unregister_sync(void);
>>>
>>> [...]
>>>
>>>>  static inline void
>>>> -uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>>>> +uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
>>>> +{
>>>> +}
>>>> +static inline void uprobes_unregister_sync(void)
>>>
>>> *uprobes*_unregister_sync, is it a typo?
>>>
>>
>> I think the idea behind this is that you do a lot of individual uprobe
>> unregistrations with multiple uprobe_unregister() calls, and then
>> follow with a single *uprobes*_unregister_sync(), because in general
>> it is meant to sync multiple uprobes unregistrations.
> 
> Ah, I think you were trying to say that only static inline
> implementation here is called uprobes_unregister_sync, while all the
> other ones are uprobe_unregister_sync(). I fixed it up, kept it as
> singular uprobe_unregister_sync().
> 

Yes, that's exactly what i meant :)

>>
>>>>  {
>>>>  }
>>>>  static inline int uprobe_mmap(struct vm_area_struct *vma)
>>>> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
>>>> index 3b42fd355256..b0488d356399 100644
>>>> --- a/kernel/events/uprobes.c
>>>> +++ b/kernel/events/uprobes.c
> 
> [...]

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management
  2024-08-05 17:29     ` Andrii Nakryiko
@ 2024-08-06 10:45       ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-06 10:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On 08/05, Andrii Nakryiko wrote:
>
> On Mon, Aug 5, 2024 at 6:44 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 07/31, Andrii Nakryiko wrote:
> > >
> > > @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> > >       uprobe->ref_ctr_offset = ref_ctr_offset;
> > >       init_rwsem(&uprobe->register_rwsem);
> > >       init_rwsem(&uprobe->consumer_rwsem);
> > > +     RB_CLEAR_NODE(&uprobe->rb_node);
> >
> > I guess RB_CLEAR_NODE() is not necessary?
>
> I definitely needed that with my batch API changes, but it might be
> that I don't need it anymore. But I'm a bit hesitant to remove it,

OK, lets keep it, it doesn't hurt. Just it wasn't clear to me why did
you add this initialization in this patch.

> > > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
> > >                       u = rb_entry(t, struct uprobe, rb_node);
> > >                       if (u->inode != inode || u->offset < min)
> > >                               break;
> > > +                     u = try_get_uprobe(u);
> > > +                     if (!u) /* uprobe already went away, safe to ignore */
> > > +                             continue;
> > >                       list_add(&u->pending_list, head);
> >
> > cosmetic nit, feel to ignore, but to me
> >
> >                         if (try_get_uprobe(u))
> >                                 list_add(&u->pending_list, head);
> >
> > looks more readable.
>
> It's not my code base to enforce my preferences, but I'll at least
> explain why I disagree. To me, something like `if (some condition)
> <break/continue>;` is a very clear indication that this item (or even
> the rest of items in case of break) won't be processed anymore.
>
> While
>
> if (some inverted condition)
>    <do some something useful>
> <might be some more code>

OK, I won't insist. To me the most confusing part is

	u = try_get_uprobe(u);
	if (!u)
	...

If you read this code for the 1st time (or you are trying to recall it
after 10 years ;) it looks as if try_get_uprobe() can return another uprobe.

> So I'll invert this just to not be PITA, but I disagree :)

If you disagree, then don't change it ;)

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection
  2024-08-05 17:31     ` Andrii Nakryiko
@ 2024-08-06 10:54       ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-06 10:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On 08/05, Andrii Nakryiko wrote:
>
> On Mon, Aug 5, 2024 at 8:59 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > >  int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
> > >  {
> > >       struct uprobe_consumer *con;
> > > -     int ret = -ENOENT;
> > > +     int ret = -ENOENT, srcu_idx;
> > >
> > >       down_write(&uprobe->register_rwsem);
> > > -     for (con = uprobe->consumers; con && con != uc ; con = con->next)
> > > -             ;
> > > -     if (con)
> > > -             ret = register_for_each_vma(uprobe, add ? uc : NULL);
> > > +
> > > +     srcu_idx = srcu_read_lock(&uprobes_srcu);
> > > +     list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
> > > +                              srcu_read_lock_held(&uprobes_srcu)) {
> > > +             if (con == uc) {
> > > +                     ret = register_for_each_vma(uprobe, add ? uc : NULL);
> > > +                     break;
> > > +             }
> > > +     }
> >
> > we can probably remove the similar check above?
> >
> > I mean, why do we need the list_for_each_entry_srcu() above? Is it possible
> > that uprobe_apply(uprobe, uc) is called when "uc" is not on the ->consumers
> > list?
>
> Tbh, I just don't completely understand how (and why) uprobe_apply()
> is used from kernel/trace/trace_uprobe.c, so I wanted to preserve the
> logic exactly. I still don't see when this consumer is added before
> uprobe_apply()... Exposing uprobe_apply() seems like a huge API
> violation to me and I'd rather get rid of its users. But one step at a
> time.

Agreed. Unlike uprobe_unregister(), uprobe_apply() doesn't WARN() or
even explains this check, lets preserve the current logic for now.

And just in case... I am not sure too that the con == NULL case is not
possible with the current code. The recent discussions forced me to recall
some bits in uprobe.c, but not in trace_uprobe.c ;)

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] perf/uprobe: split uprobe_unregister()
  2024-07-31 21:42 ` [PATCH 6/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
  2024-08-02  2:41   ` Liao, Chang
@ 2024-08-07 13:17   ` Oleg Nesterov
  2024-08-07 15:24     ` Andrii Nakryiko
  1 sibling, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-07 13:17 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

I guess you know this, but just in case...

On 07/31, Andrii Nakryiko wrote:
>
> --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> @@ -478,7 +478,8 @@ static void testmod_unregister_uprobe(void)
>  	mutex_lock(&testmod_uprobe_mutex);
>  
>  	if (uprobe.uprobe) {
> -		uprobe_unregister(uprobe.uprobe, &uprobe.consumer);
> +		uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
> +		uprobe_unregister_sync();
>  		uprobe.offset = 0;
>  		uprobe.uprobe = NULL;

this chunk has the trivial conlicts with tip perf/core

db61e6a4eee5a selftests/bpf: fix uprobe.path leak in bpf_testmod
adds path_put(&uprobe.path) here

3c83a9ad0295e make uprobe_register() return struct uprobe *
removes the "uprobe.offset = 0;" line.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/8] uprobes: RCU-protected hot path optimizations
  2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
                   ` (7 preceding siblings ...)
  2024-07-31 21:42 ` [PATCH 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
@ 2024-08-07 13:29 ` Oleg Nesterov
  2024-08-07 15:24   ` Andrii Nakryiko
  8 siblings, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-07 13:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

On 07/31, Andrii Nakryiko wrote:
>
> Andrii Nakryiko (6):
>   uprobes: revamp uprobe refcounting and lifetime management
>   uprobes: protected uprobe lifetime with SRCU
>   uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
>   uprobes: travers uprobe's consumer list locklessly under SRCU
>     protection
>   uprobes: perform lockless SRCU-protected uprobes_tree lookup
>   uprobes: switch to RCU Tasks Trace flavor for better performance
>
> Peter Zijlstra (2):
>   rbtree: provide rb_find_rcu() / rb_find_add_rcu()
>   perf/uprobe: split uprobe_unregister()

I see nothing wrong in 1-7. LGTM.

But since you are going to send the new version, I'd like to apply V2
and then try to re-check the resulting code.

As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/8] uprobes: RCU-protected hot path optimizations
  2024-08-07 13:29 ` [PATCH 0/8] uprobes: RCU-protected hot path optimizations Oleg Nesterov
@ 2024-08-07 15:24   ` Andrii Nakryiko
  2024-08-07 17:11     ` Oleg Nesterov
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-07 15:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Wed, Aug 7, 2024 at 6:30 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/31, Andrii Nakryiko wrote:
> >
> > Andrii Nakryiko (6):
> >   uprobes: revamp uprobe refcounting and lifetime management
> >   uprobes: protected uprobe lifetime with SRCU
> >   uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
> >   uprobes: travers uprobe's consumer list locklessly under SRCU
> >     protection
> >   uprobes: perform lockless SRCU-protected uprobes_tree lookup
> >   uprobes: switch to RCU Tasks Trace flavor for better performance
> >
> > Peter Zijlstra (2):
> >   rbtree: provide rb_find_rcu() / rb_find_add_rcu()
> >   perf/uprobe: split uprobe_unregister()
>
> I see nothing wrong in 1-7. LGTM.
>

So, I don't know how it slipped the first time, because I tested
overnight with uprobe-stress, perhaps I adjusted some parameters (more
threads or different ratio of threads, not sure by now), but it turns
out that lockless RB-tree traversal actually crashes after a few
minutes of running uprobe-stress. I'll post details separately later
today, but I suspect that rb_find_rcu() and rb_find_add_rcu() is not
enough to make it safe.

Hopefully someone can help me figure this out.

> But since you are going to send the new version, I'd like to apply V2
> and then try to re-check the resulting code.

Yes, I was waiting for more of Peter's comments, but I guess I'll just
send a v2 today. I'll probably include the SRCU+timeout logic for
return_instances, and maybe lockless VMA parts as well. Those won't be
yet for landing, but it's probably useful to see everything
end-to-end.

Given the hiccup with lockless uprobes_tree lookup, we should land
that change, but the first 4 patches hopefully are in good enough
shape to apply and reduce the amount of patches that need to be
juggled around.

>
> As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)

Honestly curious, why the preference?

>
> Oleg.
>

BTW, while you are here :) What can you say about
current->sighand->siglock use in handle_singlestep()? Is there any way
to avoid that (or at least not have to take it every single time a
single-stepped uprobe is triggered?). This turned out to be a huge
issue for single-stepped uprobe when scaling to multiple CPUs even
with all the other things (all the SRCU and lockless VMA lookup) taken
care of.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] perf/uprobe: split uprobe_unregister()
  2024-08-07 13:17   ` Oleg Nesterov
@ 2024-08-07 15:24     ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-07 15:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Wed, Aug 7, 2024 at 6:17 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> I guess you know this, but just in case...
>
> On 07/31, Andrii Nakryiko wrote:
> >
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > @@ -478,7 +478,8 @@ static void testmod_unregister_uprobe(void)
> >       mutex_lock(&testmod_uprobe_mutex);
> >
> >       if (uprobe.uprobe) {
> > -             uprobe_unregister(uprobe.uprobe, &uprobe.consumer);
> > +             uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
> > +             uprobe_unregister_sync();
> >               uprobe.offset = 0;
> >               uprobe.uprobe = NULL;
>
> this chunk has the trivial conlicts with tip perf/core
>
> db61e6a4eee5a selftests/bpf: fix uprobe.path leak in bpf_testmod
> adds path_put(&uprobe.path) here
>
> 3c83a9ad0295e make uprobe_register() return struct uprobe *
> removes the "uprobe.offset = 0;" line.
>

Yep, I'll rebase and adjust everything as needed.

> Oleg.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/8] uprobes: RCU-protected hot path optimizations
  2024-08-07 15:24   ` Andrii Nakryiko
@ 2024-08-07 17:11     ` Oleg Nesterov
  2024-08-07 17:31       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-07 17:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On 08/07, Andrii Nakryiko wrote:
>
> Yes, I was waiting for more of Peter's comments, but I guess I'll just
> send a v2 today.

OK,

> I'll probably include the SRCU+timeout logic for
> return_instances, and maybe lockless VMA parts as well.

Well, feel free to do what you think right, but perhaps it would be
better to push this series first? at least 1-4.

As for lockless VMA. To me this needs more discussions. I didn't read
your conversation with Peter and Suren carefully, but I too have some
concerns. Most probably I am wrong, and until I saw this thread I didn't
even know that vm_area_free() uses call_rcu() if CONFIG_PER_VMA_LOCK,
but still.

> > As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)
>
> Honestly curious, why the preference?

Well, you can safely ignore me, but since you have asked ;)

I understand what SRCU does, and years ago I even understood (I hope)
the implementation. More or less the same for rcu_tasks. But as for
the _trace flavour, I simply fail to understand its semantics.

> BTW, while you are here :) What can you say about
> current->sighand->siglock use in handle_singlestep()?

It should die, and this looks simple. I disagree with the patches
from Liao, see the
https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com/
thread, but I agree with the intent.

IMO, we need a simple "bool restore_sigpending" in uprobe_task, it will make the
necessary changes really simple.

(To clarify. In fact I think that a new TIF_ or even PF_ flag makes more sense,
 afaics it can have more users. But I don't think that uprobes can provide enough
 justification for that right now)

Oleg.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-07-31 21:42 ` [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
@ 2024-08-07 17:14   ` Andrii Nakryiko
  2024-08-08 10:04     ` Oleg Nesterov
  2024-08-08 14:29     ` Oleg Nesterov
  2024-08-08 13:40   ` Oleg Nesterov
  1 sibling, 2 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-07 17:14 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
	linux-kernel, jolsa, paulmck

On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Another big bottleneck to scalablity is uprobe_treelock that's taken in
> a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
> take advantage of that and make uprobes_tree RB-tree look up lockless.
>
> To make RB-tree RCU-protected lockless lookup correct, we need to take
> into account that such RB-tree lookup can return false negatives if there
> are parallel RB-tree modifications (rotations) going on. We use seqcount
> lock to detect whether RB-tree changed, and if we find nothing while
> RB-tree got modified inbetween, we just retry. If uprobe was found, then
> it's guaranteed to be a correct lookup.
>
> With all the lock-avoiding changes done, we get a pretty decent
> improvement in performance and scalability of uprobes with number of
> CPUs, even though we are still nowhere near linear scalability. This is
> due to SRCU not really scaling very well with number of CPUs on
> a particular hardware that was used for testing (80-core Intel Xeon Gold
> 6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
> currently taken to resolve interrupt address to inode+offset and then
> uprobe instance. And, of course, uretprobes still need similar RCU to
> avoid refcount in the hot path, which will be addressed in the follow up
> patches.
>
> Nevertheless, the improvement is good. We used BPF selftest-based
> uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
> varying number of CPUs on which uprobes and uretprobes are triggered.
>
> BASELINE
> ========
> uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
> uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
> uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
> uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
> uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
> uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
> uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)
>
> uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
> uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
> uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
> uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
> uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
> uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
> uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)
>
> SRCU CHANGES
> ============
> uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
> uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
> uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
> uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
> uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
> uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
> uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)
>
> uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
> uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
> uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
> uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
> uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
> uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
> uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)
>
> Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
> 8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
> to 5mln/s.
>
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
>

Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not
enough or are buggy. I managed to more or less reliably start
reproducing a crash, which was bisected to exactly this change. My
wild guess is that we'd need an rb_erase_rcu() variant or something,
because what I'm seeing is a corrupted rb_root node while performing
lockless rb_find_rcu() operation.

You can find below debugging info (which Gmail will butcher here) in
[0], for convenience.

So, I got the below crash when running `sudo ./uprobe-stress -a20 -f3
-m5 -t4` on a 16-core QEMU VM.

[  179.375551] BUG: unable to handle page fault for address: 0000441f0f660097
[  179.376612] #PF: supervisor read access in kernel mode
[  179.377334] #PF: error_code(0x0000) - not-present page
[  179.378037] PGD 0 P4D 0
[  179.378391] Oops: Oops: 0000 [#1] PREEMPT SMP
[  179.378992] CPU: 5 UID: 0 PID: 2292 Comm: uprobe-stress Tainted: G
          E      6.11.0-rc1-00025-g6f8e0d8d5b55-dirty #181
[  179.380514] Tainted: [E]=UNSIGNED_MODULE
[  179.381022] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  179.382475] RIP: 0010:uprobe_notify_resume+0x3db/0xc40
[  179.383148] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35
bb 36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85
ed 74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d
98 00
[  179.385639] RSP: 0000:ffffc90000a93e78 EFLAGS: 00010202
[  179.386338] RAX: 74c2394808478d48 RBX: ffff888105619800 RCX: 0000000000004118
[  179.387480] RDX: ffff888109c18eb0 RSI: 00000000000a6130 RDI: ffff888105619800
[  179.388677] RBP: 0000441f0f660007 R08: ffff8881098f1300 R09: 0000000000000000
[  179.389729] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888105680fc0
[  179.390694] R13: 0000000000000000 R14: ffff888105681068 R15: ffff888105681000
[  179.391717] FS:  00007f10690006c0(0000) GS:ffff88881fb40000(0000)
knlGS:0000000000000000
[  179.392800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  179.393582] CR2: 0000441f0f660097 CR3: 00000001049bb004 CR4: 0000000000370ef0
[  179.394536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  179.395485] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  179.396464] Call Trace:
[  179.396796]  <TASK>
[  179.397133]  ? __die+0x1f/0x60
[  179.397640]  ? page_fault_oops+0x14c/0x440
[  179.398316]  ? do_user_addr_fault+0x5f/0x6c0
[  179.398899]  ? kvm_read_and_reset_apf_flags+0x3c/0x50
[  179.399730]  ? exc_page_fault+0x66/0x130
[  179.400361]  ? asm_exc_page_fault+0x22/0x30
[  179.400912]  ? uprobe_notify_resume+0x3db/0xc40
[  179.401500]  ? uprobe_notify_resume+0x11a/0xc40
[  179.402089]  ? arch_uprobe_exception_notify+0x39/0x40
[  179.402736]  ? notifier_call_chain+0x55/0xc0
[  179.403293]  irqentry_exit_to_user_mode+0x98/0x140
[  179.403917]  asm_exc_int3+0x35/0x40
[  179.404394] RIP: 0033:0x404119
[  179.404831] Code: 8b 45 fc 89 c7 e8 43 09 00 00 83 c0 01 c9 c3 55
48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 29 09 00 00 83 c0 01
c9 c3 cc <48> 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 0f 09 00 00
83 c0
[  179.407350] RSP: 002b:00007f1068fffab8 EFLAGS: 00000206
[  179.408083] RAX: 0000000000404118 RBX: 00007f1069000cdc RCX: 000000000000000a
[  179.409020] RDX: 0000000000000012 RSI: 0000000000000064 RDI: 0000000000000012
[  179.409965] RBP: 00007f1068fffae0 R08: 00007f106a84403c R09: 00007f106a8440a0
[  179.410864] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffff80
[  179.411768] R13: 0000000000000016 R14: 00007ffc3dc8f070 R15: 00007f1068800000
[  179.412672]  </TASK>
[  179.412965] Modules linked in: aesni_intel(E) crypto_simd(E)
cryptd(E) kvm_intel(E) kvm(E) i2c_piix4(E) i2c_smbus(E) i2c_core(E)
i6300esb(E) crc32c_intel(E) floppy(E) pcspkr(E) button(E) serio_raw(E)
[  179.415227] CR2: 0000441f0f660097
[  179.415683] ---[ end trace 0000000000000000 ]---


Note RBP: 0000441f0f660007 and RIP: 0010:uprobe_notify_resume+0x3db/0xc40

Decoding:

[ 179.417075] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35 bb
36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85 ed
74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d 98
00
All code
========
0: c1 e1 0c                shl $0xc,%ecx
3: 48 2b 08                sub (%rax),%rcx
6: 48 8b 44 24 08          mov 0x8(%rsp),%rax
b: 48 01 c1                add %rax,%rcx
e: 8b 35 bb 36 0d 03       mov 0x30d36bb(%rip),%esi # 0x30d36cf
14: 40 f6 c6 01            test $0x1,%sil
18: 0f 85 be 00 00 00      jne 0xdc
1e: 48 8b 2d ba 36 0d 03   mov 0x30d36ba(%rip),%rbp # 0x30d36df
25: 48 85 ed               test %rbp,%rbp
28: 74 29                  je 0x53
2a:* 48 8b 85 90 00 00 00  mov 0x90(%rbp),%rax <-- trapping instruction
31: 48 39 c2               cmp %rax,%rdx
34: 72 2d                  jb 0x63
36: 48 39 d0               cmp %rdx,%rax
39: 72 0f                  jb 0x4a
3b: 48                     rex.W
3c: 3b                     .byte 0x3b
3d: 8d                     .byte 0x8d
3e: 98                     cwtl
...


Let's also look at uprobe_notify_resume+0x3db:

(gdb) list *(uprobe_notify_resume+0x3db)
0xffffffff81242f5b is in uprobe_notify_resume
(/data/users/andriin/linux/kernel/events/uprobes.c:662).
657
658 static __always_inline
659 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
660                const struct uprobe *r)
661 {
662     if (l_inode < r->inode)
663         return -1;
664
665     if (l_inode > r->inode)
666         return 1;

So this is most probably when accessing uprobe through r->inode.

Looking at uprobe_notify_resume disassembly:

0xffffffff81242f2c <+940>:  mov 0x78(%rax),%rcx
0xffffffff81242f30 <+944>:  shl $0xc,%rcx
0xffffffff81242f34 <+948>:  sub (%rax),%rcx
0xffffffff81242f37 <+951>:  mov 0x8(%rsp),%rax
0xffffffff81242f3c <+956>:  add %rax,%rcx
0xffffffff81242f3f <+959>:  mov 0x30d36bb(%rip),%esi #
0xffffffff84316600 <uprobes_seqcount>
0xffffffff81242f45 <+965>:  test $0x1,%sil
0xffffffff81242f49 <+969>:  jne 0xffffffff8124300d <uprobe_notify_resume+1165>
0xffffffff81242f4f <+975>:  mov 0x30d36ba(%rip),%rbp #
0xffffffff84316610 <uprobes_tree>
0xffffffff81242f56 <+982>:  test %rbp,%rbp
0xffffffff81242f59 <+985>:  je 0xffffffff81242f84 <uprobe_notify_resume+1028>
0xffffffff81242f5b <+987>:  mov 0x90(%rbp),%rax
0xffffffff81242f62 <+994>:  cmp %rax,%rdx
0xffffffff81242f65 <+997>:  jb 0xffffffff81242f94 <uprobe_notify_resume+1044>
0xffffffff81242f67 <+999>:  cmp %rdx,%rax
0xffffffff81242f6a <+1002>: jb 0xffffffff81242f7b <uprobe_notify_resume+1019>
0xffffffff81242f6c <+1004>: cmp 0x98(%rbp),%rcx
0xffffffff81242f73 <+1011>: jl 0xffffffff81242f94 <uprobe_notify_resume+1044>
0xffffffff81242f75 <+1013>: jle 0xffffffff81242d0e <uprobe_notify_resume+398>
0xffffffff81242f7b <+1019>: mov 0x8(%rbp),%rbp
0xffffffff81242f7f <+1023>: test %rbp,%rbp
0xffffffff81242f25 <+933>:  mov 0xa8(%rcx),%rdx

Not how in mov 0x90(%rbp),%rax %rbp is coming from

mov 0x30d36ba(%rip),%rbp # 0xffffffff84316610 <uprobes_tree>

So. This is inlined uprobe_cmp() on uprobes_tree.rb_node (i.e., the
root of RB tree). And we know that rbp was set to 0x0000441f0f660097,
which doesn't look like a valid kernel address to me. 0x90 is
offsetof(struct uprobe, inode), and offsetof(struct uprobe, rb_node)
== 0. So we load the rb_node/uprobe pointer from uprobe_tree (struct
rb_root), get 0x0000441f0f660097, and then r->inode crashes the kernel
because r looks like user space address.

Note that we never modify RB tree root (or any node) directly
anywhere. We only mutate it with rb_find_add_rcu() and rb_erase(),
both under uprobes_treelock.


So, any ideas how we can end up with "corrupted" root on lockless
lookup with rb_find_rcu()? This seems to be the very first lockless
RB-tree lookup use case in the tree, so perhaps it's not that safe
after all (certainly rb_erase() is non-trivial enough to not be
"obviously correct" w.r.t. RCU, no)?

  [0] https://gist.github.com/anakryiko/df31ab75f25544af93cd41273056ee88

> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index b0488d356399..d03962cc96de 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
>  #define no_uprobe_events()     RB_EMPTY_ROOT(&uprobes_tree)
>
>  static DEFINE_RWLOCK(uprobes_treelock);        /* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>
>  DEFINE_STATIC_SRCU(uprobes_srcu);
>
> @@ -629,8 +630,11 @@ static void put_uprobe(struct uprobe *uprobe)
>
>         write_lock(&uprobes_treelock);
>
> -       if (uprobe_is_active(uprobe))
> +       if (uprobe_is_active(uprobe)) {
> +               write_seqcount_begin(&uprobes_seqcount);
>                 rb_erase(&uprobe->rb_node, &uprobes_tree);
> +               write_seqcount_end(&uprobes_seqcount);
> +       }
>
>         write_unlock(&uprobes_treelock);
>
> @@ -696,14 +700,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
>                 .offset = offset,
>         };
>         struct rb_node *node;
> +       unsigned int seq;
>
>         lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
>
> -       read_lock(&uprobes_treelock);
> -       node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
> -       read_unlock(&uprobes_treelock);
> +       do {
> +               seq = read_seqcount_begin(&uprobes_seqcount);
> +               node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
> +               /*
> +                * Lockless RB-tree lookups can result only in false negatives.
> +                * If the element is found, it is correct and can be returned
> +                * under RCU protection. If we find nothing, we need to
> +                * validate that seqcount didn't change. If it did, we have to
> +                * try again as we might have missed the element (false
> +                * negative). If seqcount is unchanged, search truly failed.
> +                */
> +               if (node)
> +                       return __node_2_uprobe(node);
> +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
>
> -       return node ? __node_2_uprobe(node) : NULL;
> +       return NULL;
>  }
>
>  /*
> @@ -725,7 +741,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
>  {
>         struct rb_node *node;
>  again:
> -       node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
> +       node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
>         if (node) {
>                 struct uprobe *u = __node_2_uprobe(node);
>
> @@ -750,7 +766,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>         struct uprobe *u;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         u = __insert_uprobe(uprobe);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>
>         return u;
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/8] uprobes: RCU-protected hot path optimizations
  2024-08-07 17:11     ` Oleg Nesterov
@ 2024-08-07 17:31       ` Andrii Nakryiko
  2024-08-07 18:24         ` Oleg Nesterov
  2024-08-08  7:51         ` Liao, Chang
  0 siblings, 2 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-07 17:31 UTC (permalink / raw)
  To: Oleg Nesterov, Liao Chang
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Wed, Aug 7, 2024 at 10:11 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/07, Andrii Nakryiko wrote:
> >
> > Yes, I was waiting for more of Peter's comments, but I guess I'll just
> > send a v2 today.
>
> OK,
>
> > I'll probably include the SRCU+timeout logic for
> > return_instances, and maybe lockless VMA parts as well.
>
> Well, feel free to do what you think right, but perhaps it would be
> better to push this series first? at least 1-4.

Ok, I can send those first 4 patches first and hopefully we can land
them soon and move to the next part. I just also wrote up details
about that crash in rb_find_rcu().

>
> As for lockless VMA. To me this needs more discussions. I didn't read

We are still discussing, feel free to join the conversation.

> your conversation with Peter and Suren carefully, but I too have some
> concerns. Most probably I am wrong, and until I saw this thread I didn't
> even know that vm_area_free() uses call_rcu() if CONFIG_PER_VMA_LOCK,
> but still.
>
> > > As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)
> >
> > Honestly curious, why the preference?
>
> Well, you can safely ignore me, but since you have asked ;)
>
> I understand what SRCU does, and years ago I even understood (I hope)
> the implementation. More or less the same for rcu_tasks. But as for
> the _trace flavour, I simply fail to understand its semantics.

Ok, I won't try to repeat Paul's explanations. If you are curious you
can find them in comments to my previous batch register/unregister API
patches.

>
> > BTW, while you are here :) What can you say about
> > current->sighand->siglock use in handle_singlestep()?
>
> It should die, and this looks simple. I disagree with the patches
> from Liao, see the
> https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com/
> thread, but I agree with the intent.

I wasn't aware of this patch, thanks for mentioning it. Strange that
me or at least bpf@vger.kernel.org wasn't CC'ed.

Liao, please cc bpf@ mailing list for future patches like that.

>
> IMO, we need a simple "bool restore_sigpending" in uprobe_task, it will make the
> necessary changes really simple.

The simpler the better, I can't comment on correctness as I don't
understand the logic well enough. Are you going to send a patch with
your bool flag proposal?

>
> (To clarify. In fact I think that a new TIF_ or even PF_ flag makes more sense,
>  afaics it can have more users. But I don't think that uprobes can provide enough
>  justification for that right now)
>
> Oleg.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/8] uprobes: RCU-protected hot path optimizations
  2024-08-07 17:31       ` Andrii Nakryiko
@ 2024-08-07 18:24         ` Oleg Nesterov
  2024-08-08  7:51         ` Liao, Chang
  1 sibling, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-07 18:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Liao Chang, Andrii Nakryiko, linux-trace-kernel, peterz, rostedt,
	mhiramat, bpf, linux-kernel, jolsa, paulmck

On 08/07, Andrii Nakryiko wrote:
>
> Are you going to send a patch with
> your bool flag proposal?

No, I will wait for the next version from Liao.

If he agrees with my comments. If not, we will continue the discussion in that thread.

Note also another patch from Liao which removes ->siglock from uprobe_deny_signal(),
https://lore.kernel.org/all/20240731095017.1560516-1-liaochang1@huawei.com/

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/8] uprobes: RCU-protected hot path optimizations
  2024-08-07 17:31       ` Andrii Nakryiko
  2024-08-07 18:24         ` Oleg Nesterov
@ 2024-08-08  7:51         ` Liao, Chang
  1 sibling, 0 replies; 51+ messages in thread
From: Liao, Chang @ 2024-08-08  7:51 UTC (permalink / raw)
  To: Andrii Nakryiko, Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck



在 2024/8/8 1:31, Andrii Nakryiko 写道:
> On Wed, Aug 7, 2024 at 10:11 AM Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> On 08/07, Andrii Nakryiko wrote:
>>>
>>> Yes, I was waiting for more of Peter's comments, but I guess I'll just
>>> send a v2 today.
>>
>> OK,
>>
>>> I'll probably include the SRCU+timeout logic for
>>> return_instances, and maybe lockless VMA parts as well.
>>
>> Well, feel free to do what you think right, but perhaps it would be
>> better to push this series first? at least 1-4.
> 
> Ok, I can send those first 4 patches first and hopefully we can land
> them soon and move to the next part. I just also wrote up details
> about that crash in rb_find_rcu().
> 
>>
>> As for lockless VMA. To me this needs more discussions. I didn't read
> 
> We are still discussing, feel free to join the conversation.
> 
>> your conversation with Peter and Suren carefully, but I too have some
>> concerns. Most probably I am wrong, and until I saw this thread I didn't
>> even know that vm_area_free() uses call_rcu() if CONFIG_PER_VMA_LOCK,
>> but still.
>>
>>>> As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)
>>>
>>> Honestly curious, why the preference?
>>
>> Well, you can safely ignore me, but since you have asked ;)
>>
>> I understand what SRCU does, and years ago I even understood (I hope)
>> the implementation. More or less the same for rcu_tasks. But as for
>> the _trace flavour, I simply fail to understand its semantics.
> 
> Ok, I won't try to repeat Paul's explanations. If you are curious you
> can find them in comments to my previous batch register/unregister API
> patches.
> 
>>
>>> BTW, while you are here :) What can you say about
>>> current->sighand->siglock use in handle_singlestep()?
>>
>> It should die, and this looks simple. I disagree with the patches
>> from Liao, see the
>> https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com/
>> thread, but I agree with the intent.
> 
> I wasn't aware of this patch, thanks for mentioning it. Strange that
> me or at least bpf@vger.kernel.org wasn't CC'ed.
> 
> Liao, please cc bpf@ mailing list for future patches like that.

OK, sorry about that.

> 
>>
>> IMO, we need a simple "bool restore_sigpending" in uprobe_task, it will make the
>> necessary changes really simple.
> 

[...]

>>
>> (To clarify. In fact I think that a new TIF_ or even PF_ flag makes more sense,
>>  afaics it can have more users. But I don't think that uprobes can provide enough
>>  justification for that right now)

I also face the same choice when Oleg suggested me to add new flag to track the denied
flag, due to I haven't encountered scenarios outside of uprobe that would deny signal,
so I'm not confident of introduce new TIF_ flag without a fully understanding of potential
potential impacts.

>>
>> Oleg.
>>

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-08-07 17:14   ` Andrii Nakryiko
@ 2024-08-08 10:04     ` Oleg Nesterov
  2024-08-08 14:29     ` Oleg Nesterov
  1 sibling, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-08 10:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On 08/07, Andrii Nakryiko wrote:
>
> Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not
> enough or are buggy. I managed to more or less reliably start
> reproducing a crash, which was bisected to exactly this change. My
> wild guess is that we'd need an rb_erase_rcu() variant or something,

And then I think it is not safe to put uprobe->rb_node and uprobe->rcu
in the union, sorry...

Did you get the crash with this change?

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-07-31 21:42 ` [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
  2024-08-07 17:14   ` Andrii Nakryiko
@ 2024-08-08 13:40   ` Oleg Nesterov
  2024-08-10 14:00     ` Oleg Nesterov
  1 sibling, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-08 13:40 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

On 07/31, Andrii Nakryiko wrote:
>
>  static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);

Just noticed... Why seqcount_rwlock_t?

find_uprobe_rcu() doesn't use read_seqbegin_or_lock(),
seqcount_t should work just fine.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-08-07 17:14   ` Andrii Nakryiko
  2024-08-08 10:04     ` Oleg Nesterov
@ 2024-08-08 14:29     ` Oleg Nesterov
  2024-08-08 17:00       ` Andrii Nakryiko
  1 sibling, 1 reply; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-08 14:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On 08/07, Andrii Nakryiko wrote:
>
> So, any ideas how we can end up with "corrupted" root on lockless
> lookup with rb_find_rcu()?

I certainly can't help ;) I know ABSOLUTELY NOTHING about rb or any
other tree.

But,

> This seems to be the very first lockless
> RB-tree lookup use case in the tree,

Well, latch_tree_find() is supposed to be rcu-safe afaics, and
__lt_erase() is just rb_erase(). So it is not the 1st use case.

See also the "Notes on lockless lookups" comment in lib/rbtree.c.

So it seems that rb_erase() is supposed to be rcu-safe. However
it uses __rb_change_child(), not __rb_change_child_rcu().

Not that I think this can explain the problem, and on x86
__smp_store_release() is just WRITE_ONCE, but looks confusing...

Oleg.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-08-08 14:29     ` Oleg Nesterov
@ 2024-08-08 17:00       ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-08-08 17:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Thu, Aug 8, 2024 at 7:29 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/07, Andrii Nakryiko wrote:
> >
> > So, any ideas how we can end up with "corrupted" root on lockless
> > lookup with rb_find_rcu()?
>
> I certainly can't help ;) I know ABSOLUTELY NOTHING about rb or any
> other tree.
>
> But,
>
> > This seems to be the very first lockless
> > RB-tree lookup use case in the tree,
>
> Well, latch_tree_find() is supposed to be rcu-safe afaics, and
> __lt_erase() is just rb_erase(). So it is not the 1st use case.
>
> See also the "Notes on lockless lookups" comment in lib/rbtree.c.
>
> So it seems that rb_erase() is supposed to be rcu-safe. However
> it uses __rb_change_child(), not __rb_change_child_rcu().
>

While trying to mitigate the crash locally I noticed
__rb_change_child() and changed manually all the cases to
__rb_change_child_rcu(). That didn't help :) But I think your guess
about sharing rcu and rb_node is the right now, so hopefully that will
solve the issue.

> Not that I think this can explain the problem, and on x86
> __smp_store_release() is just WRITE_ONCE, but looks confusing...
>
> Oleg.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
  2024-08-08 13:40   ` Oleg Nesterov
@ 2024-08-10 14:00     ` Oleg Nesterov
  0 siblings, 0 replies; 51+ messages in thread
From: Oleg Nesterov @ 2024-08-10 14:00 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

On 08/08, Oleg Nesterov wrote:
>
> On 07/31, Andrii Nakryiko wrote:
> >
> >  static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
> > +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>
> Just noticed... Why seqcount_rwlock_t?
>
> find_uprobe_rcu() doesn't use read_seqbegin_or_lock(),
> seqcount_t should work just fine.

Please ignore... I forgot that seqcount_t is not CONFIG_PREEMPT_RT-friendly.

Hmm. __seqprop_preemptible() returns 0, this doesn't look right... Nevermend.

Oleg.


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2024-08-10 14:00 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-31 21:42 [PATCH 0/8] uprobes: RCU-protected hot path optimizations Andrii Nakryiko
2024-07-31 21:42 ` [PATCH 1/8] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
2024-07-31 21:42 ` [PATCH 2/8] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
2024-08-01 11:09   ` Jiri Olsa
2024-08-01 16:49     ` Andrii Nakryiko
2024-08-01 22:07   ` Andrii Nakryiko
2024-08-02  8:50     ` Oleg Nesterov
2024-08-02 14:58       ` Andrii Nakryiko
2024-08-02 22:19         ` Oleg Nesterov
2024-08-02 11:11   ` Jiri Olsa
2024-08-02 15:03     ` Andrii Nakryiko
2024-08-05 13:44   ` Oleg Nesterov
2024-08-05 17:29     ` Andrii Nakryiko
2024-08-06 10:45       ` Oleg Nesterov
2024-07-31 21:42 ` [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
2024-08-01 12:23   ` Liao, Chang
2024-08-01 16:49     ` Andrii Nakryiko
2024-08-02  1:30       ` Liao, Chang
2024-08-05 14:51   ` Oleg Nesterov
2024-08-05 17:31     ` Andrii Nakryiko
2024-07-31 21:42 ` [PATCH 4/8] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
2024-07-31 21:42 ` [PATCH 5/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
2024-08-01 14:27   ` Jiri Olsa
2024-08-01 16:49     ` Andrii Nakryiko
2024-08-05 15:59   ` Oleg Nesterov
2024-08-05 17:31     ` Andrii Nakryiko
2024-08-06 10:54       ` Oleg Nesterov
2024-07-31 21:42 ` [PATCH 6/8] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
2024-08-02  2:41   ` Liao, Chang
2024-08-02 15:05     ` Andrii Nakryiko
2024-08-05 20:01       ` Andrii Nakryiko
2024-08-06  1:50         ` Liao, Chang
2024-08-07 13:17   ` Oleg Nesterov
2024-08-07 15:24     ` Andrii Nakryiko
2024-07-31 21:42 ` [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
2024-08-07 17:14   ` Andrii Nakryiko
2024-08-08 10:04     ` Oleg Nesterov
2024-08-08 14:29     ` Oleg Nesterov
2024-08-08 17:00       ` Andrii Nakryiko
2024-08-08 13:40   ` Oleg Nesterov
2024-08-10 14:00     ` Oleg Nesterov
2024-07-31 21:42 ` [PATCH 8/8] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
2024-08-01  9:35   ` Peter Zijlstra
2024-08-01 16:49     ` Andrii Nakryiko
2024-08-01 18:05     ` Paul E. McKenney
2024-08-07 13:29 ` [PATCH 0/8] uprobes: RCU-protected hot path optimizations Oleg Nesterov
2024-08-07 15:24   ` Andrii Nakryiko
2024-08-07 17:11     ` Oleg Nesterov
2024-08-07 17:31       ` Andrii Nakryiko
2024-08-07 18:24         ` Oleg Nesterov
2024-08-08  7:51         ` Liao, Chang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).