linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/11] perf/uprobe: Optimize uprobes
@ 2024-07-11 11:02 Peter Zijlstra
  2024-07-11 11:02 ` [PATCH v2 01/11] perf/uprobe: Re-indent labels Peter Zijlstra
                   ` (11 more replies)
  0 siblings, 12 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

Hi!

These patches implement the (S)RCU based proposal to optimize uprobes.

On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a
tight loop:

  perf probe -x ./uprobes test=func
  perf stat -ae probe_uprobe:test  -- sleep 1

  perf probe -x ./uprobes test=func%return
  perf stat -ae probe_uprobe:test__return -- sleep 1

PRE:

  4,038,804      probe_uprobe:test
  2,356,275      probe_uprobe:test__return

POST:

  7,216,579      probe_uprobe:test
  6,744,786      probe_uprobe:test__return

(copy-paste FTW, I didn't do new numbers because the fast paths didn't change --
 and quick test run shows similar numbers)

Patches also available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes


Changes since last time:
 - better split with intermediate inc_not_zero()
 - fix UPROBE_HANDLER_REMOVE
 - restored the lost rcu_assign_pointer()
 - avoid lockdep for uretprobe_srcu
 - add missing put_uprobe() -> srcu_read_unlock() conversion
 - actually initialize return_instance::has_ref
 - a few comments
 - things I don't remember



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 01/11] perf/uprobe: Re-indent labels
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-11 11:58   ` Jiri Olsa
  2024-07-11 11:02 ` [PATCH v2 02/11] perf/uprobe: Remove spurious whitespace Peter Zijlstra
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

Remove the silly label indenting.

 s/^\ \([[:alnum:]]*\):$/\1:/g

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/uprobes.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -205,7 +205,7 @@ static int __replace_page(struct vm_area
 	folio_put(old_folio);
 
 	err = 0;
- unlock:
+unlock:
 	mmu_notifier_invalidate_range_end(&range);
 	folio_unlock(old_folio);
 	return err;
@@ -857,7 +857,7 @@ static int prepare_uprobe(struct uprobe
 	smp_wmb(); /* pairs with the smp_rmb() in handle_swbp() */
 	set_bit(UPROBE_COPY_INSN, &uprobe->flags);
 
- out:
+out:
 	up_write(&uprobe->consumer_rwsem);
 
 	return ret;
@@ -965,7 +965,7 @@ build_map_info(struct address_space *map
 	struct map_info *info;
 	int more = 0;
 
- again:
+again:
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		if (!valid_vma(vma, is_register))
@@ -1019,7 +1019,7 @@ build_map_info(struct address_space *map
 	} while (--more);
 
 	goto again;
- out:
+out:
 	while (prev)
 		prev = free_map_info(prev);
 	return curr;
@@ -1068,13 +1068,13 @@ register_for_each_vma(struct uprobe *upr
 				err |= remove_breakpoint(uprobe, mm, info->vaddr);
 		}
 
- unlock:
+unlock:
 		mmap_write_unlock(mm);
- free:
+free:
 		mmput(mm);
 		info = free_map_info(info);
 	}
- out:
+out:
 	percpu_up_write(&dup_mmap_sem);
 	return err;
 }
@@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod
 	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
 		return -EINVAL;
 
- retry:
+retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (!uprobe)
 		return -ENOMEM;
@@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct
 	ret = 0;
 	/* pairs with get_xol_area() */
 	smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */
- fail:
+fail:
 	mmap_write_unlock(mm);
 
 	return ret;
@@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are
 	kfree(area->bitmap);
  free_area:
 	kfree(area);
- out:
+out:
 	return NULL;
 }
 
@@ -1915,7 +1915,7 @@ static void prepare_uretprobe(struct upr
 	utask->return_instances = ri;
 
 	return;
- fail:
+fail:
 	kfree(ri);
 }
 
@@ -2031,7 +2031,7 @@ static int is_trap_at_addr(struct mm_str
 
 	copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 	put_page(page);
- out:
+out:
 	/* This needs to return true for any variant of the trap insn */
 	return is_trap_insn(&opcode);
 }
@@ -2159,7 +2159,7 @@ static void handle_trampoline(struct pt_
 	utask->return_instances = ri;
 	return;
 
- sigill:
+sigill:
 	uprobe_warn(current, "handle uretprobe, sending SIGILL.");
 	force_sig(SIGILL);
 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 02/11] perf/uprobe: Remove spurious whitespace
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
  2024-07-11 11:02 ` [PATCH v2 01/11] perf/uprobe: Re-indent labels Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-11 11:02 ` [PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu() Peter Zijlstra
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/uprobes.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -67,7 +67,7 @@ struct uprobe {
 	 * The generic code assumes that it has two members of unknown type
 	 * owned by the arch-specific code:
 	 *
-	 * 	insn -	copy_insn() saves the original instruction here for
+	 *	insn -	copy_insn() saves the original instruction here for
 	 *		arch_uprobe_analyze_insn().
 	 *
 	 *	ixol -	potentially modified instruction to execute out of
@@ -95,18 +95,18 @@ static LIST_HEAD(delayed_uprobe_list);
  * allocated.
  */
 struct xol_area {
-	wait_queue_head_t 		wq;		/* if all slots are busy */
-	atomic_t 			slot_count;	/* number of in-use slots */
-	unsigned long 			*bitmap;	/* 0 = free slot */
+	wait_queue_head_t		wq;		/* if all slots are busy */
+	atomic_t			slot_count;	/* number of in-use slots */
+	unsigned long			*bitmap;	/* 0 = free slot */
 
 	struct vm_special_mapping	xol_mapping;
-	struct page 			*pages[2];
+	struct page			*pages[2];
 	/*
 	 * We keep the vma's vm_start rather than a pointer to the vma
 	 * itself.  The probed process or a naughty kernel module could make
 	 * the vma go away, and we must handle that reasonably gracefully.
 	 */
-	unsigned long 			vaddr;		/* Page(s) of instruction slots */
+	unsigned long			vaddr;		/* Page(s) of instruction slots */
 };
 
 /*



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu()
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
  2024-07-11 11:02 ` [PATCH v2 01/11] perf/uprobe: Re-indent labels Peter Zijlstra
  2024-07-11 11:02 ` [PATCH v2 02/11] perf/uprobe: Remove spurious whitespace Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-12 20:23   ` Andrii Nakryiko
  2024-07-11 11:02 ` [PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe() Peter Zijlstra
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

Much like latch_tree, add two RCU methods for the regular RB-tree,
which can be used in conjunction with a seqcount to provide lockless
lookups.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 include/linux/rbtree.h |   67 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -245,6 +245,42 @@ rb_find_add(struct rb_node *node, struct
 }
 
 /**
+ * rb_find_add_rcu() - find equivalent @node in @tree, or add @node
+ * @node: node to look-for / insert
+ * @tree: tree to search / modify
+ * @cmp: operator defining the node order
+ *
+ * Adds a Store-Release for link_node.
+ *
+ * Returns the rb_node matching @node, or NULL when no match is found and @node
+ * is inserted.
+ */
+static __always_inline struct rb_node *
+rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
+		int (*cmp)(struct rb_node *, const struct rb_node *))
+{
+	struct rb_node **link = &tree->rb_node;
+	struct rb_node *parent = NULL;
+	int c;
+
+	while (*link) {
+		parent = *link;
+		c = cmp(node, parent);
+
+		if (c < 0)
+			link = &parent->rb_left;
+		else if (c > 0)
+			link = &parent->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node_rcu(node, parent, link);
+	rb_insert_color(node, tree);
+	return NULL;
+}
+
+/**
  * rb_find() - find @key in tree @tree
  * @key: key to match
  * @tree: tree to search
@@ -268,6 +304,37 @@ rb_find(const void *key, const struct rb
 		else
 			return node;
 	}
+
+	return NULL;
+}
+
+/**
+ * rb_find_rcu() - find @key in tree @tree
+ * @key: key to match
+ * @tree: tree to search
+ * @cmp: operator defining the node order
+ *
+ * Notably, tree descent vs concurrent tree rotations is unsound and can result
+ * in false-negatives.
+ *
+ * Returns the rb_node matching @key or NULL.
+ */
+static __always_inline struct rb_node *
+rb_find_rcu(const void *key, const struct rb_root *tree,
+	    int (*cmp)(const void *key, const struct rb_node *))
+{
+	struct rb_node *node = tree->rb_node;
+
+	while (node) {
+		int c = cmp(key, node);
+
+		if (c < 0)
+			node = rcu_dereference_raw(node->rb_left);
+		else if (c > 0)
+			node = rcu_dereference_raw(node->rb_right);
+		else
+			return node;
+	}
 
 	return NULL;
 }



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe()
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (2 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu() Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-11 13:59   ` Masami Hiramatsu
  2024-07-11 11:02 ` [PATCH v2 05/11] perf/uprobe: Simplify UPROBE_HANDLER_REMOVE logic Peter Zijlstra
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

With handle_swbp() triggering concurrently on (all) CPUs, tree_lock
becomes a bottleneck. Avoid treelock by doing RCU lookups of the
uprobe.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/uprobes.c |   49 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 9 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
@@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem)
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	refcount_t		ref;
+	struct rcu_head		rcu;
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
@@ -587,12 +589,25 @@ set_orig_insn(struct arch_uprobe *auprob
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
+static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
+{
+	if (refcount_inc_not_zero(&uprobe->ref))
+		return uprobe;
+	return NULL;
+}
+
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
 	refcount_inc(&uprobe->ref);
 	return uprobe;
 }
 
+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+	kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (refcount_dec_and_test(&uprobe->ref)) {
@@ -604,7 +619,7 @@ static void put_uprobe(struct uprobe *up
 		mutex_lock(&delayed_uprobe_lock);
 		delayed_uprobe_remove(uprobe, NULL);
 		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
+		call_rcu(&uprobe->rcu, uprobe_free_rcu);
 	}
 }
 
@@ -653,10 +668,10 @@ static struct uprobe *__find_uprobe(stru
 		.inode = inode,
 		.offset = offset,
 	};
-	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
+	struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
 
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		return try_get_uprobe(__node_2_uprobe(node));
 
 	return NULL;
 }
@@ -667,20 +682,32 @@ static struct uprobe *__find_uprobe(stru
  */
 static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 {
-	struct uprobe *uprobe;
+	unsigned int seq;
 
-	read_lock(&uprobes_treelock);
-	uprobe = __find_uprobe(inode, offset);
-	read_unlock(&uprobes_treelock);
+	guard(rcu)();
 
-	return uprobe;
+	do {
+		seq = read_seqcount_begin(&uprobes_seqcount);
+		struct uprobe *uprobe = __find_uprobe(inode, offset);
+		if (uprobe) {
+			/*
+			 * Lockless RB-tree lookups are prone to false-negatives.
+			 * If they find something, it's good. If they do not find,
+			 * it needs to be validated.
+			 */
+			return uprobe;
+		}
+	} while (read_seqcount_retry(&uprobes_seqcount, seq));
+
+	/* Really didn't find anything. */
+	return NULL;
 }
 
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
 
-	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
+	node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
 	if (node)
 		return get_uprobe(__node_2_uprobe(node));
 
@@ -702,7 +729,9 @@ static struct uprobe *insert_uprobe(stru
 	struct uprobe *u;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	u = __insert_uprobe(uprobe);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 
 	return u;
@@ -936,7 +965,9 @@ static void delete_uprobe(struct uprobe
 		return;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	rb_erase(&uprobe->rb_node, &uprobes_tree);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
 	put_uprobe(uprobe);



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 05/11] perf/uprobe: Simplify UPROBE_HANDLER_REMOVE logic
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (3 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe() Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-11 11:02 ` [PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list Peter Zijlstra
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

Specifically, get rid of the uprobe->consumers re-load, which isn't
sound under RCU.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/uprobes.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2101,6 +2101,7 @@ static void handler_chain(struct uprobe
 	struct uprobe_consumer *uc;
 	int remove = UPROBE_HANDLER_REMOVE;
 	bool need_prep = false; /* prepare return uprobe, when needed */
+	bool had_handler = false;
 
 	down_read(&uprobe->register_rwsem);
 	for (uc = uprobe->consumers; uc; uc = uc->next) {
@@ -2115,16 +2116,26 @@ static void handler_chain(struct uprobe
 		if (uc->ret_handler)
 			need_prep = true;
 
+		/*
+		 * A single handler that does not mask out REMOVE, means the
+		 * probe stays.
+		 */
+		had_handler = true;
 		remove &= rc;
 	}
 
+	/*
+	 * If there were no handlers called, nobody asked for it to be removed
+	 * but also nobody got to mask the value. Fix it up.
+	 */
+	if (!had_handler)
+		remove = 0;
+
 	if (need_prep && !remove)
 		prepare_uretprobe(uprobe, regs); /* put bp at return */
 
-	if (remove && uprobe->consumers) {
-		WARN_ON(!uprobe_is_active(uprobe));
+	if (remove)
 		unapply_uprobe(uprobe, current->mm);
-	}
 	up_read(&uprobe->register_rwsem);
 }
 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (4 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 05/11] perf/uprobe: Simplify UPROBE_HANDLER_REMOVE logic Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-12 21:06   ` Andrii Nakryiko
  2024-07-11 11:02 ` [PATCH v2 07/11] perf/uprobe: Split uprobe_unregister() Peter Zijlstra
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

With handle_swbp() hitting concurrently on (all) CPUs the
uprobe->register_rwsem can get very contended. Add an SRCU instance to
cover the consumer list and consumer lifetime.

Since the consumer are externally embedded structures, unregister will
have to suffer a synchronize_srcu().

A notably complication is the UPROBE_HANDLER_REMOVE logic which can
race against uprobe_register() such that it might want to remove a
freshly installer handler that didn't get called. In order to close
this hole, a seqcount is added. With that, the removal path can tell
if anything changed and bail out of the removal.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/uprobes.c |   60 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 50 insertions(+), 10 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -26,6 +26,7 @@
 #include <linux/task_work.h>
 #include <linux/shmem_fs.h>
 #include <linux/khugepaged.h>
+#include <linux/srcu.h>
 
 #include <linux/uprobes.h>
 
@@ -49,6 +50,11 @@ static struct mutex uprobes_mmap_mutex[U
 
 DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
+/*
+ * Covers uprobe->consumers lifetime.
+ */
+DEFINE_STATIC_SRCU(uprobes_srcu);
+
 /* Have a copy of original instruction */
 #define UPROBE_COPY_INSN	0
 
@@ -57,6 +63,7 @@ struct uprobe {
 	refcount_t		ref;
 	struct rcu_head		rcu;
 	struct rw_semaphore	register_rwsem;
+	seqcount_t		register_seq;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
@@ -760,6 +767,7 @@ static struct uprobe *alloc_uprobe(struc
 	uprobe->offset = offset;
 	uprobe->ref_ctr_offset = ref_ctr_offset;
 	init_rwsem(&uprobe->register_rwsem);
+	seqcount_init(&uprobe->register_seq);
 	init_rwsem(&uprobe->consumer_rwsem);
 
 	/* add to uprobes_tree, sorted on inode:offset */
@@ -782,8 +790,8 @@ static struct uprobe *alloc_uprobe(struc
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	down_write(&uprobe->consumer_rwsem);
-	uc->next = uprobe->consumers;
-	uprobe->consumers = uc;
+	WRITE_ONCE(uc->next, uprobe->consumers);
+	rcu_assign_pointer(uprobe->consumers, uc);
 	up_write(&uprobe->consumer_rwsem);
 }
 
@@ -800,7 +808,7 @@ static bool consumer_del(struct uprobe *
 	down_write(&uprobe->consumer_rwsem);
 	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
 		if (*con == uc) {
-			*con = uc->next;
+			WRITE_ONCE(*con, uc->next);
 			ret = true;
 			break;
 		}
@@ -1139,9 +1147,13 @@ void uprobe_unregister(struct inode *ino
 		return;
 
 	down_write(&uprobe->register_rwsem);
+	raw_write_seqcount_begin(&uprobe->register_seq);
 	__uprobe_unregister(uprobe, uc);
+	raw_write_seqcount_end(&uprobe->register_seq);
 	up_write(&uprobe->register_rwsem);
 	put_uprobe(uprobe);
+
+	synchronize_srcu(&uprobes_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
@@ -1204,10 +1216,12 @@ static int __uprobe_register(struct inod
 	down_write(&uprobe->register_rwsem);
 	ret = -EAGAIN;
 	if (likely(uprobe_is_active(uprobe))) {
+		raw_write_seqcount_begin(&uprobe->register_seq);
 		consumer_add(uprobe, uc);
 		ret = register_for_each_vma(uprobe, uc);
 		if (ret)
 			__uprobe_unregister(uprobe, uc);
+		raw_write_seqcount_end(&uprobe->register_seq);
 	}
 	up_write(&uprobe->register_rwsem);
 	put_uprobe(uprobe);
@@ -1250,10 +1264,12 @@ int uprobe_apply(struct inode *inode, lo
 		return ret;
 
 	down_write(&uprobe->register_rwsem);
+	raw_write_seqcount_begin(&uprobe->register_seq);
 	for (con = uprobe->consumers; con && con != uc ; con = con->next)
 		;
 	if (con)
 		ret = register_for_each_vma(uprobe, add ? uc : NULL);
+	raw_write_seqcount_end(&uprobe->register_seq);
 	up_write(&uprobe->register_rwsem);
 	put_uprobe(uprobe);
 
@@ -2096,15 +2112,23 @@ static struct uprobe *find_active_uprobe
 	return uprobe;
 }
 
+#define for_each_consumer_rcu(pos, head) \
+	for (pos = rcu_dereference_raw(head); pos; \
+	     pos = rcu_dereference_raw(pos->next))
+
 static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 {
 	struct uprobe_consumer *uc;
 	int remove = UPROBE_HANDLER_REMOVE;
 	bool need_prep = false; /* prepare return uprobe, when needed */
 	bool had_handler = false;
+	unsigned int seq;
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	guard(srcu)(&uprobes_srcu);
+
+	seq = raw_read_seqcount_begin(&uprobe->register_seq);
+
+	for_each_consumer_rcu(uc, uprobe->consumers) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2134,9 +2158,25 @@ static void handler_chain(struct uprobe
 	if (need_prep && !remove)
 		prepare_uretprobe(uprobe, regs); /* put bp at return */
 
-	if (remove)
+	if (remove) {
+		/*
+		 * Removing uprobes is a slow path, after all, the more probes
+		 * you remove, the less probe hits you get.
+		 *
+		 * This needs to serialize against uprobe_register(), such that
+		 * if the above RCU iteration missed a new handler that
+		 * would've liked to keep the probe, we don't go uninstall the
+		 * probe after it already ran register_for_each_vma().
+		 *
+		 * The rwsem ensures exclusivity against uprobe_register()
+		 * while the seqcount will avoid the removal if anything has
+		 * changed since we started.
+		 */
+		guard(rwsem_read)(&uprobe->register_rwsem);
+		if (read_seqcount_retry(&uprobe->register_seq, seq))
+			return;
 		unapply_uprobe(uprobe, current->mm);
-	up_read(&uprobe->register_rwsem);
+	}
 }
 
 static void
@@ -2145,12 +2185,12 @@ handle_uretprobe_chain(struct return_ins
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	guard(srcu)(&uprobes_srcu);
+
+	for_each_consumer_rcu(uc, uprobe->consumers) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	up_read(&uprobe->register_rwsem);
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 07/11] perf/uprobe: Split uprobe_unregister()
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (5 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-12 21:10   ` Andrii Nakryiko
  2024-07-11 11:02 ` [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU Peter Zijlstra
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

With uprobe_unregister() having grown a synchronize_srcu(), it becomes
fairly slow to call. Esp. since both users of this API call it in a
loop.

Peel off the sync_srcu() and do it once, after the loop.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 include/linux/uprobes.h     |    8 ++++++--
 kernel/events/uprobes.c     |    8 ++++++--
 kernel/trace/bpf_trace.c    |    6 ++++--
 kernel/trace/trace_uprobe.c |    6 +++++-
 4 files changed, 21 insertions(+), 7 deletions(-)

--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -113,7 +113,8 @@ extern int uprobe_write_opcode(struct ar
 extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
 extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+extern void uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -163,7 +164,10 @@ uprobe_apply(struct inode *inode, loff_t
 	return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+{
+}
+static inline void uprobes_unregister_sync(void)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1138,7 +1138,7 @@ __uprobe_unregister(struct uprobe *uprob
  * @offset: offset from the start of the file.
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+void uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
 {
 	struct uprobe *uprobe;
 
@@ -1152,10 +1152,14 @@ void uprobe_unregister(struct inode *ino
 	raw_write_seqcount_end(&uprobe->register_seq);
 	up_write(&uprobe->register_rwsem);
 	put_uprobe(uprobe);
+}
+EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
 
+void uprobe_unregister_sync(void)
+{
 	synchronize_srcu(&uprobes_srcu);
 }
-EXPORT_SYMBOL_GPL(uprobe_unregister);
+EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
 /*
  * __uprobe_register - register a probe
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3181,9 +3181,11 @@ static void bpf_uprobe_unregister(struct
 	u32 i;
 
 	for (i = 0; i < cnt; i++) {
-		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
-				  &uprobes[i].consumer);
+		uprobe_unregister_nosync(d_real_inode(path->dentry), uprobes[i].offset,
+					 &uprobes[i].consumer);
 	}
+	if (cnt)
+		uprobe_unregister_sync();
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct tr
 static void __probe_event_disable(struct trace_probe *tp)
 {
 	struct trace_uprobe *tu;
+	bool sync = false;
 
 	tu = container_of(tp, struct trace_uprobe, tp);
 	WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
@@ -1112,9 +1113,12 @@ static void __probe_event_disable(struct
 		if (!tu->inode)
 			continue;
 
-		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
+		uprobe_unregister_nosync(tu->inode, tu->offset, &tu->consumer);
+		sync = true;
 		tu->inode = NULL;
 	}
+	if (sync)
+		uprobe_unregister_sync();
 }
 
 static int probe_event_enable(struct trace_event_call *call,



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (6 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 07/11] perf/uprobe: Split uprobe_unregister() Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-11 14:03   ` Jiri Olsa
  2024-07-12 21:21   ` Andrii Nakryiko
  2024-07-11 11:02 ` [PATCH v2 09/11] srcu: Add __srcu_clone_read_lock() Peter Zijlstra
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

With handle_swbp() hitting concurrently on (all) CPUs, potentially on
the same uprobe, the uprobe->refcount can get *very* hot. Move the
struct uprobe lifetime into uprobes_srcu such that it covers both the
uprobe and the uprobe->consumers list.

With this, handle_swbp() can use a single large SRCU critical section
to avoid taking a refcount on the uprobe for it's duration.

Notably, the single-step and uretprobe paths need a reference that
leaves handle_swbp() and will, for now, still use ->refcount.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/uprobes.c |   68 ++++++++++++++++++++++++++++--------------------
 1 file changed, 41 insertions(+), 27 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -51,7 +51,7 @@ static struct mutex uprobes_mmap_mutex[U
 DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
 /*
- * Covers uprobe->consumers lifetime.
+ * Covers uprobe->consumers lifetime as well as struct uprobe.
  */
 DEFINE_STATIC_SRCU(uprobes_srcu);
 
@@ -626,7 +626,7 @@ static void put_uprobe(struct uprobe *up
 		mutex_lock(&delayed_uprobe_lock);
 		delayed_uprobe_remove(uprobe, NULL);
 		mutex_unlock(&delayed_uprobe_lock);
-		call_rcu(&uprobe->rcu, uprobe_free_rcu);
+		call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
 	}
 }
 
@@ -678,7 +678,7 @@ static struct uprobe *__find_uprobe(stru
 	struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
 
 	if (node)
-		return try_get_uprobe(__node_2_uprobe(node));
+		return __node_2_uprobe(node);
 
 	return NULL;
 }
@@ -691,7 +691,7 @@ static struct uprobe *find_uprobe(struct
 {
 	unsigned int seq;
 
-	guard(rcu)();
+	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
 	do {
 		seq = read_seqcount_begin(&uprobes_seqcount);
@@ -1142,6 +1142,8 @@ void uprobe_unregister_nosync(struct ino
 {
 	struct uprobe *uprobe;
 
+	guard(srcu)(&uprobes_srcu);
+
 	uprobe = find_uprobe(inode, offset);
 	if (WARN_ON(!uprobe))
 		return;
@@ -1151,7 +1153,6 @@ void uprobe_unregister_nosync(struct ino
 	__uprobe_unregister(uprobe, uc);
 	raw_write_seqcount_end(&uprobe->register_seq);
 	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
 
@@ -1263,6 +1264,8 @@ int uprobe_apply(struct inode *inode, lo
 	struct uprobe_consumer *con;
 	int ret = -ENOENT;
 
+	guard(srcu)(&uprobes_srcu);
+
 	uprobe = find_uprobe(inode, offset);
 	if (WARN_ON(!uprobe))
 		return ret;
@@ -1275,7 +1278,6 @@ int uprobe_apply(struct inode *inode, lo
 		ret = register_for_each_vma(uprobe, add ? uc : NULL);
 	raw_write_seqcount_end(&uprobe->register_seq);
 	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
 
 	return ret;
 }
@@ -1929,10 +1931,14 @@ static void prepare_uretprobe(struct upr
 	if (!ri)
 		return;
 
+	ri->uprobe = try_get_uprobe(uprobe);
+	if (!ri->uprobe)
+		goto err_mem;
+
 	trampoline_vaddr = get_trampoline_vaddr();
 	orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
 	if (orig_ret_vaddr == -1)
-		goto fail;
+		goto err_uprobe;
 
 	/* drop the entries invalidated by longjmp() */
 	chained = (orig_ret_vaddr == trampoline_vaddr);
@@ -1950,12 +1956,11 @@ static void prepare_uretprobe(struct upr
 			 * attack from user-space.
 			 */
 			uprobe_warn(current, "handle tail call");
-			goto fail;
+			goto err_uprobe;
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
 
-	ri->uprobe = get_uprobe(uprobe);
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
@@ -1966,7 +1971,10 @@ static void prepare_uretprobe(struct upr
 	utask->return_instances = ri;
 
 	return;
-fail:
+
+err_uprobe:
+	uprobe_put(ri->uprobe);
+err_mem:
 	kfree(ri);
 }
 
@@ -1982,22 +1990,31 @@ pre_ssout(struct uprobe *uprobe, struct
 	if (!utask)
 		return -ENOMEM;
 
+	utask->active_uprobe = try_get_uprobe(uprobe);
+	if (!utask->active_uprobe)
+		return -ESRCH;
+
 	xol_vaddr = xol_get_insn_slot(uprobe);
-	if (!xol_vaddr)
-		return -ENOMEM;
+	if (!xol_vaddr) {
+		err = -ENOMEM;
+		goto err_uprobe;
+	}
 
 	utask->xol_vaddr = xol_vaddr;
 	utask->vaddr = bp_vaddr;
 
 	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
-	if (unlikely(err)) {
-		xol_free_insn_slot(current);
-		return err;
-	}
+	if (unlikely(err))
+		goto err_xol;
 
-	utask->active_uprobe = uprobe;
 	utask->state = UTASK_SSTEP;
 	return 0;
+
+err_xol:
+	xol_free_insn_slot(current);
+err_uprobe:
+	put_uprobe(utask->active_uprobe);
+	return err;
 }
 
 /*
@@ -2128,7 +2145,7 @@ static void handler_chain(struct uprobe
 	bool had_handler = false;
 	unsigned int seq;
 
-	guard(srcu)(&uprobes_srcu);
+	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
 	seq = raw_read_seqcount_begin(&uprobe->register_seq);
 
@@ -2276,6 +2293,8 @@ static void handle_swbp(struct pt_regs *
 	if (bp_vaddr == get_trampoline_vaddr())
 		return handle_trampoline(regs);
 
+	guard(srcu)(&uprobes_srcu);
+
 	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
 	if (!uprobe) {
 		if (is_swbp > 0) {
@@ -2304,7 +2323,7 @@ static void handle_swbp(struct pt_regs *
 	 * new and not-yet-analyzed uprobe at the same address, restart.
 	 */
 	if (unlikely(!test_bit(UPROBE_COPY_INSN, &uprobe->flags)))
-		goto out;
+		return;
 
 	/*
 	 * Pairs with the smp_wmb() in prepare_uprobe().
@@ -2317,22 +2336,17 @@ static void handle_swbp(struct pt_regs *
 
 	/* Tracing handlers use ->utask to communicate with fetch methods */
 	if (!get_utask())
-		goto out;
+		return;
 
 	if (arch_uprobe_ignore(&uprobe->arch, regs))
-		goto out;
+		return;
 
 	handler_chain(uprobe, regs);
 
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
-		goto out;
-
-	if (!pre_ssout(uprobe, regs, bp_vaddr))
 		return;
 
-	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
-out:
-	put_uprobe(uprobe);
+	pre_ssout(uprobe, regs, bp_vaddr);
 }
 
 /*



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 09/11] srcu: Add __srcu_clone_read_lock()
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (7 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-11 11:02 ` [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU Peter Zijlstra
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

In order to support carrying an srcu_read_lock() section across fork,
where both the parent and child process will do: srcu_read_unlock(),
it is needed to account for the extra decrement with an extra
increment at fork time.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/srcu.h     |    1 +
 include/linux/srcutiny.h |   10 ++++++++++
 kernel/rcu/srcutree.c    |    5 +++++
 3 files changed, 16 insertions(+)

--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -55,6 +55,7 @@ void call_srcu(struct srcu_struct *ssp,
 		void (*func)(struct rcu_head *head));
 void cleanup_srcu_struct(struct srcu_struct *ssp);
 int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp);
+void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx);
 void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
 void synchronize_srcu(struct srcu_struct *ssp);
 unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp);
--- a/include/linux/srcutiny.h
+++ b/include/linux/srcutiny.h
@@ -71,6 +71,16 @@ static inline int __srcu_read_lock(struc
 	return idx;
 }
 
+static inline void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx)
+{
+	int newval;
+
+	preempt_disable();  // Needed for PREEMPT_AUTO
+	newval = READ_ONCE(ssp->srcu_lock_nesting[idx]) + 1;
+	WRITE_ONCE(ssp->srcu_lock_nesting[idx], newval);
+	preempt_enable();
+}
+
 static inline void synchronize_srcu_expedited(struct srcu_struct *ssp)
 {
 	synchronize_srcu(ssp);
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -720,6 +720,11 @@ int __srcu_read_lock(struct srcu_struct
 }
 EXPORT_SYMBOL_GPL(__srcu_read_lock);
 
+void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx)
+{
+	this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter);
+}
+
 /*
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (8 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 09/11] srcu: Add __srcu_clone_read_lock() Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-11 16:06   ` Oleg Nesterov
  2024-07-12 21:28   ` Andrii Nakryiko
  2024-07-11 11:02 ` [PATCH v2 11/11] perf/uprobe: Add uretprobe timer Peter Zijlstra
  2024-07-12  4:57 ` [PATCH v2 00/11] perf/uprobe: Optimize uprobes Andrii Nakryiko
  11 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

Both single-step and uretprobes take a refcount on struct uprobe in
handle_swbp() in order to ensure struct uprobe stays extant until a
next trap.

Since uprobe_unregister() only cares about the uprobe_consumer
life-time, and these intra-trap sections can be arbitrarily large,
create a second SRCU domain to cover these.

Notably, a uretprobe with a registered return_instance that never
triggers -- because userspace -- will currently pin the
return_instance and related uprobe until the task dies. With this
convertion to SRCU this behaviour will inhibit freeing of all uprobes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/uprobes.h |    2 +
 kernel/events/uprobes.c |   60 +++++++++++++++++++++++-------------------------
 2 files changed, 31 insertions(+), 31 deletions(-)

--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -78,6 +78,7 @@ struct uprobe_task {
 
 	struct return_instance		*return_instances;
 	unsigned int			depth;
+	unsigned int			active_srcu_idx;
 };
 
 struct return_instance {
@@ -86,6 +87,7 @@ struct return_instance {
 	unsigned long		stack;		/* stack pointer */
 	unsigned long		orig_ret_vaddr; /* original return address */
 	bool			chained;	/* true, if instance is nested */
+	int			srcu_idx;
 
 	struct return_instance	*next;		/* keep as stack */
 };
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -54,6 +54,15 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem)
  * Covers uprobe->consumers lifetime as well as struct uprobe.
  */
 DEFINE_STATIC_SRCU(uprobes_srcu);
+/*
+ * Covers return_instance->uprobe and utask->active_uprobe. Separate from
+ * uprobe_srcu because uprobe_unregister() doesn't need to wait for this
+ * and these lifetimes can be fairly long.
+ *
+ * Notably, these sections span userspace and as such use
+ * __srcu_read_{,un}lock() to elide lockdep.
+ */
+DEFINE_STATIC_SRCU(uretprobes_srcu);
 
 /* Have a copy of original instruction */
 #define UPROBE_COPY_INSN	0
@@ -596,25 +605,24 @@ set_orig_insn(struct arch_uprobe *auprob
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
-static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
-{
-	if (refcount_inc_not_zero(&uprobe->ref))
-		return uprobe;
-	return NULL;
-}
-
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
 	refcount_inc(&uprobe->ref);
 	return uprobe;
 }
 
-static void uprobe_free_rcu(struct rcu_head *rcu)
+static void uprobe_free_stage2(struct rcu_head *rcu)
 {
 	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
 	kfree(uprobe);
 }
 
+static void uprobe_free_stage1(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+	call_srcu(&uretprobes_srcu, &uprobe->rcu, uprobe_free_stage2);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (refcount_dec_and_test(&uprobe->ref)) {
@@ -626,7 +634,7 @@ static void put_uprobe(struct uprobe *up
 		mutex_lock(&delayed_uprobe_lock);
 		delayed_uprobe_remove(uprobe, NULL);
 		mutex_unlock(&delayed_uprobe_lock);
-		call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
+		call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_stage1);
 	}
 }
 
@@ -1753,7 +1761,7 @@ unsigned long uprobe_get_trap_addr(struc
 static struct return_instance *free_ret_instance(struct return_instance *ri)
 {
 	struct return_instance *next = ri->next;
-	put_uprobe(ri->uprobe);
+	__srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
 	kfree(ri);
 	return next;
 }
@@ -1771,7 +1779,7 @@ void uprobe_free_utask(struct task_struc
 		return;
 
 	if (utask->active_uprobe)
-		put_uprobe(utask->active_uprobe);
+		__srcu_read_unlock(&uretprobes_srcu, utask->active_srcu_idx);
 
 	ri = utask->return_instances;
 	while (ri)
@@ -1814,7 +1822,7 @@ static int dup_utask(struct task_struct
 			return -ENOMEM;
 
 		*n = *o;
-		get_uprobe(n->uprobe);
+		__srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
 		n->next = NULL;
 
 		*p = n;
@@ -1931,14 +1939,10 @@ static void prepare_uretprobe(struct upr
 	if (!ri)
 		return;
 
-	ri->uprobe = try_get_uprobe(uprobe);
-	if (!ri->uprobe)
-		goto err_mem;
-
 	trampoline_vaddr = get_trampoline_vaddr();
 	orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
 	if (orig_ret_vaddr == -1)
-		goto err_uprobe;
+		goto err_mem;
 
 	/* drop the entries invalidated by longjmp() */
 	chained = (orig_ret_vaddr == trampoline_vaddr);
@@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct upr
 			 * attack from user-space.
 			 */
 			uprobe_warn(current, "handle tail call");
-			goto err_uprobe;
+			goto err_mem;
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
 
+	ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu);
+	ri->uprobe = uprobe;
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
@@ -1972,8 +1978,6 @@ static void prepare_uretprobe(struct upr
 
 	return;
 
-err_uprobe:
-	uprobe_put(ri->uprobe);
 err_mem:
 	kfree(ri);
 }
@@ -1990,15 +1994,9 @@ pre_ssout(struct uprobe *uprobe, struct
 	if (!utask)
 		return -ENOMEM;
 
-	utask->active_uprobe = try_get_uprobe(uprobe);
-	if (!utask->active_uprobe)
-		return -ESRCH;
-
 	xol_vaddr = xol_get_insn_slot(uprobe);
-	if (!xol_vaddr) {
-		err = -ENOMEM;
-		goto err_uprobe;
-	}
+	if (!xol_vaddr)
+		return -ENOMEM;
 
 	utask->xol_vaddr = xol_vaddr;
 	utask->vaddr = bp_vaddr;
@@ -2007,13 +2005,13 @@ pre_ssout(struct uprobe *uprobe, struct
 	if (unlikely(err))
 		goto err_xol;
 
+	utask->active_srcu_idx = __srcu_read_lock(&uretprobes_srcu);
+	utask->active_uprobe = uprobe;
 	utask->state = UTASK_SSTEP;
 	return 0;
 
 err_xol:
 	xol_free_insn_slot(current);
-err_uprobe:
-	put_uprobe(utask->active_uprobe);
 	return err;
 }
 
@@ -2366,7 +2364,7 @@ static void handle_singlestep(struct upr
 	else
 		WARN_ON_ONCE(1);
 
-	put_uprobe(uprobe);
+	__srcu_read_unlock(&uretprobes_srcu, utask->active_srcu_idx);
 	utask->active_uprobe = NULL;
 	utask->state = UTASK_RUNNING;
 	xol_free_insn_slot(current);



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (9 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU Peter Zijlstra
@ 2024-07-11 11:02 ` Peter Zijlstra
  2024-07-11 13:19   ` Oleg Nesterov
  2024-07-12 21:43   ` Andrii Nakryiko
  2024-07-12  4:57 ` [PATCH v2 00/11] perf/uprobe: Optimize uprobes Andrii Nakryiko
  11 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 11:02 UTC (permalink / raw)
  To: mingo, andrii, oleg
  Cc: linux-kernel, linux-trace-kernel, peterz, rostedt, mhiramat,
	jolsa, clm, paulmck

In order to put a bound on the uretprobe_srcu critical section, add a
timer to uprobe_task. Upon every RI added or removed the timer is
pushed forward to now + 1s. If the timer were ever to fire, it would
convert the SRCU 'reference' to a refcount reference if possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/uprobes.h |    8 +++++
 kernel/events/uprobes.c |   67 ++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 69 insertions(+), 6 deletions(-)

--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -15,6 +15,7 @@
 #include <linux/rbtree.h>
 #include <linux/types.h>
 #include <linux/wait.h>
+#include <linux/timer.h>
 
 struct vm_area_struct;
 struct mm_struct;
@@ -79,6 +80,10 @@ struct uprobe_task {
 	struct return_instance		*return_instances;
 	unsigned int			depth;
 	unsigned int			active_srcu_idx;
+
+	struct timer_list		ri_timer;
+	struct callback_head		ri_task_work;
+	struct task_struct		*task;
 };
 
 struct return_instance {
@@ -86,7 +91,8 @@ struct return_instance {
 	unsigned long		func;
 	unsigned long		stack;		/* stack pointer */
 	unsigned long		orig_ret_vaddr; /* original return address */
-	bool			chained;	/* true, if instance is nested */
+	u8			chained;	/* true, if instance is nested */
+	u8			has_ref;
 	int			srcu_idx;
 
 	struct return_instance	*next;		/* keep as stack */
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1761,7 +1761,12 @@ unsigned long uprobe_get_trap_addr(struc
 static struct return_instance *free_ret_instance(struct return_instance *ri)
 {
 	struct return_instance *next = ri->next;
-	__srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
+	if (ri->uprobe) {
+		if (ri->has_ref)
+			put_uprobe(ri->uprobe);
+		else
+			__srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
+	}
 	kfree(ri);
 	return next;
 }
@@ -1785,11 +1790,48 @@ void uprobe_free_utask(struct task_struc
 	while (ri)
 		ri = free_ret_instance(ri);
 
+	timer_delete_sync(&utask->ri_timer);
+	task_work_cancel(utask->task, &utask->ri_task_work);
 	xol_free_insn_slot(t);
 	kfree(utask);
 	t->utask = NULL;
 }
 
+static void return_instance_task_work(struct callback_head *head)
+{
+	struct uprobe_task *utask = container_of(head, struct uprobe_task, ri_task_work);
+	struct return_instance *ri;
+
+	for (ri = utask->return_instances; ri; ri = ri->next) {
+		if (!ri->uprobe)
+			continue;
+		if (ri->has_ref)
+			continue;
+		if (refcount_inc_not_zero(&ri->uprobe->ref))
+			ri->has_ref = true;
+		else
+			ri->uprobe = NULL;
+		__srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
+	}
+}
+
+static void return_instance_timer(struct timer_list *timer)
+{
+	struct uprobe_task *utask = container_of(timer, struct uprobe_task, ri_timer);
+	task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL);
+}
+
+static struct uprobe_task *alloc_utask(struct task_struct *task)
+{
+	struct uprobe_task *utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
+	if (!utask)
+		return NULL;
+	timer_setup(&utask->ri_timer, return_instance_timer, 0);
+	init_task_work(&utask->ri_task_work, return_instance_task_work);
+	utask->task = task;
+	return utask;
+}
+
 /*
  * Allocate a uprobe_task object for the task if necessary.
  * Called when the thread hits a breakpoint.
@@ -1801,7 +1843,7 @@ void uprobe_free_utask(struct task_struc
 static struct uprobe_task *get_utask(void)
 {
 	if (!current->utask)
-		current->utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
+		current->utask = alloc_utask(current);
 	return current->utask;
 }
 
@@ -1810,7 +1852,7 @@ static int dup_utask(struct task_struct
 	struct uprobe_task *n_utask;
 	struct return_instance **p, *o, *n;
 
-	n_utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
+	n_utask = alloc_utask(t);
 	if (!n_utask)
 		return -ENOMEM;
 	t->utask = n_utask;
@@ -1822,13 +1864,20 @@ static int dup_utask(struct task_struct
 			return -ENOMEM;
 
 		*n = *o;
-		__srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
+		if (n->uprobe) {
+			if (n->has_ref)
+				get_uprobe(n->uprobe);
+			else
+				__srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
+		}
 		n->next = NULL;
 
 		*p = n;
 		p = &n->next;
 		n_utask->depth++;
 	}
+	if (n_utask->return_instances)
+		mod_timer(&n_utask->ri_timer, jiffies + HZ);
 
 	return 0;
 }
@@ -1967,6 +2016,7 @@ static void prepare_uretprobe(struct upr
 
 	ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu);
 	ri->uprobe = uprobe;
+	ri->has_ref = 0;
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
@@ -1976,6 +2026,8 @@ static void prepare_uretprobe(struct upr
 	ri->next = utask->return_instances;
 	utask->return_instances = ri;
 
+	mod_timer(&utask->ri_timer, jiffies + HZ);
+
 	return;
 
 err_mem:
@@ -2204,6 +2256,9 @@ handle_uretprobe_chain(struct return_ins
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
 
+	if (!uprobe)
+		return;
+
 	guard(srcu)(&uprobes_srcu);
 
 	for_each_consumer_rcu(uc, uprobe->consumers) {
@@ -2250,8 +2305,10 @@ static void handle_trampoline(struct pt_
 
 		instruction_pointer_set(regs, ri->orig_ret_vaddr);
 		do {
-			if (valid)
+			if (valid) {
 				handle_uretprobe_chain(ri, regs);
+				mod_timer(&utask->ri_timer, jiffies + HZ);
+			}
 			ri = free_ret_instance(ri);
 			utask->depth--;
 		} while (ri != next);



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/11] perf/uprobe: Re-indent labels
  2024-07-11 11:02 ` [PATCH v2 01/11] perf/uprobe: Re-indent labels Peter Zijlstra
@ 2024-07-11 11:58   ` Jiri Olsa
  2024-07-11 12:07     ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Jiri Olsa @ 2024-07-11 11:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, clm, paulmck

On Thu, Jul 11, 2024 at 01:02:36PM +0200, Peter Zijlstra wrote:

SNIP

> @@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod
>  	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
>  		return -EINVAL;
>  
> - retry:
> +retry:
>  	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
>  	if (!uprobe)
>  		return -ENOMEM;
> @@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct
>  	ret = 0;
>  	/* pairs with get_xol_area() */
>  	smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */
> - fail:
> +fail:
>  	mmap_write_unlock(mm);
>  
>  	return ret;
> @@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are
>  	kfree(area->bitmap);
>   free_area:

hi,
missed this one and another one few lines before that ;-)

jirka

>  	kfree(area);
> - out:
> +out:
>  	return NULL;
>  }
>  
> @@ -1915,7 +1915,7 @@ static void prepare_uretprobe(struct upr
>  	utask->return_instances = ri;
>  
>  	return;
> - fail:
> +fail:
>  	kfree(ri);
>  }
>  
> @@ -2031,7 +2031,7 @@ static int is_trap_at_addr(struct mm_str
>  
>  	copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
>  	put_page(page);
> - out:
> +out:
>  	/* This needs to return true for any variant of the trap insn */
>  	return is_trap_insn(&opcode);
>  }
> @@ -2159,7 +2159,7 @@ static void handle_trampoline(struct pt_
>  	utask->return_instances = ri;
>  	return;
>  
> - sigill:
> +sigill:
>  	uprobe_warn(current, "handle uretprobe, sending SIGILL.");
>  	force_sig(SIGILL);
>  
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/11] perf/uprobe: Re-indent labels
  2024-07-11 11:58   ` Jiri Olsa
@ 2024-07-11 12:07     ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 12:07 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, clm, paulmck

On Thu, Jul 11, 2024 at 01:58:04PM +0200, Jiri Olsa wrote:
> On Thu, Jul 11, 2024 at 01:02:36PM +0200, Peter Zijlstra wrote:
> 
> SNIP
> 
> > @@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod
> >  	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
> >  		return -EINVAL;
> >  
> > - retry:
> > +retry:
> >  	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
> >  	if (!uprobe)
> >  		return -ENOMEM;
> > @@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct
> >  	ret = 0;
> >  	/* pairs with get_xol_area() */
> >  	smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */
> > - fail:
> > +fail:
> >  	mmap_write_unlock(mm);
> >  
> >  	return ret;
> > @@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are
> >  	kfree(area->bitmap);
> >   free_area:
> 
> hi,
> missed this one and another one few lines before that ;-)

Bah, _ isn't in [[:alnum:]]. I'll go fix.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
  2024-07-11 11:02 ` [PATCH v2 11/11] perf/uprobe: Add uretprobe timer Peter Zijlstra
@ 2024-07-11 13:19   ` Oleg Nesterov
  2024-07-11 15:00     ` Peter Zijlstra
  2024-07-12 21:43   ` Andrii Nakryiko
  1 sibling, 1 reply; 45+ messages in thread
From: Oleg Nesterov @ 2024-07-11 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

Not sure I read this patch correctly, but at first glance it looks
suspicious..

On 07/11, Peter Zijlstra wrote:
>
> +static void return_instance_timer(struct timer_list *timer)
> +{
> +	struct uprobe_task *utask = container_of(timer, struct uprobe_task, ri_timer);
> +	task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL);
> +}

What if utask->task sleeps in TASK_STOPPED/TASK_TRACED state before
return from the ret-probed function?

In this case it won't react to TWA_SIGNAL until debugger or SIGCONT
wakes it up.

---------------------------------------------------------------------------
And it seems that even task_work_add() itself is not safe...

Suppose we have 2 ret-probed functions

	void f2() { ... }
	void f1() { ...; f2(); }

A task T calls f1(), hits the bp, and calls prepare_uretprobe() which does

	mod_timer(&utask->ri_timer, jiffies + HZ);

Then later it calls f2() and the pending timer expires after it enters the
kernel, but before the next prepare_uretprobe() -> mod_timer().

In this case ri_task_work is already queued and the timer is pending again.

Now. Even if T goes to the exit_to_user_mode_loop() path immediately, in
theory nothing can guarantee that it will call get_signal/task_work_run
in less than 1 second, it can be preempted.

But T can sleep in xol_take_insn_slot() before return from handle_swbp(),
and this is not so theoretical.

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe()
  2024-07-11 11:02 ` [PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe() Peter Zijlstra
@ 2024-07-11 13:59   ` Masami Hiramatsu
  0 siblings, 0 replies; 45+ messages in thread
From: Masami Hiramatsu @ 2024-07-11 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, 11 Jul 2024 13:02:39 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> With handle_swbp() triggering concurrently on (all) CPUs, tree_lock
> becomes a bottleneck. Avoid treelock by doing RCU lookups of the
> uprobe.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks,

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/events/uprobes.c |   49 +++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 40 insertions(+), 9 deletions(-)
> 
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_
>  #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
>  
>  static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>  
>  #define UPROBES_HASH_SZ	13
>  /* serialize uprobe->pending_list */
> @@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem)
>  struct uprobe {
>  	struct rb_node		rb_node;	/* node in the rb tree */
>  	refcount_t		ref;
> +	struct rcu_head		rcu;
>  	struct rw_semaphore	register_rwsem;
>  	struct rw_semaphore	consumer_rwsem;
>  	struct list_head	pending_list;
> @@ -587,12 +589,25 @@ set_orig_insn(struct arch_uprobe *auprob
>  			*(uprobe_opcode_t *)&auprobe->insn);
>  }
>  
> +static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
> +{
> +	if (refcount_inc_not_zero(&uprobe->ref))
> +		return uprobe;
> +	return NULL;
> +}
> +
>  static struct uprobe *get_uprobe(struct uprobe *uprobe)
>  {
>  	refcount_inc(&uprobe->ref);
>  	return uprobe;
>  }
>  
> +static void uprobe_free_rcu(struct rcu_head *rcu)
> +{
> +	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> +	kfree(uprobe);
> +}
> +
>  static void put_uprobe(struct uprobe *uprobe)
>  {
>  	if (refcount_dec_and_test(&uprobe->ref)) {
> @@ -604,7 +619,7 @@ static void put_uprobe(struct uprobe *up
>  		mutex_lock(&delayed_uprobe_lock);
>  		delayed_uprobe_remove(uprobe, NULL);
>  		mutex_unlock(&delayed_uprobe_lock);
> -		kfree(uprobe);
> +		call_rcu(&uprobe->rcu, uprobe_free_rcu);
>  	}
>  }
>  
> @@ -653,10 +668,10 @@ static struct uprobe *__find_uprobe(stru
>  		.inode = inode,
>  		.offset = offset,
>  	};
> -	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
> +	struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
>  
>  	if (node)
> -		return get_uprobe(__node_2_uprobe(node));
> +		return try_get_uprobe(__node_2_uprobe(node));
>  
>  	return NULL;
>  }
> @@ -667,20 +682,32 @@ static struct uprobe *__find_uprobe(stru
>   */
>  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
>  {
> -	struct uprobe *uprobe;
> +	unsigned int seq;
>  
> -	read_lock(&uprobes_treelock);
> -	uprobe = __find_uprobe(inode, offset);
> -	read_unlock(&uprobes_treelock);
> +	guard(rcu)();
>  
> -	return uprobe;
> +	do {
> +		seq = read_seqcount_begin(&uprobes_seqcount);
> +		struct uprobe *uprobe = __find_uprobe(inode, offset);
> +		if (uprobe) {
> +			/*
> +			 * Lockless RB-tree lookups are prone to false-negatives.
> +			 * If they find something, it's good. If they do not find,
> +			 * it needs to be validated.
> +			 */
> +			return uprobe;
> +		}
> +	} while (read_seqcount_retry(&uprobes_seqcount, seq));
> +
> +	/* Really didn't find anything. */
> +	return NULL;
>  }
>  
>  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
>  {
>  	struct rb_node *node;
>  
> -	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
> +	node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
>  	if (node)
>  		return get_uprobe(__node_2_uprobe(node));
>  
> @@ -702,7 +729,9 @@ static struct uprobe *insert_uprobe(stru
>  	struct uprobe *u;
>  
>  	write_lock(&uprobes_treelock);
> +	write_seqcount_begin(&uprobes_seqcount);
>  	u = __insert_uprobe(uprobe);
> +	write_seqcount_end(&uprobes_seqcount);
>  	write_unlock(&uprobes_treelock);
>  
>  	return u;
> @@ -936,7 +965,9 @@ static void delete_uprobe(struct uprobe
>  		return;
>  
>  	write_lock(&uprobes_treelock);
> +	write_seqcount_begin(&uprobes_seqcount);
>  	rb_erase(&uprobe->rb_node, &uprobes_tree);
> +	write_seqcount_end(&uprobes_seqcount);
>  	write_unlock(&uprobes_treelock);
>  	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
>  	put_uprobe(uprobe);
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU
  2024-07-11 11:02 ` [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU Peter Zijlstra
@ 2024-07-11 14:03   ` Jiri Olsa
  2024-07-12 21:21   ` Andrii Nakryiko
  1 sibling, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2024-07-11 14:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, clm, paulmck

On Thu, Jul 11, 2024 at 01:02:43PM +0200, Peter Zijlstra wrote:

SNIP

>  	/* Tracing handlers use ->utask to communicate with fetch methods */
>  	if (!get_utask())
> -		goto out;
> +		return;
>  
>  	if (arch_uprobe_ignore(&uprobe->arch, regs))
> -		goto out;
> +		return;
>  
>  	handler_chain(uprobe, regs);
>  
>  	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> -		goto out;
> -
> -	if (!pre_ssout(uprobe, regs, bp_vaddr))
>  		return;
>  
> -	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
> -out:
> -	put_uprobe(uprobe);
> +	pre_ssout(uprobe, regs, bp_vaddr);

pre_ssout could now return void

jirka

>  }
>  
>  /*
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
  2024-07-11 13:19   ` Oleg Nesterov
@ 2024-07-11 15:00     ` Peter Zijlstra
  2024-07-11 15:55       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 15:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 03:19:19PM +0200, Oleg Nesterov wrote:
> Not sure I read this patch correctly, but at first glance it looks
> suspicious..
> 
> On 07/11, Peter Zijlstra wrote:
> >
> > +static void return_instance_timer(struct timer_list *timer)
> > +{
> > +	struct uprobe_task *utask = container_of(timer, struct uprobe_task, ri_timer);
> > +	task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL);
> > +}
> 
> What if utask->task sleeps in TASK_STOPPED/TASK_TRACED state before
> return from the ret-probed function?
> 
> In this case it won't react to TWA_SIGNAL until debugger or SIGCONT
> wakes it up.

Or FROZEN etc.. Yeah.

> ---------------------------------------------------------------------------
> And it seems that even task_work_add() itself is not safe...
> 
> Suppose we have 2 ret-probed functions
> 
> 	void f2() { ... }
> 	void f1() { ...; f2(); }
> 
> A task T calls f1(), hits the bp, and calls prepare_uretprobe() which does
> 
> 	mod_timer(&utask->ri_timer, jiffies + HZ);
> 
> Then later it calls f2() and the pending timer expires after it enters the
> kernel, but before the next prepare_uretprobe() -> mod_timer().
> 
> In this case ri_task_work is already queued and the timer is pending again.

You're saying we can hit a double enqueue, right? Yeah, that's a
problem. But that can be fairly easily rectified.

> Now. Even if T goes to the exit_to_user_mode_loop() path immediately, in
> theory nothing can guarantee that it will call get_signal/task_work_run
> in less than 1 second, it can be preempted.
> 
> But T can sleep in xol_take_insn_slot() before return from handle_swbp(),
> and this is not so theoretical.

So the assumption is that kernel code makes forward progress. If we get
preempted, we'll get scheduled again. If the machine is so overloaded
this takes more than a second, stretching the SRCU period is the least
of your problems.

Same with sleeps, it'll get a wakeup.

The only thing that is out of our control is userspace. And yes, I had
not considered STOPPED/TRACED/FROZEN.

So the reason I did that task_work is because the return_instance list
is strictly current, so a random timer cannot safely poke at it. And
barring those pesky states, it does as desired.

Let me ponder that a little, I *can* make it work, but all 'solutions'
I've come up with so far are really rather vile.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
  2024-07-11 15:00     ` Peter Zijlstra
@ 2024-07-11 15:55       ` Peter Zijlstra
  2024-07-11 16:06         ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 15:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 05:00:54PM +0200, Peter Zijlstra wrote:

> Let me ponder that a little, I *can* make it work, but all 'solutions'
> I've come up with so far are really rather vile.

This is the least horrible solution I could come up with...

---
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -83,6 +83,7 @@ struct uprobe_task {
 
 	struct timer_list		ri_timer;
 	struct callback_head		ri_task_work;
+	bool				ri_work_pending;
 	struct task_struct		*task;
 };
 
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1797,9 +1797,8 @@ void uprobe_free_utask(struct task_struc
 	t->utask = NULL;
 }
 
-static void return_instance_task_work(struct callback_head *head)
+static void __return_instance_put(struct uprobe_task *utask)
 {
-	struct uprobe_task *utask = container_of(head, struct uprobe_task, ri_task_work);
 	struct return_instance *ri;
 
 	for (ri = utask->return_instances; ri; ri = ri->next) {
@@ -1815,9 +1814,43 @@ static void return_instance_task_work(st
 	}
 }
 
+static void return_instance_task_work(struct callback_head *head)
+{
+	struct uprobe_task *utask = container_of(head, struct uprobe_task, ri_task_work);
+	utask->ri_work_pending = false;
+	__return_instance_put(utask);
+}
+
+static int return_instance_blocked(struct task_struct *p, void *arg)
+{
+	unsigned int state = READ_ONCE(p->__state);
+
+	if (state == TASK_RUNNING || state == TASK_WAKING)
+		return 0;
+
+	smp_rmb();
+	if (p->on_rq)
+		return 0;
+
+	/*
+	 * Per __task_needs_rq_locked() we now have: !p->on_cpu and only hold
+	 * p->pi_lock, and can consiter the task fully blocked.
+	 */
+
+	__return_instance_put(p->utask);
+	return 1;
+}
+
 static void return_instance_timer(struct timer_list *timer)
 {
 	struct uprobe_task *utask = container_of(timer, struct uprobe_task, ri_timer);
+	if (utask->ri_work_pending)
+		return;
+
+	if (task_call_func(utask->task, return_instance_blocked, NULL))
+		return;
+
+	utask->ri_work_pending = true;
 	task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL);
 }
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
  2024-07-11 15:55       ` Peter Zijlstra
@ 2024-07-11 16:06         ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 16:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 05:55:42PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2024 at 05:00:54PM +0200, Peter Zijlstra wrote:
> 
> > Let me ponder that a little, I *can* make it work, but all 'solutions'
> > I've come up with so far are really rather vile.
> 
> This is the least horrible solution I could come up with...
> 
> ---
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -83,6 +83,7 @@ struct uprobe_task {
>  
>  	struct timer_list		ri_timer;
>  	struct callback_head		ri_task_work;
> +	bool				ri_work_pending;
>  	struct task_struct		*task;
>  };
>  
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1797,9 +1797,8 @@ void uprobe_free_utask(struct task_struc
>  	t->utask = NULL;
>  }
>  
> -static void return_instance_task_work(struct callback_head *head)
> +static void __return_instance_put(struct uprobe_task *utask)
>  {
> -	struct uprobe_task *utask = container_of(head, struct uprobe_task, ri_task_work);
>  	struct return_instance *ri;
>  
>  	for (ri = utask->return_instances; ri; ri = ri->next) {
> @@ -1815,9 +1814,43 @@ static void return_instance_task_work(st
>  	}
>  }
>  
> +static void return_instance_task_work(struct callback_head *head)
> +{
> +	struct uprobe_task *utask = container_of(head, struct uprobe_task, ri_task_work);
> +	utask->ri_work_pending = false;
> +	__return_instance_put(utask);
> +}
> +
> +static int return_instance_blocked(struct task_struct *p, void *arg)
> +{
> +	unsigned int state = READ_ONCE(p->__state);
> +
> +	if (state == TASK_RUNNING || state == TASK_WAKING)
> +		return 0;
> +
> +	smp_rmb();
> +	if (p->on_rq)
> +		return 0;
> +
> +	/*
> +	 * Per __task_needs_rq_locked() we now have: !p->on_cpu and only hold
> +	 * p->pi_lock, and can consiter the task fully blocked.
> +	 */
> +
> +	__return_instance_put(p->utask);

PREEMPT_RT might not like this though, doing the full RI iteration under
a raw_spinlock_t...

I just can't think of anything saner just now. Oh well, let me go make
dinner, perhaps something will come to me.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
  2024-07-11 11:02 ` [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU Peter Zijlstra
@ 2024-07-11 16:06   ` Oleg Nesterov
  2024-07-11 18:42     ` Peter Zijlstra
  2024-07-12 21:28   ` Andrii Nakryiko
  1 sibling, 1 reply; 45+ messages in thread
From: Oleg Nesterov @ 2024-07-11 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

I'll try to actually apply the whole series and read the code tomorrow.
Right now I can't understand this change... Just one question for now.

On 07/11, Peter Zijlstra wrote:
>
> @@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct upr
>  			 * attack from user-space.
>  			 */
>  			uprobe_warn(current, "handle tail call");
> -			goto err_uprobe;
> +			goto err_mem;
>  		}
>  		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
>  	}
>
> +	ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu);
> +	ri->uprobe = uprobe;

It seems that, if we race with _unregister, this __srcu_read_lock()
can happen after call_srcu(uprobes_srcu, uprobe, uprobe_free_stage1)
was already called...

In this case read_lock "has no effect" in that uprobe_free_stage1()
can run before free_ret_instance() does srcu_read_unlock(ri->srcu_idx).

Perhaps it is fine, uprobe_free_stage1() does another call_srcu(),
but somehow I got lost.

Could you re-check this logic? Most probably I missed something, but still...

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
  2024-07-11 16:06   ` Oleg Nesterov
@ 2024-07-11 18:42     ` Peter Zijlstra
  2024-07-12 10:26       ` Oleg Nesterov
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-11 18:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 06:06:53PM +0200, Oleg Nesterov wrote:
> I'll try to actually apply the whole series and read the code tomorrow.
> Right now I can't understand this change... Just one question for now.
> 
> On 07/11, Peter Zijlstra wrote:
> >
> > @@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct upr
> >  			 * attack from user-space.
> >  			 */
> >  			uprobe_warn(current, "handle tail call");
> > -			goto err_uprobe;
> > +			goto err_mem;
> >  		}
> >  		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
> >  	}
> >
> > +	ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu);
> > +	ri->uprobe = uprobe;
> 
> It seems that, if we race with _unregister, this __srcu_read_lock()
> can happen after call_srcu(uprobes_srcu, uprobe, uprobe_free_stage1)
> was already called...
> 
> In this case read_lock "has no effect" in that uprobe_free_stage1()
> can run before free_ret_instance() does srcu_read_unlock(ri->srcu_idx).
> 
> Perhaps it is fine, uprobe_free_stage1() does another call_srcu(),
> but somehow I got lost.
> 
> Could you re-check this logic? Most probably I missed something, but still...


  handle_swbp()
    guard(srcu)(&uprobes_srcu);
    handle_chain();
      prepare_uretprobe()
        __srcu_read_lock(&uretprobe_srcu);


vs

  uprobe_free_stage2
    kfree(uprobe)

  uprobe_free_stage1
    call_srcu(&uretprobe_srcu, &uprobe->rcu, uprobe_free_stage2);

  put_uprobe()
    if (refcount_dec_and_test)
      call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_stage1);
  

So my thinking was since we take uretprobe_srcu *inside* uprobe_srcu,
this reference must be visible before we execute stage1, and as such
stage2 cannot complete prematurely.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
                   ` (10 preceding siblings ...)
  2024-07-11 11:02 ` [PATCH v2 11/11] perf/uprobe: Add uretprobe timer Peter Zijlstra
@ 2024-07-12  4:57 ` Andrii Nakryiko
  2024-07-12  9:13   ` Peter Zijlstra
                     ` (2 more replies)
  11 siblings, 3 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-12  4:57 UTC (permalink / raw)
  To: Peter Zijlstra, oleg
  Cc: mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi!
>
> These patches implement the (S)RCU based proposal to optimize uprobes.
>
> On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a
> tight loop:
>
>   perf probe -x ./uprobes test=func
>   perf stat -ae probe_uprobe:test  -- sleep 1
>
>   perf probe -x ./uprobes test=func%return
>   perf stat -ae probe_uprobe:test__return -- sleep 1
>
> PRE:
>
>   4,038,804      probe_uprobe:test
>   2,356,275      probe_uprobe:test__return
>
> POST:
>
>   7,216,579      probe_uprobe:test
>   6,744,786      probe_uprobe:test__return
>
> (copy-paste FTW, I didn't do new numbers because the fast paths didn't change --
>  and quick test run shows similar numbers)
>
> Patches also available here:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes
>
>
> Changes since last time:
>  - better split with intermediate inc_not_zero()
>  - fix UPROBE_HANDLER_REMOVE
>  - restored the lost rcu_assign_pointer()
>  - avoid lockdep for uretprobe_srcu
>  - add missing put_uprobe() -> srcu_read_unlock() conversion
>  - actually initialize return_instance::has_ref
>  - a few comments
>  - things I don't remember
>
>

Hey Peter!

Thanks for the v2, I plan to look at it more thoroughly tomorrow. But
meanwhile I spent a good chunk of today to write an uprobes
stress-test, so we can validate that we are not regressing anything
(yes, I don't trust lockless code and people in general ;)

Anyways, if you'd like to use it, it's at [0]. All you should need to
build and run it is:

  $ cd examples/c
  $ make -j$(nproc) uprobe-stress
  $ sudo ./uprobe-stress -tN -aM -mP -fR


N, M, P, R are number of threads dedicated to one of four functions of
the stress test: triggering user space functions (N),
attaching/detaching various random subsets of uprobes (M), mmap()ing
parts of executable with uprobes (P), and forking the process and
triggering uprobes for a little bit (R). The idea is to test various
timings and interleavings of uprobe-related logic.

You should only need not-too-old Clang to build everything (Clang 12+
should work, I believe). But do let me know if you run into troubles.

I did run this stress test for a little while on current
bpf-next/master with no issues detected (yay!).

But then I also ran it on Linux built from perf/uprobes branch (these
patches), and after a few seconds I see that there is no more
attachment/detachment happening. Eventually I got splats, which you
can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command
to run it inside my QEMU image.

So there is still something off, hopefully this will help to debug and
hammer out any remaining kinks. Thanks!

  [0] https://github.com/libbpf/libbpf-bootstrap/commit/2f88cef90f9728ec8c7bee7bd48fdbcf197806c3
  [1] https://gist.github.com/anakryiko/f761690addf7aa5f08caec95fda9ef1a

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-12  4:57 ` [PATCH v2 00/11] perf/uprobe: Optimize uprobes Andrii Nakryiko
@ 2024-07-12  9:13   ` Peter Zijlstra
  2024-07-12 13:10   ` Peter Zijlstra
  2024-07-15 14:45   ` Peter Zijlstra
  2 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-12  9:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: oleg, mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 09:57:44PM -0700, Andrii Nakryiko wrote:

> You should only need not-too-old Clang to build everything (Clang 12+
> should work, I believe). But do let me know if you run into troubles.

A quick look at the thing shows me it's full of BPF gunk :/

Which means, I probably also need a kernel with BPF and BTF and all that
other gunk on -- which I don't have on any of my machines.

Esp. the BTF stuff is a force kill on all my builds as it completely
destroys build times.

> But then I also ran it on Linux built from perf/uprobes branch (these
> patches), and after a few seconds I see that there is no more
> attachment/detachment happening. Eventually I got splats, which you
> can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command
> to run it inside my QEMU image.

OK, I'll see if I can make it work.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
  2024-07-11 18:42     ` Peter Zijlstra
@ 2024-07-12 10:26       ` Oleg Nesterov
  0 siblings, 0 replies; 45+ messages in thread
From: Oleg Nesterov @ 2024-07-12 10:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On 07/11, Peter Zijlstra wrote:
>
>   uprobe_free_stage1
>     call_srcu(&uretprobe_srcu, &uprobe->rcu, uprobe_free_stage2);
>
>   put_uprobe()
>     if (refcount_dec_and_test)
>       call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_stage1);
>
>
> So my thinking was since we take uretprobe_srcu *inside* uprobe_srcu,

Ah, indeed! Somehow misread the change in put_uprobe() as if it
uses call_rcu(uretprobe_srcu).

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-12  4:57 ` [PATCH v2 00/11] perf/uprobe: Optimize uprobes Andrii Nakryiko
  2024-07-12  9:13   ` Peter Zijlstra
@ 2024-07-12 13:10   ` Peter Zijlstra
  2024-07-12 15:29     ` Andrii Nakryiko
  2024-07-15 14:45   ` Peter Zijlstra
  2 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-12 13:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: oleg, mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 09:57:44PM -0700, Andrii Nakryiko wrote:

> Anyways, if you'd like to use it, it's at [0]. All you should need to
> build and run it is:
> 
>   $ cd examples/c
>   $ make -j$(nproc) uprobe-stress
>   $ sudo ./uprobe-stress -tN -aM -mP -fR

>   [0] https://github.com/libbpf/libbpf-bootstrap/commit/2f88cef90f9728ec8c7bee7bd48fdbcf197806c3

So, I cannot clone that [0] URL I had to click around github shite for a
while to figure out wtf to clone and where your commit lives, turns out
it is:

	$ git clone https://github.com/libbpf/libbpf-bootstrap.git
	$ cd libbpf-bootstrap
	$ git checkout uprobe-stress

But then I do the above and I'm greeted with:

root@ivb-ep:/usr/src/libbpf-bootstrap/examples/c# make -j40
  MKDIR    .output
  MKDIR    .output/libbpf
  MKDIR    bpftool
  LIB      libbpf.a
  BPFTOOL  bpftool/bootstrap/bpftool
make[1]: *** /usr/src/libbpf-bootstrap/libbpf/src: No such file or directory.  Stop.
make: *** [Makefile:87: /usr/src/libbpf-bootstrap/examples/c/.output/libbpf.a] Error 2
make: *** Waiting for unfinished jobs....
make[1]: *** /usr/src/libbpf-bootstrap/bpftool/src: No such file or directory.  Stop.
make: *** [Makefile:95: /usr/src/libbpf-bootstrap/examples/c/.output/bpftool/bootstrap/bpftool] Error 2


Now what ?!?

BPF is ever such unusable shite :/ It's very near as bad as qemu.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-12 13:10   ` Peter Zijlstra
@ 2024-07-12 15:29     ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-12 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: oleg, mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Fri, Jul 12, 2024 at 6:10 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jul 11, 2024 at 09:57:44PM -0700, Andrii Nakryiko wrote:
>
> > Anyways, if you'd like to use it, it's at [0]. All you should need to
> > build and run it is:
> >
> >   $ cd examples/c
> >   $ make -j$(nproc) uprobe-stress
> >   $ sudo ./uprobe-stress -tN -aM -mP -fR
>
> >   [0] https://github.com/libbpf/libbpf-bootstrap/commit/2f88cef90f9728ec8c7bee7bd48fdbcf197806c3
>
> So, I cannot clone that [0] URL I had to click around github shite for a
> while to figure out wtf to clone and where your commit lives, turns out
> it is:

Sorry, my bad, it's just "normal" Github stuff, but yes, I also find
it quite confusing, so should have been more explicit that this is
uprobe-stress branch in libbpf/libbpf-bootstrap Github repo.

>
>         $ git clone https://github.com/libbpf/libbpf-bootstrap.git
>         $ cd libbpf-bootstrap
>         $ git checkout uprobe-stress

Yes, sorry, as I said, I should have been more thorough in my
instructions. You did a great job figuring all the above out, the next
step is to make sure all the git submodules are checked out, so, in
addition to the above you are just missing git submodule
initialization:

$ git submodule update --init --recursive
$ cd examples/c
$ make -j%(nproc) uprobe-stress

NOTE: Don't run just `make`, because it will build all the examples,
which have extra dependencies, and we don't want to go over that with
you :) Stick to `make uprobe-stress` to build just an uprobe-stress
binary.

The only extra thing that you might need (if you don't have that
already on your build machine) is development versions of libelf and
zlib, as dependencies of libbpf (that would be elfutils-libelf-devel
and zlib-devel packages in Fedora, I think). libbpf-bootstrap is
trying to be as self-contained and dependency free as possible (which
is also why git submodules).

You were worried about BTF. This tool doesn't need BTF and it should
work without it enabled in kernel config (but if you do want BTF
generation to succeed, I think the only kernel build requirement for
that is up-to-date pahole from dwarves package).

As for the kernel config, I don't think you need anything BPF-specific beyond:

CONFIG_BPF=y
CONFIG_BPF_EVENTS=y
CONFIG_BPF_SYSCALL=y

But just in case, we maintain a list of kernel configuration that *all
of BPF selftests* require (see [0]), so worst case just append that to
your config (but really, above three is probably all you need,
assuming you have all the non-BPF perf/tracing/uprobe configs
enabled).

  [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/config

>
> But then I do the above and I'm greeted with:
>
> root@ivb-ep:/usr/src/libbpf-bootstrap/examples/c# make -j40
>   MKDIR    .output
>   MKDIR    .output/libbpf
>   MKDIR    bpftool
>   LIB      libbpf.a
>   BPFTOOL  bpftool/bootstrap/bpftool
> make[1]: *** /usr/src/libbpf-bootstrap/libbpf/src: No such file or directory.  Stop.
> make: *** [Makefile:87: /usr/src/libbpf-bootstrap/examples/c/.output/libbpf.a] Error 2
> make: *** Waiting for unfinished jobs....
> make[1]: *** /usr/src/libbpf-bootstrap/bpftool/src: No such file or directory.  Stop.
> make: *** [Makefile:95: /usr/src/libbpf-bootstrap/examples/c/.output/bpftool/bootstrap/bpftool] Error 2
>
>
> Now what ?!?
>
> BPF is ever such unusable shite :/ It's very near as bad as qemu.
>

Sorry, not going into discussions about this :) I do agree initial
setup is not straightforward, and this libbpf-bootstrap repo is
actually an attempt to let users start quickly by doing a minimal and
more-or-less straightforward BPF setup (which is why I built
uprobe-stress on top of libbpf-bootstrap setup instead of BPF
selftests; that would be even more painful process for you which I
didn't want to force you through, see above about
selftests/bpf/config).

But just keep in mind that using BPF here isn't some sort of random
choice just because I tend to work mostly with BPF. BPF is the only
interface to multi-uprobe attachment, it's a lightweight and
low-overhead way to know whether uprobes are triggered (I do
memory-mapped sharing of per-cpu counters between BPF/kernel and user
space, no extra syscall overhead, blocking, or unnecessary memory
ordering is added). And in general, whether you like it or not, most
people would never care to use uprobes if not for BPF as an interface
into that.

I'm not trying to convert you into BPF or anything like that, but this
has to use BPF for end-to-end testing.

But you are close, please don't give up!

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu()
  2024-07-11 11:02 ` [PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu() Peter Zijlstra
@ 2024-07-12 20:23   ` Andrii Nakryiko
  2024-07-15 11:21     ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-12 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Much like latch_tree, add two RCU methods for the regular RB-tree,
> which can be used in conjunction with a seqcount to provide lockless
> lookups.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  include/linux/rbtree.h |   67 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 67 insertions(+)
>
> --- a/include/linux/rbtree.h
> +++ b/include/linux/rbtree.h
> @@ -245,6 +245,42 @@ rb_find_add(struct rb_node *node, struct
>  }
>
>  /**
> + * rb_find_add_rcu() - find equivalent @node in @tree, or add @node
> + * @node: node to look-for / insert
> + * @tree: tree to search / modify
> + * @cmp: operator defining the node order
> + *
> + * Adds a Store-Release for link_node.
> + *
> + * Returns the rb_node matching @node, or NULL when no match is found and @node
> + * is inserted.
> + */
> +static __always_inline struct rb_node *
> +rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
> +               int (*cmp)(struct rb_node *, const struct rb_node *))

I don't get the point of the RCU version of rb_find_add as RCU itself
doesn't provide enough protection for modification of the tree, right?
So in uprobes code you do rb_find_add_rcu() under uprobes_treelock +
uprobes_seqcount locks. Wouldn't it be just as fine to do plain
non-RCU rb_find_add() in that case? After all, you do plain rb_erase
under the same set of locks.

So what's the point of this one?

> +{
> +       struct rb_node **link = &tree->rb_node;
> +       struct rb_node *parent = NULL;
> +       int c;
> +
> +       while (*link) {
> +               parent = *link;
> +               c = cmp(node, parent);
> +
> +               if (c < 0)
> +                       link = &parent->rb_left;
> +               else if (c > 0)
> +                       link = &parent->rb_right;
> +               else
> +                       return parent;
> +       }
> +
> +       rb_link_node_rcu(node, parent, link);
> +       rb_insert_color(node, tree);
> +       return NULL;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list
  2024-07-11 11:02 ` [PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list Peter Zijlstra
@ 2024-07-12 21:06   ` Andrii Nakryiko
  2024-07-15 11:25     ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-12 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

+ bpf@vger, please cc bpf ML for the next revision, these changes are
very relevant there as well, thanks

On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> With handle_swbp() hitting concurrently on (all) CPUs the
> uprobe->register_rwsem can get very contended. Add an SRCU instance to
> cover the consumer list and consumer lifetime.
>
> Since the consumer are externally embedded structures, unregister will
> have to suffer a synchronize_srcu().
>
> A notably complication is the UPROBE_HANDLER_REMOVE logic which can
> race against uprobe_register() such that it might want to remove a
> freshly installer handler that didn't get called. In order to close
> this hole, a seqcount is added. With that, the removal path can tell
> if anything changed and bail out of the removal.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/events/uprobes.c |   60 ++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 50 insertions(+), 10 deletions(-)
>

[...]

> @@ -800,7 +808,7 @@ static bool consumer_del(struct uprobe *
>         down_write(&uprobe->consumer_rwsem);
>         for (con = &uprobe->consumers; *con; con = &(*con)->next) {
>                 if (*con == uc) {
> -                       *con = uc->next;
> +                       WRITE_ONCE(*con, uc->next);

I have a dumb and mechanical question.

Above in consumer_add() you are doing WRITE_ONCE() for uc->next
assignment, but rcu_assign_pointer() for uprobe->consumers. Here, you
are doing WRITE_ONCE() for the same operation, if it so happens that
uc == *con == uprobe->consumers. So is rcu_assign_pointer() necessary
in consumer_addr()? If yes, we should have it here as well, no? And if
not, why bother with it in consumer_add()?

>                         ret = true;
>                         break;
>                 }
> @@ -1139,9 +1147,13 @@ void uprobe_unregister(struct inode *ino
>                 return;
>
>         down_write(&uprobe->register_rwsem);
> +       raw_write_seqcount_begin(&uprobe->register_seq);
>         __uprobe_unregister(uprobe, uc);
> +       raw_write_seqcount_end(&uprobe->register_seq);
>         up_write(&uprobe->register_rwsem);
>         put_uprobe(uprobe);
> +
> +       synchronize_srcu(&uprobes_srcu);
>  }
>  EXPORT_SYMBOL_GPL(uprobe_unregister);

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 07/11] perf/uprobe: Split uprobe_unregister()
  2024-07-11 11:02 ` [PATCH v2 07/11] perf/uprobe: Split uprobe_unregister() Peter Zijlstra
@ 2024-07-12 21:10   ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-12 21:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

+ bpf

On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> With uprobe_unregister() having grown a synchronize_srcu(), it becomes
> fairly slow to call. Esp. since both users of this API call it in a
> loop.
>
> Peel off the sync_srcu() and do it once, after the loop.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  include/linux/uprobes.h     |    8 ++++++--
>  kernel/events/uprobes.c     |    8 ++++++--
>  kernel/trace/bpf_trace.c    |    6 ++++--
>  kernel/trace/trace_uprobe.c |    6 +++++-
>  4 files changed, 21 insertions(+), 7 deletions(-)
>

BPF side of things looks good:

Acked-by: Andrii Nakryiko <andrii@kernel.org>


> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -113,7 +113,8 @@ extern int uprobe_write_opcode(struct ar
>  extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
>  extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
> -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> +extern void uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> +extern void uprobe_unregister_sync(void);
>  extern int uprobe_mmap(struct vm_area_struct *vma);
>  extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
>  extern void uprobe_start_dup_mmap(void);
> @@ -163,7 +164,10 @@ uprobe_apply(struct inode *inode, loff_t
>         return -ENOSYS;
>  }
>  static inline void
> -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> +uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> +{
> +}
> +static inline void uprobes_unregister_sync(void)
>  {
>  }
>  static inline int uprobe_mmap(struct vm_area_struct *vma)
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1138,7 +1138,7 @@ __uprobe_unregister(struct uprobe *uprob
>   * @offset: offset from the start of the file.
>   * @uc: identify which probe if multiple probes are colocated.
>   */
> -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> +void uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
>  {
>         struct uprobe *uprobe;
>
> @@ -1152,10 +1152,14 @@ void uprobe_unregister(struct inode *ino
>         raw_write_seqcount_end(&uprobe->register_seq);
>         up_write(&uprobe->register_rwsem);
>         put_uprobe(uprobe);
> +}
> +EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
>
> +void uprobe_unregister_sync(void)
> +{
>         synchronize_srcu(&uprobes_srcu);
>  }
> -EXPORT_SYMBOL_GPL(uprobe_unregister);
> +EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
>
>  /*
>   * __uprobe_register - register a probe
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -3181,9 +3181,11 @@ static void bpf_uprobe_unregister(struct
>         u32 i;
>
>         for (i = 0; i < cnt; i++) {
> -               uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
> -                                 &uprobes[i].consumer);
> +               uprobe_unregister_nosync(d_real_inode(path->dentry), uprobes[i].offset,
> +                                        &uprobes[i].consumer);
>         }
> +       if (cnt)
> +               uprobe_unregister_sync();
>  }
>
>  static void bpf_uprobe_multi_link_release(struct bpf_link *link)
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct tr
>  static void __probe_event_disable(struct trace_probe *tp)
>  {
>         struct trace_uprobe *tu;
> +       bool sync = false;
>
>         tu = container_of(tp, struct trace_uprobe, tp);
>         WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
> @@ -1112,9 +1113,12 @@ static void __probe_event_disable(struct
>                 if (!tu->inode)
>                         continue;
>
> -               uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
> +               uprobe_unregister_nosync(tu->inode, tu->offset, &tu->consumer);
> +               sync = true;
>                 tu->inode = NULL;
>         }
> +       if (sync)
> +               uprobe_unregister_sync();
>  }
>
>  static int probe_event_enable(struct trace_event_call *call,
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU
  2024-07-11 11:02 ` [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU Peter Zijlstra
  2024-07-11 14:03   ` Jiri Olsa
@ 2024-07-12 21:21   ` Andrii Nakryiko
  1 sibling, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-12 21:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

+ bpf

On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> With handle_swbp() hitting concurrently on (all) CPUs, potentially on
> the same uprobe, the uprobe->refcount can get *very* hot. Move the
> struct uprobe lifetime into uprobes_srcu such that it covers both the
> uprobe and the uprobe->consumers list.
>
> With this, handle_swbp() can use a single large SRCU critical section
> to avoid taking a refcount on the uprobe for it's duration.
>
> Notably, the single-step and uretprobe paths need a reference that
> leaves handle_swbp() and will, for now, still use ->refcount.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/events/uprobes.c |   68 ++++++++++++++++++++++++++++--------------------
>  1 file changed, 41 insertions(+), 27 deletions(-)
>
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -51,7 +51,7 @@ static struct mutex uprobes_mmap_mutex[U
>  DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>

[...]

> @@ -1982,22 +1990,31 @@ pre_ssout(struct uprobe *uprobe, struct
>         if (!utask)
>                 return -ENOMEM;
>
> +       utask->active_uprobe = try_get_uprobe(uprobe);
> +       if (!utask->active_uprobe)
> +               return -ESRCH;
> +
>         xol_vaddr = xol_get_insn_slot(uprobe);
> -       if (!xol_vaddr)
> -               return -ENOMEM;
> +       if (!xol_vaddr) {
> +               err = -ENOMEM;
> +               goto err_uprobe;
> +       }
>
>         utask->xol_vaddr = xol_vaddr;
>         utask->vaddr = bp_vaddr;
>
>         err = arch_uprobe_pre_xol(&uprobe->arch, regs);
> -       if (unlikely(err)) {
> -               xol_free_insn_slot(current);

let's keep this here, because you later remove err_uprobe part and
err_xol is only jumped to from here; it's better to just drop err_xol
and err_uprobe altogether and keep the original xol_free_insn_slot()
here.

> -               return err;
> -       }
> +       if (unlikely(err))
> +               goto err_xol;
>
> -       utask->active_uprobe = uprobe;
>         utask->state = UTASK_SSTEP;
>         return 0;
> +
> +err_xol:
> +       xol_free_insn_slot(current);
> +err_uprobe:
> +       put_uprobe(utask->active_uprobe);

utask->active_uprobe = NULL;

let's not leave garbage in utask (even if you remove this later anyways)

> +       return err;
>  }
>
>  /*

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
  2024-07-11 11:02 ` [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU Peter Zijlstra
  2024-07-11 16:06   ` Oleg Nesterov
@ 2024-07-12 21:28   ` Andrii Nakryiko
  2024-07-15 11:59     ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-12 21:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

+ bpf

On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Both single-step and uretprobes take a refcount on struct uprobe in
> handle_swbp() in order to ensure struct uprobe stays extant until a
> next trap.
>
> Since uprobe_unregister() only cares about the uprobe_consumer
> life-time, and these intra-trap sections can be arbitrarily large,
> create a second SRCU domain to cover these.
>
> Notably, a uretprobe with a registered return_instance that never
> triggers -- because userspace -- will currently pin the
> return_instance and related uprobe until the task dies. With this
> convertion to SRCU this behaviour will inhibit freeing of all uprobes.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/uprobes.h |    2 +
>  kernel/events/uprobes.c |   60 +++++++++++++++++++++++-------------------------
>  2 files changed, 31 insertions(+), 31 deletions(-)
>
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -78,6 +78,7 @@ struct uprobe_task {
>
>         struct return_instance          *return_instances;
>         unsigned int                    depth;
> +       unsigned int                    active_srcu_idx;
>  };
>
>  struct return_instance {
> @@ -86,6 +87,7 @@ struct return_instance {
>         unsigned long           stack;          /* stack pointer */
>         unsigned long           orig_ret_vaddr; /* original return address */
>         bool                    chained;        /* true, if instance is nested */
> +       int                     srcu_idx;
>
>         struct return_instance  *next;          /* keep as stack */
>  };
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -54,6 +54,15 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem)
>   * Covers uprobe->consumers lifetime as well as struct uprobe.
>   */
>  DEFINE_STATIC_SRCU(uprobes_srcu);
> +/*
> + * Covers return_instance->uprobe and utask->active_uprobe. Separate from
> + * uprobe_srcu because uprobe_unregister() doesn't need to wait for this
> + * and these lifetimes can be fairly long.
> + *
> + * Notably, these sections span userspace and as such use
> + * __srcu_read_{,un}lock() to elide lockdep.
> + */
> +DEFINE_STATIC_SRCU(uretprobes_srcu);
>
>  /* Have a copy of original instruction */
>  #define UPROBE_COPY_INSN       0
> @@ -596,25 +605,24 @@ set_orig_insn(struct arch_uprobe *auprob
>                         *(uprobe_opcode_t *)&auprobe->insn);
>  }
>
> -static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
> -{
> -       if (refcount_inc_not_zero(&uprobe->ref))
> -               return uprobe;
> -       return NULL;
> -}
> -
>  static struct uprobe *get_uprobe(struct uprobe *uprobe)
>  {
>         refcount_inc(&uprobe->ref);
>         return uprobe;
>  }
>
> -static void uprobe_free_rcu(struct rcu_head *rcu)
> +static void uprobe_free_stage2(struct rcu_head *rcu)
>  {
>         struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
>         kfree(uprobe);
>  }
>
> +static void uprobe_free_stage1(struct rcu_head *rcu)
> +{
> +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> +       call_srcu(&uretprobes_srcu, &uprobe->rcu, uprobe_free_stage2);
> +}
> +
>  static void put_uprobe(struct uprobe *uprobe)
>  {
>         if (refcount_dec_and_test(&uprobe->ref)) {
> @@ -626,7 +634,7 @@ static void put_uprobe(struct uprobe *up
>                 mutex_lock(&delayed_uprobe_lock);
>                 delayed_uprobe_remove(uprobe, NULL);
>                 mutex_unlock(&delayed_uprobe_lock);
> -               call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
> +               call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_stage1);
>         }
>  }
>
> @@ -1753,7 +1761,7 @@ unsigned long uprobe_get_trap_addr(struc
>  static struct return_instance *free_ret_instance(struct return_instance *ri)
>  {
>         struct return_instance *next = ri->next;
> -       put_uprobe(ri->uprobe);
> +       __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
>         kfree(ri);
>         return next;
>  }
> @@ -1771,7 +1779,7 @@ void uprobe_free_utask(struct task_struc
>                 return;
>
>         if (utask->active_uprobe)
> -               put_uprobe(utask->active_uprobe);
> +               __srcu_read_unlock(&uretprobes_srcu, utask->active_srcu_idx);
>
>         ri = utask->return_instances;
>         while (ri)
> @@ -1814,7 +1822,7 @@ static int dup_utask(struct task_struct
>                         return -ENOMEM;
>
>                 *n = *o;
> -               get_uprobe(n->uprobe);
> +               __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);

do we need to add this __srcu_clone_read_lock hack just to avoid
taking a refcount in dup_utask (i.e., on process fork)? This is not
that frequent and performance-sensitive case, so it seems like it
should be fine to take refcount and avoid doing srcu_read_unlock() in
a new process. Just like the case with long-running uretprobes where
you convert SRCU lock into refcount.

>                 n->next = NULL;
>
>                 *p = n;
> @@ -1931,14 +1939,10 @@ static void prepare_uretprobe(struct upr
>         if (!ri)
>                 return;
>
> -       ri->uprobe = try_get_uprobe(uprobe);
> -       if (!ri->uprobe)
> -               goto err_mem;
> -
>         trampoline_vaddr = get_trampoline_vaddr();
>         orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
>         if (orig_ret_vaddr == -1)
> -               goto err_uprobe;
> +               goto err_mem;
>
>         /* drop the entries invalidated by longjmp() */
>         chained = (orig_ret_vaddr == trampoline_vaddr);
> @@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct upr
>                          * attack from user-space.
>                          */
>                         uprobe_warn(current, "handle tail call");
> -                       goto err_uprobe;
> +                       goto err_mem;
>                 }
>                 orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
>         }
>
> +       ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu);
> +       ri->uprobe = uprobe;
>         ri->func = instruction_pointer(regs);
>         ri->stack = user_stack_pointer(regs);
>         ri->orig_ret_vaddr = orig_ret_vaddr;
> @@ -1972,8 +1978,6 @@ static void prepare_uretprobe(struct upr
>
>         return;
>
> -err_uprobe:
> -       uprobe_put(ri->uprobe);
>  err_mem:
>         kfree(ri);
>  }
> @@ -1990,15 +1994,9 @@ pre_ssout(struct uprobe *uprobe, struct
>         if (!utask)
>                 return -ENOMEM;
>
> -       utask->active_uprobe = try_get_uprobe(uprobe);
> -       if (!utask->active_uprobe)
> -               return -ESRCH;
> -
>         xol_vaddr = xol_get_insn_slot(uprobe);
> -       if (!xol_vaddr) {
> -               err = -ENOMEM;
> -               goto err_uprobe;
> -       }
> +       if (!xol_vaddr)
> +               return -ENOMEM;
>
>         utask->xol_vaddr = xol_vaddr;
>         utask->vaddr = bp_vaddr;
> @@ -2007,13 +2005,13 @@ pre_ssout(struct uprobe *uprobe, struct
>         if (unlikely(err))
>                 goto err_xol;
>
> +       utask->active_srcu_idx = __srcu_read_lock(&uretprobes_srcu);
> +       utask->active_uprobe = uprobe;
>         utask->state = UTASK_SSTEP;
>         return 0;
>
>  err_xol:
>         xol_free_insn_slot(current);
> -err_uprobe:
> -       put_uprobe(utask->active_uprobe);
>         return err;
>  }
>
> @@ -2366,7 +2364,7 @@ static void handle_singlestep(struct upr
>         else
>                 WARN_ON_ONCE(1);
>
> -       put_uprobe(uprobe);
> +       __srcu_read_unlock(&uretprobes_srcu, utask->active_srcu_idx);
>         utask->active_uprobe = NULL;
>         utask->state = UTASK_RUNNING;
>         xol_free_insn_slot(current);
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
  2024-07-11 11:02 ` [PATCH v2 11/11] perf/uprobe: Add uretprobe timer Peter Zijlstra
  2024-07-11 13:19   ` Oleg Nesterov
@ 2024-07-12 21:43   ` Andrii Nakryiko
  2024-07-15 11:41     ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-12 21:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

+ bpf

On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> In order to put a bound on the uretprobe_srcu critical section, add a
> timer to uprobe_task. Upon every RI added or removed the timer is
> pushed forward to now + 1s. If the timer were ever to fire, it would
> convert the SRCU 'reference' to a refcount reference if possible.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/uprobes.h |    8 +++++
>  kernel/events/uprobes.c |   67 ++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 69 insertions(+), 6 deletions(-)
>
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -15,6 +15,7 @@
>  #include <linux/rbtree.h>
>  #include <linux/types.h>
>  #include <linux/wait.h>
> +#include <linux/timer.h>
>
>  struct vm_area_struct;
>  struct mm_struct;
> @@ -79,6 +80,10 @@ struct uprobe_task {
>         struct return_instance          *return_instances;
>         unsigned int                    depth;
>         unsigned int                    active_srcu_idx;
> +
> +       struct timer_list               ri_timer;
> +       struct callback_head            ri_task_work;
> +       struct task_struct              *task;
>  };
>
>  struct return_instance {
> @@ -86,7 +91,8 @@ struct return_instance {
>         unsigned long           func;
>         unsigned long           stack;          /* stack pointer */
>         unsigned long           orig_ret_vaddr; /* original return address */
> -       bool                    chained;        /* true, if instance is nested */
> +       u8                      chained;        /* true, if instance is nested */
> +       u8                      has_ref;

Why bool -> u8 switch? You don't touch chained, so why change its
type? And for has_ref you interchangeably use 0 and true for the same
field. Let's stick to bool as there is nothing wrong with it?

>         int                     srcu_idx;
>
>         struct return_instance  *next;          /* keep as stack */

[...]

> @@ -1822,13 +1864,20 @@ static int dup_utask(struct task_struct
>                         return -ENOMEM;
>
>                 *n = *o;
> -               __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
> +               if (n->uprobe) {
> +                       if (n->has_ref)
> +                               get_uprobe(n->uprobe);
> +                       else
> +                               __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
> +               }
>                 n->next = NULL;
>
>                 *p = n;
>                 p = &n->next;
>                 n_utask->depth++;
>         }
> +       if (n_utask->return_instances)
> +               mod_timer(&n_utask->ri_timer, jiffies + HZ);

let's add #define for HZ, so it's adjusted in just one place (instead
of 3 as it is right now)

Also, we can have up to 64 levels of uretprobe nesting, so,
technically, the user can cause a delay of 64 seconds in total. Maybe
let's use something smaller than a full second? After all, if the
user-space function has high latency, then this refcount congestion is
much less of a problem. I'd set it to something like 50-100 ms for
starters.

>
>         return 0;
>  }
> @@ -1967,6 +2016,7 @@ static void prepare_uretprobe(struct upr
>
>         ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu);
>         ri->uprobe = uprobe;
> +       ri->has_ref = 0;
>         ri->func = instruction_pointer(regs);
>         ri->stack = user_stack_pointer(regs);
>         ri->orig_ret_vaddr = orig_ret_vaddr;
> @@ -1976,6 +2026,8 @@ static void prepare_uretprobe(struct upr
>         ri->next = utask->return_instances;
>         utask->return_instances = ri;
>
> +       mod_timer(&utask->ri_timer, jiffies + HZ);
> +
>         return;
>
>  err_mem:
> @@ -2204,6 +2256,9 @@ handle_uretprobe_chain(struct return_ins
>         struct uprobe *uprobe = ri->uprobe;
>         struct uprobe_consumer *uc;
>
> +       if (!uprobe)
> +               return;
> +
>         guard(srcu)(&uprobes_srcu);
>
>         for_each_consumer_rcu(uc, uprobe->consumers) {
> @@ -2250,8 +2305,10 @@ static void handle_trampoline(struct pt_
>
>                 instruction_pointer_set(regs, ri->orig_ret_vaddr);
>                 do {
> -                       if (valid)
> +                       if (valid) {
>                                 handle_uretprobe_chain(ri, regs);
> +                               mod_timer(&utask->ri_timer, jiffies + HZ);
> +                       }
>                         ri = free_ret_instance(ri);
>                         utask->depth--;
>                 } while (ri != next);
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu()
  2024-07-12 20:23   ` Andrii Nakryiko
@ 2024-07-15 11:21     ` Peter Zijlstra
  2024-07-15 17:13       ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-15 11:21 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Fri, Jul 12, 2024 at 01:23:43PM -0700, Andrii Nakryiko wrote:
> On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Much like latch_tree, add two RCU methods for the regular RB-tree,
> > which can be used in conjunction with a seqcount to provide lockless
> > lookups.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> >  include/linux/rbtree.h |   67 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 67 insertions(+)
> >
> > --- a/include/linux/rbtree.h
> > +++ b/include/linux/rbtree.h
> > @@ -245,6 +245,42 @@ rb_find_add(struct rb_node *node, struct
> >  }
> >
> >  /**
> > + * rb_find_add_rcu() - find equivalent @node in @tree, or add @node
> > + * @node: node to look-for / insert
> > + * @tree: tree to search / modify
> > + * @cmp: operator defining the node order
> > + *
> > + * Adds a Store-Release for link_node.
> > + *
> > + * Returns the rb_node matching @node, or NULL when no match is found and @node
> > + * is inserted.
> > + */
> > +static __always_inline struct rb_node *
> > +rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
> > +               int (*cmp)(struct rb_node *, const struct rb_node *))
> 
> I don't get the point of the RCU version of rb_find_add as RCU itself
> doesn't provide enough protection for modification of the tree, right?
> So in uprobes code you do rb_find_add_rcu() under uprobes_treelock +
> uprobes_seqcount locks. Wouldn't it be just as fine to do plain
> non-RCU rb_find_add() in that case? After all, you do plain rb_erase
> under the same set of locks.
> 
> So what's the point of this one?

The store-release when adding it to the tree. Without that it becomes
possible to find the entry while the entry itself is incomplete.

Eg. something like:

 entry.foo = A
 rb_find_add(&entry->node, &my_tree, my_cmp);

vs

 rcu_read_lock();
 entry = rb_find_rcu(...);
 assert(entry->foo == A);

might fail. Because there is nothing ordering the foo store and the
rb-node add.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list
  2024-07-12 21:06   ` Andrii Nakryiko
@ 2024-07-15 11:25     ` Peter Zijlstra
  2024-07-15 17:30       ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-15 11:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

On Fri, Jul 12, 2024 at 02:06:08PM -0700, Andrii Nakryiko wrote:
> + bpf@vger, please cc bpf ML for the next revision, these changes are
> very relevant there as well, thanks
> 
> On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > With handle_swbp() hitting concurrently on (all) CPUs the
> > uprobe->register_rwsem can get very contended. Add an SRCU instance to
> > cover the consumer list and consumer lifetime.
> >
> > Since the consumer are externally embedded structures, unregister will
> > have to suffer a synchronize_srcu().
> >
> > A notably complication is the UPROBE_HANDLER_REMOVE logic which can
> > race against uprobe_register() such that it might want to remove a
> > freshly installer handler that didn't get called. In order to close
> > this hole, a seqcount is added. With that, the removal path can tell
> > if anything changed and bail out of the removal.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/events/uprobes.c |   60 ++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 50 insertions(+), 10 deletions(-)
> >
> 
> [...]
> 
> > @@ -800,7 +808,7 @@ static bool consumer_del(struct uprobe *
> >         down_write(&uprobe->consumer_rwsem);
> >         for (con = &uprobe->consumers; *con; con = &(*con)->next) {
> >                 if (*con == uc) {
> > -                       *con = uc->next;
> > +                       WRITE_ONCE(*con, uc->next);
> 
> I have a dumb and mechanical question.
> 
> Above in consumer_add() you are doing WRITE_ONCE() for uc->next
> assignment, but rcu_assign_pointer() for uprobe->consumers. Here, you
> are doing WRITE_ONCE() for the same operation, if it so happens that
> uc == *con == uprobe->consumers. So is rcu_assign_pointer() necessary
> in consumer_addr()? If yes, we should have it here as well, no? And if
> not, why bother with it in consumer_add()?

add is a publish and needs to ensure all stores to the object are
ordered before the object is linked in. Note that rcu_assign_pointer()
is basically a fancy way of writing smp_store_release().

del otoh does not publish, it removes and doesn't need ordering.

It does however need to ensure the pointer write itself isn't torn. That
is, without the WRITE_ONCE() the compiler is free to do byte stores in
order to update the 8 byte pointer value (assuming 64bit). This is
pretty dumb, but very much permitted by C and also utterly fatal in the
case where an RCU iteration comes by and reads a half-half pointer
value.

> >                         ret = true;
> >                         break;
> >                 }
> > @@ -1139,9 +1147,13 @@ void uprobe_unregister(struct inode *ino
> >                 return;
> >
> >         down_write(&uprobe->register_rwsem);
> > +       raw_write_seqcount_begin(&uprobe->register_seq);
> >         __uprobe_unregister(uprobe, uc);
> > +       raw_write_seqcount_end(&uprobe->register_seq);
> >         up_write(&uprobe->register_rwsem);
> >         put_uprobe(uprobe);
> > +
> > +       synchronize_srcu(&uprobes_srcu);
> >  }
> >  EXPORT_SYMBOL_GPL(uprobe_unregister);
> 
> [...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
  2024-07-12 21:43   ` Andrii Nakryiko
@ 2024-07-15 11:41     ` Peter Zijlstra
  2024-07-15 17:34       ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-15 11:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

On Fri, Jul 12, 2024 at 02:43:52PM -0700, Andrii Nakryiko wrote:
> + bpf
> 
> On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > In order to put a bound on the uretprobe_srcu critical section, add a
> > timer to uprobe_task. Upon every RI added or removed the timer is
> > pushed forward to now + 1s. If the timer were ever to fire, it would
> > convert the SRCU 'reference' to a refcount reference if possible.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  include/linux/uprobes.h |    8 +++++
> >  kernel/events/uprobes.c |   67 ++++++++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 69 insertions(+), 6 deletions(-)
> >
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -15,6 +15,7 @@
> >  #include <linux/rbtree.h>
> >  #include <linux/types.h>
> >  #include <linux/wait.h>
> > +#include <linux/timer.h>
> >
> >  struct vm_area_struct;
> >  struct mm_struct;
> > @@ -79,6 +80,10 @@ struct uprobe_task {
> >         struct return_instance          *return_instances;
> >         unsigned int                    depth;
> >         unsigned int                    active_srcu_idx;
> > +
> > +       struct timer_list               ri_timer;
> > +       struct callback_head            ri_task_work;
> > +       struct task_struct              *task;
> >  };
> >
> >  struct return_instance {
> > @@ -86,7 +91,8 @@ struct return_instance {
> >         unsigned long           func;
> >         unsigned long           stack;          /* stack pointer */
> >         unsigned long           orig_ret_vaddr; /* original return address */
> > -       bool                    chained;        /* true, if instance is nested */
> > +       u8                      chained;        /* true, if instance is nested */
> > +       u8                      has_ref;
> 
> Why bool -> u8 switch? You don't touch chained, so why change its
> type? And for has_ref you interchangeably use 0 and true for the same
> field. Let's stick to bool as there is nothing wrong with it?

sizeof(_Bool) is implementation defined. It is 1 for x86_64, but there
are platforms where it ends up begin 4 (some PowerPC ABIs among others.
I didn't want to grow this structure for no reason.

> >         int                     srcu_idx;
> >
> >         struct return_instance  *next;          /* keep as stack */
> 
> [...]
> 
> > @@ -1822,13 +1864,20 @@ static int dup_utask(struct task_struct
> >                         return -ENOMEM;
> >
> >                 *n = *o;
> > -               __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
> > +               if (n->uprobe) {
> > +                       if (n->has_ref)
> > +                               get_uprobe(n->uprobe);
> > +                       else
> > +                               __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
> > +               }
> >                 n->next = NULL;
> >
> >                 *p = n;
> >                 p = &n->next;
> >                 n_utask->depth++;
> >         }
> > +       if (n_utask->return_instances)
> > +               mod_timer(&n_utask->ri_timer, jiffies + HZ);
> 
> let's add #define for HZ, so it's adjusted in just one place (instead
> of 3 as it is right now)

Can do I suppose.

> Also, we can have up to 64 levels of uretprobe nesting, so,
> technically, the user can cause a delay of 64 seconds in total. Maybe
> let's use something smaller than a full second? After all, if the
> user-space function has high latency, then this refcount congestion is
> much less of a problem. I'd set it to something like 50-100 ms for
> starters.

Before you know it we'll have a sysctl :/ But sure, we can do something
shorter.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
  2024-07-12 21:28   ` Andrii Nakryiko
@ 2024-07-15 11:59     ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-15 11:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

On Fri, Jul 12, 2024 at 02:28:13PM -0700, Andrii Nakryiko wrote:

> > @@ -1814,7 +1822,7 @@ static int dup_utask(struct task_struct
> >                         return -ENOMEM;
> >
> >                 *n = *o;
> > -               get_uprobe(n->uprobe);
> > +               __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
> 
> do we need to add this __srcu_clone_read_lock hack just to avoid
> taking a refcount in dup_utask (i.e., on process fork)? This is not
> that frequent and performance-sensitive case, so it seems like it
> should be fine to take refcount and avoid doing srcu_read_unlock() in
> a new process. Just like the case with long-running uretprobes where
> you convert SRCU lock into refcount.

Yes, I suppose that is now possible too. But it makes the patches harder
to split. Let me ponder that after I get it to pass your stress thing.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-12  4:57 ` [PATCH v2 00/11] perf/uprobe: Optimize uprobes Andrii Nakryiko
  2024-07-12  9:13   ` Peter Zijlstra
  2024-07-12 13:10   ` Peter Zijlstra
@ 2024-07-15 14:45   ` Peter Zijlstra
  2024-07-15 17:10     ` Andrii Nakryiko
  2 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2024-07-15 14:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: oleg, mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Thu, Jul 11, 2024 at 09:57:44PM -0700, Andrii Nakryiko wrote:

> But then I also ran it on Linux built from perf/uprobes branch (these
> patches), and after a few seconds I see that there is no more
> attachment/detachment happening. Eventually I got splats, which you
> can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command
> to run it inside my QEMU image.

So them git voodoo incantations did work and I got it built. I'm running
that exact same line above (minus the sudo, because test box only has a
root account I think) on real hardware.

I'm now ~100 periods in and wondering what 'eventually' means...

Also, this is a 2 socket, 10 core per socket, 2 threads per core
ivybridge thing, are those parameters sufficient?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-15 14:45   ` Peter Zijlstra
@ 2024-07-15 17:10     ` Andrii Nakryiko
  2024-07-15 18:10       ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-15 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: oleg, mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Mon, Jul 15, 2024 at 7:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jul 11, 2024 at 09:57:44PM -0700, Andrii Nakryiko wrote:
>
> > But then I also ran it on Linux built from perf/uprobes branch (these
> > patches), and after a few seconds I see that there is no more
> > attachment/detachment happening. Eventually I got splats, which you
> > can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command
> > to run it inside my QEMU image.
>
> So them git voodoo incantations did work and I got it built. I'm running
> that exact same line above (minus the sudo, because test box only has a
> root account I think) on real hardware.
>
> I'm now ~100 periods in and wondering what 'eventually' means...

So I was running in a qemu set up with 16 cores on top of bare metal's
80 core CPU (Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz). I just tried
it again, and I can reproduce it within first few periods:

WORKING HARD!..

PERIOD #1 STATS:
FUNC CALLS               919632
UPROBE HITS              706351
URETPROBE HITS           641679
ATTACHED LINKS              951
ATTACHED UPROBES           2421
ATTACHED URETPROBES        2343
MMAP CALLS                33533
FORKS CALLS                 241

PERIOD #2 STATS:
FUNC CALLS                11444
UPROBE HITS               14320
URETPROBE HITS             9896
ATTACHED LINKS               26
ATTACHED UPROBES             75
ATTACHED URETPROBES          61
MMAP CALLS                39093
FORKS CALLS                  14

PERIOD #3 STATS:
FUNC CALLS                  230
UPROBE HITS                 152
URETPROBE HITS              145
ATTACHED LINKS                2
ATTACHED UPROBES              2
ATTACHED URETPROBES           2
MMAP CALLS                39121
FORKS CALLS                   0

PERIOD #4 STATS:
FUNC CALLS                    0
UPROBE HITS                   0
URETPROBE HITS                0
ATTACHED LINKS                0
ATTACHED UPROBES              0
ATTACHED URETPROBES           0
MMAP CALLS                39010
FORKS CALLS                   0

You can see in the second period all the numbers drop and by period #4
(which is about 20 seconds in) anything but mmap()ing stops. When I
said "eventually" I meant about a minute tops, however long it takes
to do soft lockup detection, 23 seconds this time.

So it should be very fast.

Note that I'm running with debug kernel configuration (see [0] for
full kernel config), here are debug-related settings, in case that
makes a difference:

$ cat ~/linux-build/default/.config | rg -i debug | rg -v '^#'
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_BLK_DEBUG_FS=y
CONFIG_PNP_DEBUG_MESSAGES=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC79XX_DEBUG_MASK=0
CONFIG_SCSI_MVSAS_DEBUG=y
CONFIG_DM_DEBUG=y
CONFIG_MLX4_DEBUG=y
CONFIG_USB_SERIAL_DEBUG=m
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_IPOIB_DEBUG=y
CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
CONFIG_CIFS_DEBUG=y
CONFIG_DLM_DEBUG=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF4=y
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_DEBUG_INFO_BTF_MODULES=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
CONFIG_ARCH_HAS_DEBUG_WX=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_IRQFLAGS=y
CONFIG_X86_DEBUG_FPU=y
CONFIG_FAULT_INJECTION_DEBUG_FS=y

  [0] https://gist.github.com/anakryiko/97a023a95b30fb0fe607ff743433e64b

>
> Also, this is a 2 socket, 10 core per socket, 2 threads per core
> ivybridge thing, are those parameters sufficient?

Should be, I guess? It might be VM vs bare metal differences, though.
I'll try to run this on bare metal with more production-like kernel
configuration to see if I can still trigger this. Will let you know
the results when I get them.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu()
  2024-07-15 11:21     ` Peter Zijlstra
@ 2024-07-15 17:13       ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-15 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Mon, Jul 15, 2024 at 4:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 12, 2024 at 01:23:43PM -0700, Andrii Nakryiko wrote:
> > On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Much like latch_tree, add two RCU methods for the regular RB-tree,
> > > which can be used in conjunction with a seqcount to provide lockless
> > > lookups.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > ---
> > >  include/linux/rbtree.h |   67 +++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 67 insertions(+)
> > >
> > > --- a/include/linux/rbtree.h
> > > +++ b/include/linux/rbtree.h
> > > @@ -245,6 +245,42 @@ rb_find_add(struct rb_node *node, struct
> > >  }
> > >
> > >  /**
> > > + * rb_find_add_rcu() - find equivalent @node in @tree, or add @node
> > > + * @node: node to look-for / insert
> > > + * @tree: tree to search / modify
> > > + * @cmp: operator defining the node order
> > > + *
> > > + * Adds a Store-Release for link_node.
> > > + *
> > > + * Returns the rb_node matching @node, or NULL when no match is found and @node
> > > + * is inserted.
> > > + */
> > > +static __always_inline struct rb_node *
> > > +rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
> > > +               int (*cmp)(struct rb_node *, const struct rb_node *))
> >
> > I don't get the point of the RCU version of rb_find_add as RCU itself
> > doesn't provide enough protection for modification of the tree, right?
> > So in uprobes code you do rb_find_add_rcu() under uprobes_treelock +
> > uprobes_seqcount locks. Wouldn't it be just as fine to do plain
> > non-RCU rb_find_add() in that case? After all, you do plain rb_erase
> > under the same set of locks.
> >
> > So what's the point of this one?
>
> The store-release when adding it to the tree. Without that it becomes
> possible to find the entry while the entry itself is incomplete.
>
> Eg. something like:
>
>  entry.foo = A
>  rb_find_add(&entry->node, &my_tree, my_cmp);
>
> vs
>
>  rcu_read_lock();
>  entry = rb_find_rcu(...);
>  assert(entry->foo == A);
>
> might fail. Because there is nothing ordering the foo store and the
> rb-node add.
>
>

Ah, I see, thanks for the explanation. That's what "Adds a
Store-Release for link_node." in the comment means, I see.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list
  2024-07-15 11:25     ` Peter Zijlstra
@ 2024-07-15 17:30       ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-15 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

On Mon, Jul 15, 2024 at 4:25 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 12, 2024 at 02:06:08PM -0700, Andrii Nakryiko wrote:
> > + bpf@vger, please cc bpf ML for the next revision, these changes are
> > very relevant there as well, thanks
> >
> > On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > With handle_swbp() hitting concurrently on (all) CPUs the
> > > uprobe->register_rwsem can get very contended. Add an SRCU instance to
> > > cover the consumer list and consumer lifetime.
> > >
> > > Since the consumer are externally embedded structures, unregister will
> > > have to suffer a synchronize_srcu().
> > >
> > > A notably complication is the UPROBE_HANDLER_REMOVE logic which can
> > > race against uprobe_register() such that it might want to remove a
> > > freshly installer handler that didn't get called. In order to close
> > > this hole, a seqcount is added. With that, the removal path can tell
> > > if anything changed and bail out of the removal.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/events/uprobes.c |   60 ++++++++++++++++++++++++++++++++++++++++--------
> > >  1 file changed, 50 insertions(+), 10 deletions(-)
> > >
> >
> > [...]
> >
> > > @@ -800,7 +808,7 @@ static bool consumer_del(struct uprobe *
> > >         down_write(&uprobe->consumer_rwsem);
> > >         for (con = &uprobe->consumers; *con; con = &(*con)->next) {
> > >                 if (*con == uc) {
> > > -                       *con = uc->next;
> > > +                       WRITE_ONCE(*con, uc->next);
> >
> > I have a dumb and mechanical question.
> >
> > Above in consumer_add() you are doing WRITE_ONCE() for uc->next
> > assignment, but rcu_assign_pointer() for uprobe->consumers. Here, you
> > are doing WRITE_ONCE() for the same operation, if it so happens that
> > uc == *con == uprobe->consumers. So is rcu_assign_pointer() necessary
> > in consumer_addr()? If yes, we should have it here as well, no? And if
> > not, why bother with it in consumer_add()?
>
> add is a publish and needs to ensure all stores to the object are
> ordered before the object is linked in. Note that rcu_assign_pointer()
> is basically a fancy way of writing smp_store_release().
>
> del otoh does not publish, it removes and doesn't need ordering.
>
> It does however need to ensure the pointer write itself isn't torn. That
> is, without the WRITE_ONCE() the compiler is free to do byte stores in
> order to update the 8 byte pointer value (assuming 64bit). This is
> pretty dumb, but very much permitted by C and also utterly fatal in the
> case where an RCU iteration comes by and reads a half-half pointer
> value.
>

Thanks for elaborating, very helpful! It's basically the same idea as
with rb_find_add_rcu(), got it.

> > >                         ret = true;
> > >                         break;
> > >                 }
> > > @@ -1139,9 +1147,13 @@ void uprobe_unregister(struct inode *ino
> > >                 return;
> > >
> > >         down_write(&uprobe->register_rwsem);
> > > +       raw_write_seqcount_begin(&uprobe->register_seq);
> > >         __uprobe_unregister(uprobe, uc);
> > > +       raw_write_seqcount_end(&uprobe->register_seq);
> > >         up_write(&uprobe->register_rwsem);
> > >         put_uprobe(uprobe);
> > > +
> > > +       synchronize_srcu(&uprobes_srcu);
> > >  }
> > >  EXPORT_SYMBOL_GPL(uprobe_unregister);
> >
> > [...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
  2024-07-15 11:41     ` Peter Zijlstra
@ 2024-07-15 17:34       ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-15 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, andrii, oleg, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

On Mon, Jul 15, 2024 at 4:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 12, 2024 at 02:43:52PM -0700, Andrii Nakryiko wrote:
> > + bpf
> >
> > On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > In order to put a bound on the uretprobe_srcu critical section, add a
> > > timer to uprobe_task. Upon every RI added or removed the timer is
> > > pushed forward to now + 1s. If the timer were ever to fire, it would
> > > convert the SRCU 'reference' to a refcount reference if possible.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  include/linux/uprobes.h |    8 +++++
> > >  kernel/events/uprobes.c |   67 ++++++++++++++++++++++++++++++++++++++++++++----
> > >  2 files changed, 69 insertions(+), 6 deletions(-)
> > >
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/rbtree.h>
> > >  #include <linux/types.h>
> > >  #include <linux/wait.h>
> > > +#include <linux/timer.h>
> > >
> > >  struct vm_area_struct;
> > >  struct mm_struct;
> > > @@ -79,6 +80,10 @@ struct uprobe_task {
> > >         struct return_instance          *return_instances;
> > >         unsigned int                    depth;
> > >         unsigned int                    active_srcu_idx;
> > > +
> > > +       struct timer_list               ri_timer;
> > > +       struct callback_head            ri_task_work;
> > > +       struct task_struct              *task;
> > >  };
> > >
> > >  struct return_instance {
> > > @@ -86,7 +91,8 @@ struct return_instance {
> > >         unsigned long           func;
> > >         unsigned long           stack;          /* stack pointer */
> > >         unsigned long           orig_ret_vaddr; /* original return address */
> > > -       bool                    chained;        /* true, if instance is nested */
> > > +       u8                      chained;        /* true, if instance is nested */
> > > +       u8                      has_ref;
> >
> > Why bool -> u8 switch? You don't touch chained, so why change its
> > type? And for has_ref you interchangeably use 0 and true for the same
> > field. Let's stick to bool as there is nothing wrong with it?
>
> sizeof(_Bool) is implementation defined. It is 1 for x86_64, but there
> are platforms where it ends up begin 4 (some PowerPC ABIs among others.
> I didn't want to grow this structure for no reason.

There are tons of bools in the kernel, surprised that we (kernel
makefiles) don't do anything on PowerPC to keep it consistent with the
rest of the world. Oh well, it just kind of looks off when there is a
mix of 0 and true used for the same field.

>
> > >         int                     srcu_idx;
> > >
> > >         struct return_instance  *next;          /* keep as stack */
> >
> > [...]
> >
> > > @@ -1822,13 +1864,20 @@ static int dup_utask(struct task_struct
> > >                         return -ENOMEM;
> > >
> > >                 *n = *o;
> > > -               __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
> > > +               if (n->uprobe) {
> > > +                       if (n->has_ref)
> > > +                               get_uprobe(n->uprobe);
> > > +                       else
> > > +                               __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
> > > +               }
> > >                 n->next = NULL;
> > >
> > >                 *p = n;
> > >                 p = &n->next;
> > >                 n_utask->depth++;
> > >         }
> > > +       if (n_utask->return_instances)
> > > +               mod_timer(&n_utask->ri_timer, jiffies + HZ);
> >
> > let's add #define for HZ, so it's adjusted in just one place (instead
> > of 3 as it is right now)
>
> Can do I suppose.

thanks!

>
> > Also, we can have up to 64 levels of uretprobe nesting, so,
> > technically, the user can cause a delay of 64 seconds in total. Maybe
> > let's use something smaller than a full second? After all, if the
> > user-space function has high latency, then this refcount congestion is
> > much less of a problem. I'd set it to something like 50-100 ms for
> > starters.
>
> Before you know it we'll have a sysctl :/ But sure, we can do something
> shorter.

:) let's hope we won't need sysctl (I don't think we will, FWIW)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-15 17:10     ` Andrii Nakryiko
@ 2024-07-15 18:10       ` Andrii Nakryiko
  2024-07-19 18:42         ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-15 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: oleg, mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck

On Mon, Jul 15, 2024 at 10:10 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 15, 2024 at 7:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Jul 11, 2024 at 09:57:44PM -0700, Andrii Nakryiko wrote:
> >
> > > But then I also ran it on Linux built from perf/uprobes branch (these
> > > patches), and after a few seconds I see that there is no more
> > > attachment/detachment happening. Eventually I got splats, which you
> > > can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command
> > > to run it inside my QEMU image.
> >
> > So them git voodoo incantations did work and I got it built. I'm running
> > that exact same line above (minus the sudo, because test box only has a
> > root account I think) on real hardware.
> >
> > I'm now ~100 periods in and wondering what 'eventually' means...
>
> So I was running in a qemu set up with 16 cores on top of bare metal's
> 80 core CPU (Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz). I just tried
> it again, and I can reproduce it within first few periods:
>
> WORKING HARD!..
>
> PERIOD #1 STATS:
> FUNC CALLS               919632
> UPROBE HITS              706351
> URETPROBE HITS           641679
> ATTACHED LINKS              951
> ATTACHED UPROBES           2421
> ATTACHED URETPROBES        2343
> MMAP CALLS                33533
> FORKS CALLS                 241
>
> PERIOD #2 STATS:
> FUNC CALLS                11444
> UPROBE HITS               14320
> URETPROBE HITS             9896
> ATTACHED LINKS               26
> ATTACHED UPROBES             75
> ATTACHED URETPROBES          61
> MMAP CALLS                39093
> FORKS CALLS                  14
>
> PERIOD #3 STATS:
> FUNC CALLS                  230
> UPROBE HITS                 152
> URETPROBE HITS              145
> ATTACHED LINKS                2
> ATTACHED UPROBES              2
> ATTACHED URETPROBES           2
> MMAP CALLS                39121
> FORKS CALLS                   0
>
> PERIOD #4 STATS:
> FUNC CALLS                    0
> UPROBE HITS                   0
> URETPROBE HITS                0
> ATTACHED LINKS                0
> ATTACHED UPROBES              0
> ATTACHED URETPROBES           0
> MMAP CALLS                39010
> FORKS CALLS                   0
>
> You can see in the second period all the numbers drop and by period #4
> (which is about 20 seconds in) anything but mmap()ing stops. When I
> said "eventually" I meant about a minute tops, however long it takes
> to do soft lockup detection, 23 seconds this time.
>
> So it should be very fast.
>
> Note that I'm running with debug kernel configuration (see [0] for
> full kernel config), here are debug-related settings, in case that
> makes a difference:
>
> $ cat ~/linux-build/default/.config | rg -i debug | rg -v '^#'
> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_BLK_DEBUG_FS=y
> CONFIG_PNP_DEBUG_MESSAGES=y
> CONFIG_AIC7XXX_DEBUG_MASK=0
> CONFIG_AIC79XX_DEBUG_MASK=0
> CONFIG_SCSI_MVSAS_DEBUG=y
> CONFIG_DM_DEBUG=y
> CONFIG_MLX4_DEBUG=y
> CONFIG_USB_SERIAL_DEBUG=m
> CONFIG_INFINIBAND_MTHCA_DEBUG=y
> CONFIG_INFINIBAND_IPOIB_DEBUG=y
> CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
> CONFIG_CIFS_DEBUG=y
> CONFIG_DLM_DEBUG=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_INFO_DWARF4=y
> CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
> CONFIG_DEBUG_INFO_BTF=y
> CONFIG_DEBUG_INFO_BTF_MODULES=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_FS_ALLOW_ALL=y
> CONFIG_ARCH_HAS_DEBUG_WX=y
> CONFIG_HAVE_DEBUG_KMEMLEAK=y
> CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
> CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> CONFIG_SCHED_DEBUG=y
> CONFIG_DEBUG_PREEMPT=y
> CONFIG_LOCK_DEBUGGING_SUPPORT=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> CONFIG_DEBUG_RWSEMS=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_DEBUG_LOCKDEP=y
> CONFIG_DEBUG_ATOMIC_SLEEP=y
> CONFIG_DEBUG_IRQFLAGS=y
> CONFIG_X86_DEBUG_FPU=y
> CONFIG_FAULT_INJECTION_DEBUG_FS=y
>
>   [0] https://gist.github.com/anakryiko/97a023a95b30fb0fe607ff743433e64b
>
> >
> > Also, this is a 2 socket, 10 core per socket, 2 threads per core
> > ivybridge thing, are those parameters sufficient?
>
> Should be, I guess? It might be VM vs bare metal differences, though.
> I'll try to run this on bare metal with more production-like kernel
> configuration to see if I can still trigger this. Will let you know
> the results when I get them.

Ok, so I ran it on bare metal host with production config. I didn't
really bother to specify parameters (so just one thread for
everything, the default):

# ./uprobe-stress
WORKING HARD!..

PERIOD #1 STATS:
FUNC CALLS              2959843
UPROBE HITS             1001312
URETPROBE HITS                0
ATTACHED LINKS                6
ATTACHED UPROBES             28
ATTACHED URETPROBES           0
MMAP CALLS                 8143
FORKS CALLS                 301

PERIOD #2 STATS:
FUNC CALLS                    0
UPROBE HITS              822826
URETPROBE HITS                0
ATTACHED LINKS                0
ATTACHED UPROBES              0
ATTACHED URETPROBES           0
MMAP CALLS                 8006
FORKS CALLS                 270

PERIOD #3 STATS:
FUNC CALLS                    0
UPROBE HITS              889534
URETPROBE HITS                0
ATTACHED LINKS                0
ATTACHED UPROBES              0
ATTACHED URETPROBES           0
MMAP CALLS                 8004
FORKS CALLS                 288

PERIOD #4 STATS:
FUNC CALLS                    0
UPROBE HITS              886506
URETPROBE HITS                0
ATTACHED LINKS                0
ATTACHED UPROBES              0
ATTACHED URETPROBES           0
MMAP CALLS                 8120
FORKS CALLS                 285

PERIOD #5 STATS:
FUNC CALLS                    0
UPROBE HITS              804556
URETPROBE HITS                0
ATTACHED LINKS                0
ATTACHED UPROBES              0
ATTACHED URETPROBES           0
MMAP CALLS                 7131
FORKS CALLS                 263
^C
EXITING...

Message from syslogd@kerneltest003.10.atn6.facebook.com at Jul 15 11:06:33 ...
 kernel:[ 2194.334618] watchdog: BUG: soft lockup - CPU#71 stuck for
48s! [uprobe-stress:69900]

It was weird on the very first period (no uretprobes, small amount of
attachments). And sure enough (gmail will reformat below in the
garbage, so [0] has the splat with the original formatting).

  [0] https://gist.github.com/anakryiko/3e3ddcccc5ea3ca70ce90b5491485fdc

I also keep getting:

Message from syslogd@kerneltest003.10.atn6.facebook.com at Jul 15 11:09:41 ...
 kernel:[ 2382.334088] watchdog: BUG: soft lockup - CPU#71 stuck for
223s! [uprobe-stress:69900]

so it's not just a temporary slowdown


[ 2166.893057] rcu: INFO: rcu_sched self-detected stall on CPU
[ 2166.904199] rcu:     71-....: (20999 ticks this GP)
idle=2c84/1/0x4000000000000000 softirq=30158/30158 fqs=8110
[ 2166.923810] rcu:              hardirqs   softirqs   csw/system
[ 2166.934939] rcu:      number:        0        183            0
[ 2166.946064] rcu:     cputime:       60          0        10438
==> 10549(ms)
[ 2166.959969] rcu:     (t=21065 jiffies g=369217 q=207850 ncpus=80)
[ 2166.971619] CPU: 71 PID: 69900 Comm: uprobe-stress Tainted: G S
     E      6.10.0-rc7-00071-g9423ae8ef6ff #62
[ 2166.992275] Hardware name: Quanta Tioga Pass Single Side
01-0032211004/Tioga Pass Single Side, BIOS F08_3A24 05/13/2020
[ 2167.013804] RIP: 0010:uprobe_notify_resume+0x622/0xe20
[ 2167.024064] Code: 8d 9d c0 00 00 00 48 89 df 4c 89 e6 e8 d7 f9 ff
ff 84 c0 0f 85 c6 06 00 00 48 89 5c 24 20 41 8b 6d 58 40 f6 c5 01 74
23 f3 90 <eb> f2 83 7c 24 18 00 48 8b 44 24 10 0f 8e 71 01 00 00 bf 05
00 00
[ 2167.061543] RSP: 0000:ffffc9004a49fe78 EFLAGS: 00000202
[ 2167.071973] RAX: 0000000000000000 RBX: ffff88a11d307fc0 RCX: ffff88a120752c40
[ 2167.086223] RDX: 00000000000042ec RSI: ffffc9004a49ff58 RDI: ffff88a11d307fc0
[ 2167.100472] RBP: 0000000000000003 R08: ffff88a12516e500 R09: ffff88a12516f208
[ 2167.114717] R10: 00000000004042ec R11: 000000000000000f R12: ffffc9004a49ff58
[ 2167.128967] R13: ffff88a11d307f00 R14: 00000000004042ec R15: ffff88a09042e000
[ 2167.143213] FS:  00007fd252000640(0000) GS:ffff88bfffbc0000(0000)
knlGS:0000000000000000
[ 2167.159368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2167.170843] CR2: 00007fd244000b60 CR3: 000000209090b001 CR4: 00000000007706f0
[ 2167.185091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2167.199340] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2167.213586] PKRU: 55555554
[ 2167.218994] Call Trace:
[ 2167.223883]  <IRQ>
[ 2167.227905]  ? rcu_dump_cpu_stacks+0x77/0xd0
[ 2167.236433]  ? print_cpu_stall+0x150/0x2a0
[ 2167.244615]  ? rcu_sched_clock_irq+0x319/0x490
[ 2167.253487]  ? update_process_times+0x71/0xa0
[ 2167.262191]  ? tick_nohz_handler+0xc0/0x100
[ 2167.270544]  ? tick_setup_sched_timer+0x170/0x170
[ 2167.279937]  ? __hrtimer_run_queues+0xe3/0x250
[ 2167.288815]  ? hrtimer_interrupt+0xf0/0x390
[ 2167.297168]  ? __sysvec_apic_timer_interrupt+0x47/0x110
[ 2167.307602]  ? sysvec_apic_timer_interrupt+0x68/0x80
[ 2167.317519]  </IRQ>
[ 2167.321710]  <TASK>
[ 2167.325905]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 2167.336517]  ? uprobe_notify_resume+0x622/0xe20
[ 2167.345565]  ? uprobe_notify_resume+0x609/0xe20
[ 2167.354612]  ? __se_sys_futex+0xf3/0x180
[ 2167.362445]  ? arch_uprobe_exception_notify+0x29/0x40
[ 2167.372533]  ? notify_die+0x51/0xb0
[ 2167.379503]  irqentry_exit_to_user_mode+0x7f/0xd0
[ 2167.388896]  asm_exc_int3+0x35/0x40
[ 2167.395862] RIP: 0033:0x4042ec
[ 2167.401966] Code: fc 8b 45 fc 89 c7 e8 6f 07 00 00 83 c0 01 c9 c3
cc 48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 55 07 00 00 83 c0
01 c9 c3 <cc> 48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 3b 07 00
00 83
[ 2167.439439] RSP: 002b:00007fd251fff8a8 EFLAGS: 00000206
[ 2167.449874] RAX: 00000000004042ec RBX: 00007fd252000640 RCX: 000000000000001c
[ 2167.464122] RDX: 0000000000000033 RSI: 0000000000000064 RDI: 0000000000000033
[ 2167.478368] RBP: 00007fd251fff8d0 R08: 00007fd2523fa234 R09: 00007fd2523fa280
[ 2167.492617] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd252000640
[ 2167.506866] R13: 0000000000000016 R14: 00007fd252289930 R15: 0000000000000000
[ 2167.521117]  </TASK>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-15 18:10       ` Andrii Nakryiko
@ 2024-07-19 18:42         ` Andrii Nakryiko
  2024-07-27  0:18           ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-19 18:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: oleg, mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

On Mon, Jul 15, 2024 at 11:10 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 15, 2024 at 10:10 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jul 15, 2024 at 7:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Jul 11, 2024 at 09:57:44PM -0700, Andrii Nakryiko wrote:
> > >
> > > > But then I also ran it on Linux built from perf/uprobes branch (these
> > > > patches), and after a few seconds I see that there is no more
> > > > attachment/detachment happening. Eventually I got splats, which you
> > > > can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command
> > > > to run it inside my QEMU image.
> > >
> > > So them git voodoo incantations did work and I got it built. I'm running
> > > that exact same line above (minus the sudo, because test box only has a
> > > root account I think) on real hardware.
> > >
> > > I'm now ~100 periods in and wondering what 'eventually' means...
> >
> > So I was running in a qemu set up with 16 cores on top of bare metal's
> > 80 core CPU (Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz). I just tried
> > it again, and I can reproduce it within first few periods:
> >
> > WORKING HARD!..
> >
> > PERIOD #1 STATS:
> > FUNC CALLS               919632
> > UPROBE HITS              706351
> > URETPROBE HITS           641679
> > ATTACHED LINKS              951
> > ATTACHED UPROBES           2421
> > ATTACHED URETPROBES        2343
> > MMAP CALLS                33533
> > FORKS CALLS                 241
> >
> > PERIOD #2 STATS:
> > FUNC CALLS                11444
> > UPROBE HITS               14320
> > URETPROBE HITS             9896
> > ATTACHED LINKS               26
> > ATTACHED UPROBES             75
> > ATTACHED URETPROBES          61
> > MMAP CALLS                39093
> > FORKS CALLS                  14
> >
> > PERIOD #3 STATS:
> > FUNC CALLS                  230
> > UPROBE HITS                 152
> > URETPROBE HITS              145
> > ATTACHED LINKS                2
> > ATTACHED UPROBES              2
> > ATTACHED URETPROBES           2
> > MMAP CALLS                39121
> > FORKS CALLS                   0
> >
> > PERIOD #4 STATS:
> > FUNC CALLS                    0
> > UPROBE HITS                   0
> > URETPROBE HITS                0
> > ATTACHED LINKS                0
> > ATTACHED UPROBES              0
> > ATTACHED URETPROBES           0
> > MMAP CALLS                39010
> > FORKS CALLS                   0
> >
> > You can see in the second period all the numbers drop and by period #4
> > (which is about 20 seconds in) anything but mmap()ing stops. When I
> > said "eventually" I meant about a minute tops, however long it takes
> > to do soft lockup detection, 23 seconds this time.
> >
> > So it should be very fast.
> >
> > Note that I'm running with debug kernel configuration (see [0] for
> > full kernel config), here are debug-related settings, in case that
> > makes a difference:
> >
> > $ cat ~/linux-build/default/.config | rg -i debug | rg -v '^#'
> > CONFIG_X86_DEBUGCTLMSR=y
> > CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> > CONFIG_BLK_DEBUG_FS=y
> > CONFIG_PNP_DEBUG_MESSAGES=y
> > CONFIG_AIC7XXX_DEBUG_MASK=0
> > CONFIG_AIC79XX_DEBUG_MASK=0
> > CONFIG_SCSI_MVSAS_DEBUG=y
> > CONFIG_DM_DEBUG=y
> > CONFIG_MLX4_DEBUG=y
> > CONFIG_USB_SERIAL_DEBUG=m
> > CONFIG_INFINIBAND_MTHCA_DEBUG=y
> > CONFIG_INFINIBAND_IPOIB_DEBUG=y
> > CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
> > CONFIG_CIFS_DEBUG=y
> > CONFIG_DLM_DEBUG=y
> > CONFIG_DEBUG_BUGVERBOSE=y
> > CONFIG_DEBUG_KERNEL=y
> > CONFIG_DEBUG_INFO=y
> > CONFIG_DEBUG_INFO_DWARF4=y
> > CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
> > CONFIG_DEBUG_INFO_BTF=y
> > CONFIG_DEBUG_INFO_BTF_MODULES=y
> > CONFIG_DEBUG_FS=y
> > CONFIG_DEBUG_FS_ALLOW_ALL=y
> > CONFIG_ARCH_HAS_DEBUG_WX=y
> > CONFIG_HAVE_DEBUG_KMEMLEAK=y
> > CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
> > CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> > CONFIG_SCHED_DEBUG=y
> > CONFIG_DEBUG_PREEMPT=y
> > CONFIG_LOCK_DEBUGGING_SUPPORT=y
> > CONFIG_DEBUG_RT_MUTEXES=y
> > CONFIG_DEBUG_SPINLOCK=y
> > CONFIG_DEBUG_MUTEXES=y
> > CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> > CONFIG_DEBUG_RWSEMS=y
> > CONFIG_DEBUG_LOCK_ALLOC=y
> > CONFIG_DEBUG_LOCKDEP=y
> > CONFIG_DEBUG_ATOMIC_SLEEP=y
> > CONFIG_DEBUG_IRQFLAGS=y
> > CONFIG_X86_DEBUG_FPU=y
> > CONFIG_FAULT_INJECTION_DEBUG_FS=y
> >
> >   [0] https://gist.github.com/anakryiko/97a023a95b30fb0fe607ff743433e64b
> >
> > >
> > > Also, this is a 2 socket, 10 core per socket, 2 threads per core
> > > ivybridge thing, are those parameters sufficient?
> >
> > Should be, I guess? It might be VM vs bare metal differences, though.
> > I'll try to run this on bare metal with more production-like kernel
> > configuration to see if I can still trigger this. Will let you know
> > the results when I get them.
>
> Ok, so I ran it on bare metal host with production config. I didn't
> really bother to specify parameters (so just one thread for
> everything, the default):
>
> # ./uprobe-stress
> WORKING HARD!..
>
> PERIOD #1 STATS:
> FUNC CALLS              2959843
> UPROBE HITS             1001312
> URETPROBE HITS                0
> ATTACHED LINKS                6
> ATTACHED UPROBES             28
> ATTACHED URETPROBES           0
> MMAP CALLS                 8143
> FORKS CALLS                 301
>
> PERIOD #2 STATS:
> FUNC CALLS                    0
> UPROBE HITS              822826
> URETPROBE HITS                0
> ATTACHED LINKS                0
> ATTACHED UPROBES              0
> ATTACHED URETPROBES           0
> MMAP CALLS                 8006
> FORKS CALLS                 270
>
> PERIOD #3 STATS:
> FUNC CALLS                    0
> UPROBE HITS              889534
> URETPROBE HITS                0
> ATTACHED LINKS                0
> ATTACHED UPROBES              0
> ATTACHED URETPROBES           0
> MMAP CALLS                 8004
> FORKS CALLS                 288
>
> PERIOD #4 STATS:
> FUNC CALLS                    0
> UPROBE HITS              886506
> URETPROBE HITS                0
> ATTACHED LINKS                0
> ATTACHED UPROBES              0
> ATTACHED URETPROBES           0
> MMAP CALLS                 8120
> FORKS CALLS                 285
>
> PERIOD #5 STATS:
> FUNC CALLS                    0
> UPROBE HITS              804556
> URETPROBE HITS                0
> ATTACHED LINKS                0
> ATTACHED UPROBES              0
> ATTACHED URETPROBES           0
> MMAP CALLS                 7131
> FORKS CALLS                 263
> ^C
> EXITING...
>
> Message from syslogd@kerneltest003.10.atn6.facebook.com at Jul 15 11:06:33 ...
>  kernel:[ 2194.334618] watchdog: BUG: soft lockup - CPU#71 stuck for
> 48s! [uprobe-stress:69900]
>
> It was weird on the very first period (no uretprobes, small amount of
> attachments). And sure enough (gmail will reformat below in the
> garbage, so [0] has the splat with the original formatting).
>
>   [0] https://gist.github.com/anakryiko/3e3ddcccc5ea3ca70ce90b5491485fdc
>
> I also keep getting:
>
> Message from syslogd@kerneltest003.10.atn6.facebook.com at Jul 15 11:09:41 ...
>  kernel:[ 2382.334088] watchdog: BUG: soft lockup - CPU#71 stuck for
> 223s! [uprobe-stress:69900]
>
> so it's not just a temporary slowdown
>
>
> [ 2166.893057] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 2166.904199] rcu:     71-....: (20999 ticks this GP)
> idle=2c84/1/0x4000000000000000 softirq=30158/30158 fqs=8110
> [ 2166.923810] rcu:              hardirqs   softirqs   csw/system
> [ 2166.934939] rcu:      number:        0        183            0
> [ 2166.946064] rcu:     cputime:       60          0        10438
> ==> 10549(ms)
> [ 2166.959969] rcu:     (t=21065 jiffies g=369217 q=207850 ncpus=80)
> [ 2166.971619] CPU: 71 PID: 69900 Comm: uprobe-stress Tainted: G S
>      E      6.10.0-rc7-00071-g9423ae8ef6ff #62
> [ 2166.992275] Hardware name: Quanta Tioga Pass Single Side
> 01-0032211004/Tioga Pass Single Side, BIOS F08_3A24 05/13/2020
> [ 2167.013804] RIP: 0010:uprobe_notify_resume+0x622/0xe20
> [ 2167.024064] Code: 8d 9d c0 00 00 00 48 89 df 4c 89 e6 e8 d7 f9 ff
> ff 84 c0 0f 85 c6 06 00 00 48 89 5c 24 20 41 8b 6d 58 40 f6 c5 01 74
> 23 f3 90 <eb> f2 83 7c 24 18 00 48 8b 44 24 10 0f 8e 71 01 00 00 bf 05
> 00 00
> [ 2167.061543] RSP: 0000:ffffc9004a49fe78 EFLAGS: 00000202
> [ 2167.071973] RAX: 0000000000000000 RBX: ffff88a11d307fc0 RCX: ffff88a120752c40
> [ 2167.086223] RDX: 00000000000042ec RSI: ffffc9004a49ff58 RDI: ffff88a11d307fc0
> [ 2167.100472] RBP: 0000000000000003 R08: ffff88a12516e500 R09: ffff88a12516f208
> [ 2167.114717] R10: 00000000004042ec R11: 000000000000000f R12: ffffc9004a49ff58
> [ 2167.128967] R13: ffff88a11d307f00 R14: 00000000004042ec R15: ffff88a09042e000
> [ 2167.143213] FS:  00007fd252000640(0000) GS:ffff88bfffbc0000(0000)
> knlGS:0000000000000000
> [ 2167.159368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2167.170843] CR2: 00007fd244000b60 CR3: 000000209090b001 CR4: 00000000007706f0
> [ 2167.185091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2167.199340] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2167.213586] PKRU: 55555554
> [ 2167.218994] Call Trace:
> [ 2167.223883]  <IRQ>
> [ 2167.227905]  ? rcu_dump_cpu_stacks+0x77/0xd0
> [ 2167.236433]  ? print_cpu_stall+0x150/0x2a0
> [ 2167.244615]  ? rcu_sched_clock_irq+0x319/0x490
> [ 2167.253487]  ? update_process_times+0x71/0xa0
> [ 2167.262191]  ? tick_nohz_handler+0xc0/0x100
> [ 2167.270544]  ? tick_setup_sched_timer+0x170/0x170
> [ 2167.279937]  ? __hrtimer_run_queues+0xe3/0x250
> [ 2167.288815]  ? hrtimer_interrupt+0xf0/0x390
> [ 2167.297168]  ? __sysvec_apic_timer_interrupt+0x47/0x110
> [ 2167.307602]  ? sysvec_apic_timer_interrupt+0x68/0x80
> [ 2167.317519]  </IRQ>
> [ 2167.321710]  <TASK>
> [ 2167.325905]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 2167.336517]  ? uprobe_notify_resume+0x622/0xe20
> [ 2167.345565]  ? uprobe_notify_resume+0x609/0xe20
> [ 2167.354612]  ? __se_sys_futex+0xf3/0x180
> [ 2167.362445]  ? arch_uprobe_exception_notify+0x29/0x40
> [ 2167.372533]  ? notify_die+0x51/0xb0
> [ 2167.379503]  irqentry_exit_to_user_mode+0x7f/0xd0
> [ 2167.388896]  asm_exc_int3+0x35/0x40
> [ 2167.395862] RIP: 0033:0x4042ec
> [ 2167.401966] Code: fc 8b 45 fc 89 c7 e8 6f 07 00 00 83 c0 01 c9 c3
> cc 48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 55 07 00 00 83 c0
> 01 c9 c3 <cc> 48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 3b 07 00
> 00 83
> [ 2167.439439] RSP: 002b:00007fd251fff8a8 EFLAGS: 00000206
> [ 2167.449874] RAX: 00000000004042ec RBX: 00007fd252000640 RCX: 000000000000001c
> [ 2167.464122] RDX: 0000000000000033 RSI: 0000000000000064 RDI: 0000000000000033
> [ 2167.478368] RBP: 00007fd251fff8d0 R08: 00007fd2523fa234 R09: 00007fd2523fa280
> [ 2167.492617] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd252000640
> [ 2167.506866] R13: 0000000000000016 R14: 00007fd252289930 R15: 0000000000000000
> [ 2167.521117]  </TASK>

Peter,

Did you manage to reproduce this?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
  2024-07-19 18:42         ` Andrii Nakryiko
@ 2024-07-27  0:18           ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2024-07-27  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: oleg, mingo, andrii, linux-kernel, linux-trace-kernel, rostedt,
	mhiramat, jolsa, clm, paulmck, bpf

On Fri, Jul 19, 2024 at 11:42 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 15, 2024 at 11:10 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jul 15, 2024 at 10:10 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, Jul 15, 2024 at 7:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Thu, Jul 11, 2024 at 09:57:44PM -0700, Andrii Nakryiko wrote:
> > > >
> > > > > But then I also ran it on Linux built from perf/uprobes branch (these
> > > > > patches), and after a few seconds I see that there is no more
> > > > > attachment/detachment happening. Eventually I got splats, which you
> > > > > can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command
> > > > > to run it inside my QEMU image.
> > > >
> > > > So them git voodoo incantations did work and I got it built. I'm running
> > > > that exact same line above (minus the sudo, because test box only has a
> > > > root account I think) on real hardware.
> > > >
> > > > I'm now ~100 periods in and wondering what 'eventually' means...
> > >
> > > So I was running in a qemu set up with 16 cores on top of bare metal's
> > > 80 core CPU (Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz). I just tried
> > > it again, and I can reproduce it within first few periods:
> > >
> > > WORKING HARD!..
> > >
> > > PERIOD #1 STATS:
> > > FUNC CALLS               919632
> > > UPROBE HITS              706351
> > > URETPROBE HITS           641679
> > > ATTACHED LINKS              951
> > > ATTACHED UPROBES           2421
> > > ATTACHED URETPROBES        2343
> > > MMAP CALLS                33533
> > > FORKS CALLS                 241
> > >
> > > PERIOD #2 STATS:
> > > FUNC CALLS                11444
> > > UPROBE HITS               14320
> > > URETPROBE HITS             9896
> > > ATTACHED LINKS               26
> > > ATTACHED UPROBES             75
> > > ATTACHED URETPROBES          61
> > > MMAP CALLS                39093
> > > FORKS CALLS                  14
> > >
> > > PERIOD #3 STATS:
> > > FUNC CALLS                  230
> > > UPROBE HITS                 152
> > > URETPROBE HITS              145
> > > ATTACHED LINKS                2
> > > ATTACHED UPROBES              2
> > > ATTACHED URETPROBES           2
> > > MMAP CALLS                39121
> > > FORKS CALLS                   0
> > >
> > > PERIOD #4 STATS:
> > > FUNC CALLS                    0
> > > UPROBE HITS                   0
> > > URETPROBE HITS                0
> > > ATTACHED LINKS                0
> > > ATTACHED UPROBES              0
> > > ATTACHED URETPROBES           0
> > > MMAP CALLS                39010
> > > FORKS CALLS                   0
> > >
> > > You can see in the second period all the numbers drop and by period #4
> > > (which is about 20 seconds in) anything but mmap()ing stops. When I
> > > said "eventually" I meant about a minute tops, however long it takes
> > > to do soft lockup detection, 23 seconds this time.
> > >
> > > So it should be very fast.
> > >
> > > Note that I'm running with debug kernel configuration (see [0] for
> > > full kernel config), here are debug-related settings, in case that
> > > makes a difference:
> > >
> > > $ cat ~/linux-build/default/.config | rg -i debug | rg -v '^#'
> > > CONFIG_X86_DEBUGCTLMSR=y
> > > CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> > > CONFIG_BLK_DEBUG_FS=y
> > > CONFIG_PNP_DEBUG_MESSAGES=y
> > > CONFIG_AIC7XXX_DEBUG_MASK=0
> > > CONFIG_AIC79XX_DEBUG_MASK=0
> > > CONFIG_SCSI_MVSAS_DEBUG=y
> > > CONFIG_DM_DEBUG=y
> > > CONFIG_MLX4_DEBUG=y
> > > CONFIG_USB_SERIAL_DEBUG=m
> > > CONFIG_INFINIBAND_MTHCA_DEBUG=y
> > > CONFIG_INFINIBAND_IPOIB_DEBUG=y
> > > CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
> > > CONFIG_CIFS_DEBUG=y
> > > CONFIG_DLM_DEBUG=y
> > > CONFIG_DEBUG_BUGVERBOSE=y
> > > CONFIG_DEBUG_KERNEL=y
> > > CONFIG_DEBUG_INFO=y
> > > CONFIG_DEBUG_INFO_DWARF4=y
> > > CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
> > > CONFIG_DEBUG_INFO_BTF=y
> > > CONFIG_DEBUG_INFO_BTF_MODULES=y
> > > CONFIG_DEBUG_FS=y
> > > CONFIG_DEBUG_FS_ALLOW_ALL=y
> > > CONFIG_ARCH_HAS_DEBUG_WX=y
> > > CONFIG_HAVE_DEBUG_KMEMLEAK=y
> > > CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
> > > CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> > > CONFIG_SCHED_DEBUG=y
> > > CONFIG_DEBUG_PREEMPT=y
> > > CONFIG_LOCK_DEBUGGING_SUPPORT=y
> > > CONFIG_DEBUG_RT_MUTEXES=y
> > > CONFIG_DEBUG_SPINLOCK=y
> > > CONFIG_DEBUG_MUTEXES=y
> > > CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> > > CONFIG_DEBUG_RWSEMS=y
> > > CONFIG_DEBUG_LOCK_ALLOC=y
> > > CONFIG_DEBUG_LOCKDEP=y
> > > CONFIG_DEBUG_ATOMIC_SLEEP=y
> > > CONFIG_DEBUG_IRQFLAGS=y
> > > CONFIG_X86_DEBUG_FPU=y
> > > CONFIG_FAULT_INJECTION_DEBUG_FS=y
> > >
> > >   [0] https://gist.github.com/anakryiko/97a023a95b30fb0fe607ff743433e64b
> > >
> > > >
> > > > Also, this is a 2 socket, 10 core per socket, 2 threads per core
> > > > ivybridge thing, are those parameters sufficient?
> > >
> > > Should be, I guess? It might be VM vs bare metal differences, though.
> > > I'll try to run this on bare metal with more production-like kernel
> > > configuration to see if I can still trigger this. Will let you know
> > > the results when I get them.
> >
> > Ok, so I ran it on bare metal host with production config. I didn't
> > really bother to specify parameters (so just one thread for
> > everything, the default):
> >
> > # ./uprobe-stress
> > WORKING HARD!..
> >
> > PERIOD #1 STATS:
> > FUNC CALLS              2959843
> > UPROBE HITS             1001312
> > URETPROBE HITS                0
> > ATTACHED LINKS                6
> > ATTACHED UPROBES             28
> > ATTACHED URETPROBES           0
> > MMAP CALLS                 8143
> > FORKS CALLS                 301
> >
> > PERIOD #2 STATS:
> > FUNC CALLS                    0
> > UPROBE HITS              822826
> > URETPROBE HITS                0
> > ATTACHED LINKS                0
> > ATTACHED UPROBES              0
> > ATTACHED URETPROBES           0
> > MMAP CALLS                 8006
> > FORKS CALLS                 270
> >
> > PERIOD #3 STATS:
> > FUNC CALLS                    0
> > UPROBE HITS              889534
> > URETPROBE HITS                0
> > ATTACHED LINKS                0
> > ATTACHED UPROBES              0
> > ATTACHED URETPROBES           0
> > MMAP CALLS                 8004
> > FORKS CALLS                 288
> >
> > PERIOD #4 STATS:
> > FUNC CALLS                    0
> > UPROBE HITS              886506
> > URETPROBE HITS                0
> > ATTACHED LINKS                0
> > ATTACHED UPROBES              0
> > ATTACHED URETPROBES           0
> > MMAP CALLS                 8120
> > FORKS CALLS                 285
> >
> > PERIOD #5 STATS:
> > FUNC CALLS                    0
> > UPROBE HITS              804556
> > URETPROBE HITS                0
> > ATTACHED LINKS                0
> > ATTACHED UPROBES              0
> > ATTACHED URETPROBES           0
> > MMAP CALLS                 7131
> > FORKS CALLS                 263
> > ^C
> > EXITING...
> >
> > Message from syslogd@kerneltest003.10.atn6.facebook.com at Jul 15 11:06:33 ...
> >  kernel:[ 2194.334618] watchdog: BUG: soft lockup - CPU#71 stuck for
> > 48s! [uprobe-stress:69900]
> >
> > It was weird on the very first period (no uretprobes, small amount of
> > attachments). And sure enough (gmail will reformat below in the
> > garbage, so [0] has the splat with the original formatting).
> >
> >   [0] https://gist.github.com/anakryiko/3e3ddcccc5ea3ca70ce90b5491485fdc
> >
> > I also keep getting:
> >
> > Message from syslogd@kerneltest003.10.atn6.facebook.com at Jul 15 11:09:41 ...
> >  kernel:[ 2382.334088] watchdog: BUG: soft lockup - CPU#71 stuck for
> > 223s! [uprobe-stress:69900]
> >
> > so it's not just a temporary slowdown
> >
> >
> > [ 2166.893057] rcu: INFO: rcu_sched self-detected stall on CPU
> > [ 2166.904199] rcu:     71-....: (20999 ticks this GP)
> > idle=2c84/1/0x4000000000000000 softirq=30158/30158 fqs=8110
> > [ 2166.923810] rcu:              hardirqs   softirqs   csw/system
> > [ 2166.934939] rcu:      number:        0        183            0
> > [ 2166.946064] rcu:     cputime:       60          0        10438
> > ==> 10549(ms)
> > [ 2166.959969] rcu:     (t=21065 jiffies g=369217 q=207850 ncpus=80)
> > [ 2166.971619] CPU: 71 PID: 69900 Comm: uprobe-stress Tainted: G S
> >      E      6.10.0-rc7-00071-g9423ae8ef6ff #62
> > [ 2166.992275] Hardware name: Quanta Tioga Pass Single Side
> > 01-0032211004/Tioga Pass Single Side, BIOS F08_3A24 05/13/2020
> > [ 2167.013804] RIP: 0010:uprobe_notify_resume+0x622/0xe20
> > [ 2167.024064] Code: 8d 9d c0 00 00 00 48 89 df 4c 89 e6 e8 d7 f9 ff
> > ff 84 c0 0f 85 c6 06 00 00 48 89 5c 24 20 41 8b 6d 58 40 f6 c5 01 74
> > 23 f3 90 <eb> f2 83 7c 24 18 00 48 8b 44 24 10 0f 8e 71 01 00 00 bf 05
> > 00 00
> > [ 2167.061543] RSP: 0000:ffffc9004a49fe78 EFLAGS: 00000202
> > [ 2167.071973] RAX: 0000000000000000 RBX: ffff88a11d307fc0 RCX: ffff88a120752c40
> > [ 2167.086223] RDX: 00000000000042ec RSI: ffffc9004a49ff58 RDI: ffff88a11d307fc0
> > [ 2167.100472] RBP: 0000000000000003 R08: ffff88a12516e500 R09: ffff88a12516f208
> > [ 2167.114717] R10: 00000000004042ec R11: 000000000000000f R12: ffffc9004a49ff58
> > [ 2167.128967] R13: ffff88a11d307f00 R14: 00000000004042ec R15: ffff88a09042e000
> > [ 2167.143213] FS:  00007fd252000640(0000) GS:ffff88bfffbc0000(0000)
> > knlGS:0000000000000000
> > [ 2167.159368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 2167.170843] CR2: 00007fd244000b60 CR3: 000000209090b001 CR4: 00000000007706f0
> > [ 2167.185091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 2167.199340] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 2167.213586] PKRU: 55555554
> > [ 2167.218994] Call Trace:
> > [ 2167.223883]  <IRQ>
> > [ 2167.227905]  ? rcu_dump_cpu_stacks+0x77/0xd0
> > [ 2167.236433]  ? print_cpu_stall+0x150/0x2a0
> > [ 2167.244615]  ? rcu_sched_clock_irq+0x319/0x490
> > [ 2167.253487]  ? update_process_times+0x71/0xa0
> > [ 2167.262191]  ? tick_nohz_handler+0xc0/0x100
> > [ 2167.270544]  ? tick_setup_sched_timer+0x170/0x170
> > [ 2167.279937]  ? __hrtimer_run_queues+0xe3/0x250
> > [ 2167.288815]  ? hrtimer_interrupt+0xf0/0x390
> > [ 2167.297168]  ? __sysvec_apic_timer_interrupt+0x47/0x110
> > [ 2167.307602]  ? sysvec_apic_timer_interrupt+0x68/0x80
> > [ 2167.317519]  </IRQ>
> > [ 2167.321710]  <TASK>
> > [ 2167.325905]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [ 2167.336517]  ? uprobe_notify_resume+0x622/0xe20
> > [ 2167.345565]  ? uprobe_notify_resume+0x609/0xe20
> > [ 2167.354612]  ? __se_sys_futex+0xf3/0x180
> > [ 2167.362445]  ? arch_uprobe_exception_notify+0x29/0x40
> > [ 2167.372533]  ? notify_die+0x51/0xb0
> > [ 2167.379503]  irqentry_exit_to_user_mode+0x7f/0xd0
> > [ 2167.388896]  asm_exc_int3+0x35/0x40
> > [ 2167.395862] RIP: 0033:0x4042ec
> > [ 2167.401966] Code: fc 8b 45 fc 89 c7 e8 6f 07 00 00 83 c0 01 c9 c3
> > cc 48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 55 07 00 00 83 c0
> > 01 c9 c3 <cc> 48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 3b 07 00
> > 00 83
> > [ 2167.439439] RSP: 002b:00007fd251fff8a8 EFLAGS: 00000206
> > [ 2167.449874] RAX: 00000000004042ec RBX: 00007fd252000640 RCX: 000000000000001c
> > [ 2167.464122] RDX: 0000000000000033 RSI: 0000000000000064 RDI: 0000000000000033
> > [ 2167.478368] RBP: 00007fd251fff8d0 R08: 00007fd2523fa234 R09: 00007fd2523fa280
> > [ 2167.492617] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd252000640
> > [ 2167.506866] R13: 0000000000000016 R14: 00007fd252289930 R15: 0000000000000000
> > [ 2167.521117]  </TASK>
>
> Peter,
>
> Did you manage to reproduce this?

Ping.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2024-07-27  0:19 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-11 11:02 [PATCH v2 00/11] perf/uprobe: Optimize uprobes Peter Zijlstra
2024-07-11 11:02 ` [PATCH v2 01/11] perf/uprobe: Re-indent labels Peter Zijlstra
2024-07-11 11:58   ` Jiri Olsa
2024-07-11 12:07     ` Peter Zijlstra
2024-07-11 11:02 ` [PATCH v2 02/11] perf/uprobe: Remove spurious whitespace Peter Zijlstra
2024-07-11 11:02 ` [PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu() Peter Zijlstra
2024-07-12 20:23   ` Andrii Nakryiko
2024-07-15 11:21     ` Peter Zijlstra
2024-07-15 17:13       ` Andrii Nakryiko
2024-07-11 11:02 ` [PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe() Peter Zijlstra
2024-07-11 13:59   ` Masami Hiramatsu
2024-07-11 11:02 ` [PATCH v2 05/11] perf/uprobe: Simplify UPROBE_HANDLER_REMOVE logic Peter Zijlstra
2024-07-11 11:02 ` [PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list Peter Zijlstra
2024-07-12 21:06   ` Andrii Nakryiko
2024-07-15 11:25     ` Peter Zijlstra
2024-07-15 17:30       ` Andrii Nakryiko
2024-07-11 11:02 ` [PATCH v2 07/11] perf/uprobe: Split uprobe_unregister() Peter Zijlstra
2024-07-12 21:10   ` Andrii Nakryiko
2024-07-11 11:02 ` [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU Peter Zijlstra
2024-07-11 14:03   ` Jiri Olsa
2024-07-12 21:21   ` Andrii Nakryiko
2024-07-11 11:02 ` [PATCH v2 09/11] srcu: Add __srcu_clone_read_lock() Peter Zijlstra
2024-07-11 11:02 ` [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU Peter Zijlstra
2024-07-11 16:06   ` Oleg Nesterov
2024-07-11 18:42     ` Peter Zijlstra
2024-07-12 10:26       ` Oleg Nesterov
2024-07-12 21:28   ` Andrii Nakryiko
2024-07-15 11:59     ` Peter Zijlstra
2024-07-11 11:02 ` [PATCH v2 11/11] perf/uprobe: Add uretprobe timer Peter Zijlstra
2024-07-11 13:19   ` Oleg Nesterov
2024-07-11 15:00     ` Peter Zijlstra
2024-07-11 15:55       ` Peter Zijlstra
2024-07-11 16:06         ` Peter Zijlstra
2024-07-12 21:43   ` Andrii Nakryiko
2024-07-15 11:41     ` Peter Zijlstra
2024-07-15 17:34       ` Andrii Nakryiko
2024-07-12  4:57 ` [PATCH v2 00/11] perf/uprobe: Optimize uprobes Andrii Nakryiko
2024-07-12  9:13   ` Peter Zijlstra
2024-07-12 13:10   ` Peter Zijlstra
2024-07-12 15:29     ` Andrii Nakryiko
2024-07-15 14:45   ` Peter Zijlstra
2024-07-15 17:10     ` Andrii Nakryiko
2024-07-15 18:10       ` Andrii Nakryiko
2024-07-19 18:42         ` Andrii Nakryiko
2024-07-27  0:18           ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).