[PATCH 0/3] SRCU-protected uretprobes hot path

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] SRCU-protected uretprobes hot path
@ 2024-09-09 22:49 Andrii Nakryiko
  2024-09-09 22:49 ` [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context Andrii Nakryiko
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2024-09-09 22:49 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
	Andrii Nakryiko

Recently landed changes make uprobe entry hot code path makes use of RCU Tasks
Trace to avoid touching uprobe refcount, which at high frequency of uprobe
triggering leads to excessive cache line bouncing and limited scalability with
increased number of CPUs that simultaneously execute uprobe handlers.

This patch set adds return uprobe (uretprobe) side of this, this time
utilizing SRCU for the same reasons. Given the time between entry uprobe
activation (at which point uretprobe code hijacks user-space stack to get
activated on user function return) and uretprobe activation can be arbitrarily
long and is completely under control of user code, we need to protect
ourselves from too long or unbounded SRCU grace periods.

To that end we keep SRCU protection only for a limited time, and if user space
code takes longer to return, pending uretprobe instances are "downgraded" to
refcounted ones. This gives us best scalability and performance for
high-frequency uretprobes, and keeps upper bound on SRCU grace period duration
for low frequency uretprobes.

There are a bunch of synchronization issues between timer callback running in
IRQ handler and current thread executing uretprobe handlers, which is
abstracted away behind "hybrid lifetime uprobe" (hprobe) wrapper around uprobe
instance itself. See patch #1 for all the details.

rfc->v1:
  - made put_uprobe() work in any context, not just user context (Oleg);
  - changed to unconditional mod_timer() usage to avoid races (Oleg).
  - I kept single-stepped uprobe changes, as they have a simple use of all the
    hprobe functionality developed in patch #1.

Andrii Nakryiko (3):
  uprobes: allow put_uprobe() from non-sleepable softirq context
  uprobes: SRCU-protect uretprobe lifetime (with timeout)
  uprobes: implement SRCU-protected lifetime for single-stepped uprobe

 include/linux/uprobes.h |  53 +++++-
 kernel/events/uprobes.c | 370 +++++++++++++++++++++++++++++++++-------
 2 files changed, 353 insertions(+), 70 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context
  2024-09-09 22:49 [PATCH 0/3] SRCU-protected uretprobes hot path Andrii Nakryiko
@ 2024-09-09 22:49 ` Andrii Nakryiko
  2024-09-10  2:51   ` Alexei Starovoitov
  2024-09-15 14:49   ` Oleg Nesterov
  2024-09-09 22:49 ` [PATCH 2/3] uprobes: SRCU-protect uretprobe lifetime (with timeout) Andrii Nakryiko
  2024-09-09 22:49 ` [PATCH 3/3] uprobes: implement SRCU-protected lifetime for single-stepped uprobe Andrii Nakryiko
  2 siblings, 2 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2024-09-09 22:49 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
	Andrii Nakryiko

Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
makes it unsuitable to be called from more restricted context like softirq.

Let's make put_uprobe() agnostic to the context in which it is called,
and use work queue to defer the mutex-protected clean up steps.

To avoid unnecessarily increasing the size of struct uprobe, we colocate
work_struct in parallel with rb_node and rcu, both of which are unused
by the time we get to schedule clean up work.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a2e6a57f79f2..377bd524bc8b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -27,6 +27,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/khugepaged.h>
 #include <linux/rcupdate_trace.h>
+#include <linux/workqueue.h>
 
 #include <linux/uprobes.h>
 
@@ -54,14 +55,20 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 #define UPROBE_COPY_INSN	0
 
 struct uprobe {
-	struct rb_node		rb_node;	/* node in the rb tree */
+	union {
+		struct {
+			struct rb_node		rb_node;	/* node in the rb tree */
+			struct rcu_head		rcu;
+		};
+		/* work is used only during freeing, rcu and rb_node are unused at that point */
+		struct work_struct work;
+	};
 	refcount_t		ref;
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
 	struct list_head	consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
-	struct rcu_head		rcu;
 	loff_t			offset;
 	loff_t			ref_ctr_offset;
 	unsigned long		flags;
@@ -620,11 +627,28 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
 	return !RB_EMPTY_NODE(&uprobe->rb_node);
 }
 
+static void uprobe_free_deferred(struct work_struct *work)
+{
+	struct uprobe *uprobe = container_of(work, struct uprobe, work);
+
+	/*
+	 * If application munmap(exec_vma) before uprobe_unregister()
+	 * gets called, we don't get a chance to remove uprobe from
+	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
+	 */
+	mutex_lock(&delayed_uprobe_lock);
+	delayed_uprobe_remove(uprobe, NULL);
+	mutex_unlock(&delayed_uprobe_lock);
+
+	kfree(uprobe);
+}
+
 static void uprobe_free_rcu(struct rcu_head *rcu)
 {
 	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
 
-	kfree(uprobe);
+	INIT_WORK(&uprobe->work, uprobe_free_deferred);
+	schedule_work(&uprobe->work);
 }
 
 static void put_uprobe(struct uprobe *uprobe)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] uprobes: SRCU-protect uretprobe lifetime (with timeout)
  2024-09-09 22:49 [PATCH 0/3] SRCU-protected uretprobes hot path Andrii Nakryiko
  2024-09-09 22:49 ` [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context Andrii Nakryiko
@ 2024-09-09 22:49 ` Andrii Nakryiko
  2024-09-09 22:49 ` [PATCH 3/3] uprobes: implement SRCU-protected lifetime for single-stepped uprobe Andrii Nakryiko
  2 siblings, 0 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2024-09-09 22:49 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
	Andrii Nakryiko

Avoid taking refcount on uprobe in prepare_uretprobe(), instead take
uretprobe-specific SRCU lock and keep it active as kernel transfers
control back to user space.

Given we can't rely on user space returning from traced function within
reasonable time period, we need to make sure not to keep SRCU lock
active for too long, though. To that effect, we employ a timer callback
which is meant to terminate SRCU lock region after predefined timeout
(currently set to 100ms), and instead transfer underlying struct
uprobe's lifetime protection to refcounting.

This fallback to less scalable refcounting after 100ms is a fine
tradeoff from uretprobe's scalability and performance perspective,
because uretprobing *long running* user functions inherently doesn't run
into scalability issues (there is just not enough frequency to cause
noticeable issues with either performance or scalability).

The overall trick is in ensuring synchronization between current thread
and timer's callback fired on some other thread. To cope with that with
minimal logic complications, we add hprobe wrapper which is used to
contain all the racy and synchronization related issues behind a small
number of basic helpers: hprobe_expire() for "downgrading" uprobe from
SRCU-protected state to refcounted state, and a hprobe_consume()
+ hprobe_finalize() pair of single-use consuming helpers. Other than
that whatever current thread's logic is there stays the same, as timer
thread cannot modify return_instance state (or add new/remove old
return_instances). It only takes care of SRCU unlock and uprobe
refcounting, which is hidden from the higher-level uretprobe handling
logic.

We use atomic xchg() in hprobe_consume(), which is called from
performance critical handle_uretprobe_chain() function run in the
current context. When uncontended, this xchg() doesn't seem to hurt
performance as there are no other competing CPUs fighting for the same
cache line. We also mark struct return_instance as ____cacheline_aligned
to ensure no false sharing can happen.

Another technical moment, we need to make sure that the list of return
instances can be safely traversed under RCU from timer callback, so we
delay return_instance freeing with kfree_rcu() and make sure that list
modifications use RCU-aware operations.

Also, given SRCU lock survives transition from kernel to user space and
back we need to use lower-level __srcu_read_lock() and
__srcu_read_unlock() to avoid lockdep complaining.

Just to give an impression of a kind of performance improvements this
change brings, below are benchmarking results with and without these
SRCU changes, assuming other uprobe optimizations (mainly RCU Tasks
Trace for entry uprobes, lockless RB-tree lookup, and lockless VMA to
uprobe lookup) are left intact:

WITHOUT SRCU for uretprobes
===========================
uretprobe-nop         ( 1 cpus):    2.197 ± 0.002M/s  (  2.197M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.325 ± 0.001M/s  (  1.662M/s/cpu)
uretprobe-nop         ( 3 cpus):    4.129 ± 0.002M/s  (  1.376M/s/cpu)
uretprobe-nop         ( 4 cpus):    6.180 ± 0.003M/s  (  1.545M/s/cpu)
uretprobe-nop         ( 8 cpus):    7.323 ± 0.005M/s  (  0.915M/s/cpu)
uretprobe-nop         (16 cpus):    6.943 ± 0.005M/s  (  0.434M/s/cpu)
uretprobe-nop         (32 cpus):    5.931 ± 0.014M/s  (  0.185M/s/cpu)
uretprobe-nop         (64 cpus):    5.145 ± 0.003M/s  (  0.080M/s/cpu)
uretprobe-nop         (80 cpus):    4.925 ± 0.005M/s  (  0.062M/s/cpu)

WITH SRCU for uretprobes
========================
uretprobe-nop         ( 1 cpus):    1.968 ± 0.001M/s  (  1.968M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.739 ± 0.003M/s  (  1.869M/s/cpu)
uretprobe-nop         ( 3 cpus):    5.616 ± 0.003M/s  (  1.872M/s/cpu)
uretprobe-nop         ( 4 cpus):    7.286 ± 0.002M/s  (  1.822M/s/cpu)
uretprobe-nop         ( 8 cpus):   13.657 ± 0.007M/s  (  1.707M/s/cpu)
uretprobe-nop         (32 cpus):   45.305 ± 0.066M/s  (  1.416M/s/cpu)
uretprobe-nop         (64 cpus):   42.390 ± 0.922M/s  (  0.662M/s/cpu)
uretprobe-nop         (80 cpus):   47.554 ± 2.411M/s  (  0.594M/s/cpu)

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |  49 ++++++-
 kernel/events/uprobes.c | 285 ++++++++++++++++++++++++++++++++++------
 2 files changed, 292 insertions(+), 42 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2b294bf1881f..1b194c51d4d3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -15,6 +15,7 @@
 #include <linux/rbtree.h>
 #include <linux/types.h>
 #include <linux/wait.h>
+#include <linux/timer.h>
 
 struct uprobe;
 struct vm_area_struct;
@@ -56,6 +57,45 @@ enum uprobe_task_state {
 	UTASK_SSTEP_TRAPPED,
 };
 
+/*
+ * Hybrid lifetime uprobe. Represents a uprobe instance that could be either
+ * SRCU protected (with SRCU protection eventually potentially timing out),
+ * refcounted using uprobe->ref, or there could be no valid uprobe (NULL).
+ *
+ * hprobe's internal state is setup such that background timer thread can
+ * atomically "downgrade" temporarily RCU-protected uprobe into refcounted one
+ * (or no uprobe, if refcounting failed).
+ *
+ * *stable* pointer always point to the uprobe (or could be NULL if there is
+ * was no valid underlying uprobe to begin with).
+ *
+ * *leased* pointer is the key to achieving race-free atomic lifetime state
+ * transition and can have three possible states:
+ *   - either the same non-NULL value as *stable*, in which case uprobe is
+ *     SRCU-protected;
+ *   - NULL, in which case uprobe (if there is any) is refcounted;
+ *   - special __UPROBE_DEAD value, which represents an uprobe that was SRCU
+ *     protected initially, but SRCU period timed out and we attempted to
+ *     convert it to refcounted, but refcount_inc_not_zero() failed, because
+ *     uprobe effectively went away (the last consumer unsubscribed). In this
+ *     case it's important to know that *stable* pointer (which still has
+ *     non-NULL uprobe pointer) shouldn't be used, because lifetime of
+ *     underlying uprobe is not guaranteed anymore. __UPROBE_DEAD is just an
+ *     internal marker and is handled transparently by hprobe_fetch() helper.
+ *
+ * When uprobe is SRCU-protected, we also record srcu_idx value, necessary for
+ * SRCU unlocking.
+ *
+ * See hprobe_expire() and hprobe_fetch() for details of race-free uprobe
+ * state transitioning details. It all hinges on atomic xchg() over *leaded*
+ * pointer. *stable* pointer, once initially set, is not modified concurrently.
+ */
+struct hprobe {
+	struct uprobe *stable;
+	struct uprobe *leased;
+	int srcu_idx;
+};
+
 /*
  * uprobe_task: Metadata of a task while it singlesteps.
  */
@@ -75,6 +115,7 @@ struct uprobe_task {
 	};
 
 	struct uprobe			*active_uprobe;
+	struct timer_list		ri_timer;
 	unsigned long			xol_vaddr;
 
 	struct arch_uprobe              *auprobe;
@@ -84,14 +125,18 @@ struct uprobe_task {
 };
 
 struct return_instance {
-	struct uprobe		*uprobe;
 	unsigned long		func;
 	unsigned long		stack;		/* stack pointer */
 	unsigned long		orig_ret_vaddr; /* original return address */
 	bool			chained;	/* true, if instance is nested */
 
 	struct return_instance	*next;		/* keep as stack */
-};
+
+	union {
+		struct hprobe		hprobe;
+		struct rcu_head		rcu;
+	};
+} ____cacheline_aligned;
 
 enum rp_check {
 	RP_CHECK_CALL,
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 377bd524bc8b..b047e68499d5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -28,6 +28,7 @@
 #include <linux/khugepaged.h>
 #include <linux/rcupdate_trace.h>
 #include <linux/workqueue.h>
+#include <linux/srcu.h>
 
 #include <linux/uprobes.h>
 
@@ -51,6 +52,9 @@ static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
 
 DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
+/* Covers return_instance's uprobe lifetime. */
+DEFINE_STATIC_SRCU(uretprobes_srcu);
+
 /* Have a copy of original instruction */
 #define UPROBE_COPY_INSN	0
 
@@ -602,13 +606,6 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
-/* uprobe should have guaranteed positive refcount */
-static struct uprobe *get_uprobe(struct uprobe *uprobe)
-{
-	refcount_inc(&uprobe->ref);
-	return uprobe;
-}
-
 /*
  * uprobe should have guaranteed lifetime, which can be either of:
  *   - caller already has refcount taken (and wants an extra one);
@@ -643,7 +640,7 @@ static void uprobe_free_deferred(struct work_struct *work)
 	kfree(uprobe);
 }
 
-static void uprobe_free_rcu(struct rcu_head *rcu)
+static void uprobe_free_rcu_tasks_trace(struct rcu_head *rcu)
 {
 	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
 
@@ -651,6 +648,13 @@ static void uprobe_free_rcu(struct rcu_head *rcu)
 	schedule_work(&uprobe->work);
 }
 
+static void uprobe_free_srcu(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+
+	call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu_tasks_trace);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (!refcount_dec_and_test(&uprobe->ref))
@@ -675,7 +679,146 @@ static void put_uprobe(struct uprobe *uprobe)
 	delayed_uprobe_remove(uprobe, NULL);
 	mutex_unlock(&delayed_uprobe_lock);
 
-	call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
+	/* start srcu -> rcu_tasks_trace -> kfree chain */
+	call_srcu(&uretprobes_srcu, &uprobe->rcu, uprobe_free_srcu);
+}
+
+/*
+ * Special marker pointer for when ri_timer() expired, unlocking RCU, but
+ * failed to acquire refcount on uprobe (because it doesn't have any
+ * associated consumer anymore, for example). In such case it's important for
+ * hprobe_consume() to return NULL uprobe, instead of "stable" uprobe pointer,
+ * as that one isn't protected by either refcount nor RCU region now.
+ */
+#define __UPROBE_DEAD ((struct uprobe *)(-0xdead))
+
+#define RI_TIMER_PERIOD (HZ/10) /* 100 ms */
+
+/* Initialize hprobe as SRCU-protected "leased" uprobe */
+static void hprobe_init_leased(struct hprobe *hprobe, struct uprobe *uprobe, int srcu_idx)
+{
+	hprobe->srcu_idx = srcu_idx;
+	hprobe->stable = uprobe;
+	hprobe->leased = uprobe;
+}
+
+/* Initialize hprobe as refcounted ("stable") uprobe (uprobe can be NULL). */
+static void hprobe_init_stable(struct hprobe *hprobe, struct uprobe *uprobe)
+{
+	hprobe->srcu_idx = -1;
+	hprobe->stable = uprobe;
+	hprobe->leased = NULL;
+}
+
+/*
+ * hprobe_consume() fetches hprobe's underlying uprobe and detects whether
+ * uprobe is still SRCU protected, or is refcounted. hprobe_consume() can be
+ * used only once for a given hprobe.
+ *
+ * Caller has to perform SRCU unlock if under_rcu is set to true;
+ * otherwise, either properly refcounted uprobe is returned or NULL.
+ */
+static inline struct uprobe *hprobe_consume(struct hprobe *hprobe, bool *under_rcu)
+{
+	struct uprobe *uprobe;
+
+	uprobe = xchg(&hprobe->leased, NULL);
+	if (uprobe) {
+		if (unlikely(uprobe == __UPROBE_DEAD)) {
+			*under_rcu = false;
+			return NULL;
+		}
+
+		*under_rcu = true;
+		return uprobe;
+	}
+
+	*under_rcu = false;
+	return hprobe->stable;
+}
+
+/*
+ * Reset hprobe state and, if under_rcu is true, release SRCU lock.
+ * hprobe_finalize() can only be used from current context after
+ * hprobe_consume() call (which determines uprobe and under_rcu value).
+ */
+static void hprobe_finalize(struct hprobe *hprobe, struct uprobe *uprobe, bool under_rcu)
+{
+	if (under_rcu)
+		__srcu_read_unlock(&uretprobes_srcu, hprobe->srcu_idx);
+	else if (uprobe)
+		put_uprobe(uprobe);
+	/* prevent free_ret_instance() from double-putting uprobe */
+	hprobe->stable = NULL;
+}
+
+/*
+ * Attempt to switch (atomically) uprobe from being RCU protected ("leased")
+ * to refcounted ("stable") state. Competes with hprobe_consume(), only one of
+ * them can win the race to perform SRCU unlocking. Whoever wins must perform
+ * SRCU unlock.
+ *
+ * Returns underlying valid uprobe or NULL, if there was no underlying uprobe
+ * to begin with or we failed to bump its refcount and it's going away.
+ *
+ * Returned non-NULL uprobe can be still safely used within an ongoing SRCU
+ * locked region. It's not guaranteed that returned uprobe has a positive
+ * refcount, so caller has to attempt try_get_uprobe(), if it needs to use
+ * returned uprobe instance beyond ongoing SRCU lock region. See dup_utask().
+ */
+static struct uprobe* hprobe_expire(struct hprobe *hprobe)
+{
+	struct uprobe *uprobe;
+
+	/*
+	 * return_instance's hprobe is protected by RCU.
+	 * Underlying uprobe is itself protected from reuse by SRCU.
+	 */
+	lockdep_assert(rcu_read_lock_held() && srcu_read_lock_held(&uretprobes_srcu));
+
+	/*
+	 * Leased pointer can only be NULL, __UPROBE_DEAD, or some valid uprobe
+	 * pointer. This pointer can only be updated to NULL or __UPROBE_DEAD,
+	 * not any other valid uprobe pointer. So it's safe to fetch it with
+	 * READ_ONCE() and try to refcount it, if it's not NULL or __UPROBE_DEAD.
+	 */
+	uprobe = data_race(READ_ONCE(hprobe->leased));
+	if (!uprobe || uprobe == __UPROBE_DEAD)
+		return NULL;
+
+	if (!try_get_uprobe(uprobe)) {
+		/*
+		 * hprobe_consume() might have xchg()'ed to NULL already,
+		 * in which case we shouldn't set __UPROBE_DEAD.
+		 */
+		cmpxchg(&hprobe->leased, uprobe, __UPROBE_DEAD);
+		return NULL;
+	}
+
+	/*
+	 * Even if hprobe_consume() won and unlocked SRCU, we still have
+	 * a guarantee that uprobe won't be freed (and thus won't be reused)
+	 * because out caller maintains its own SRCU locked region.
+	 * So cmpxchg() below is well-formed.
+	 */
+	if (cmpxchg(&hprobe->leased, uprobe, NULL)) {
+		/*
+		 * At this point uprobe is properly refcounted, so it's safe
+		 * to end its original SRCU locked region.
+		 */
+		__srcu_read_unlock(&uretprobes_srcu, hprobe->srcu_idx);
+		return uprobe;
+	}
+
+	/* We lost the race, undo our refcount bump. It can drop to zero. */
+	put_uprobe(uprobe);
+
+	/*
+	 * We return underlying uprobe nevertheless because it's still valid
+	 * until the end of current SRCU locked region, and can be used to
+	 * try_get_uprobe(). This is used in dup_utask().
+	 */
+	return uprobe;
 }
 
 static __always_inline
@@ -1180,6 +1323,7 @@ void uprobe_unregister_sync(void)
 	 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
 	 */
 	synchronize_rcu_tasks_trace();
+	synchronize_srcu(&uretprobes_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
@@ -1760,11 +1904,18 @@ unsigned long uprobe_get_trap_addr(struct pt_regs *regs)
 	return instruction_pointer(regs);
 }
 
-static struct return_instance *free_ret_instance(struct return_instance *ri)
+static struct return_instance *free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
 {
 	struct return_instance *next = ri->next;
-	put_uprobe(ri->uprobe);
-	kfree(ri);
+	struct uprobe *uprobe;
+	bool under_rcu;
+
+	if (cleanup_hprobe) {
+		uprobe = hprobe_consume(&ri->hprobe, &under_rcu);
+		hprobe_finalize(&ri->hprobe, uprobe, under_rcu);
+	}
+
+	kfree_rcu(ri, rcu);
 	return next;
 }
 
@@ -1780,18 +1931,51 @@ void uprobe_free_utask(struct task_struct *t)
 	if (!utask)
 		return;
 
+	timer_delete_sync(&utask->ri_timer);
+
 	if (utask->active_uprobe)
 		put_uprobe(utask->active_uprobe);
 
 	ri = utask->return_instances;
 	while (ri)
-		ri = free_ret_instance(ri);
+		ri = free_ret_instance(ri, true /* cleanup_hprobe */);
 
 	xol_free_insn_slot(t);
 	kfree(utask);
 	t->utask = NULL;
 }
 
+#define for_each_ret_instance_rcu(pos, head) \
+	for (pos = rcu_dereference_raw(head); pos; pos = rcu_dereference_raw(pos->next))
+
+static void ri_timer(struct timer_list *timer)
+{
+	struct uprobe_task *utask = container_of(timer, struct uprobe_task, ri_timer);
+	struct return_instance *ri;
+
+	/* SRCU protects uprobe from reuse for the cmpxchg() inside hprobe_expire(). */
+	guard(srcu)(&uretprobes_srcu);
+	/* RCU protects return_instance from freeing. */
+	guard(rcu)();
+
+	for_each_ret_instance_rcu(ri, utask->return_instances) {
+		hprobe_expire(&ri->hprobe);
+	}
+}
+
+static struct uprobe_task *alloc_utask(void)
+{
+	struct uprobe_task *utask;
+
+	utask = kzalloc(sizeof(*utask), GFP_KERNEL);
+	if (!utask)
+		return NULL;
+
+	timer_setup(&utask->ri_timer, ri_timer, 0);
+
+	return utask;
+}
+
 /*
  * Allocate a uprobe_task object for the task if necessary.
  * Called when the thread hits a breakpoint.
@@ -1803,7 +1987,7 @@ void uprobe_free_utask(struct task_struct *t)
 static struct uprobe_task *get_utask(void)
 {
 	if (!current->utask)
-		current->utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
+		current->utask = alloc_utask();
 	return current->utask;
 }
 
@@ -1811,12 +1995,16 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 {
 	struct uprobe_task *n_utask;
 	struct return_instance **p, *o, *n;
+	struct uprobe *uprobe;
 
-	n_utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
+	n_utask = alloc_utask();
 	if (!n_utask)
 		return -ENOMEM;
 	t->utask = n_utask;
 
+	/* protect uprobes from freeing, we'll need try_get_uprobe() them */
+	guard(srcu)(&uretprobes_srcu);
+
 	p = &n_utask->return_instances;
 	for (o = o_utask->return_instances; o; o = o->next) {
 		n = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
@@ -1824,17 +2012,24 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 			return -ENOMEM;
 
 		*n = *o;
+
+		/* see hprobe_expire() comments */
+		uprobe = hprobe_expire(&o->hprobe);
+		if (uprobe) /* refcount bump for new utask */
+			uprobe = try_get_uprobe(uprobe);
+
 		/*
-		 * uprobe's refcnt has to be positive at this point, kept by
-		 * utask->return_instances items; return_instances can't be
-		 * removed right now, as task is blocked due to duping; so
-		 * get_uprobe() is safe to use here.
+		 * New utask will have stable properly refcounted uprobe or NULL.
+		 * Even if we failed to get refcounted uprobe, we still need
+		 * to preserve full set of return_instances for proper
+		 * uretprobe handling and nesting in forked task.
 		 */
-		get_uprobe(n->uprobe);
-		n->next = NULL;
+		hprobe_init_stable(&n->hprobe, uprobe);
 
-		*p = n;
+		n->next = NULL;
+		rcu_assign_pointer(*p, n);
 		p = &n->next;
+
 		n_utask->depth++;
 	}
 
@@ -1910,10 +2105,10 @@ static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
 	enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
 
 	while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
-		ri = free_ret_instance(ri);
+		ri = free_ret_instance(ri, true /* cleanup_hprobe */);
 		utask->depth--;
 	}
-	utask->return_instances = ri;
+	rcu_assign_pointer(utask->return_instances, ri);
 }
 
 static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
@@ -1922,6 +2117,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 	struct uprobe_task *utask;
 	unsigned long orig_ret_vaddr, trampoline_vaddr;
 	bool chained;
+	int srcu_idx;
 
 	if (!get_xol_area())
 		return;
@@ -1937,10 +2133,6 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		return;
 	}
 
-	/* we need to bump refcount to store uprobe in utask */
-	if (!try_get_uprobe(uprobe))
-		return;
-
 	ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
 	if (!ri)
 		goto fail;
@@ -1970,20 +2162,26 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-	ri->uprobe = uprobe;
+
+	/* __srcu_read_lock() because SRCU lock survives switch to user space */
+	srcu_idx = __srcu_read_lock(&uretprobes_srcu);
+
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
 	ri->chained = chained;
 
 	utask->depth++;
+
+	hprobe_init_leased(&ri->hprobe, uprobe, srcu_idx);
 	ri->next = utask->return_instances;
-	utask->return_instances = ri;
+	rcu_assign_pointer(utask->return_instances, ri);
+
+	mod_timer(&utask->ri_timer, jiffies + RI_TIMER_PERIOD);
 
 	return;
 fail:
 	kfree(ri);
-	put_uprobe(uprobe);
 }
 
 /* Prepare to single-step probed instruction out of line. */
@@ -2178,11 +2376,14 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 }
 
 static void
-handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
+handle_uretprobe_chain(struct return_instance *ri, struct uprobe *uprobe, struct pt_regs *regs)
 {
-	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
 
+	/* all consumers unsubscribed meanwhile */
+	if (unlikely(!uprobe))
+		return;
+
 	rcu_read_lock_trace();
 	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		if (uc->ret_handler)
@@ -2207,7 +2408,8 @@ void uprobe_handle_trampoline(struct pt_regs *regs)
 {
 	struct uprobe_task *utask;
 	struct return_instance *ri, *next;
-	bool valid;
+	struct uprobe *uprobe;
+	bool valid, under_rcu;
 
 	utask = current->utask;
 	if (!utask)
@@ -2237,21 +2439,24 @@ void uprobe_handle_trampoline(struct pt_regs *regs)
 			 * trampoline addresses on the stack are replaced with correct
 			 * original return addresses
 			 */
-			utask->return_instances = ri->next;
+			rcu_assign_pointer(utask->return_instances, ri->next);
+
+			uprobe = hprobe_consume(&ri->hprobe, &under_rcu);
 			if (valid)
-				handle_uretprobe_chain(ri, regs);
-			ri = free_ret_instance(ri);
+				handle_uretprobe_chain(ri, uprobe, regs);
+			hprobe_finalize(&ri->hprobe, uprobe, under_rcu);
+
+			/* We already took care of hprobe, no need to waste more time on that. */
+			ri = free_ret_instance(ri, false /* !cleanup_hprobe */);
 			utask->depth--;
 		} while (ri != next);
 	} while (!valid);
 
-	utask->return_instances = ri;
 	return;
 
- sigill:
+sigill:
 	uprobe_warn(current, "handle uretprobe, sending SIGILL.");
 	force_sig(SIGILL);
-
 }
 
 bool __weak arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] uprobes: implement SRCU-protected lifetime for single-stepped uprobe
  2024-09-09 22:49 [PATCH 0/3] SRCU-protected uretprobes hot path Andrii Nakryiko
  2024-09-09 22:49 ` [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context Andrii Nakryiko
  2024-09-09 22:49 ` [PATCH 2/3] uprobes: SRCU-protect uretprobe lifetime (with timeout) Andrii Nakryiko
@ 2024-09-09 22:49 ` Andrii Nakryiko
  2024-09-15 14:51   ` Oleg Nesterov
  2 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2024-09-09 22:49 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, oleg
  Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
	Andrii Nakryiko

Similarly to how we SRCU-protect uprobe instance (and avoid refcounting
it unnecessarily) when waiting for return probe hit, use hprobe approach
to do the same with single-stepped uprobe. Same hprobe_* primitives are
used. We also reuse ri_timer() callback to expire both pending
single-step uprobe and return instances.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |  4 +--
 kernel/events/uprobes.c | 55 ++++++++++++++++++++++++-----------------
 2 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 1b194c51d4d3..31cf7306cdf6 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -114,7 +114,7 @@ struct uprobe_task {
 		};
 	};
 
-	struct uprobe			*active_uprobe;
+	struct hprobe			active_hprobe;
 	struct timer_list		ri_timer;
 	unsigned long			xol_vaddr;
 
@@ -122,7 +122,7 @@ struct uprobe_task {
 
 	struct return_instance		*return_instances;
 	unsigned int			depth;
-};
+} ____cacheline_aligned;
 
 struct return_instance {
 	unsigned long		func;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index b047e68499d5..d9ab5e0dd9dd 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1894,11 +1894,16 @@ unsigned long __weak uprobe_get_swbp_addr(struct pt_regs *regs)
 	return instruction_pointer(regs) - UPROBE_SWBP_INSN_SIZE;
 }
 
+static bool utask_has_pending_sstep_uprobe(struct uprobe_task *utask)
+{
+	return utask->active_hprobe.stable != NULL;
+}
+
 unsigned long uprobe_get_trap_addr(struct pt_regs *regs)
 {
 	struct uprobe_task *utask = current->utask;
 
-	if (unlikely(utask && utask->active_uprobe))
+	if (unlikely(utask && utask_has_pending_sstep_uprobe(utask)))
 		return utask->vaddr;
 
 	return instruction_pointer(regs);
@@ -1927,14 +1932,17 @@ void uprobe_free_utask(struct task_struct *t)
 {
 	struct uprobe_task *utask = t->utask;
 	struct return_instance *ri;
+	struct uprobe *uprobe;
+	bool under_rcu;
 
 	if (!utask)
 		return;
 
 	timer_delete_sync(&utask->ri_timer);
 
-	if (utask->active_uprobe)
-		put_uprobe(utask->active_uprobe);
+	/* clean up pending single-stepped uprobe */
+	uprobe = hprobe_consume(&utask->active_hprobe, &under_rcu);
+	hprobe_finalize(&utask->active_hprobe, uprobe, under_rcu);
 
 	ri = utask->return_instances;
 	while (ri)
@@ -1958,6 +1966,8 @@ static void ri_timer(struct timer_list *timer)
 	/* RCU protects return_instance from freeing. */
 	guard(rcu)();
 
+	hprobe_expire(&utask->active_hprobe);
+
 	for_each_ret_instance_rcu(ri, utask->return_instances) {
 		hprobe_expire(&ri->hprobe);
 	}
@@ -2190,20 +2200,15 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 {
 	struct uprobe_task *utask;
 	unsigned long xol_vaddr;
-	int err;
+	int err, srcu_idx;
 
 	utask = get_utask();
 	if (!utask)
 		return -ENOMEM;
 
-	if (!try_get_uprobe(uprobe))
-		return -EINVAL;
-
 	xol_vaddr = xol_get_insn_slot(uprobe);
-	if (!xol_vaddr) {
-		err = -ENOMEM;
-		goto err_out;
-	}
+	if (!xol_vaddr)
+		return -ENOMEM;
 
 	utask->xol_vaddr = xol_vaddr;
 	utask->vaddr = bp_vaddr;
@@ -2211,15 +2216,17 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
 	if (unlikely(err)) {
 		xol_free_insn_slot(current);
-		goto err_out;
+		return err;
 	}
 
-	utask->active_uprobe = uprobe;
+	srcu_idx = __srcu_read_lock(&uretprobes_srcu);
+
+	hprobe_init_leased(&utask->active_hprobe, uprobe, srcu_idx);
 	utask->state = UTASK_SSTEP;
+
+	mod_timer(&utask->ri_timer, jiffies + RI_TIMER_PERIOD);
+
 	return 0;
-err_out:
-	put_uprobe(uprobe);
-	return err;
 }
 
 /*
@@ -2236,7 +2243,7 @@ bool uprobe_deny_signal(void)
 	struct task_struct *t = current;
 	struct uprobe_task *utask = t->utask;
 
-	if (likely(!utask || !utask->active_uprobe))
+	if (likely(!utask || !utask_has_pending_sstep_uprobe(utask)))
 		return false;
 
 	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
@@ -2553,8 +2560,10 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 {
 	struct uprobe *uprobe;
 	int err = 0;
+	bool under_rcu;
+
+	uprobe = hprobe_consume(&utask->active_hprobe, &under_rcu);
 
-	uprobe = utask->active_uprobe;
 	if (utask->state == UTASK_SSTEP_ACK)
 		err = arch_uprobe_post_xol(&uprobe->arch, regs);
 	else if (utask->state == UTASK_SSTEP_TRAPPED)
@@ -2562,8 +2571,8 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 	else
 		WARN_ON_ONCE(1);
 
-	put_uprobe(uprobe);
-	utask->active_uprobe = NULL;
+	hprobe_finalize(&utask->active_hprobe, uprobe, under_rcu);
+
 	utask->state = UTASK_RUNNING;
 	xol_free_insn_slot(current);
 
@@ -2580,7 +2589,7 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 /*
  * On breakpoint hit, breakpoint notifier sets the TIF_UPROBE flag and
  * allows the thread to return from interrupt. After that handle_swbp()
- * sets utask->active_uprobe.
+ * sets utask->active_hprobe.
  *
  * On singlestep exception, singlestep notifier sets the TIF_UPROBE flag
  * and allows the thread to return from interrupt.
@@ -2595,7 +2604,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 	clear_thread_flag(TIF_UPROBE);
 
 	utask = current->utask;
-	if (utask && utask->active_uprobe)
+	if (utask && utask_has_pending_sstep_uprobe(utask))
 		handle_singlestep(utask, regs);
 	else
 		handle_swbp(regs);
@@ -2626,7 +2635,7 @@ int uprobe_post_sstep_notifier(struct pt_regs *regs)
 {
 	struct uprobe_task *utask = current->utask;
 
-	if (!current->mm || !utask || !utask->active_uprobe)
+	if (!current->mm || !utask || !utask_has_pending_sstep_uprobe(utask))
 		/* task is currently not uprobed */
 		return 0;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context
  2024-09-09 22:49 ` [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context Andrii Nakryiko
@ 2024-09-10  2:51   ` Alexei Starovoitov
  2024-09-10  5:13     ` Andrii Nakryiko
  2024-09-15 14:49   ` Oleg Nesterov
  1 sibling, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2024-09-10  2:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, Peter Zijlstra, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, bpf, LKML, Jiri Olsa, Paul E. McKenney

On Mon, Sep 9, 2024 at 3:49 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
> makes it unsuitable to be called from more restricted context like softirq.
>
> Let's make put_uprobe() agnostic to the context in which it is called,
> and use work queue to defer the mutex-protected clean up steps.
>
> To avoid unnecessarily increasing the size of struct uprobe, we colocate
> work_struct in parallel with rb_node and rcu, both of which are unused
> by the time we get to schedule clean up work.
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 30 +++++++++++++++++++++++++++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index a2e6a57f79f2..377bd524bc8b 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -27,6 +27,7 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/khugepaged.h>
>  #include <linux/rcupdate_trace.h>
> +#include <linux/workqueue.h>
>
>  #include <linux/uprobes.h>
>
> @@ -54,14 +55,20 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>  #define UPROBE_COPY_INSN       0
>
>  struct uprobe {
> -       struct rb_node          rb_node;        /* node in the rb tree */
> +       union {
> +               struct {
> +                       struct rb_node          rb_node;        /* node in the rb tree */
> +                       struct rcu_head         rcu;
> +               };
> +               /* work is used only during freeing, rcu and rb_node are unused at that point */
> +               struct work_struct work;
> +       };
>         refcount_t              ref;
>         struct rw_semaphore     register_rwsem;
>         struct rw_semaphore     consumer_rwsem;
>         struct list_head        pending_list;
>         struct list_head        consumers;
>         struct inode            *inode;         /* Also hold a ref to inode */
> -       struct rcu_head         rcu;
>         loff_t                  offset;
>         loff_t                  ref_ctr_offset;
>         unsigned long           flags;
> @@ -620,11 +627,28 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
>         return !RB_EMPTY_NODE(&uprobe->rb_node);
>  }
>
> +static void uprobe_free_deferred(struct work_struct *work)
> +{
> +       struct uprobe *uprobe = container_of(work, struct uprobe, work);
> +
> +       /*
> +        * If application munmap(exec_vma) before uprobe_unregister()
> +        * gets called, we don't get a chance to remove uprobe from
> +        * delayed_uprobe_list from remove_breakpoint(). Do it here.
> +        */
> +       mutex_lock(&delayed_uprobe_lock);
> +       delayed_uprobe_remove(uprobe, NULL);
> +       mutex_unlock(&delayed_uprobe_lock);
> +
> +       kfree(uprobe);
> +}
> +
>  static void uprobe_free_rcu(struct rcu_head *rcu)
>  {
>         struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
>
> -       kfree(uprobe);
> +       INIT_WORK(&uprobe->work, uprobe_free_deferred);
> +       schedule_work(&uprobe->work);
>  }
>
>  static void put_uprobe(struct uprobe *uprobe)

It seems put_uprobe hunk was lost, since the patch is not doing
what commit log describes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context
  2024-09-10  2:51   ` Alexei Starovoitov
@ 2024-09-10  5:13     ` Andrii Nakryiko
  2024-09-10 15:56       ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2024-09-10  5:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, linux-trace-kernel, Peter Zijlstra,
	Oleg Nesterov, Steven Rostedt, Masami Hiramatsu, bpf, LKML,
	Jiri Olsa, Paul E. McKenney

On Mon, Sep 9, 2024 at 7:52 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 9, 2024 at 3:49 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
> > makes it unsuitable to be called from more restricted context like softirq.
> >
> > Let's make put_uprobe() agnostic to the context in which it is called,
> > and use work queue to defer the mutex-protected clean up steps.
> >
> > To avoid unnecessarily increasing the size of struct uprobe, we colocate
> > work_struct in parallel with rb_node and rcu, both of which are unused
> > by the time we get to schedule clean up work.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/events/uprobes.c | 30 +++++++++++++++++++++++++++---
> >  1 file changed, 27 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index a2e6a57f79f2..377bd524bc8b 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/shmem_fs.h>
> >  #include <linux/khugepaged.h>
> >  #include <linux/rcupdate_trace.h>
> > +#include <linux/workqueue.h>
> >
> >  #include <linux/uprobes.h>
> >
> > @@ -54,14 +55,20 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
> >  #define UPROBE_COPY_INSN       0
> >
> >  struct uprobe {
> > -       struct rb_node          rb_node;        /* node in the rb tree */
> > +       union {
> > +               struct {
> > +                       struct rb_node          rb_node;        /* node in the rb tree */
> > +                       struct rcu_head         rcu;
> > +               };
> > +               /* work is used only during freeing, rcu and rb_node are unused at that point */
> > +               struct work_struct work;
> > +       };
> >         refcount_t              ref;
> >         struct rw_semaphore     register_rwsem;
> >         struct rw_semaphore     consumer_rwsem;
> >         struct list_head        pending_list;
> >         struct list_head        consumers;
> >         struct inode            *inode;         /* Also hold a ref to inode */
> > -       struct rcu_head         rcu;
> >         loff_t                  offset;
> >         loff_t                  ref_ctr_offset;
> >         unsigned long           flags;
> > @@ -620,11 +627,28 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
> >         return !RB_EMPTY_NODE(&uprobe->rb_node);
> >  }
> >
> > +static void uprobe_free_deferred(struct work_struct *work)
> > +{
> > +       struct uprobe *uprobe = container_of(work, struct uprobe, work);
> > +
> > +       /*
> > +        * If application munmap(exec_vma) before uprobe_unregister()
> > +        * gets called, we don't get a chance to remove uprobe from
> > +        * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > +        */
> > +       mutex_lock(&delayed_uprobe_lock);
> > +       delayed_uprobe_remove(uprobe, NULL);
> > +       mutex_unlock(&delayed_uprobe_lock);
> > +
> > +       kfree(uprobe);
> > +}
> > +
> >  static void uprobe_free_rcu(struct rcu_head *rcu)
> >  {
> >         struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> >
> > -       kfree(uprobe);
> > +       INIT_WORK(&uprobe->work, uprobe_free_deferred);
> > +       schedule_work(&uprobe->work);
> >  }
> >
> >  static void put_uprobe(struct uprobe *uprobe)
>
> It seems put_uprobe hunk was lost, since the patch is not doing
> what commit log describes.


Hmm, put_uprobe() has:

call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);

at the end (see [0], which added that), so we do schedule_work() in
RCU callback, similarly to what we do with bpf_map freeing in the BPF
subsystem.

This patch set is based on the latest tip/perf/core (and also assuming
the RCU Tasks Trace patch that mysteriously disappeared is actually
there, hopefully it will just as magically be restored).

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=perf/core&id=8617408f7a01e94ce1f73e40a7704530e5dfb25c

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context
  2024-09-10  5:13     ` Andrii Nakryiko
@ 2024-09-10 15:56       ` Alexei Starovoitov
  2024-09-10 17:46         ` Andrii Nakryiko
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2024-09-10 15:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, Peter Zijlstra,
	Oleg Nesterov, Steven Rostedt, Masami Hiramatsu, bpf, LKML,
	Jiri Olsa, Paul E. McKenney

On Mon, Sep 9, 2024 at 10:13 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Sep 9, 2024 at 7:52 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Sep 9, 2024 at 3:49 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
> > > makes it unsuitable to be called from more restricted context like softirq.
> > >
> > > Let's make put_uprobe() agnostic to the context in which it is called,
> > > and use work queue to defer the mutex-protected clean up steps.
> > >
> > > To avoid unnecessarily increasing the size of struct uprobe, we colocate
> > > work_struct in parallel with rb_node and rcu, both of which are unused
> > > by the time we get to schedule clean up work.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  kernel/events/uprobes.c | 30 +++++++++++++++++++++++++++---
> > >  1 file changed, 27 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index a2e6a57f79f2..377bd524bc8b 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -27,6 +27,7 @@
> > >  #include <linux/shmem_fs.h>
> > >  #include <linux/khugepaged.h>
> > >  #include <linux/rcupdate_trace.h>
> > > +#include <linux/workqueue.h>
> > >
> > >  #include <linux/uprobes.h>
> > >
> > > @@ -54,14 +55,20 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
> > >  #define UPROBE_COPY_INSN       0
> > >
> > >  struct uprobe {
> > > -       struct rb_node          rb_node;        /* node in the rb tree */
> > > +       union {
> > > +               struct {
> > > +                       struct rb_node          rb_node;        /* node in the rb tree */
> > > +                       struct rcu_head         rcu;
> > > +               };
> > > +               /* work is used only during freeing, rcu and rb_node are unused at that point */
> > > +               struct work_struct work;
> > > +       };
> > >         refcount_t              ref;
> > >         struct rw_semaphore     register_rwsem;
> > >         struct rw_semaphore     consumer_rwsem;
> > >         struct list_head        pending_list;
> > >         struct list_head        consumers;
> > >         struct inode            *inode;         /* Also hold a ref to inode */
> > > -       struct rcu_head         rcu;
> > >         loff_t                  offset;
> > >         loff_t                  ref_ctr_offset;
> > >         unsigned long           flags;
> > > @@ -620,11 +627,28 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
> > >         return !RB_EMPTY_NODE(&uprobe->rb_node);
> > >  }
> > >
> > > +static void uprobe_free_deferred(struct work_struct *work)
> > > +{
> > > +       struct uprobe *uprobe = container_of(work, struct uprobe, work);
> > > +
> > > +       /*
> > > +        * If application munmap(exec_vma) before uprobe_unregister()
> > > +        * gets called, we don't get a chance to remove uprobe from
> > > +        * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > > +        */
> > > +       mutex_lock(&delayed_uprobe_lock);
> > > +       delayed_uprobe_remove(uprobe, NULL);
> > > +       mutex_unlock(&delayed_uprobe_lock);
> > > +
> > > +       kfree(uprobe);
> > > +}
> > > +
> > >  static void uprobe_free_rcu(struct rcu_head *rcu)
> > >  {
> > >         struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > >
> > > -       kfree(uprobe);
> > > +       INIT_WORK(&uprobe->work, uprobe_free_deferred);
> > > +       schedule_work(&uprobe->work);
> > >  }
> > >
> > >  static void put_uprobe(struct uprobe *uprobe)
> >
> > It seems put_uprobe hunk was lost, since the patch is not doing
> > what commit log describes.
>
>
> Hmm, put_uprobe() has:
>
> call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
>
> at the end (see [0], which added that), so we do schedule_work() in
> RCU callback, similarly to what we do with bpf_map freeing in the BPF
> subsystem.
>
> This patch set is based on the latest tip/perf/core (and also assuming
> the RCU Tasks Trace patch that mysteriously disappeared is actually
> there, hopefully it will just as magically be restored).
>
>   [0] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=perf/core&id=8617408f7a01e94ce1f73e40a7704530e5dfb25c

I'm still not following.
put_uprobe() did delayed_uprobe_remove() and this patch is doing it again.

The commit log implies that mutex+delayed_uprobe_remove should be removed
from put_uprobe(), but that's not what the patch is doing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context
  2024-09-10 15:56       ` Alexei Starovoitov
@ 2024-09-10 17:46         ` Andrii Nakryiko
  0 siblings, 0 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2024-09-10 17:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, linux-trace-kernel, Peter Zijlstra,
	Oleg Nesterov, Steven Rostedt, Masami Hiramatsu, bpf, LKML,
	Jiri Olsa, Paul E. McKenney

On Tue, Sep 10, 2024 at 8:56 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 9, 2024 at 10:13 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Sep 9, 2024 at 7:52 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Sep 9, 2024 at 3:49 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
> > > > makes it unsuitable to be called from more restricted context like softirq.
> > > >
> > > > Let's make put_uprobe() agnostic to the context in which it is called,
> > > > and use work queue to defer the mutex-protected clean up steps.
> > > >
> > > > To avoid unnecessarily increasing the size of struct uprobe, we colocate
> > > > work_struct in parallel with rb_node and rcu, both of which are unused
> > > > by the time we get to schedule clean up work.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  kernel/events/uprobes.c | 30 +++++++++++++++++++++++++++---
> > > >  1 file changed, 27 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index a2e6a57f79f2..377bd524bc8b 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -27,6 +27,7 @@
> > > >  #include <linux/shmem_fs.h>
> > > >  #include <linux/khugepaged.h>
> > > >  #include <linux/rcupdate_trace.h>
> > > > +#include <linux/workqueue.h>
> > > >
> > > >  #include <linux/uprobes.h>
> > > >
> > > > @@ -54,14 +55,20 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
> > > >  #define UPROBE_COPY_INSN       0
> > > >
> > > >  struct uprobe {
> > > > -       struct rb_node          rb_node;        /* node in the rb tree */
> > > > +       union {
> > > > +               struct {
> > > > +                       struct rb_node          rb_node;        /* node in the rb tree */
> > > > +                       struct rcu_head         rcu;
> > > > +               };
> > > > +               /* work is used only during freeing, rcu and rb_node are unused at that point */
> > > > +               struct work_struct work;
> > > > +       };
> > > >         refcount_t              ref;
> > > >         struct rw_semaphore     register_rwsem;
> > > >         struct rw_semaphore     consumer_rwsem;
> > > >         struct list_head        pending_list;
> > > >         struct list_head        consumers;
> > > >         struct inode            *inode;         /* Also hold a ref to inode */
> > > > -       struct rcu_head         rcu;
> > > >         loff_t                  offset;
> > > >         loff_t                  ref_ctr_offset;
> > > >         unsigned long           flags;
> > > > @@ -620,11 +627,28 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
> > > >         return !RB_EMPTY_NODE(&uprobe->rb_node);
> > > >  }
> > > >
> > > > +static void uprobe_free_deferred(struct work_struct *work)
> > > > +{
> > > > +       struct uprobe *uprobe = container_of(work, struct uprobe, work);
> > > > +
> > > > +       /*
> > > > +        * If application munmap(exec_vma) before uprobe_unregister()
> > > > +        * gets called, we don't get a chance to remove uprobe from
> > > > +        * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > > > +        */
> > > > +       mutex_lock(&delayed_uprobe_lock);
> > > > +       delayed_uprobe_remove(uprobe, NULL);
> > > > +       mutex_unlock(&delayed_uprobe_lock);
> > > > +
> > > > +       kfree(uprobe);
> > > > +}
> > > > +
> > > >  static void uprobe_free_rcu(struct rcu_head *rcu)
> > > >  {
> > > >         struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > > >
> > > > -       kfree(uprobe);
> > > > +       INIT_WORK(&uprobe->work, uprobe_free_deferred);
> > > > +       schedule_work(&uprobe->work);
> > > >  }
> > > >
> > > >  static void put_uprobe(struct uprobe *uprobe)
> > >
> > > It seems put_uprobe hunk was lost, since the patch is not doing
> > > what commit log describes.
> >
> >
> > Hmm, put_uprobe() has:
> >
> > call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
> >
> > at the end (see [0], which added that), so we do schedule_work() in
> > RCU callback, similarly to what we do with bpf_map freeing in the BPF
> > subsystem.
> >
> > This patch set is based on the latest tip/perf/core (and also assuming
> > the RCU Tasks Trace patch that mysteriously disappeared is actually
> > there, hopefully it will just as magically be restored).
> >
> >   [0] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=perf/core&id=8617408f7a01e94ce1f73e40a7704530e5dfb25c
>
> I'm still not following.
> put_uprobe() did delayed_uprobe_remove() and this patch is doing it again.
>
> The commit log implies that mutex+delayed_uprobe_remove should be removed
> from put_uprobe(), but that's not what the patch is doing.

Ah, that part. You are right, it's my bad. I copied that lock to a
work queue callback, but didn't remove it from the put_uprobe(). Will
fix it in v2, thanks for catching!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context
  2024-09-09 22:49 ` [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context Andrii Nakryiko
  2024-09-10  2:51   ` Alexei Starovoitov
@ 2024-09-15 14:49   ` Oleg Nesterov
  2024-09-17  8:19     ` Andrii Nakryiko
  1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2024-09-15 14:49 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

On 09/09, Andrii Nakryiko wrote:
>
> Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
> makes it unsuitable to be called from more restricted context like softirq.
>
> Let's make put_uprobe() agnostic to the context in which it is called,
> and use work queue to defer the mutex-protected clean up steps.

...

> +static void uprobe_free_deferred(struct work_struct *work)
> +{
> +	struct uprobe *uprobe = container_of(work, struct uprobe, work);
> +
> +	/*
> +	 * If application munmap(exec_vma) before uprobe_unregister()
> +	 * gets called, we don't get a chance to remove uprobe from
> +	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
> +	 */
> +	mutex_lock(&delayed_uprobe_lock);
> +	delayed_uprobe_remove(uprobe, NULL);
> +	mutex_unlock(&delayed_uprobe_lock);
> +
> +	kfree(uprobe);
> +}
> +
>  static void uprobe_free_rcu(struct rcu_head *rcu)
>  {
>  	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
>
> -	kfree(uprobe);
> +	INIT_WORK(&uprobe->work, uprobe_free_deferred);
> +	schedule_work(&uprobe->work);
>  }

This is still wrong afaics...

If put_uprobe() can be called from softirq (after the next patch), then
put_uprobe() and all other users of uprobes_treelock should use
write_lock_bh/read_lock_bh to avoid the deadlock.

To be honest... I simply can't force myself to even try to read 2/3 ;) I'll
try to do this later, but I am sure I will never like it, sorry.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] uprobes: implement SRCU-protected lifetime for single-stepped uprobe
  2024-09-09 22:49 ` [PATCH 3/3] uprobes: implement SRCU-protected lifetime for single-stepped uprobe Andrii Nakryiko
@ 2024-09-15 14:51   ` Oleg Nesterov
  2024-09-17  8:20     ` Andrii Nakryiko
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2024-09-15 14:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
	jolsa, paulmck

On 09/09, Andrii Nakryiko wrote:
>
> Similarly to how we SRCU-protect uprobe instance (and avoid refcounting
> it unnecessarily) when waiting for return probe hit, use hprobe approach
> to do the same with single-stepped uprobe. Same hprobe_* primitives are
> used. We also reuse ri_timer() callback to expire both pending
> single-step uprobe and return instances.

Well, I still think it would be better (and much simpler) to simply kill
utask->active_uprobe, iirc I even sent the RFC patch...

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context
  2024-09-15 14:49   ` Oleg Nesterov
@ 2024-09-17  8:19     ` Andrii Nakryiko
  2024-10-04 20:18       ` Andrii Nakryiko
  0 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2024-09-17  8:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Sun, Sep 15, 2024 at 4:49 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 09/09, Andrii Nakryiko wrote:
> >
> > Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
> > makes it unsuitable to be called from more restricted context like softirq.
> >
> > Let's make put_uprobe() agnostic to the context in which it is called,
> > and use work queue to defer the mutex-protected clean up steps.
>
> ...
>
> > +static void uprobe_free_deferred(struct work_struct *work)
> > +{
> > +     struct uprobe *uprobe = container_of(work, struct uprobe, work);
> > +
> > +     /*
> > +      * If application munmap(exec_vma) before uprobe_unregister()
> > +      * gets called, we don't get a chance to remove uprobe from
> > +      * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > +      */
> > +     mutex_lock(&delayed_uprobe_lock);
> > +     delayed_uprobe_remove(uprobe, NULL);
> > +     mutex_unlock(&delayed_uprobe_lock);
> > +
> > +     kfree(uprobe);
> > +}
> > +
> >  static void uprobe_free_rcu(struct rcu_head *rcu)
> >  {
> >       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> >
> > -     kfree(uprobe);
> > +     INIT_WORK(&uprobe->work, uprobe_free_deferred);
> > +     schedule_work(&uprobe->work);
> >  }
>
> This is still wrong afaics...
>
> If put_uprobe() can be called from softirq (after the next patch), then
> put_uprobe() and all other users of uprobes_treelock should use
> write_lock_bh/read_lock_bh to avoid the deadlock.

Ok, I see the problem, that's unfortunate.

I see three ways to handle that:

1) keep put_uprobe() as is, and instead do schedule_work() from the
timer thread to postpone put_uprobe(). (but I'm not a big fan of this)
2) move uprobes_treelock part of put_uprobe() into rcu callback, I
think it has no bearing on correctness, uprobe_is_active() is there
already to handle races between putting uprobe and removing it from
uprobes_tree (I prefer this one over #1 )
3) you might like this the most ;) I think I can simplify
hprobes_expire() from patch #2 to not need put_uprobe() at all, if I
protect uprobe lifetime with non-sleepable
rcu_read_lock()/rcu_read_unlock() and perform try_get_uprobe() as the
very last step after cmpxchg() succeeded.

I'm leaning towards #3, but #2 seems fine to me as well.

>
> To be honest... I simply can't force myself to even try to read 2/3 ;) I'll
> try to do this later, but I am sure I will never like it, sorry.

This might sound rude, but the goal here is not to make you like it :)
The goal is to improve performance with minimal complexity. And I'm
very open to any alternative proposals as to how to make uretprobes
RCU-protected to avoid refcounting in the hot path.

I think #3 proposal above will make it a bit more palatable (but there
is still locklessness, cmpxchg, etc, I see no way around that,
unfortunately).

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] uprobes: implement SRCU-protected lifetime for single-stepped uprobe
  2024-09-15 14:51   ` Oleg Nesterov
@ 2024-09-17  8:20     ` Andrii Nakryiko
  0 siblings, 0 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2024-09-17  8:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Sun, Sep 15, 2024 at 4:51 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 09/09, Andrii Nakryiko wrote:
> >
> > Similarly to how we SRCU-protect uprobe instance (and avoid refcounting
> > it unnecessarily) when waiting for return probe hit, use hprobe approach
> > to do the same with single-stepped uprobe. Same hprobe_* primitives are
> > used. We also reuse ri_timer() callback to expire both pending
> > single-step uprobe and return instances.
>
> Well, I still think it would be better (and much simpler) to simply kill
> utask->active_uprobe, iirc I even sent the RFC patch...
>

let's do it, please send non-RFC patches and get them landed!

> Oleg.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context
  2024-09-17  8:19     ` Andrii Nakryiko
@ 2024-10-04 20:18       ` Andrii Nakryiko
  0 siblings, 0 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2024-10-04 20:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
	bpf, linux-kernel, jolsa, paulmck

On Tue, Sep 17, 2024 at 1:19 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sun, Sep 15, 2024 at 4:49 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 09/09, Andrii Nakryiko wrote:
> > >
> > > Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
> > > makes it unsuitable to be called from more restricted context like softirq.
> > >
> > > Let's make put_uprobe() agnostic to the context in which it is called,
> > > and use work queue to defer the mutex-protected clean up steps.
> >
> > ...
> >
> > > +static void uprobe_free_deferred(struct work_struct *work)
> > > +{
> > > +     struct uprobe *uprobe = container_of(work, struct uprobe, work);
> > > +
> > > +     /*
> > > +      * If application munmap(exec_vma) before uprobe_unregister()
> > > +      * gets called, we don't get a chance to remove uprobe from
> > > +      * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > > +      */
> > > +     mutex_lock(&delayed_uprobe_lock);
> > > +     delayed_uprobe_remove(uprobe, NULL);
> > > +     mutex_unlock(&delayed_uprobe_lock);
> > > +
> > > +     kfree(uprobe);
> > > +}
> > > +
> > >  static void uprobe_free_rcu(struct rcu_head *rcu)
> > >  {
> > >       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > >
> > > -     kfree(uprobe);
> > > +     INIT_WORK(&uprobe->work, uprobe_free_deferred);
> > > +     schedule_work(&uprobe->work);
> > >  }
> >
> > This is still wrong afaics...
> >
> > If put_uprobe() can be called from softirq (after the next patch), then
> > put_uprobe() and all other users of uprobes_treelock should use
> > write_lock_bh/read_lock_bh to avoid the deadlock.
>
> Ok, I see the problem, that's unfortunate.
>
> I see three ways to handle that:
>
> 1) keep put_uprobe() as is, and instead do schedule_work() from the
> timer thread to postpone put_uprobe(). (but I'm not a big fan of this)
> 2) move uprobes_treelock part of put_uprobe() into rcu callback, I
> think it has no bearing on correctness, uprobe_is_active() is there
> already to handle races between putting uprobe and removing it from
> uprobes_tree (I prefer this one over #1 )
> 3) you might like this the most ;) I think I can simplify
> hprobes_expire() from patch #2 to not need put_uprobe() at all, if I
> protect uprobe lifetime with non-sleepable
> rcu_read_lock()/rcu_read_unlock() and perform try_get_uprobe() as the
> very last step after cmpxchg() succeeded.
>
> I'm leaning towards #3, but #2 seems fine to me as well.

Ok, so just a short update. I don't think #3 works, I do need
try_get_uprobe() before I know for sure that cmpxchg() succeeds. Which
means I'd need a compensating put_uprobe() if cmpxchg() fails. So for
put_uprobe(), I just made it do all the locking in deferred work
callback (which is #2 above), which I think resolved the issue you
pointed out with potential deadlock and removes any limitations on
put_uprobe().

Also, I rewrote the hprobe_consume() and hprobe_expire() in terms of
an explicit state machine with 4 possible states (LEASED, STABLE,
GONE, CONSUMED), which I think makes the logic a bit more
straightforward to follow. Hopefully that will make the change more
palatable for you. I'm probably going to post patches next week,
though.

>
> >
> > To be honest... I simply can't force myself to even try to read 2/3 ;) I'll
> > try to do this later, but I am sure I will never like it, sorry.
>
> This might sound rude, but the goal here is not to make you like it :)
> The goal is to improve performance with minimal complexity. And I'm
> very open to any alternative proposals as to how to make uretprobes
> RCU-protected to avoid refcounting in the hot path.
>
> I think #3 proposal above will make it a bit more palatable (but there
> is still locklessness, cmpxchg, etc, I see no way around that,
> unfortunately).
>
> >
> > Oleg.
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-10-04 20:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-09 22:49 [PATCH 0/3] SRCU-protected uretprobes hot path Andrii Nakryiko
2024-09-09 22:49 ` [PATCH 1/3] uprobes: allow put_uprobe() from non-sleepable softirq context Andrii Nakryiko
2024-09-10  2:51   ` Alexei Starovoitov
2024-09-10  5:13     ` Andrii Nakryiko
2024-09-10 15:56       ` Alexei Starovoitov
2024-09-10 17:46         ` Andrii Nakryiko
2024-09-15 14:49   ` Oleg Nesterov
2024-09-17  8:19     ` Andrii Nakryiko
2024-10-04 20:18       ` Andrii Nakryiko
2024-09-09 22:49 ` [PATCH 2/3] uprobes: SRCU-protect uretprobe lifetime (with timeout) Andrii Nakryiko
2024-09-09 22:49 ` [PATCH 3/3] uprobes: implement SRCU-protected lifetime for single-stepped uprobe Andrii Nakryiko
2024-09-15 14:51   ` Oleg Nesterov
2024-09-17  8:20     ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).