[PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
@ 2024-07-01 22:39 Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 01/12] uprobes: update outdated comment Andrii Nakryiko
                   ` (13 more replies)
  0 siblings, 14 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

This patch set, ultimately, switches global uprobes_treelock from RW spinlock
to per-CPU RW semaphore, which has better performance and scales better under
contention and multiple parallel threads triggering lots of uprobes.

To make this work well with attaching multiple uprobes (through BPF
multi-uprobe), we need to add batched versions of uprobe register/unregister
APIs. This is what most of the patch set is actually doing. The actual switch
to per-CPU RW semaphore is trivial after that and is done in the very last
patch #12. See commit message with some comparison numbers.

Patch #4 is probably the most important patch in the series, revamping uprobe
lifetime management and refcounting. See patch description and added code
comments for all the details.

With changes in patch #4, we open up the way to refactor uprobe_register() and
uprobe_unregister() implementations in such a way that we can avoid taking
uprobes_treelock many times during a single batched attachment/detachment.
This allows to accommodate a much higher latency of taking per-CPU RW
semaphore for write. The end result of this patch set is that attaching 50
thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms
both before and after the changes in this patch set.

Patch #5 updates existing uprobe consumers to put all the relevant necessary
pieces into struct uprobe_consumer, without having to pass around
offset/ref_ctr_offset. Existing consumers already keep this data around, we
just formalize the interface.

Patches #6 through #10 add batched versions of register/unregister APIs and
gradually factor them in such a way as to allow taking single (batched)
uprobes_treelock, splitting the logic into multiple independent phases.

Patch #11 switched BPF multi-uprobes to batched uprobe APIs.

As mentioned, a very straightforward patch #12 takes advantage of all the prep
work and just switches uprobes_treelock to per-CPU RW semaphore.

v1->v2:
  - added RCU-delayed uprobe freeing to put_uprobe() (Masami);
  - fixed clean up handling in uprobe_register_batch (Jiri);
  - adjusted UPROBE_REFCNT_* constants to be more meaningful (Oleg);
  - dropped the "fix" to switch to write-protected mmap_sem, adjusted invalid
    comment instead (Oleg).

Andrii Nakryiko (12):
  uprobes: update outdated comment
  uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  uprobes: simplify error handling for alloc_uprobe()
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: move offset and ref_ctr_offset into uprobe_consumer
  uprobes: add batch uprobe register/unregister APIs
  uprobes: inline alloc_uprobe() logic into __uprobe_register()
  uprobes: split uprobe allocation and uprobes_tree insertion steps
  uprobes: batch uprobes_treelock during registration
  uprobes: improve lock batching for uprobe_unregister_batch
  uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  uprobes: switch uprobes_treelock to per-CPU RW semaphore

 include/linux/uprobes.h                       |  29 +-
 kernel/events/uprobes.c                       | 550 ++++++++++++------
 kernel/trace/bpf_trace.c                      |  40 +-
 kernel/trace/trace_uprobe.c                   |  53 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  22 +-
 5 files changed, 447 insertions(+), 247 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v2 01/12] uprobes: update outdated comment
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-03 11:38   ` Oleg Nesterov
  2024-07-01 22:39 ` [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() Andrii Nakryiko
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

There is no task_struct passed into get_user_pages_remote() anymore,
drop the parts of comment mentioning NULL tsk, it's just confusing at
this point.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..081821fd529a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 		goto out;
 
 	/*
-	 * The NULL 'tsk' here ensures that any faults that occur here
-	 * will not be accounted to the task.  'mm' *is* current->mm,
-	 * but we treat this as a 'remote' access since it is
-	 * essentially a kernel access to the memory.
+	 * 'mm' *is* current->mm, but we treat this as a 'remote' access since
+	 * it is essentially a kernel access to the memory.
 	 */
 	result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);
 	if (result < 0)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 01/12] uprobes: update outdated comment Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-03 11:41   ` Oleg Nesterov
  2024-07-03 13:15   ` Masami Hiramatsu
  2024-07-01 22:39 ` [PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe() Andrii Nakryiko
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

It seems like uprobe_write_opcode() doesn't require writer locked
mmap_sem, any lock (reader or writer) should be sufficient. This was
established in a discussion in [0] and looking through existing code
seems to confirm that there is no need for write-locked mmap_sem.

Fix the comment to state this clearly.

  [0] https://lore.kernel.org/linux-trace-kernel/20240625190748.GC14254@redhat.com/

Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into uprobe_write_opcode()")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 081821fd529a..f87049c08ee9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
  * @vaddr: the virtual address to store the opcode.
  * @opcode: opcode to be written at @vaddr.
  *
- * Called with mm->mmap_lock held for write.
+ * Called with mm->mmap_lock held for read or write.
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe()
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 01/12] uprobes: update outdated comment Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Return -ENOMEM instead of NULL, which makes caller's error handling just
a touch simpler.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f87049c08ee9..23449a8c5e7e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 
 	uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
 	if (!uprobe)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	uprobe->inode = inode;
 	uprobe->offset = offset;
@@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 
  retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
-	if (!uprobe)
-		return -ENOMEM;
 	if (IS_ERR(uprobe))
 		return PTR_ERR(uprobe);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe() Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-02 10:22   ` Peter Zijlstra
                     ` (2 more replies)
  2024-07-01 22:39 ` [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
    and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
    handler code, temporary refcount bump is done to keep uprobe from
    being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
    take another refcount to keep uprobe alive until user space code
    returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is problematic and inconvenient.
Because of it, we have extra retry logic in uprobe_register(), and we
have an extra logic in __uprobe_unregister(), which checks that uprobe
has no more consumers, and if that's the case, it removes struct uprobe
from uprobes_tree (through delete_uprobe(), which takes writer lock on
uprobes_tree), decrementing refcount after that. The latter is the
source of unfortunate race with uprobe_register, necessitating retries.

All of the above is a complication that makes adding batched uprobe
registration/unregistration APIs hard, and generally makes following the
logic harder.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead,
uprobe_consumer is assuming this extra refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, though find_active_uprobe()->find_uprobe() lookup
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). This is because
find_active_uprobe() bumps refcount without knowing for sure that
uprobe's refcount is already positive (and it has to be this way, there
is no way around that setup).

One, attempted initially, way to solve this is through using
atomic_inc_not_zero() approach, turning get_uprobe() into
try_get_uprobe(), which can fail to bump refcount if uprobe is already
destined to be destroyed. This, unfortunately, turns out to be a rather
expensive due to underlying cmpxchg() operation in
atomic_inc_not_zero() and scales rather poorly with increased amount of
parallel threads triggering uprobes.

So, we devise a refcounting scheme that doesn't require cmpxchg(),
instead relying only on atomic additions, which scale better and are
faster. While the solution has a bit of a trick to it, all the logic is
nicely compartmentalized in __get_uprobe() and put_uprobe() helpers and
doesn't leak outside of those low-level helpers.

We, effectively, structure uprobe's destruction (i.e., put_uprobe() logic)
in such a way that we support "resurrecting" uprobe by bumping its
refcount from zero back to one, and pretending like it never dropped to
zero in the first place. This is done in a race-free way under
exclusive writer uprobes_treelock. Crucially, we take lock only once
refcount drops to zero. If we had to take lock before decrementing
refcount, the approach would be prohibitively expensive.

Anyways, under exclusive writer lock, we double-check that refcount
didn't change and is still zero. If it is, we proceed with destruction,
because at that point we have a guarantee that find_active_uprobe()
can't successfully look up this uprobe instance, as it's going to be
removed in destructor under writer lock. If, on the other hand,
find_active_uprobe() managed to bump refcount from zero to one in
between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
write_lock(&uprobes_treelock), we'll deterministically detect this with
extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
pretend like atomic_dec_and_test() never returned true. There is no
resource freeing or any other irreversible action taken up till this
point, so we just exit early.

One tricky part in the above is actually two CPUs racing and dropping
refcnt to zero, and then attempting to free resources. This can happen
as follows:
  - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
  - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
    which point it decides that it needs to free uprobe as well.

At this point both CPU #0 and CPU #1 will believe they need to destroy
uprobe, which is obviously wrong. To prevent this situations, we augment
refcount with epoch counter, which is always incremented by 1 on either
get or put operation. This allows those two CPUs above to disambiguate
who should actually free uprobe (it's the CPU #1, because it has
up-to-date epoch). See comments in the code and note the specific values
of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
a single atomi64_t is actually a two sort-of-independent 32-bit counters
that are incremented/decremented with a single atomic_add_and_return()
operation. Note also a small and extremely rare (and thus having no
effect on performance) need to clear the highest bit every 2 billion
get/put operations to prevent high 32-bit counter from "bleeding over"
into lower 32-bit counter.

Another aspect with this race is the winning CPU might, at least
theoretically, be so quick that it will free uprobe memory before losing
CPU gets a chance to discover that it lost. To prevent this, we
protected and delay uprobe lifetime with RCU. We can't use
rcu_read_lock() + rcu_read_unlock(), because we need to take locks
inside the RCU critical section. Luckily, we have RCU Tasks Trace
flavor, which supports locking and sleeping. It is already used by BPF
subsystem for sleepable BPF programs (including sleepable BPF uprobe
programs), and is optimized for reader-dominated workflows. It fits
perfectly and doesn't seem to introduce any significant slowdowns in
uprobe hot path.

All the above contained trickery aside, we end up with a nice semantics
for get and put operations, where get always succeeds and put handles
all the races properly and transparently to the caller.

And just to justify this a bit unorthodox refcounting approach, under
uprobe triggering micro-benchmark (using BPF selftests' bench tool) with
8 triggering threads, atomic_inc_not_zero() approach was producing about
3.3 millions/sec total uprobe triggerings across all threads. While the
final atomic_add_and_return()-based approach managed to get 3.6 millions/sec
throughput under the same 8 competing threads.

Furthermore, CPU profiling showed the following overall CPU usage:
  - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
    atomic_inc_not_zero approach;
  - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
    atomic_add_and_return approach implemented by this patch.

So, CPU is spending relatively more CPU time in get/put operations while
delivering less total throughput if using atomic_inc_not_zero(). And
this will be even more prominent once we optimize away uprobe->register_rwsem
in the subsequent patch sets. So while slightly less straightforward,
current approach seems to be clearly winning and justified.

We also rename get_uprobe() to __get_uprobe() to indicate it's
a delicate internal helper that is only safe to call under valid
circumstances:
  - while holding uprobes_treelock (to synchronize with exclusive write
    lock in put_uprobe(), as described above);
  - or if we have a guarantee that uprobe's refcount is already positive
    through caller holding at least one refcount (in this case there is
    no risk of refcount dropping to zero by any other CPU).

We also document why it's safe to do unconditional __get_uprobe() at all
call sites, to make it clear that we maintain the above invariants.

Note also, we now don't have a race between registration and
unregistration, so we remove the retry logic completely.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 260 ++++++++++++++++++++++++++++++----------
 1 file changed, 195 insertions(+), 65 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 23449a8c5e7e..560cf1ca512a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
-	refcount_t		ref;
+	atomic64_t		ref;		/* see UPROBE_REFCNT_GET below */
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
+	struct rcu_head		rcu;
 	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
@@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
-static struct uprobe *get_uprobe(struct uprobe *uprobe)
+/*
+ * Uprobe's 64-bit refcount is actually two independent counters co-located in
+ * a single u64 value:
+ *   - lower 32 bits are just a normal refcount with is increment and
+ *   decremented on get and put, respectively, just like normal refcount
+ *   would;
+ *   - upper 32 bits are a tag (or epoch, if you will), which is always
+ *   incremented by one, no matter whether get or put operation is done.
+ *
+ * This upper counter is meant to distinguish between:
+ *   - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
+ *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
+ *   sequence, also proceeding to "destruction".
+ *
+ * In both cases refcount drops to zero, but in one case it will have epoch N,
+ * while the second drop to zero will have a different epoch N + 2, allowing
+ * first destructor to bail out because epoch changed between refcount going
+ * to zero and put_uprobe() taking uprobes_treelock (under which overall
+ * 64-bit refcount is double-checked, see put_uprobe() for details).
+ *
+ * Lower 32-bit counter is not meant to over overflow, while it's expected
+ * that upper 32-bit counter will overflow occasionally. Note, though, that we
+ * can't allow upper 32-bit counter to "bleed over" into lower 32-bit counter,
+ * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
+ * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
+ * epoch effectively a 31-bit counter with highest bit used as a flag to
+ * perform a fix-up. This ensures epoch and refcnt parts do not "interfere".
+ *
+ * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
+ * epoch and refcnt parts atomically with one atomic_add().
+ * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
+ * *increment* epoch part.
+ */
+#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x0000000100000001LL */
+#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0x00000000ffffffffLL */
+
+/*
+ * Caller has to make sure that:
+ *   a) either uprobe's refcnt is positive before this call;
+ *   b) or uprobes_treelock is held (doesn't matter if for read or write),
+ *      preventing uprobe's destructor from removing it from uprobes_tree.
+ *
+ * In the latter case, uprobe's destructor will "resurrect" uprobe instance if
+ * it detects that its refcount went back to being positive again inbetween it
+ * dropping to zero at some point and (potentially delayed) destructor
+ * callback actually running.
+ */
+static struct uprobe *__get_uprobe(struct uprobe *uprobe)
 {
-	refcount_inc(&uprobe->ref);
+	s64 v;
+
+	v = atomic64_add_return(UPROBE_REFCNT_GET, &uprobe->ref);
+
+	/*
+	 * If the highest bit is set, we need to clear it. If cmpxchg() fails,
+	 * we don't retry because there is another CPU that just managed to
+	 * update refcnt and will attempt the same "fix up". Eventually one of
+	 * them will succeed to clear highset bit.
+	 */
+	if (unlikely(v < 0))
+		(void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
+
 	return uprobe;
 }
 
+static inline bool uprobe_is_active(struct uprobe *uprobe)
+{
+	return !RB_EMPTY_NODE(&uprobe->rb_node);
+}
+
+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+
+	kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
-	if (refcount_dec_and_test(&uprobe->ref)) {
+	s64 v;
+
+	/*
+	 * here uprobe instance is guaranteed to be alive, so we use Tasks
+	 * Trace RCU to guarantee that uprobe won't be freed from under us, if
+	 * we end up being a losing "destructor" inside uprobe_treelock'ed
+	 * section double-checking uprobe->ref value below.
+	 * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
+	 */
+	rcu_read_lock_trace();
+
+	v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
+
+	if (unlikely((u32)v == 0)) {
+		bool destroy;
+
+		write_lock(&uprobes_treelock);
+		/*
+		 * We might race with find_uprobe()->__get_uprobe() executed
+		 * from inside read-locked uprobes_treelock, which can bump
+		 * refcount from zero back to one, after we got here. Even
+		 * worse, it's possible for another CPU to do 0 -> 1 -> 0
+		 * transition between this CPU doing atomic_add() and taking
+		 * uprobes_treelock. In either case this CPU should bail out
+		 * and not proceed with destruction.
+		 *
+		 * So now that we have exclusive write lock, we double check
+		 * the total 64-bit refcount value, which includes the epoch.
+		 * If nothing changed (i.e., epoch is the same and refcnt is
+		 * still zero), we are good and we proceed with the clean up.
+		 *
+		 * But if it managed to be updated back at least once, we just
+		 * pretend it never went to zero. If lower 32-bit refcnt part
+		 * drops to zero again, another CPU will proceed with
+		 * destruction, due to more up to date epoch.
+		 */
+		destroy = atomic64_read(&uprobe->ref) == v;
+		if (destroy && uprobe_is_active(uprobe))
+			rb_erase(&uprobe->rb_node, &uprobes_tree);
+		write_unlock(&uprobes_treelock);
+
+		/*
+		 * Beyond here we don't need RCU protection, we are either the
+		 * winning destructor and we control the rest of uprobe's
+		 * lifetime; or we lost and we are bailing without accessing
+		 * uprobe fields anymore.
+		 */
+		rcu_read_unlock_trace();
+
+		/* uprobe got resurrected, pretend we never tried to free it */
+		if (!destroy)
+			return;
+
 		/*
 		 * If application munmap(exec_vma) before uprobe_unregister()
 		 * gets called, we don't get a chance to remove uprobe from
@@ -604,8 +728,21 @@ static void put_uprobe(struct uprobe *uprobe)
 		mutex_lock(&delayed_uprobe_lock);
 		delayed_uprobe_remove(uprobe, NULL);
 		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
+
+		call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
+		return;
 	}
+
+	/*
+	 * If the highest bit is set, we need to clear it. If cmpxchg() fails,
+	 * we don't retry because there is another CPU that just managed to
+	 * update refcnt and will attempt the same "fix up". Eventually one of
+	 * them will succeed to clear highset bit.
+	 */
+	if (unlikely(v < 0))
+		(void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
+
+	rcu_read_unlock_trace();
 }
 
 static __always_inline
@@ -653,12 +790,15 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 		.inode = inode,
 		.offset = offset,
 	};
-	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
+	struct rb_node *node;
+	struct uprobe *u = NULL;
 
+	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		/* we hold uprobes_treelock, so it's safe to __get_uprobe() */
+		u = __get_uprobe(__node_2_uprobe(node));
 
-	return NULL;
+	return u;
 }
 
 /*
@@ -676,26 +816,37 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/*
+ * Attempt to insert a new uprobe into uprobes_tree.
+ *
+ * If uprobe already exists (for given inode+offset), we just increment
+ * refcount of previously existing uprobe.
+ *
+ * If not, a provided new instance of uprobe is inserted into the tree (with
+ * assumed initial refcount == 1).
+ *
+ * In any case, we return a uprobe instance that ends up being in uprobes_tree.
+ * Caller has to clean up new uprobe instance, if it ended up not being
+ * inserted into the tree.
+ *
+ * We assume that uprobes_treelock is held for writing.
+ */
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
+	struct uprobe *u = uprobe;
 
 	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		/* we hold uprobes_treelock, so it's safe to __get_uprobe() */
+		u = __get_uprobe(__node_2_uprobe(node));
 
-	/* get access + creation ref */
-	refcount_set(&uprobe->ref, 2);
-	return NULL;
+	return u;
 }
 
 /*
- * Acquire uprobes_treelock.
- * Matching uprobe already exists in rbtree;
- *	increment (access refcount) and return the matching uprobe.
- *
- * No matching uprobe; insert the uprobe in rb_tree;
- *	get a double refcount (access + creation) and return NULL.
+ * Acquire uprobes_treelock and insert uprobe into uprobes_tree
+ * (or reuse existing one, see __insert_uprobe() comments above).
  */
 static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 {
@@ -732,11 +883,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->ref_ctr_offset = ref_ctr_offset;
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
+	RB_CLEAR_NODE(&uprobe->rb_node);
+	atomic64_set(&uprobe->ref, 1);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
 	/* a uprobe exists for this inode:offset combination */
-	if (cur_uprobe) {
+	if (cur_uprobe != uprobe) {
 		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 			ref_ctr_mismatch_warn(cur_uprobe, uprobe);
 			put_uprobe(cur_uprobe);
@@ -921,27 +1074,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad
 	return set_orig_insn(&uprobe->arch, mm, vaddr);
 }
 
-static inline bool uprobe_is_active(struct uprobe *uprobe)
-{
-	return !RB_EMPTY_NODE(&uprobe->rb_node);
-}
-/*
- * There could be threads that have already hit the breakpoint. They
- * will recheck the current insn and restart if find_uprobe() fails.
- * See find_active_uprobe().
- */
-static void delete_uprobe(struct uprobe *uprobe)
-{
-	if (WARN_ON(!uprobe_is_active(uprobe)))
-		return;
-
-	write_lock(&uprobes_treelock);
-	rb_erase(&uprobe->rb_node, &uprobes_tree);
-	write_unlock(&uprobes_treelock);
-	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
-	put_uprobe(uprobe);
-}
-
 struct map_info {
 	struct map_info *next;
 	struct mm_struct *mm;
@@ -1082,15 +1214,11 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 static void
 __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
-	int err;
-
 	if (WARN_ON(!consumer_del(uprobe, uc)))
 		return;
 
-	err = register_for_each_vma(uprobe, NULL);
 	/* TODO : cant unregister? schedule a worker thread */
-	if (!uprobe->consumers && !err)
-		delete_uprobe(uprobe);
+	(void)register_for_each_vma(uprobe, NULL);
 }
 
 /*
@@ -1159,28 +1287,20 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
 		return -EINVAL;
 
- retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (IS_ERR(uprobe))
 		return PTR_ERR(uprobe);
 
-	/*
-	 * We can race with uprobe_unregister()->delete_uprobe().
-	 * Check uprobe_is_active() and retry if it is false.
-	 */
 	down_write(&uprobe->register_rwsem);
-	ret = -EAGAIN;
-	if (likely(uprobe_is_active(uprobe))) {
-		consumer_add(uprobe, uc);
-		ret = register_for_each_vma(uprobe, uc);
-		if (ret)
-			__uprobe_unregister(uprobe, uc);
-	}
+	consumer_add(uprobe, uc);
+	ret = register_for_each_vma(uprobe, uc);
+	if (ret)
+		__uprobe_unregister(uprobe, uc);
 	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
 
-	if (unlikely(ret == -EAGAIN))
-		goto retry;
+	if (ret)
+		put_uprobe(uprobe);
+
 	return ret;
 }
 
@@ -1303,15 +1423,15 @@ static void build_probe_list(struct inode *inode,
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset < min)
 				break;
+			__get_uprobe(u); /* uprobes_treelock is held */
 			list_add(&u->pending_list, head);
-			get_uprobe(u);
 		}
 		for (t = n; (t = rb_next(t)); ) {
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset > max)
 				break;
+			__get_uprobe(u); /* uprobes_treelock is held */
 			list_add(&u->pending_list, head);
-			get_uprobe(u);
 		}
 	}
 	read_unlock(&uprobes_treelock);
@@ -1769,7 +1889,14 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 			return -ENOMEM;
 
 		*n = *o;
-		get_uprobe(n->uprobe);
+		/*
+		 * uprobe's refcnt has to be positive at this point, kept by
+		 * utask->return_instances items; return_instances can't be
+		 * removed right now, as task is blocked due to duping; so
+		 * __get_uprobe() is safe to use here without holding
+		 * uprobes_treelock.
+		 */
+		__get_uprobe(n->uprobe);
 		n->next = NULL;
 
 		*p = n;
@@ -1911,8 +2038,11 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-
-	ri->uprobe = get_uprobe(uprobe);
+	 /*
+	  * uprobe's refcnt is positive, held by caller, so it's safe to
+	  * unconditionally bump it one more time here
+	  */
+	ri->uprobe = __get_uprobe(uprobe);
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-03  8:13   ` Peter Zijlstra
  2024-07-07 12:48   ` Oleg Nesterov
  2024-07-01 22:39 ` [PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Simplify uprobe registration/unregistration interfaces by making offset
and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
existing users already store these fields somewhere in uprobe_consumer's
containing structure, so this doesn't pose any problem. We just move
some fields around.

On the other hand, this simplifies uprobe_register() and
uprobe_unregister() API by having only struct uprobe_consumer as one
thing representing attachment/detachment entity. This makes batched
versions of uprobe_register() and uprobe_unregister() simpler.

This also makes uprobe_register_refctr() unnecessary, so remove it and
simplify consumers.

No functional changes intended.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h                       | 18 +++----
 kernel/events/uprobes.c                       | 19 ++-----
 kernel/trace/bpf_trace.c                      | 21 +++-----
 kernel/trace/trace_uprobe.c                   | 53 ++++++++-----------
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 22 ++++----
 5 files changed, 55 insertions(+), 78 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a75ba37ce3c8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -42,6 +42,11 @@ struct uprobe_consumer {
 				enum uprobe_filter_ctx ctx,
 				struct mm_struct *mm);
 
+	/* associated file offset of this probe */
+	loff_t offset;
+	/* associated refctr file offset of this probe, or zero */
+	loff_t ref_ctr_offset;
+	/* for internal uprobe infra use, consumers shouldn't touch fields below */
 	struct uprobe_consumer *next;
 };
 
@@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
-extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
-extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
+extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -152,11 +156,7 @@ static inline void uprobes_init(void)
 #define uprobe_get_trap_addr(regs)	instruction_pointer(regs)
 
 static inline int
-uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
-{
-	return -ENOSYS;
-}
-static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
 	return -ENOSYS;
 }
@@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo
 	return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 560cf1ca512a..8759c6d0683e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1224,14 +1224,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 /*
  * uprobe_unregister - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
- * @offset: offset from the start of the file.
- * @uc: identify which probe if multiple probes are colocated.
+ * @uc: identify which probe consumer to unregister.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 	struct uprobe *uprobe;
 
-	uprobe = find_uprobe(inode, offset);
+	uprobe = find_uprobe(inode, uc->offset);
 	if (WARN_ON(!uprobe))
 		return;
 
@@ -1304,20 +1303,12 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 	return ret;
 }
 
-int uprobe_register(struct inode *inode, loff_t offset,
-		    struct uprobe_consumer *uc)
+int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
-	return __uprobe_register(inode, offset, 0, uc);
+	return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc);
 }
 EXPORT_SYMBOL_GPL(uprobe_register);
 
-int uprobe_register_refctr(struct inode *inode, loff_t offset,
-			   loff_t ref_ctr_offset, struct uprobe_consumer *uc)
-{
-	return __uprobe_register(inode, offset, ref_ctr_offset, uc);
-}
-EXPORT_SYMBOL_GPL(uprobe_register_refctr);
-
 /*
  * uprobe_apply - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d1daeab1bbc1..ba62baec3152 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3154,8 +3154,6 @@ struct bpf_uprobe_multi_link;
 
 struct bpf_uprobe {
 	struct bpf_uprobe_multi_link *link;
-	loff_t offset;
-	unsigned long ref_ctr_offset;
 	u64 cookie;
 	struct uprobe_consumer consumer;
 };
@@ -3181,8 +3179,7 @@ static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
 	u32 i;
 
 	for (i = 0; i < cnt; i++) {
-		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
-				  &uprobes[i].consumer);
+		uprobe_unregister(d_real_inode(path->dentry), &uprobes[i].consumer);
 	}
 }
 
@@ -3262,10 +3259,10 @@ static int bpf_uprobe_multi_link_fill_link_info(const struct bpf_link *link,
 
 	for (i = 0; i < ucount; i++) {
 		if (uoffsets &&
-		    put_user(umulti_link->uprobes[i].offset, uoffsets + i))
+		    put_user(umulti_link->uprobes[i].consumer.offset, uoffsets + i))
 			return -EFAULT;
 		if (uref_ctr_offsets &&
-		    put_user(umulti_link->uprobes[i].ref_ctr_offset, uref_ctr_offsets + i))
+		    put_user(umulti_link->uprobes[i].consumer.ref_ctr_offset, uref_ctr_offsets + i))
 			return -EFAULT;
 		if (ucookies &&
 		    put_user(umulti_link->uprobes[i].cookie, ucookies + i))
@@ -3439,15 +3436,16 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 		goto error_free;
 
 	for (i = 0; i < cnt; i++) {
-		if (__get_user(uprobes[i].offset, uoffsets + i)) {
+		if (__get_user(uprobes[i].consumer.offset, uoffsets + i)) {
 			err = -EFAULT;
 			goto error_free;
 		}
-		if (uprobes[i].offset < 0) {
+		if (uprobes[i].consumer.offset < 0) {
 			err = -EINVAL;
 			goto error_free;
 		}
-		if (uref_ctr_offsets && __get_user(uprobes[i].ref_ctr_offset, uref_ctr_offsets + i)) {
+		if (uref_ctr_offsets &&
+		    __get_user(uprobes[i].consumer.ref_ctr_offset, uref_ctr_offsets + i)) {
 			err = -EFAULT;
 			goto error_free;
 		}
@@ -3477,10 +3475,7 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 		      &bpf_uprobe_multi_link_lops, prog);
 
 	for (i = 0; i < cnt; i++) {
-		err = uprobe_register_refctr(d_real_inode(link->path.dentry),
-					     uprobes[i].offset,
-					     uprobes[i].ref_ctr_offset,
-					     &uprobes[i].consumer);
+		err = uprobe_register(d_real_inode(link->path.dentry), &uprobes[i].consumer);
 		if (err) {
 			bpf_uprobe_unregister(&path, uprobes, i);
 			goto error_free;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c98e3b3386ba..d786f99114be 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -60,8 +60,6 @@ struct trace_uprobe {
 	struct path			path;
 	struct inode			*inode;
 	char				*filename;
-	unsigned long			offset;
-	unsigned long			ref_ctr_offset;
 	unsigned long			nhit;
 	struct trace_probe		tp;
 };
@@ -205,7 +203,7 @@ static unsigned long translate_user_vaddr(unsigned long file_offset)
 
 	udd = (void *) current->utask->vaddr;
 
-	base_addr = udd->bp_addr - udd->tu->offset;
+	base_addr = udd->bp_addr - udd->tu->consumer.offset;
 	return base_addr + file_offset;
 }
 
@@ -286,13 +284,13 @@ static bool trace_uprobe_match_command_head(struct trace_uprobe *tu,
 	if (strncmp(tu->filename, argv[0], len) || argv[0][len] != ':')
 		return false;
 
-	if (tu->ref_ctr_offset == 0)
-		snprintf(buf, sizeof(buf), "0x%0*lx",
-				(int)(sizeof(void *) * 2), tu->offset);
+	if (tu->consumer.ref_ctr_offset == 0)
+		snprintf(buf, sizeof(buf), "0x%0*llx",
+				(int)(sizeof(void *) * 2), tu->consumer.offset);
 	else
-		snprintf(buf, sizeof(buf), "0x%0*lx(0x%lx)",
-				(int)(sizeof(void *) * 2), tu->offset,
-				tu->ref_ctr_offset);
+		snprintf(buf, sizeof(buf), "0x%0*llx(0x%llx)",
+				(int)(sizeof(void *) * 2), tu->consumer.offset,
+				tu->consumer.ref_ctr_offset);
 	if (strcmp(buf, &argv[0][len + 1]))
 		return false;
 
@@ -410,7 +408,7 @@ static bool trace_uprobe_has_same_uprobe(struct trace_uprobe *orig,
 
 	list_for_each_entry(orig, &tpe->probes, tp.list) {
 		if (comp_inode != d_real_inode(orig->path.dentry) ||
-		    comp->offset != orig->offset)
+		    comp->consumer.offset != orig->consumer.offset)
 			continue;
 
 		/*
@@ -472,8 +470,8 @@ static int validate_ref_ctr_offset(struct trace_uprobe *new)
 
 	for_each_trace_uprobe(tmp, pos) {
 		if (new_inode == d_real_inode(tmp->path.dentry) &&
-		    new->offset == tmp->offset &&
-		    new->ref_ctr_offset != tmp->ref_ctr_offset) {
+		    new->consumer.offset == tmp->consumer.offset &&
+		    new->consumer.ref_ctr_offset != tmp->consumer.ref_ctr_offset) {
 			pr_warn("Reference counter offset mismatch.");
 			return -EINVAL;
 		}
@@ -675,8 +673,8 @@ static int __trace_uprobe_create(int argc, const char **argv)
 		WARN_ON_ONCE(ret != -ENOMEM);
 		goto fail_address_parse;
 	}
-	tu->offset = offset;
-	tu->ref_ctr_offset = ref_ctr_offset;
+	tu->consumer.offset = offset;
+	tu->consumer.ref_ctr_offset = ref_ctr_offset;
 	tu->path = path;
 	tu->filename = filename;
 
@@ -746,12 +744,12 @@ static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev)
 	char c = is_ret_probe(tu) ? 'r' : 'p';
 	int i;
 
-	seq_printf(m, "%c:%s/%s %s:0x%0*lx", c, trace_probe_group_name(&tu->tp),
+	seq_printf(m, "%c:%s/%s %s:0x%0*llx", c, trace_probe_group_name(&tu->tp),
 			trace_probe_name(&tu->tp), tu->filename,
-			(int)(sizeof(void *) * 2), tu->offset);
+			(int)(sizeof(void *) * 2), tu->consumer.offset);
 
-	if (tu->ref_ctr_offset)
-		seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
+	if (tu->consumer.ref_ctr_offset)
+		seq_printf(m, "(0x%llx)", tu->consumer.ref_ctr_offset);
 
 	for (i = 0; i < tu->tp.nr_args; i++)
 		seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
@@ -1089,12 +1087,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 	tu->consumer.filter = filter;
 	tu->inode = d_real_inode(tu->path.dentry);
 
-	if (tu->ref_ctr_offset)
-		ret = uprobe_register_refctr(tu->inode, tu->offset,
-				tu->ref_ctr_offset, &tu->consumer);
-	else
-		ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
-
+	ret = uprobe_register(tu->inode, &tu->consumer);
 	if (ret)
 		tu->inode = NULL;
 
@@ -1112,7 +1105,7 @@ static void __probe_event_disable(struct trace_probe *tp)
 		if (!tu->inode)
 			continue;
 
-		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
+		uprobe_unregister(tu->inode, &tu->consumer);
 		tu->inode = NULL;
 	}
 }
@@ -1310,7 +1303,7 @@ static int uprobe_perf_close(struct trace_event_call *call,
 		return 0;
 
 	list_for_each_entry(tu, trace_probe_probe_list(tp), tp.list) {
-		ret = uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
+		ret = uprobe_apply(tu->inode, tu->consumer.offset, &tu->consumer, false);
 		if (ret)
 			break;
 	}
@@ -1334,7 +1327,7 @@ static int uprobe_perf_open(struct trace_event_call *call,
 		return 0;
 
 	list_for_each_entry(tu, trace_probe_probe_list(tp), tp.list) {
-		err = uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
+		err = uprobe_apply(tu->inode, tu->consumer.offset, &tu->consumer, true);
 		if (err) {
 			uprobe_perf_close(call, event);
 			break;
@@ -1464,7 +1457,7 @@ int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type,
 	*fd_type = is_ret_probe(tu) ? BPF_FD_TYPE_URETPROBE
 				    : BPF_FD_TYPE_UPROBE;
 	*filename = tu->filename;
-	*probe_offset = tu->offset;
+	*probe_offset = tu->consumer.offset;
 	*probe_addr = 0;
 	return 0;
 }
@@ -1627,9 +1620,9 @@ create_local_trace_uprobe(char *name, unsigned long offs,
 		return ERR_CAST(tu);
 	}
 
-	tu->offset = offs;
+	tu->consumer.offset = offs;
 	tu->path = path;
-	tu->ref_ctr_offset = ref_ctr_offset;
+	tu->consumer.ref_ctr_offset = ref_ctr_offset;
 	tu->filename = kstrdup(name, GFP_KERNEL);
 	if (!tu->filename) {
 		ret = -ENOMEM;
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index b0132a342bb5..ca7122cdbcd3 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -391,25 +391,24 @@ static int testmod_register_uprobe(loff_t offset)
 {
 	int err = -EBUSY;
 
-	if (uprobe.offset)
+	if (uprobe.consumer.offset)
 		return -EBUSY;
 
 	mutex_lock(&testmod_uprobe_mutex);
 
-	if (uprobe.offset)
+	if (uprobe.consumer.offset)
 		goto out;
 
 	err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path);
 	if (err)
 		goto out;
 
-	err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry),
-				     offset, 0, &uprobe.consumer);
-	if (err)
+	uprobe.consumer.offset = offset;
+	err = uprobe_register(d_real_inode(uprobe.path.dentry), &uprobe.consumer);
+	if (err) {
 		path_put(&uprobe.path);
-	else
-		uprobe.offset = offset;
-
+		uprobe.consumer.offset = 0;
+	}
 out:
 	mutex_unlock(&testmod_uprobe_mutex);
 	return err;
@@ -419,10 +418,9 @@ static void testmod_unregister_uprobe(void)
 {
 	mutex_lock(&testmod_uprobe_mutex);
 
-	if (uprobe.offset) {
-		uprobe_unregister(d_real_inode(uprobe.path.dentry),
-				  uprobe.offset, &uprobe.consumer);
-		uprobe.offset = 0;
+	if (uprobe.consumer.offset) {
+		uprobe_unregister(d_real_inode(uprobe.path.dentry), &uprobe.consumer);
+		uprobe.consumer.offset = 0;
 	}
 
 	mutex_unlock(&testmod_uprobe_mutex);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() Andrii Nakryiko
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Introduce batch versions of uprobe registration (attachment) and
unregistration (detachment) APIs.

Unregistration is presumed to never fail, so that's easy.

Batch registration can fail, and so the semantics of
uprobe_register_batch() is such that either all uprobe_consumers are
successfully attached or none of them remain attached after the return.

There is no guarantee of atomicity of attachment, though, and so while
batch attachment is proceeding, some uprobes might start firing before
others are completely attached. Even if overall attachment eventually
fails, some successfully attached uprobes might fire and callers have to
be prepared to handle that. This is in no way a regression compared to
current approach of attaching uprobes one-by-one, though.

One crucial implementation detail is the addition of `struct uprobe
*uprobe` field to `struct uprobe_consumer` which is meant for internal
uprobe subsystem usage only. We use this field both as temporary storage
(to avoid unnecessary allocations) and as a back link to associated
uprobe to simplify and speed up uprobe unregistration, as we now can
avoid yet another tree lookup when unregistering uprobe_consumer.

The general direction with uprobe registration implementation is to do
batch attachment in distinct steps, each step performing some set of
checks or actions on all uprobe_consumers before proceeding to the next
phase. This, after some more changes in next patches, allows to batch
locking for each phase and in such a way amortize any long delays that
might be added by writer locks (especially once we switch
uprobes_treelock to per-CPU R/W semaphore later).

Currently, uprobe_register_batch() performs all the sanity checks first.
Then proceeds to allocate-and-insert (we'll split this up further later
on) uprobe instances, as necessary. And then the last step is actual
uprobe registration for all affected VMAs.

We take care to undo all the actions in the event of an error at any
point in this lengthy process, so end result is all-or-nothing, as
described above.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |  17 ++++
 kernel/events/uprobes.c | 180 ++++++++++++++++++++++++++++------------
 2 files changed, 146 insertions(+), 51 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a75ba37ce3c8..a6e6eb70539d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,8 @@ enum uprobe_filter_ctx {
 	UPROBE_FILTER_MMAP,
 };
 
+typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx);
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	int (*ret_handler)(struct uprobe_consumer *self,
@@ -48,6 +50,8 @@ struct uprobe_consumer {
 	loff_t ref_ctr_offset;
 	/* for internal uprobe infra use, consumers shouldn't touch fields below */
 	struct uprobe_consumer *next;
+	/* associated uprobe instance (or NULL if consumer isn't attached) */
+	struct uprobe *uprobe;
 };
 
 #ifdef CONFIG_UPROBES
@@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
+extern int uprobe_register_batch(struct inode *inode, int cnt,
+				 uprobe_consumer_fn get_uprobe_consumer, void *ctx);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
+extern void uprobe_unregister_batch(struct inode *inode, int cnt,
+				    uprobe_consumer_fn get_uprobe_consumer, void *ctx);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
 	return -ENOSYS;
 }
+static inline int uprobe_register_batch(struct inode *inode, int cnt,
+					uprobe_consumer_fn get_uprobe_consumer, void *ctx)
+{
+	return -ENOSYS;
+}
 static inline int
 uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add)
 {
@@ -169,6 +182,10 @@ static inline void
 uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
+static inline void uprobe_unregister_batch(struct inode *inode, int cnt,
+					     uprobe_consumer_fn get_uprobe_consumer, void *ctx)
+{
+}
 static inline int uprobe_mmap(struct vm_area_struct *vma)
 {
 	return 0;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8759c6d0683e..68fdf1b8e4bf 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1221,6 +1221,41 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	(void)register_for_each_vma(uprobe, NULL);
 }
 
+/*
+ * uprobe_unregister_batch - unregister a batch of already registered uprobe
+ * consumers.
+ * @inode: the file in which the probes have to be removed.
+ * @cnt: number of consumers to unregister
+ * @get_uprobe_consumer: a callback that returns Nth uprobe_consumer to attach
+ * @ctx: an arbitrary context passed through into get_uprobe_consumer callback
+ */
+void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx)
+{
+	struct uprobe *uprobe;
+	struct uprobe_consumer *uc;
+	int i;
+
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+		uprobe = uc->uprobe;
+
+		if (WARN_ON(!uprobe))
+			continue;
+
+		down_write(&uprobe->register_rwsem);
+		__uprobe_unregister(uprobe, uc);
+		up_write(&uprobe->register_rwsem);
+		put_uprobe(uprobe);
+
+		uc->uprobe = NULL;
+	}
+}
+
+static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
+{
+	return (struct uprobe_consumer *)ctx;
+}
+
 /*
  * uprobe_unregister - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
@@ -1228,84 +1263,127 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
  */
 void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
-	struct uprobe *uprobe;
-
-	uprobe = find_uprobe(inode, uc->offset);
-	if (WARN_ON(!uprobe))
-		return;
-
-	down_write(&uprobe->register_rwsem);
-	__uprobe_unregister(uprobe, uc);
-	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
+	uprobe_unregister_batch(inode, 1, uprobe_consumer_identity, uc);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
 /*
- * __uprobe_register - register a probe
- * @inode: the file in which the probe has to be placed.
- * @offset: offset from the start of the file.
- * @uc: information on howto handle the probe..
+ * uprobe_register_batch - register a batch of probes for a given inode
+ * @inode: the file in which the probes have to be placed.
+ * @cnt: number of probes to register
+ * @get_uprobe_consumer: a callback that returns Nth uprobe_consumer
+ * @ctx: an arbitrary context passed through into get_uprobe_consumer callback
+ *
+ * uprobe_consumer instance itself contains offset and (optional)
+ * ref_ctr_offset within inode to attach to.
+ *
+ * On success, each attached uprobe_consumer assumes one refcount taken for
+ * respective uprobe instance (uniquely identified by inode+offset
+ * combination). Each uprobe_consumer is expected to eventually be detached
+ * through uprobe_unregister() or uprobe_unregister_batch() call, dropping
+ * their owning refcount.
+ *
+ * Caller of uprobe_register()/uprobe_register_batch() is required to keep
+ * @inode (and the containing mount) referenced.
  *
- * Apart from the access refcount, __uprobe_register() takes a creation
- * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
- * inserted into the rbtree (i.e first consumer for a @inode:@offset
- * tuple).  Creation refcount stops uprobe_unregister from freeing the
- * @uprobe even before the register operation is complete. Creation
- * refcount is released when the last @uc for the @uprobe
- * unregisters. Caller of __uprobe_register() is required to keep @inode
- * (and the containing mount) referenced.
+ * If not all probes are successfully installed, then all the successfully
+ * installed ones are rolled back. Note, there is no atomicity guarantees
+ * w.r.t. batch attachment. Some probes might start firing before batch
+ * attachment is completed. Even more so, some consumers might fire even if
+ * overall batch attachment ultimately fails.
  *
  * Return errno if it cannot successully install probes
  * else return 0 (success)
  */
-static int __uprobe_register(struct inode *inode, loff_t offset,
-			     loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+int uprobe_register_batch(struct inode *inode, int cnt,
+			  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
 {
 	struct uprobe *uprobe;
-	int ret;
-
-	/* Uprobe must have at least one set consumer */
-	if (!uc->handler && !uc->ret_handler)
-		return -EINVAL;
+	struct uprobe_consumer *uc;
+	int ret, i;
 
 	/* copy_insn() uses read_mapping_page() or shmem_read_mapping_page() */
 	if (!inode->i_mapping->a_ops->read_folio &&
 	    !shmem_mapping(inode->i_mapping))
 		return -EIO;
-	/* Racy, just to catch the obvious mistakes */
-	if (offset > i_size_read(inode))
-		return -EINVAL;
 
-	/*
-	 * This ensures that copy_from_page(), copy_to_page() and
-	 * __update_ref_ctr() can't cross page boundary.
-	 */
-	if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
-		return -EINVAL;
-	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
+	if (cnt <= 0 || !get_uprobe_consumer)
 		return -EINVAL;
 
-	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
-	if (IS_ERR(uprobe))
-		return PTR_ERR(uprobe);
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+
+		/* Each consumer must have at least one set consumer */
+		if (!uc || (!uc->handler && !uc->ret_handler))
+			return -EINVAL;
+		/* Racy, just to catch the obvious mistakes */
+		if (uc->offset > i_size_read(inode))
+			return -EINVAL;
+		if (uc->uprobe)
+			return -EINVAL;
+		/*
+		 * This ensures that copy_from_page(), copy_to_page() and
+		 * __update_ref_ctr() can't cross page boundary.
+		 */
+		if (!IS_ALIGNED(uc->offset, UPROBE_SWBP_INSN_SIZE))
+			return -EINVAL;
+		if (!IS_ALIGNED(uc->ref_ctr_offset, sizeof(short)))
+			return -EINVAL;
+	}
 
-	down_write(&uprobe->register_rwsem);
-	consumer_add(uprobe, uc);
-	ret = register_for_each_vma(uprobe, uc);
-	if (ret)
-		__uprobe_unregister(uprobe, uc);
-	up_write(&uprobe->register_rwsem);
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
 
-	if (ret)
-		put_uprobe(uprobe);
+		uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
+		if (IS_ERR(uprobe)) {
+			ret = PTR_ERR(uprobe);
+			goto cleanup_uprobes;
+		}
+
+		uc->uprobe = uprobe;
+	}
 
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+		uprobe = uc->uprobe;
+
+		down_write(&uprobe->register_rwsem);
+		consumer_add(uprobe, uc);
+		ret = register_for_each_vma(uprobe, uc);
+		if (ret)
+			__uprobe_unregister(uprobe, uc);
+		up_write(&uprobe->register_rwsem);
+
+		if (ret)
+			goto cleanup_unreg;
+	}
+
+	return 0;
+
+cleanup_unreg:
+	/* unregister all uprobes we managed to register until failure */
+	for (i--; i >= 0; i--) {
+		uc = get_uprobe_consumer(i, ctx);
+
+		down_write(&uprobe->register_rwsem);
+		__uprobe_unregister(uc->uprobe, uc);
+		up_write(&uprobe->register_rwsem);
+	}
+cleanup_uprobes:
+	/* put all the successfully allocated/reused uprobes */
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+
+		if (uc->uprobe)
+			put_uprobe(uc->uprobe);
+		uc->uprobe = NULL;
+	}
 	return ret;
 }
 
 int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
-	return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc);
+	return uprobe_register_batch(inode, 1, uprobe_consumer_identity, uc);
 }
 EXPORT_SYMBOL_GPL(uprobe_register);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps Andrii Nakryiko
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

To allow unbundling alloc-uprobe-and-insert step which is currently
tightly coupled, inline alloc_uprobe() logic into
uprobe_register_batch() loop. It's called from one place, so we don't
really lose much in terms of maintainability.

No functional changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 65 ++++++++++++++++++-----------------------
 1 file changed, 28 insertions(+), 37 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 68fdf1b8e4bf..0f928a72a658 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -869,40 +869,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe)
 		(unsigned long long) uprobe->ref_ctr_offset);
 }
 
-static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
-				   loff_t ref_ctr_offset)
-{
-	struct uprobe *uprobe, *cur_uprobe;
-
-	uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
-	if (!uprobe)
-		return ERR_PTR(-ENOMEM);
-
-	uprobe->inode = inode;
-	uprobe->offset = offset;
-	uprobe->ref_ctr_offset = ref_ctr_offset;
-	init_rwsem(&uprobe->register_rwsem);
-	init_rwsem(&uprobe->consumer_rwsem);
-	RB_CLEAR_NODE(&uprobe->rb_node);
-	atomic64_set(&uprobe->ref, 1);
-
-	/* add to uprobes_tree, sorted on inode:offset */
-	cur_uprobe = insert_uprobe(uprobe);
-	/* a uprobe exists for this inode:offset combination */
-	if (cur_uprobe != uprobe) {
-		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
-			ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-			put_uprobe(cur_uprobe);
-			kfree(uprobe);
-			return ERR_PTR(-EINVAL);
-		}
-		kfree(uprobe);
-		uprobe = cur_uprobe;
-	}
-
-	return uprobe;
-}
-
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	down_write(&uprobe->consumer_rwsem);
@@ -1332,14 +1298,39 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 	}
 
 	for (i = 0; i < cnt; i++) {
+		struct uprobe *cur_uprobe;
+
 		uc = get_uprobe_consumer(i, ctx);
 
-		uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
-		if (IS_ERR(uprobe)) {
-			ret = PTR_ERR(uprobe);
+		uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+		if (!uprobe) {
+			ret = -ENOMEM;
 			goto cleanup_uprobes;
 		}
 
+		uprobe->inode = inode;
+		uprobe->offset = uc->offset;
+		uprobe->ref_ctr_offset = uc->ref_ctr_offset;
+		init_rwsem(&uprobe->register_rwsem);
+		init_rwsem(&uprobe->consumer_rwsem);
+		RB_CLEAR_NODE(&uprobe->rb_node);
+		atomic64_set(&uprobe->ref, 1);
+
+		/* add to uprobes_tree, sorted on inode:offset */
+		cur_uprobe = insert_uprobe(uprobe);
+		/* a uprobe exists for this inode:offset combination */
+		if (cur_uprobe != uprobe) {
+			if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
+				ref_ctr_mismatch_warn(cur_uprobe, uprobe);
+				put_uprobe(cur_uprobe);
+				kfree(uprobe);
+				ret = -EINVAL;
+				goto cleanup_uprobes;
+			}
+			kfree(uprobe);
+			uprobe = cur_uprobe;
+		}
+
 		uc->uprobe = uprobe;
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 09/12] uprobes: batch uprobes_treelock during registration Andrii Nakryiko
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Now we are ready to split alloc-and-insert coupled step into two
separate phases.

First, we allocate and prepare all potentially-to-be-inserted uprobe
instances, assuming corresponding uprobes are not yet in uprobes_tree.
This is needed so that we don't do memory allocations under
uprobes_treelock (once we batch locking for each step).

Second, we insert new uprobes or reuse already existing ones into
uprobes_tree. Any uprobe that turned out to be not necessary is
immediately freed, as there are no other references to it.

This concludes preparations that make uprobes_register_batch() ready to
batch and optimize locking per each phase.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0f928a72a658..128677ffe662 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1297,9 +1297,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 			return -EINVAL;
 	}
 
+	/* pre-allocate new uprobe instances */
 	for (i = 0; i < cnt; i++) {
-		struct uprobe *cur_uprobe;
-
 		uc = get_uprobe_consumer(i, ctx);
 
 		uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
@@ -1316,6 +1315,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 		RB_CLEAR_NODE(&uprobe->rb_node);
 		atomic64_set(&uprobe->ref, 1);
 
+		uc->uprobe = uprobe;
+	}
+
+	for (i = 0; i < cnt; i++) {
+		struct uprobe *cur_uprobe;
+
+		uc = get_uprobe_consumer(i, ctx);
+		uprobe = uc->uprobe;
+
 		/* add to uprobes_tree, sorted on inode:offset */
 		cur_uprobe = insert_uprobe(uprobe);
 		/* a uprobe exists for this inode:offset combination */
@@ -1323,15 +1331,12 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 			if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 				ref_ctr_mismatch_warn(cur_uprobe, uprobe);
 				put_uprobe(cur_uprobe);
-				kfree(uprobe);
 				ret = -EINVAL;
 				goto cleanup_uprobes;
 			}
 			kfree(uprobe);
-			uprobe = cur_uprobe;
+			uc->uprobe = cur_uprobe;
 		}
-
-		uc->uprobe = uprobe;
 	}
 
 	for (i = 0; i < cnt; i++) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 09/12] uprobes: batch uprobes_treelock during registration
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (7 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch Andrii Nakryiko
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Now that we have a good separate of each registration step, take
uprobes_treelock just once for relevant registration step, and then
process all relevant uprobes in one go.

Even if writer lock introduces a relatively large delay (as might happen
with per-CPU RW semaphore), this will keep overall batch attachment
reasonably fast.

We teach put_uprobe(), though __put_uprobe() helper, to optionally take
or not uprobes_treelock, to accommodate this pattern.

With these changes we don't need insert_uprobe() operation that
unconditionally takes uprobes_treelock, so get rid of it, leaving only
lower-level __insert_uprobe() helper.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 45 +++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 128677ffe662..ced85284bbf4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -665,7 +665,7 @@ static void uprobe_free_rcu(struct rcu_head *rcu)
 	kfree(uprobe);
 }
 
-static void put_uprobe(struct uprobe *uprobe)
+static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 {
 	s64 v;
 
@@ -683,7 +683,8 @@ static void put_uprobe(struct uprobe *uprobe)
 	if (unlikely((u32)v == 0)) {
 		bool destroy;
 
-		write_lock(&uprobes_treelock);
+		if (!tree_locked)
+			write_lock(&uprobes_treelock);
 		/*
 		 * We might race with find_uprobe()->__get_uprobe() executed
 		 * from inside read-locked uprobes_treelock, which can bump
@@ -706,7 +707,8 @@ static void put_uprobe(struct uprobe *uprobe)
 		destroy = atomic64_read(&uprobe->ref) == v;
 		if (destroy && uprobe_is_active(uprobe))
 			rb_erase(&uprobe->rb_node, &uprobes_tree);
-		write_unlock(&uprobes_treelock);
+		if (!tree_locked)
+			write_unlock(&uprobes_treelock);
 
 		/*
 		 * Beyond here we don't need RCU protection, we are either the
@@ -745,6 +747,11 @@ static void put_uprobe(struct uprobe *uprobe)
 	rcu_read_unlock_trace();
 }
 
+static void put_uprobe(struct uprobe *uprobe)
+{
+	__put_uprobe(uprobe, false);
+}
+
 static __always_inline
 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
 	       const struct uprobe *r)
@@ -844,21 +851,6 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 	return u;
 }
 
-/*
- * Acquire uprobes_treelock and insert uprobe into uprobes_tree
- * (or reuse existing one, see __insert_uprobe() comments above).
- */
-static struct uprobe *insert_uprobe(struct uprobe *uprobe)
-{
-	struct uprobe *u;
-
-	write_lock(&uprobes_treelock);
-	u = __insert_uprobe(uprobe);
-	write_unlock(&uprobes_treelock);
-
-	return u;
-}
-
 static void
 ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe)
 {
@@ -1318,6 +1310,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 		uc->uprobe = uprobe;
 	}
 
+	ret = 0;
+	write_lock(&uprobes_treelock);
 	for (i = 0; i < cnt; i++) {
 		struct uprobe *cur_uprobe;
 
@@ -1325,19 +1319,24 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 		uprobe = uc->uprobe;
 
 		/* add to uprobes_tree, sorted on inode:offset */
-		cur_uprobe = insert_uprobe(uprobe);
+		cur_uprobe = __insert_uprobe(uprobe);
 		/* a uprobe exists for this inode:offset combination */
 		if (cur_uprobe != uprobe) {
 			if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 				ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-				put_uprobe(cur_uprobe);
+
+				__put_uprobe(cur_uprobe, true);
 				ret = -EINVAL;
-				goto cleanup_uprobes;
+				goto unlock_treelock;
 			}
 			kfree(uprobe);
 			uc->uprobe = cur_uprobe;
 		}
 	}
+unlock_treelock:
+	write_unlock(&uprobes_treelock);
+	if (ret)
+		goto cleanup_uprobes;
 
 	for (i = 0; i < cnt; i++) {
 		uc = get_uprobe_consumer(i, ctx);
@@ -1367,13 +1366,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 	}
 cleanup_uprobes:
 	/* put all the successfully allocated/reused uprobes */
+	write_lock(&uprobes_treelock);
 	for (i = 0; i < cnt; i++) {
 		uc = get_uprobe_consumer(i, ctx);
 
 		if (uc->uprobe)
-			put_uprobe(uc->uprobe);
+			__put_uprobe(uc->uprobe, true);
 		uc->uprobe = NULL;
 	}
+	write_unlock(&uprobes_treelock);
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (8 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 09/12] uprobes: batch uprobes_treelock during registration Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes Andrii Nakryiko
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Similarly to what we did for uprobes_register_batch(), split
uprobe_unregister_batch() into two separate phases with different
locking needs.

First, all the VMA unregistration is performed while holding
a per-uprobe register_rwsem.

Then, we take a batched uprobes_treelock once to __put_uprobe() for all
uprobe_consumers. That uprobe_consumer->uprobe field is really handy in
helping with this.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ced85284bbf4..bb480a2400e1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1189,8 +1189,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
  */
 void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx)
 {
-	struct uprobe *uprobe;
 	struct uprobe_consumer *uc;
+	struct uprobe *uprobe;
 	int i;
 
 	for (i = 0; i < cnt; i++) {
@@ -1203,10 +1203,20 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge
 		down_write(&uprobe->register_rwsem);
 		__uprobe_unregister(uprobe, uc);
 		up_write(&uprobe->register_rwsem);
-		put_uprobe(uprobe);
+	}
 
+	write_lock(&uprobes_treelock);
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+		uprobe = uc->uprobe;
+
+		if (!uprobe)
+			continue;
+
+		__put_uprobe(uprobe, true);
 		uc->uprobe = NULL;
 	}
+	write_unlock(&uprobes_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (9 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-01 22:39 ` [PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore Andrii Nakryiko
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Switch internals of BPF multi-uprobes to batched version of uprobe
registration and unregistration APIs.

This also simplifies BPF clean up code a bit thanks to all-or-nothing
guarantee of uprobes_register_batch().

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/trace/bpf_trace.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ba62baec3152..41bf6736c542 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx {
 	struct bpf_uprobe *uprobe;
 };
 
-static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
-				  u32 cnt)
+static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, void *ctx)
 {
-	u32 i;
+	struct bpf_uprobe_multi_link *link = ctx;
 
-	for (i = 0; i < cnt; i++) {
-		uprobe_unregister(d_real_inode(path->dentry), &uprobes[i].consumer);
-	}
+	return &link->uprobes[idx].consumer;
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
@@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link *link)
 	struct bpf_uprobe_multi_link *umulti_link;
 
 	umulti_link = container_of(link, struct bpf_uprobe_multi_link, link);
-	bpf_uprobe_unregister(&umulti_link->path, umulti_link->uprobes, umulti_link->cnt);
+	uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), umulti_link->cnt,
+				umulti_link_get_uprobe_consumer, umulti_link);
 	if (umulti_link->task)
 		put_task_struct(umulti_link->task);
 	path_put(&umulti_link->path);
@@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	bpf_link_init(&link->link, BPF_LINK_TYPE_UPROBE_MULTI,
 		      &bpf_uprobe_multi_link_lops, prog);
 
-	for (i = 0; i < cnt; i++) {
-		err = uprobe_register(d_real_inode(link->path.dentry), &uprobes[i].consumer);
-		if (err) {
-			bpf_uprobe_unregister(&path, uprobes, i);
-			goto error_free;
-		}
-	}
+	err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt,
+				    umulti_link_get_uprobe_consumer, link);
+	if (err)
+		goto error_free;
 
 	err = bpf_link_prime(&link->link, &link_primer);
 	if (err)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (10 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes Andrii Nakryiko
@ 2024-07-01 22:39 ` Andrii Nakryiko
  2024-07-02 10:23 ` [PATCH v2 00/12] uprobes: add batched register/unregister APIs and " Peter Zijlstra
  2024-07-03 21:33 ` Andrii Nakryiko
  13 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:39 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

With all the batch uprobe APIs work we are now finally ready to reap the
benefits. Switch uprobes_treelock from reader-writer spinlock to a much
more efficient and scalable per-CPU RW semaphore.

Benchmarks and numbers time. I've used BPF selftests' bench tool,
trig-uprobe-nop benchmark specifically, to see how uprobe total
throughput scales with number of competing threads (mapped to individual
CPUs). Here are results:

  # threads   BEFORE (mln/s)    AFTER (mln/s)
  ---------   --------------    -------------
  1           3.131             3.140
  2           3.394             3.601
  3           3.630             3.960
  4           3.317             3.551
  5           3.448             3.464
  6           3.345             3.283
  7           3.469             3.444
  8           3.182             3.258
  9           3.138             3.139
  10          2.999             3.212
  11          2.903             3.183
  12          2.802             3.027
  13          2.792             3.027
  14          2.695             3.086
  15          2.822             2.965
  16          2.679             2.939
  17          2.622             2.888
  18          2.628             2.914
  19          2.702             2.836
  20          2.561             2.837

One can see that per-CPU RW semaphore-based implementation scales better
with number of CPUs (especially that single CPU throughput is basically
the same).

Note, scalability is still limited by register_rwsem and this will
hopefully be address in follow up patch set(s).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bb480a2400e1..1d76551e5e23 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
  */
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
-static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock);	/* serialize rbtree access */
 
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
@@ -684,7 +684,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 		bool destroy;
 
 		if (!tree_locked)
-			write_lock(&uprobes_treelock);
+			percpu_down_write(&uprobes_treelock);
 		/*
 		 * We might race with find_uprobe()->__get_uprobe() executed
 		 * from inside read-locked uprobes_treelock, which can bump
@@ -708,7 +708,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 		if (destroy && uprobe_is_active(uprobe))
 			rb_erase(&uprobe->rb_node, &uprobes_tree);
 		if (!tree_locked)
-			write_unlock(&uprobes_treelock);
+			percpu_up_write(&uprobes_treelock);
 
 		/*
 		 * Beyond here we don't need RCU protection, we are either the
@@ -816,9 +816,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 {
 	struct uprobe *uprobe;
 
-	read_lock(&uprobes_treelock);
+	percpu_down_read(&uprobes_treelock);
 	uprobe = __find_uprobe(inode, offset);
-	read_unlock(&uprobes_treelock);
+	percpu_up_read(&uprobes_treelock);
 
 	return uprobe;
 }
@@ -1205,7 +1205,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge
 		up_write(&uprobe->register_rwsem);
 	}
 
-	write_lock(&uprobes_treelock);
+	percpu_down_write(&uprobes_treelock);
 	for (i = 0; i < cnt; i++) {
 		uc = get_uprobe_consumer(i, ctx);
 		uprobe = uc->uprobe;
@@ -1216,7 +1216,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge
 		__put_uprobe(uprobe, true);
 		uc->uprobe = NULL;
 	}
-	write_unlock(&uprobes_treelock);
+	percpu_up_write(&uprobes_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
@@ -1321,7 +1321,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 	}
 
 	ret = 0;
-	write_lock(&uprobes_treelock);
+	percpu_down_write(&uprobes_treelock);
 	for (i = 0; i < cnt; i++) {
 		struct uprobe *cur_uprobe;
 
@@ -1344,7 +1344,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 		}
 	}
 unlock_treelock:
-	write_unlock(&uprobes_treelock);
+	percpu_up_write(&uprobes_treelock);
 	if (ret)
 		goto cleanup_uprobes;
 
@@ -1376,7 +1376,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 	}
 cleanup_uprobes:
 	/* put all the successfully allocated/reused uprobes */
-	write_lock(&uprobes_treelock);
+	percpu_down_write(&uprobes_treelock);
 	for (i = 0; i < cnt; i++) {
 		uc = get_uprobe_consumer(i, ctx);
 
@@ -1384,7 +1384,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 			__put_uprobe(uc->uprobe, true);
 		uc->uprobe = NULL;
 	}
-	write_unlock(&uprobes_treelock);
+	percpu_up_write(&uprobes_treelock);
 	return ret;
 }
 
@@ -1492,7 +1492,7 @@ static void build_probe_list(struct inode *inode,
 	min = vaddr_to_offset(vma, start);
 	max = min + (end - start) - 1;
 
-	read_lock(&uprobes_treelock);
+	percpu_down_read(&uprobes_treelock);
 	n = find_node_in_range(inode, min, max);
 	if (n) {
 		for (t = n; t; t = rb_prev(t)) {
@@ -1510,7 +1510,7 @@ static void build_probe_list(struct inode *inode,
 			list_add(&u->pending_list, head);
 		}
 	}
-	read_unlock(&uprobes_treelock);
+	percpu_up_read(&uprobes_treelock);
 }
 
 /* @vma contains reference counter, not the probed instruction. */
@@ -1601,9 +1601,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
 	min = vaddr_to_offset(vma, start);
 	max = min + (end - start) - 1;
 
-	read_lock(&uprobes_treelock);
+	percpu_down_read(&uprobes_treelock);
 	n = find_node_in_range(inode, min, max);
-	read_unlock(&uprobes_treelock);
+	percpu_up_read(&uprobes_treelock);
 
 	return !!n;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-01 22:39 ` [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
@ 2024-07-02 10:22   ` Peter Zijlstra
  2024-07-02 17:54     ` Andrii Nakryiko
  2024-07-03 13:36   ` Peter Zijlstra
  2024-07-05 15:37   ` Oleg Nesterov
  2 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-02 10:22 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, oleg, mingo, bpf, jolsa,
	paulmck, clm

On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:

> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 23449a8c5e7e..560cf1ca512a 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>  
>  struct uprobe {
>  	struct rb_node		rb_node;	/* node in the rb tree */
> -	refcount_t		ref;
> +	atomic64_t		ref;		/* see UPROBE_REFCNT_GET below */
>  	struct rw_semaphore	register_rwsem;
>  	struct rw_semaphore	consumer_rwsem;
> +	struct rcu_head		rcu;
>  	struct list_head	pending_list;
>  	struct uprobe_consumer	*consumers;
>  	struct inode		*inode;		/* Also hold a ref to inode */
> @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
>  			*(uprobe_opcode_t *)&auprobe->insn);
>  }
>  
> -static struct uprobe *get_uprobe(struct uprobe *uprobe)
> +/*
> + * Uprobe's 64-bit refcount is actually two independent counters co-located in
> + * a single u64 value:
> + *   - lower 32 bits are just a normal refcount with is increment and
> + *   decremented on get and put, respectively, just like normal refcount
> + *   would;
> + *   - upper 32 bits are a tag (or epoch, if you will), which is always
> + *   incremented by one, no matter whether get or put operation is done.
> + *
> + * This upper counter is meant to distinguish between:
> + *   - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
> + *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
> + *   sequence, also proceeding to "destruction".
> + *
> + * In both cases refcount drops to zero, but in one case it will have epoch N,
> + * while the second drop to zero will have a different epoch N + 2, allowing
> + * first destructor to bail out because epoch changed between refcount going
> + * to zero and put_uprobe() taking uprobes_treelock (under which overall
> + * 64-bit refcount is double-checked, see put_uprobe() for details).
> + *
> + * Lower 32-bit counter is not meant to over overflow, while it's expected

So refcount_t very explicitly handles both overflow and underflow and
screams bloody murder if they happen. Your thing does not.. 

> + * that upper 32-bit counter will overflow occasionally. Note, though, that we
> + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit counter,
> + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
> + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
> + * epoch effectively a 31-bit counter with highest bit used as a flag to
> + * perform a fix-up. This ensures epoch and refcnt parts do not "interfere".
> + *
> + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> + * epoch and refcnt parts atomically with one atomic_add().
> + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
> + * *increment* epoch part.
> + */
> +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x0000000100000001LL */
> +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0x00000000ffffffffLL */
> +
> +/*
> + * Caller has to make sure that:
> + *   a) either uprobe's refcnt is positive before this call;
> + *   b) or uprobes_treelock is held (doesn't matter if for read or write),
> + *      preventing uprobe's destructor from removing it from uprobes_tree.
> + *
> + * In the latter case, uprobe's destructor will "resurrect" uprobe instance if
> + * it detects that its refcount went back to being positive again inbetween it
> + * dropping to zero at some point and (potentially delayed) destructor
> + * callback actually running.
> + */
> +static struct uprobe *__get_uprobe(struct uprobe *uprobe)
>  {
> -	refcount_inc(&uprobe->ref);
> +	s64 v;
> +
> +	v = atomic64_add_return(UPROBE_REFCNT_GET, &uprobe->ref);

Distinct lack of u32 overflow testing here..

> +
> +	/*
> +	 * If the highest bit is set, we need to clear it. If cmpxchg() fails,
> +	 * we don't retry because there is another CPU that just managed to
> +	 * update refcnt and will attempt the same "fix up". Eventually one of
> +	 * them will succeed to clear highset bit.
> +	 */
> +	if (unlikely(v < 0))
> +		(void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
> +
>  	return uprobe;
>  }

>  static void put_uprobe(struct uprobe *uprobe)
>  {
> -	if (refcount_dec_and_test(&uprobe->ref)) {
> +	s64 v;
> +
> +	/*
> +	 * here uprobe instance is guaranteed to be alive, so we use Tasks
> +	 * Trace RCU to guarantee that uprobe won't be freed from under us, if

What's wrong with normal RCU?

> +	 * we end up being a losing "destructor" inside uprobe_treelock'ed
> +	 * section double-checking uprobe->ref value below.
> +	 * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> +	 */
> +	rcu_read_lock_trace();
> +
> +	v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);

No underflow handling... because nobody ever had a double put bug.

> +	if (unlikely((u32)v == 0)) {
> +		bool destroy;
> +
> +		write_lock(&uprobes_treelock);
> +		/*
> +		 * We might race with find_uprobe()->__get_uprobe() executed
> +		 * from inside read-locked uprobes_treelock, which can bump
> +		 * refcount from zero back to one, after we got here. Even
> +		 * worse, it's possible for another CPU to do 0 -> 1 -> 0
> +		 * transition between this CPU doing atomic_add() and taking
> +		 * uprobes_treelock. In either case this CPU should bail out
> +		 * and not proceed with destruction.
> +		 *
> +		 * So now that we have exclusive write lock, we double check
> +		 * the total 64-bit refcount value, which includes the epoch.
> +		 * If nothing changed (i.e., epoch is the same and refcnt is
> +		 * still zero), we are good and we proceed with the clean up.
> +		 *
> +		 * But if it managed to be updated back at least once, we just
> +		 * pretend it never went to zero. If lower 32-bit refcnt part
> +		 * drops to zero again, another CPU will proceed with
> +		 * destruction, due to more up to date epoch.
> +		 */
> +		destroy = atomic64_read(&uprobe->ref) == v;
> +		if (destroy && uprobe_is_active(uprobe))
> +			rb_erase(&uprobe->rb_node, &uprobes_tree);
> +		write_unlock(&uprobes_treelock);
> +
> +		/*
> +		 * Beyond here we don't need RCU protection, we are either the
> +		 * winning destructor and we control the rest of uprobe's
> +		 * lifetime; or we lost and we are bailing without accessing
> +		 * uprobe fields anymore.
> +		 */
> +		rcu_read_unlock_trace();
> +
> +		/* uprobe got resurrected, pretend we never tried to free it */
> +		if (!destroy)
> +			return;
> +
>  		/*
>  		 * If application munmap(exec_vma) before uprobe_unregister()
>  		 * gets called, we don't get a chance to remove uprobe from
> @@ -604,8 +728,21 @@ static void put_uprobe(struct uprobe *uprobe)
>  		mutex_lock(&delayed_uprobe_lock);
>  		delayed_uprobe_remove(uprobe, NULL);
>  		mutex_unlock(&delayed_uprobe_lock);
> -		kfree(uprobe);
> +
> +		call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
> +		return;
>  	}
> +
> +	/*
> +	 * If the highest bit is set, we need to clear it. If cmpxchg() fails,
> +	 * we don't retry because there is another CPU that just managed to
> +	 * update refcnt and will attempt the same "fix up". Eventually one of
> +	 * them will succeed to clear highset bit.
> +	 */
> +	if (unlikely(v < 0))
> +		(void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
> +
> +	rcu_read_unlock_trace();
>  }

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (11 preceding siblings ...)
  2024-07-01 22:39 ` [PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore Andrii Nakryiko
@ 2024-07-02 10:23 ` Peter Zijlstra
  2024-07-02 11:54   ` Peter Zijlstra
  2024-07-03 21:33 ` Andrii Nakryiko
  13 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-02 10:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, oleg, mingo, bpf, jolsa,
	paulmck, clm

On Mon, Jul 01, 2024 at 03:39:23PM -0700, Andrii Nakryiko wrote:
> This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> to per-CPU RW semaphore, which has better performance and scales better under
> contention and multiple parallel threads triggering lots of uprobes.

Why not RCU + normal lock thing?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-02 10:23 ` [PATCH v2 00/12] uprobes: add batched register/unregister APIs and " Peter Zijlstra
@ 2024-07-02 11:54   ` Peter Zijlstra
  2024-07-02 12:01     ` Peter Zijlstra
  2024-07-02 17:54     ` Andrii Nakryiko
  0 siblings, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-02 11:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, oleg, mingo, bpf, jolsa,
	paulmck, clm, linux-kernel


+LKML

On Tue, Jul 02, 2024 at 12:23:53PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 01, 2024 at 03:39:23PM -0700, Andrii Nakryiko wrote:
> > This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> > to per-CPU RW semaphore, which has better performance and scales better under
> > contention and multiple parallel threads triggering lots of uprobes.
> 
> Why not RCU + normal lock thing?

Something like the *completely* untested below.

---
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2c83ba776fc7..03b38f3f7be3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
@@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	refcount_t		ref;
+	struct rcu_head		rcu;
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
@@ -67,7 +69,7 @@ struct uprobe {
 	 * The generic code assumes that it has two members of unknown type
 	 * owned by the arch-specific code:
 	 *
-	 * 	insn -	copy_insn() saves the original instruction here for
+	 *	insn -	copy_insn() saves the original instruction here for
 	 *		arch_uprobe_analyze_insn().
 	 *
 	 *	ixol -	potentially modified instruction to execute out of
@@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
 	return uprobe;
 }
 
+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+	kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (refcount_dec_and_test(&uprobe->ref)) {
@@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
 		mutex_lock(&delayed_uprobe_lock);
 		delayed_uprobe_remove(uprobe, NULL);
 		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
+
+		call_rcu(&uprobe->rcu, uprobe_free_rcu);
 	}
 }
 
@@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 {
 	struct uprobe *uprobe;
+	unsigned seq;
 
-	read_lock(&uprobes_treelock);
-	uprobe = __find_uprobe(inode, offset);
-	read_unlock(&uprobes_treelock);
+	guard(rcu)();
 
-	return uprobe;
+	do {
+		seq = read_seqcount_begin(&uprobes_seqcount);
+		uprobes = __find_uprobe(inode, offset);
+		if (uprobes) {
+			/*
+			 * Lockless RB-tree lookups are prone to false-negatives.
+			 * If they find something, it's good. If they do not find,
+			 * it needs to be validated.
+			 */
+			return uprobes;
+		}
+	} while (read_seqcount_retry(&uprobes_seqcount, seq));
+
+	/* Really didn't find anything. */
+	return NULL;
 }
 
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
@@ -702,7 +724,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	struct uprobe *u;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	u = __insert_uprobe(uprobe);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 
 	return u;
@@ -936,7 +960,9 @@ static void delete_uprobe(struct uprobe *uprobe)
 		return;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	rb_erase(&uprobe->rb_node, &uprobes_tree);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
 	put_uprobe(uprobe);

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-02 11:54   ` Peter Zijlstra
@ 2024-07-02 12:01     ` Peter Zijlstra
  2024-07-02 17:54     ` Andrii Nakryiko
  1 sibling, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-02 12:01 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, oleg, mingo, bpf, jolsa,
	paulmck, clm, linux-kernel

On Tue, Jul 02, 2024 at 01:54:47PM +0200, Peter Zijlstra wrote:

> @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
>  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
>  {
>  	struct uprobe *uprobe;
> +	unsigned seq;
>  
> +	guard(rcu)();
>  
> +	do {
> +		seq = read_seqcount_begin(&uprobes_seqcount);
> +		uprobes = __find_uprobe(inode, offset);
> +		if (uprobes) {
> +			/*
> +			 * Lockless RB-tree lookups are prone to false-negatives.
> +			 * If they find something, it's good. If they do not find,
> +			 * it needs to be validated.
> +			 */
> +			return uprobes;
> +		}
> +	} while (read_seqcount_retry(&uprobes_seqcount, seq));
> +
> +	/* Really didn't find anything. */
> +	return NULL;
>  }
>  
>  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> @@ -702,7 +724,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>  	struct uprobe *u;
>  
>  	write_lock(&uprobes_treelock);
> +	write_seqcount_begin(&uprobes_seqcount);
>  	u = __insert_uprobe(uprobe);
> +	write_seqcount_end(&uprobes_seqcount);
>  	write_unlock(&uprobes_treelock);
>  
>  	return u;

Strictly speaking I suppose we should add rb_find_rcu() and
rc_find_add_rcu() that sprinkle some rcu_dereference_raw() and
rb_link_node_rcu() around. See the examples in __lt_find() and
__lt_insert().


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-02 11:54   ` Peter Zijlstra
  2024-07-02 12:01     ` Peter Zijlstra
@ 2024-07-02 17:54     ` Andrii Nakryiko
  2024-07-02 19:18       ` Peter Zijlstra
  1 sibling, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-02 17:54 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E . McKenney
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, clm, linux-kernel

On Tue, Jul 2, 2024 at 4:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> +LKML
>
> On Tue, Jul 02, 2024 at 12:23:53PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 01, 2024 at 03:39:23PM -0700, Andrii Nakryiko wrote:
> > > This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> > > to per-CPU RW semaphore, which has better performance and scales better under
> > > contention and multiple parallel threads triggering lots of uprobes.
> >
> > Why not RCU + normal lock thing?
>
> Something like the *completely* untested below.
>
> ---
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 2c83ba776fc7..03b38f3f7be3 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
>  #define no_uprobe_events()     RB_EMPTY_ROOT(&uprobes_tree)
>
>  static DEFINE_RWLOCK(uprobes_treelock);        /* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>
>  #define UPROBES_HASH_SZ        13
>  /* serialize uprobe->pending_list */
> @@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>  struct uprobe {
>         struct rb_node          rb_node;        /* node in the rb tree */
>         refcount_t              ref;
> +       struct rcu_head         rcu;
>         struct rw_semaphore     register_rwsem;
>         struct rw_semaphore     consumer_rwsem;
>         struct list_head        pending_list;
> @@ -67,7 +69,7 @@ struct uprobe {
>          * The generic code assumes that it has two members of unknown type
>          * owned by the arch-specific code:
>          *
> -        *      insn -  copy_insn() saves the original instruction here for
> +        *      insn -  copy_insn() saves the original instruction here for
>          *              arch_uprobe_analyze_insn().
>          *
>          *      ixol -  potentially modified instruction to execute out of
> @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
>         return uprobe;
>  }
>
> +static void uprobe_free_rcu(struct rcu_head *rcu)
> +{
> +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> +       kfree(uprobe);
> +}
> +
>  static void put_uprobe(struct uprobe *uprobe)
>  {
>         if (refcount_dec_and_test(&uprobe->ref)) {
> @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)

right above this we have roughly this:

percpu_down_write(&uprobes_treelock);

/* refcount check */
rb_erase(&uprobe->rb_node, &uprobes_tree);

percpu_up_write(&uprobes_treelock);


This writer lock is necessary for modification of the RB tree. And I
was under impression that I shouldn't be doing
percpu_(down|up)_write() inside the normal
rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
clarify.

But actually what's wrong with RCU Tasks Trace flavor? I will
ultimately use it anyway to avoid uprobe taking unnecessary refcount
and to protect uprobe->consumers iteration and uc->handler() calls,
which could be sleepable, so would need rcu_read_lock_trace().

>                 mutex_lock(&delayed_uprobe_lock);
>                 delayed_uprobe_remove(uprobe, NULL);
>                 mutex_unlock(&delayed_uprobe_lock);
> -               kfree(uprobe);
> +
> +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
>         }
>  }
>
> @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
>  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
>  {
>         struct uprobe *uprobe;
> +       unsigned seq;
>
> -       read_lock(&uprobes_treelock);
> -       uprobe = __find_uprobe(inode, offset);
> -       read_unlock(&uprobes_treelock);
> +       guard(rcu)();
>
> -       return uprobe;
> +       do {
> +               seq = read_seqcount_begin(&uprobes_seqcount);
> +               uprobes = __find_uprobe(inode, offset);
> +               if (uprobes) {
> +                       /*
> +                        * Lockless RB-tree lookups are prone to false-negatives.
> +                        * If they find something, it's good. If they do not find,
> +                        * it needs to be validated.
> +                        */
> +                       return uprobes;
> +               }
> +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> +
> +       /* Really didn't find anything. */
> +       return NULL;
>  }

Honest question here, as I don't understand the tradeoffs well enough.
Is there a lot of benefit to switching to seqcount lock vs using
percpu RW semaphore (previously recommended by Ingo). The latter is a
nice drop-in replacement and seems to be very fast and scale well.
Right now we are bottlenecked on uprobe->register_rwsem (not
uprobes_treelock anymore), which is currently limiting the scalability
of uprobes and I'm going to work on that next once I'm done with this
series.

>
>  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> @@ -702,7 +724,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>         struct uprobe *u;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         u = __insert_uprobe(uprobe);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>
>         return u;
> @@ -936,7 +960,9 @@ static void delete_uprobe(struct uprobe *uprobe)
>                 return;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         rb_erase(&uprobe->rb_node, &uprobes_tree);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>         RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
>         put_uprobe(uprobe);

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-02 10:22   ` Peter Zijlstra
@ 2024-07-02 17:54     ` Andrii Nakryiko
  0 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-02 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, paulmck, clm

On Tue, Jul 2, 2024 at 3:23 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:
>
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 23449a8c5e7e..560cf1ca512a 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
> >
> >  struct uprobe {
> >       struct rb_node          rb_node;        /* node in the rb tree */
> > -     refcount_t              ref;
> > +     atomic64_t              ref;            /* see UPROBE_REFCNT_GET below */
> >       struct rw_semaphore     register_rwsem;
> >       struct rw_semaphore     consumer_rwsem;
> > +     struct rcu_head         rcu;
> >       struct list_head        pending_list;
> >       struct uprobe_consumer  *consumers;
> >       struct inode            *inode;         /* Also hold a ref to inode */
> > @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> >                       *(uprobe_opcode_t *)&auprobe->insn);
> >  }
> >
> > -static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > +/*
> > + * Uprobe's 64-bit refcount is actually two independent counters co-located in
> > + * a single u64 value:
> > + *   - lower 32 bits are just a normal refcount with is increment and
> > + *   decremented on get and put, respectively, just like normal refcount
> > + *   would;
> > + *   - upper 32 bits are a tag (or epoch, if you will), which is always
> > + *   incremented by one, no matter whether get or put operation is done.
> > + *
> > + * This upper counter is meant to distinguish between:
> > + *   - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
> > + *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
> > + *   sequence, also proceeding to "destruction".
> > + *
> > + * In both cases refcount drops to zero, but in one case it will have epoch N,
> > + * while the second drop to zero will have a different epoch N + 2, allowing
> > + * first destructor to bail out because epoch changed between refcount going
> > + * to zero and put_uprobe() taking uprobes_treelock (under which overall
> > + * 64-bit refcount is double-checked, see put_uprobe() for details).
> > + *
> > + * Lower 32-bit counter is not meant to over overflow, while it's expected
>
> So refcount_t very explicitly handles both overflow and underflow and
> screams bloody murder if they happen. Your thing does not..
>

Correct, because I considered that to be practically impossible to
overflow this refcount. The main source of refcounts are uretprobes
that are holding uprobe references. We limit the depth of supported
recursion to 64, so you'd need 30+ millions of threads all hitting the
same uprobe/uretprobe to overflow this. I guess in theory it could
happen (not sure if we have some limits on total number of threads in
the system and if they can be bumped to over 30mln), but it just
seemed out of realm of practical possibility.

Having said that, I can add similar checks that refcount_t does in
refcount_add and do what refcount_warn_saturate does as well.

> > + * that upper 32-bit counter will overflow occasionally. Note, though, that we
> > + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit counter,
> > + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
> > + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
> > + * epoch effectively a 31-bit counter with highest bit used as a flag to
> > + * perform a fix-up. This ensures epoch and refcnt parts do not "interfere".
> > + *
> > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> > + * epoch and refcnt parts atomically with one atomic_add().
> > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
> > + * *increment* epoch part.
> > + */
> > +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x0000000100000001LL */
> > +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0x00000000ffffffffLL */
> > +
> > +/*
> > + * Caller has to make sure that:
> > + *   a) either uprobe's refcnt is positive before this call;
> > + *   b) or uprobes_treelock is held (doesn't matter if for read or write),
> > + *      preventing uprobe's destructor from removing it from uprobes_tree.
> > + *
> > + * In the latter case, uprobe's destructor will "resurrect" uprobe instance if
> > + * it detects that its refcount went back to being positive again inbetween it
> > + * dropping to zero at some point and (potentially delayed) destructor
> > + * callback actually running.
> > + */
> > +static struct uprobe *__get_uprobe(struct uprobe *uprobe)
> >  {
> > -     refcount_inc(&uprobe->ref);
> > +     s64 v;
> > +
> > +     v = atomic64_add_return(UPROBE_REFCNT_GET, &uprobe->ref);
>
> Distinct lack of u32 overflow testing here..
>
> > +
> > +     /*
> > +      * If the highest bit is set, we need to clear it. If cmpxchg() fails,
> > +      * we don't retry because there is another CPU that just managed to
> > +      * update refcnt and will attempt the same "fix up". Eventually one of
> > +      * them will succeed to clear highset bit.
> > +      */
> > +     if (unlikely(v < 0))
> > +             (void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
> > +
> >       return uprobe;
> >  }
>
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -     if (refcount_dec_and_test(&uprobe->ref)) {
> > +     s64 v;
> > +
> > +     /*
> > +      * here uprobe instance is guaranteed to be alive, so we use Tasks
> > +      * Trace RCU to guarantee that uprobe won't be freed from under us, if
>
> What's wrong with normal RCU?
>

will reply in another thread to keep things focused

> > +      * we end up being a losing "destructor" inside uprobe_treelock'ed
> > +      * section double-checking uprobe->ref value below.
> > +      * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +      */
> > +     rcu_read_lock_trace();
> > +
> > +     v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
>
> No underflow handling... because nobody ever had a double put bug.
>

ack, see above, will add checks and special saturation value.

[...]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-02 17:54     ` Andrii Nakryiko
@ 2024-07-02 19:18       ` Peter Zijlstra
  2024-07-02 23:56         ` Paul E. McKenney
  2024-07-03  4:47         ` Andrii Nakryiko
  0 siblings, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-02 19:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Paul E . McKenney, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:

> > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> >         return uprobe;
> >  }
> >
> > +static void uprobe_free_rcu(struct rcu_head *rcu)
> > +{
> > +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > +       kfree(uprobe);
> > +}
> > +
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> >         if (refcount_dec_and_test(&uprobe->ref)) {
> > @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
> 
> right above this we have roughly this:
> 
> percpu_down_write(&uprobes_treelock);
> 
> /* refcount check */
> rb_erase(&uprobe->rb_node, &uprobes_tree);
> 
> percpu_up_write(&uprobes_treelock);
> 
> 
> This writer lock is necessary for modification of the RB tree. And I
> was under impression that I shouldn't be doing
> percpu_(down|up)_write() inside the normal
> rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
> might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
> clarify.

preemptible RCU or SRCU would work.

> 
> But actually what's wrong with RCU Tasks Trace flavor? 

Paul, isn't this the RCU flavour you created to deal with
!rcu_is_watching()? The flavour that never should have been created in
favour of just cleaning up the mess instead of making more.

> I will
> ultimately use it anyway to avoid uprobe taking unnecessary refcount
> and to protect uprobe->consumers iteration and uc->handler() calls,
> which could be sleepable, so would need rcu_read_lock_trace().

I don't think you need trace-rcu for that. SRCU would do nicely I think.

> >                 mutex_lock(&delayed_uprobe_lock);
> >                 delayed_uprobe_remove(uprobe, NULL);
> >                 mutex_unlock(&delayed_uprobe_lock);
> > -               kfree(uprobe);
> > +
> > +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
> >         }
> >  }
> >
> > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> >  {
> >         struct uprobe *uprobe;
> > +       unsigned seq;
> >
> > -       read_lock(&uprobes_treelock);
> > -       uprobe = __find_uprobe(inode, offset);
> > -       read_unlock(&uprobes_treelock);
> > +       guard(rcu)();
> >
> > -       return uprobe;
> > +       do {
> > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > +               uprobes = __find_uprobe(inode, offset);
> > +               if (uprobes) {
> > +                       /*
> > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > +                        * If they find something, it's good. If they do not find,
> > +                        * it needs to be validated.
> > +                        */
> > +                       return uprobes;
> > +               }
> > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > +
> > +       /* Really didn't find anything. */
> > +       return NULL;
> >  }
> 
> Honest question here, as I don't understand the tradeoffs well enough.
> Is there a lot of benefit to switching to seqcount lock vs using
> percpu RW semaphore (previously recommended by Ingo). The latter is a
> nice drop-in replacement and seems to be very fast and scale well.

As you noted, that percpu-rwsem write side is quite insane. And you're
creating this batch complexity to mitigate that.

The patches you propose are quite complex, this alternative not so much.

> Right now we are bottlenecked on uprobe->register_rwsem (not
> uprobes_treelock anymore), which is currently limiting the scalability
> of uprobes and I'm going to work on that next once I'm done with this
> series.

Right, but it looks fairly simple to replace that rwsem with a mutex and
srcu.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-02 19:18       ` Peter Zijlstra
@ 2024-07-02 23:56         ` Paul E. McKenney
  2024-07-03  4:54           ` Andrii Nakryiko
  2024-07-03  7:50           ` Peter Zijlstra
  2024-07-03  4:47         ` Andrii Nakryiko
  1 sibling, 2 replies; 67+ messages in thread
From: Paul E. McKenney @ 2024-07-02 23:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Tue, Jul 02, 2024 at 09:18:57PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:
> 
> > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > >         return uprobe;
> > >  }
> > >
> > > +static void uprobe_free_rcu(struct rcu_head *rcu)
> > > +{
> > > +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > > +       kfree(uprobe);
> > > +}
> > > +
> > >  static void put_uprobe(struct uprobe *uprobe)
> > >  {
> > >         if (refcount_dec_and_test(&uprobe->ref)) {
> > > @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
> > 
> > right above this we have roughly this:
> > 
> > percpu_down_write(&uprobes_treelock);
> > 
> > /* refcount check */
> > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > 
> > percpu_up_write(&uprobes_treelock);
> > 
> > 
> > This writer lock is necessary for modification of the RB tree. And I
> > was under impression that I shouldn't be doing
> > percpu_(down|up)_write() inside the normal
> > rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
> > might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
> > clarify.
> 
> preemptible RCU or SRCU would work.

I agree that SRCU would work from a functional viewpoint.  No so for
preemptible RCU, which permits preemption (and on -rt, blocking for
spinlocks), it does not permit full-up blocking, and for good reason.

> > But actually what's wrong with RCU Tasks Trace flavor? 
> 
> Paul, isn't this the RCU flavour you created to deal with
> !rcu_is_watching()? The flavour that never should have been created in
> favour of just cleaning up the mess instead of making more.

My guess is that you are instead thinking of RCU Tasks Rude, which can
be eliminated once all architectures get their entry/exit/deep-idle
functions either inlined or marked noinstr.

> > I will
> > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > and to protect uprobe->consumers iteration and uc->handler() calls,
> > which could be sleepable, so would need rcu_read_lock_trace().
> 
> I don't think you need trace-rcu for that. SRCU would do nicely I think.

From a functional viewpoint, agreed.

However, in the past, the memory-barrier and array-indexing overhead
of SRCU has made it a no-go for lightweight probes into fastpath code.
And these cases were what motivated RCU Tasks Trace (as opposed to RCU
Tasks Rude).

The other rule for RCU Tasks Trace is that although readers are permitted
to block, this blocking can be for no longer than a major page fault.
If you need longer-term blocking, then you should instead use SRCU.

							Thanx, Paul

> > >                 mutex_lock(&delayed_uprobe_lock);
> > >                 delayed_uprobe_remove(uprobe, NULL);
> > >                 mutex_unlock(&delayed_uprobe_lock);
> > > -               kfree(uprobe);
> > > +
> > > +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
> > >         }
> > >  }
> > >
> > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> > >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > >  {
> > >         struct uprobe *uprobe;
> > > +       unsigned seq;
> > >
> > > -       read_lock(&uprobes_treelock);
> > > -       uprobe = __find_uprobe(inode, offset);
> > > -       read_unlock(&uprobes_treelock);
> > > +       guard(rcu)();
> > >
> > > -       return uprobe;
> > > +       do {
> > > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > > +               uprobes = __find_uprobe(inode, offset);
> > > +               if (uprobes) {
> > > +                       /*
> > > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > > +                        * If they find something, it's good. If they do not find,
> > > +                        * it needs to be validated.
> > > +                        */
> > > +                       return uprobes;
> > > +               }
> > > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > > +
> > > +       /* Really didn't find anything. */
> > > +       return NULL;
> > >  }
> > 
> > Honest question here, as I don't understand the tradeoffs well enough.
> > Is there a lot of benefit to switching to seqcount lock vs using
> > percpu RW semaphore (previously recommended by Ingo). The latter is a
> > nice drop-in replacement and seems to be very fast and scale well.
> 
> As you noted, that percpu-rwsem write side is quite insane. And you're
> creating this batch complexity to mitigate that.
> 
> The patches you propose are quite complex, this alternative not so much.
> 
> > Right now we are bottlenecked on uprobe->register_rwsem (not
> > uprobes_treelock anymore), which is currently limiting the scalability
> > of uprobes and I'm going to work on that next once I'm done with this
> > series.
> 
> Right, but it looks fairly simple to replace that rwsem with a mutex and
> srcu.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-02 19:18       ` Peter Zijlstra
  2024-07-02 23:56         ` Paul E. McKenney
@ 2024-07-03  4:47         ` Andrii Nakryiko
  2024-07-03  8:07           ` Peter Zijlstra
  1 sibling, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03  4:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E . McKenney, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Tue, Jul 2, 2024 at 12:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:
>
> > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > >         return uprobe;
> > >  }
> > >

[...]

> > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> > >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > >  {
> > >         struct uprobe *uprobe;
> > > +       unsigned seq;
> > >
> > > -       read_lock(&uprobes_treelock);
> > > -       uprobe = __find_uprobe(inode, offset);
> > > -       read_unlock(&uprobes_treelock);
> > > +       guard(rcu)();
> > >
> > > -       return uprobe;
> > > +       do {
> > > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > > +               uprobes = __find_uprobe(inode, offset);
> > > +               if (uprobes) {
> > > +                       /*
> > > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > > +                        * If they find something, it's good. If they do not find,
> > > +                        * it needs to be validated.
> > > +                        */
> > > +                       return uprobes;
> > > +               }
> > > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > > +
> > > +       /* Really didn't find anything. */
> > > +       return NULL;
> > >  }
> >
> > Honest question here, as I don't understand the tradeoffs well enough.
> > Is there a lot of benefit to switching to seqcount lock vs using
> > percpu RW semaphore (previously recommended by Ingo). The latter is a
> > nice drop-in replacement and seems to be very fast and scale well.
>
> As you noted, that percpu-rwsem write side is quite insane. And you're
> creating this batch complexity to mitigate that.

Note that batch API is needed regardless of percpu RW semaphore or
not. As I mentioned, once uprobes_treelock is mitigated one way or the
other, the next one is uprobe->register_rwsem. For scalability, we
need to get rid of it and preferably not add any locking at all. So
tentatively I'd like to have lockless RCU-protected iteration over
uprobe->consumers list and call consumer->handler(). This means that
on uprobes_unregister we'd need synchronize_rcu (for whatever RCU
flavor we end up using), to ensure that we don't free uprobe_consumer
memory from under handle_swbp() while it is actually triggering
consumers.

So, without batched unregistration we'll be back to the same problem
I'm solving here: doing synchronize_rcu() for each attached uprobe one
by one is prohibitively slow. We went through this exercise with
ftrace/kprobes already and fixed it with batched APIs. Doing that for
uprobes seems unavoidable as well.

>
> The patches you propose are quite complex, this alternative not so much.

I agree that this custom refcounting is not trivial, but at least it's
pretty well contained within two low-level helpers which are all used
within this single .c file.

On the other hand, it actually gives us a) speed and better
scalability (I showed comparisons with refcount_inc_not_zero approach
earlier, I believe) and b) it actually simplifies logic during
registration (which is even more important aspect with batched API),
where we don't need to handle uprobe suddenly going away after we
already looked it up.

I believe overall it's an improvement worth doing.

>
> > Right now we are bottlenecked on uprobe->register_rwsem (not
> > uprobes_treelock anymore), which is currently limiting the scalability
> > of uprobes and I'm going to work on that next once I'm done with this
> > series.
>
> Right, but it looks fairly simple to replace that rwsem with a mutex and
> srcu.

srcu vs RCU Tasks Trace aside (which Paul addressed), see above about
the need for batched API and synchronize_rcu().

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-02 23:56         ` Paul E. McKenney
@ 2024-07-03  4:54           ` Andrii Nakryiko
  2024-07-03  7:50           ` Peter Zijlstra
  1 sibling, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03  4:54 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Tue, Jul 2, 2024 at 4:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Jul 02, 2024 at 09:18:57PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:
> >
> > > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > > >         return uprobe;
> > > >  }
> > > >
> > > > +static void uprobe_free_rcu(struct rcu_head *rcu)
> > > > +{
> > > > +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > > > +       kfree(uprobe);
> > > > +}
> > > > +
> > > >  static void put_uprobe(struct uprobe *uprobe)
> > > >  {
> > > >         if (refcount_dec_and_test(&uprobe->ref)) {
> > > > @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
> > >
> > > right above this we have roughly this:
> > >
> > > percpu_down_write(&uprobes_treelock);
> > >
> > > /* refcount check */
> > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > >
> > > percpu_up_write(&uprobes_treelock);
> > >
> > >
> > > This writer lock is necessary for modification of the RB tree. And I
> > > was under impression that I shouldn't be doing
> > > percpu_(down|up)_write() inside the normal
> > > rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
> > > might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
> > > clarify.
> >
> > preemptible RCU or SRCU would work.
>
> I agree that SRCU would work from a functional viewpoint.  No so for
> preemptible RCU, which permits preemption (and on -rt, blocking for
> spinlocks), it does not permit full-up blocking, and for good reason.
>
> > > But actually what's wrong with RCU Tasks Trace flavor?
> >
> > Paul, isn't this the RCU flavour you created to deal with
> > !rcu_is_watching()? The flavour that never should have been created in
> > favour of just cleaning up the mess instead of making more.
>
> My guess is that you are instead thinking of RCU Tasks Rude, which can
> be eliminated once all architectures get their entry/exit/deep-idle
> functions either inlined or marked noinstr.
>
> > > I will
> > > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > > and to protect uprobe->consumers iteration and uc->handler() calls,
> > > which could be sleepable, so would need rcu_read_lock_trace().
> >
> > I don't think you need trace-rcu for that. SRCU would do nicely I think.
>
> From a functional viewpoint, agreed.
>
> However, in the past, the memory-barrier and array-indexing overhead
> of SRCU has made it a no-go for lightweight probes into fastpath code.
> And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> Tasks Rude).

Yep, and this is a similar case here. I've actually implemented
SRCU-based protection and benchmarked it (all other things being the
same). I see 5% slowdown for the fastest uprobe kind (entry uprobe on
nop) for the single-threaded use case. We go down from 3.15 millions/s
triggerings to slightly below 3 millions/s. With more threads the
difference increases a bit, though numbers vary a bit from run to run,
so I don't want to put out the exact number. But I see that for
SRCU-based implementation total aggregated peak achievable throughput
is about 3.5-3.6 mln/s vs this implementation reaching 4-4.1 mln/s.
Again, some of that could be variability, but I did run multiple
rounds and that's the trend I'm seeing.

>
> The other rule for RCU Tasks Trace is that although readers are permitted
> to block, this blocking can be for no longer than a major page fault.
> If you need longer-term blocking, then you should instead use SRCU.
>

And this is the case here. Right now rcu_read_lock_trace() is
protecting uprobes_treelock, which is only taken for the duration of
RB tree lookup/insert/delete. In my subsequent changes to eliminate
register_rwsem we might be executing uprobe_consumer under this RCU
lock, but those also should be only sleeping for page faults.

On the other hand, hot path (reader side) is quite hot with
millions/second executions and should add as little overhead as
possible (which is why I'm seeing SRCU-based implementation being
slower, as I mentioned above).

>                                                         Thanx, Paul
>
> > > >                 mutex_lock(&delayed_uprobe_lock);
> > > >                 delayed_uprobe_remove(uprobe, NULL);
> > > >                 mutex_unlock(&delayed_uprobe_lock);
> > > > -               kfree(uprobe);
> > > > +
> > > > +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
> > > >         }
> > > >  }
> > > >
> > > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> > > >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > >  {
> > > >         struct uprobe *uprobe;
> > > > +       unsigned seq;
> > > >
> > > > -       read_lock(&uprobes_treelock);
> > > > -       uprobe = __find_uprobe(inode, offset);
> > > > -       read_unlock(&uprobes_treelock);
> > > > +       guard(rcu)();
> > > >
> > > > -       return uprobe;
> > > > +       do {
> > > > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > > > +               uprobes = __find_uprobe(inode, offset);
> > > > +               if (uprobes) {
> > > > +                       /*
> > > > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > > > +                        * If they find something, it's good. If they do not find,
> > > > +                        * it needs to be validated.
> > > > +                        */
> > > > +                       return uprobes;
> > > > +               }
> > > > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > > > +
> > > > +       /* Really didn't find anything. */
> > > > +       return NULL;
> > > >  }
> > >
> > > Honest question here, as I don't understand the tradeoffs well enough.
> > > Is there a lot of benefit to switching to seqcount lock vs using
> > > percpu RW semaphore (previously recommended by Ingo). The latter is a
> > > nice drop-in replacement and seems to be very fast and scale well.
> >
> > As you noted, that percpu-rwsem write side is quite insane. And you're
> > creating this batch complexity to mitigate that.
> >
> > The patches you propose are quite complex, this alternative not so much.
> >
> > > Right now we are bottlenecked on uprobe->register_rwsem (not
> > > uprobes_treelock anymore), which is currently limiting the scalability
> > > of uprobes and I'm going to work on that next once I'm done with this
> > > series.
> >
> > Right, but it looks fairly simple to replace that rwsem with a mutex and
> > srcu.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-02 23:56         ` Paul E. McKenney
  2024-07-03  4:54           ` Andrii Nakryiko
@ 2024-07-03  7:50           ` Peter Zijlstra
  2024-07-03 14:08             ` Paul E. McKenney
  2024-07-03 21:57             ` Steven Rostedt
  1 sibling, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-03  7:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Tue, Jul 02, 2024 at 04:56:53PM -0700, Paul E. McKenney wrote:

> > Paul, isn't this the RCU flavour you created to deal with
> > !rcu_is_watching()? The flavour that never should have been created in
> > favour of just cleaning up the mess instead of making more.
> 
> My guess is that you are instead thinking of RCU Tasks Rude, which can
> be eliminated once all architectures get their entry/exit/deep-idle
> functions either inlined or marked noinstr.

Would it make sense to disable it for those architectures that have
already done this work?

> > > I will
> > > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > > and to protect uprobe->consumers iteration and uc->handler() calls,
> > > which could be sleepable, so would need rcu_read_lock_trace().
> > 
> > I don't think you need trace-rcu for that. SRCU would do nicely I think.
> 
> From a functional viewpoint, agreed.
> 
> However, in the past, the memory-barrier and array-indexing overhead
> of SRCU has made it a no-go for lightweight probes into fastpath code.
> And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> Tasks Rude).

I'm thinking we're growing too many RCU flavours again :/ I suppose I'll
have to go read up on rcu/tasks.* and see what's what.

> The other rule for RCU Tasks Trace is that although readers are permitted
> to block, this blocking can be for no longer than a major page fault.
> If you need longer-term blocking, then you should instead use SRCU.

I think this would render it unsuitable for uprobes. The whole point of
having a sleepable handler is to be able to take faults.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-03  4:47         ` Andrii Nakryiko
@ 2024-07-03  8:07           ` Peter Zijlstra
  2024-07-03 20:55             ` Andrii Nakryiko
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-03  8:07 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Paul E . McKenney, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Tue, Jul 02, 2024 at 09:47:41PM -0700, Andrii Nakryiko wrote:

> > As you noted, that percpu-rwsem write side is quite insane. And you're
> > creating this batch complexity to mitigate that.
> 
> 
> Note that batch API is needed regardless of percpu RW semaphore or
> not. As I mentioned, once uprobes_treelock is mitigated one way or the
> other, the next one is uprobe->register_rwsem. For scalability, we
> need to get rid of it and preferably not add any locking at all. So
> tentatively I'd like to have lockless RCU-protected iteration over
> uprobe->consumers list and call consumer->handler(). This means that
> on uprobes_unregister we'd need synchronize_rcu (for whatever RCU
> flavor we end up using), to ensure that we don't free uprobe_consumer
> memory from under handle_swbp() while it is actually triggering
> consumers.
> 
> So, without batched unregistration we'll be back to the same problem
> I'm solving here: doing synchronize_rcu() for each attached uprobe one
> by one is prohibitively slow. We went through this exercise with
> ftrace/kprobes already and fixed it with batched APIs. Doing that for
> uprobes seems unavoidable as well.

I'm not immediately seeing how you need that terrible refcount stuff for
the batching though. If all you need is group a few unregisters together
in order to share a sync_rcu() that seems way overkill.

You seem to have muddled the order of things, which makes the actual
reason for doing things utterly unclear.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
  2024-07-01 22:39 ` [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
@ 2024-07-03  8:13   ` Peter Zijlstra
  2024-07-03 10:13     ` Masami Hiramatsu
  2024-07-07 12:48   ` Oleg Nesterov
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-03  8:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, oleg, mingo, bpf, jolsa,
	paulmck, clm

On Mon, Jul 01, 2024 at 03:39:28PM -0700, Andrii Nakryiko wrote:
> Simplify uprobe registration/unregistration interfaces by making offset
> and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
> existing users already store these fields somewhere in uprobe_consumer's
> containing structure, so this doesn't pose any problem. We just move
> some fields around.
> 
> On the other hand, this simplifies uprobe_register() and
> uprobe_unregister() API by having only struct uprobe_consumer as one
> thing representing attachment/detachment entity. This makes batched
> versions of uprobe_register() and uprobe_unregister() simpler.
> 
> This also makes uprobe_register_refctr() unnecessary, so remove it and
> simplify consumers.
> 
> No functional changes intended.
> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/uprobes.h                       | 18 +++----
>  kernel/events/uprobes.c                       | 19 ++-----
>  kernel/trace/bpf_trace.c                      | 21 +++-----
>  kernel/trace/trace_uprobe.c                   | 53 ++++++++-----------
>  .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 22 ++++----
>  5 files changed, 55 insertions(+), 78 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index b503fafb7fb3..a75ba37ce3c8 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -42,6 +42,11 @@ struct uprobe_consumer {
>  				enum uprobe_filter_ctx ctx,
>  				struct mm_struct *mm);
>  
> +	/* associated file offset of this probe */
> +	loff_t offset;
> +	/* associated refctr file offset of this probe, or zero */
> +	loff_t ref_ctr_offset;
> +	/* for internal uprobe infra use, consumers shouldn't touch fields below */
>  	struct uprobe_consumer *next;
>  };
>  
> @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
>  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
>  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
> -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);

It seems very weird and unnatural to split inode and offset like this.
The whole offset thing only makes sense within the context of an inode.

So yeah, lets not do this.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
  2024-07-03  8:13   ` Peter Zijlstra
@ 2024-07-03 10:13     ` Masami Hiramatsu
  2024-07-03 18:23       ` Andrii Nakryiko
  0 siblings, 1 reply; 67+ messages in thread
From: Masami Hiramatsu @ 2024-07-03 10:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, paulmck, clm

On Wed, 3 Jul 2024 10:13:15 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jul 01, 2024 at 03:39:28PM -0700, Andrii Nakryiko wrote:
> > Simplify uprobe registration/unregistration interfaces by making offset
> > and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
> > existing users already store these fields somewhere in uprobe_consumer's
> > containing structure, so this doesn't pose any problem. We just move
> > some fields around.
> > 
> > On the other hand, this simplifies uprobe_register() and
> > uprobe_unregister() API by having only struct uprobe_consumer as one
> > thing representing attachment/detachment entity. This makes batched
> > versions of uprobe_register() and uprobe_unregister() simpler.
> > 
> > This also makes uprobe_register_refctr() unnecessary, so remove it and
> > simplify consumers.
> > 
> > No functional changes intended.
> > 
> > Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/uprobes.h                       | 18 +++----
> >  kernel/events/uprobes.c                       | 19 ++-----
> >  kernel/trace/bpf_trace.c                      | 21 +++-----
> >  kernel/trace/trace_uprobe.c                   | 53 ++++++++-----------
> >  .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 22 ++++----
> >  5 files changed, 55 insertions(+), 78 deletions(-)
> > 
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index b503fafb7fb3..a75ba37ce3c8 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -42,6 +42,11 @@ struct uprobe_consumer {
> >  				enum uprobe_filter_ctx ctx,
> >  				struct mm_struct *mm);
> >  
> > +	/* associated file offset of this probe */
> > +	loff_t offset;
> > +	/* associated refctr file offset of this probe, or zero */
> > +	loff_t ref_ctr_offset;
> > +	/* for internal uprobe infra use, consumers shouldn't touch fields below */
> >  	struct uprobe_consumer *next;
> >  };
> >  
> > @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
> >  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
> >  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> >  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> > -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> > -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> > +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
> >  extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
> > -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> > +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
> 
> It seems very weird and unnatural to split inode and offset like this.
> The whole offset thing only makes sense within the context of an inode.

Hm, so would you mean we should have inode inside the uprobe_consumer?
If so, I think it is reasonable.

Thank you,

> 
> So yeah, lets not do this.


-- 
Masami Hiramatsu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 01/12] uprobes: update outdated comment
  2024-07-01 22:39 ` [PATCH v2 01/12] uprobes: update outdated comment Andrii Nakryiko
@ 2024-07-03 11:38   ` Oleg Nesterov
  2024-07-03 18:24     ` Andrii Nakryiko
                       ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Oleg Nesterov @ 2024-07-03 11:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, peterz, mingo, bpf, jolsa,
	paulmck, clm

Sorry for the late reply. I'll try to read this version/discussion
when I have time... yes, I have already promised this before, sorry :/

On 07/01, Andrii Nakryiko wrote:
>
> There is no task_struct passed into get_user_pages_remote() anymore,
> drop the parts of comment mentioning NULL tsk, it's just confusing at
> this point.

Agreed.

> @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>  		goto out;
>
>  	/*
> -	 * The NULL 'tsk' here ensures that any faults that occur here
> -	 * will not be accounted to the task.  'mm' *is* current->mm,
> -	 * but we treat this as a 'remote' access since it is
> -	 * essentially a kernel access to the memory.
> +	 * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> +	 * it is essentially a kernel access to the memory.
>  	 */
>  	result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);

OK, this makes it less confusing, so

Acked-by: Oleg Nesterov <oleg@redhat.com>


---------------------------------------------------------------------
but it still looks confusing to me. This code used to pass tsk = NULL
only to avoid tsk->maj/min_flt++ in faultin_page().

But today mm_account_fault() increments these counters without checking
FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
just use get_user_pages() and remove this comment?

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  2024-07-01 22:39 ` [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() Andrii Nakryiko
@ 2024-07-03 11:41   ` Oleg Nesterov
  2024-07-03 13:15   ` Masami Hiramatsu
  1 sibling, 0 replies; 67+ messages in thread
From: Oleg Nesterov @ 2024-07-03 11:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, peterz, mingo, bpf, jolsa,
	paulmck, clm

On 07/01, Andrii Nakryiko wrote:
>
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
>   * @vaddr: the virtual address to store the opcode.
>   * @opcode: opcode to be written at @vaddr.
>   *
> - * Called with mm->mmap_lock held for write.
> + * Called with mm->mmap_lock held for read or write.
>   * Return 0 (success) or a negative errno.

Thanks,

Acked-by: Oleg Nesterov <oleg@redhat.com>


I'll try to send the patch which explains the reasons for mmap_write_lock()
in register_for_each_vma() later.

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  2024-07-01 22:39 ` [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() Andrii Nakryiko
  2024-07-03 11:41   ` Oleg Nesterov
@ 2024-07-03 13:15   ` Masami Hiramatsu
  2024-07-03 18:25     ` Andrii Nakryiko
  1 sibling, 1 reply; 67+ messages in thread
From: Masami Hiramatsu @ 2024-07-03 13:15 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, oleg, peterz, mingo, bpf, jolsa,
	paulmck, clm

On Mon,  1 Jul 2024 15:39:25 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> It seems like uprobe_write_opcode() doesn't require writer locked
> mmap_sem, any lock (reader or writer) should be sufficient. This was
> established in a discussion in [0] and looking through existing code
> seems to confirm that there is no need for write-locked mmap_sem.
> 
> Fix the comment to state this clearly.
> 
>   [0] https://lore.kernel.org/linux-trace-kernel/20240625190748.GC14254@redhat.com/
> 
> Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into uprobe_write_opcode()")

nit: why this has Fixes but [01/12] doesn't?

Should I pick both to fixes branch?

Thank you,

> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 081821fd529a..f87049c08ee9 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
>   * @vaddr: the virtual address to store the opcode.
>   * @opcode: opcode to be written at @vaddr.
>   *
> - * Called with mm->mmap_lock held for write.
> + * Called with mm->mmap_lock held for read or write.
>   * Return 0 (success) or a negative errno.
>   */
>  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-01 22:39 ` [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
  2024-07-02 10:22   ` Peter Zijlstra
@ 2024-07-03 13:36   ` Peter Zijlstra
  2024-07-03 20:47     ` Andrii Nakryiko
  2024-07-05 15:37   ` Oleg Nesterov
  2 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-03 13:36 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, oleg, mingo, bpf, jolsa,
	paulmck, clm

On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:

> One, attempted initially, way to solve this is through using
> atomic_inc_not_zero() approach, turning get_uprobe() into
> try_get_uprobe(),

This is the canonical thing to do. Everybody does this.

> which can fail to bump refcount if uprobe is already
> destined to be destroyed. This, unfortunately, turns out to be a rather
> expensive due to underlying cmpxchg() operation in
> atomic_inc_not_zero() and scales rather poorly with increased amount of
> parallel threads triggering uprobes.

Different archs different trade-offs. You'll not see this on LL/SC archs
for example.

> Furthermore, CPU profiling showed the following overall CPU usage:
>   - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
>     atomic_inc_not_zero approach;
>   - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
>     atomic_add_and_return approach implemented by this patch.

I think those numbers suggest trying to not have a refcount in the first
place. Both are pretty terrible, yes one is less terrible than the
other, but still terrible.

Specifically, I'm thinking it is the refcounting in handlw_swbp() that
is actually the problem, all the other stuff is noise. 

So if you have SRCU protected consumers, what is the reason for still
having a refcount in handlw_swbp() ? Simply have the whole of it inside
a single SRCU critical section, then all consumers you find get a hit.

Hmm, return probes are a pain, they require the uprobe to stay extant
between handle_swbp() and handle_trampoline(). I'm thinking we can do
that with SRCU as well.

When I cobble all that together (it really shouldn't be one patch, but
you get the idea I hope) it looks a little something like the below.

I *think* it should work, but perhaps I've missed something?

TL;DR replace treelock with seqcount+SRCU
      replace register_rwsem with SRCU
      replace handle_swbp() refcount with SRCU
      replace return_instance refcount with a second SRCU

Paul, I had to do something vile with SRCU. The basic problem is that we
want to keep a SRCU critical section across fork(), which leads to both
parent and child doing srcu_read_unlock(&srcu, idx). As such, I need an
extra increment on the @idx ssp counter to even things out, see
__srcu_read_clone_lock().

---
 include/linux/rbtree.h  |  45 +++++++++++++
 include/linux/srcu.h    |   2 +
 include/linux/uprobes.h |   2 +
 kernel/events/uprobes.c | 166 +++++++++++++++++++++++++++++++-----------------
 kernel/rcu/srcutree.c   |   5 ++
 5 files changed, 161 insertions(+), 59 deletions(-)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index f7edca369eda..9847fa58a287 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -244,6 +244,31 @@ rb_find_add(struct rb_node *node, struct rb_root *tree,
 	return NULL;
 }
 
+static __always_inline struct rb_node *
+rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
+		int (*cmp)(struct rb_node *, const struct rb_node *))
+{
+	struct rb_node **link = &tree->rb_node;
+	struct rb_node *parent = NULL;
+	int c;
+
+	while (*link) {
+		parent = *link;
+		c = cmp(node, parent);
+
+		if (c < 0)
+			link = &parent->rb_left;
+		else if (c > 0)
+			link = &parent->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node_rcu(node, parent, link);
+	rb_insert_color(node, tree);
+	return NULL;
+}
+
 /**
  * rb_find() - find @key in tree @tree
  * @key: key to match
@@ -272,6 +297,26 @@ rb_find(const void *key, const struct rb_root *tree,
 	return NULL;
 }
 
+static __always_inline struct rb_node *
+rb_find_rcu(const void *key, const struct rb_root *tree,
+	    int (*cmp)(const void *key, const struct rb_node *))
+{
+	struct rb_node *node = tree->rb_node;
+
+	while (node) {
+		int c = cmp(key, node);
+
+		if (c < 0)
+			node = rcu_dereference_raw(node->rb_left);
+		else if (c > 0)
+			node = rcu_dereference_raw(node->rb_right);
+		else
+			return node;
+	}
+
+	return NULL;
+}
+
 /**
  * rb_find_first() - find the first @key in @tree
  * @key: key to match
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 236610e4a8fa..9b14acecbb9d 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -55,7 +55,9 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
 		void (*func)(struct rcu_head *head));
 void cleanup_srcu_struct(struct srcu_struct *ssp);
 int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp);
+void __srcu_read_clone_lock(struct srcu_struct *ssp, int idx);
 void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
+
 void synchronize_srcu(struct srcu_struct *ssp);
 unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp);
 unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..354cab634341 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -78,6 +78,7 @@ struct uprobe_task {
 
 	struct return_instance		*return_instances;
 	unsigned int			depth;
+	unsigned int			active_srcu_idx;
 };
 
 struct return_instance {
@@ -86,6 +87,7 @@ struct return_instance {
 	unsigned long		stack;		/* stack pointer */
 	unsigned long		orig_ret_vaddr; /* original return address */
 	bool			chained;	/* true, if instance is nested */
+	int			srcu_idx;
 
 	struct return_instance	*next;		/* keep as stack */
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2c83ba776fc7..0b7574a54093 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -26,6 +26,7 @@
 #include <linux/task_work.h>
 #include <linux/shmem_fs.h>
 #include <linux/khugepaged.h>
+#include <linux/srcu.h>
 
 #include <linux/uprobes.h>
 
@@ -40,6 +41,17 @@ static struct rb_root uprobes_tree = RB_ROOT;
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
+
+/*
+ * Used for both the uprobes_tree and the uprobe->consumer list.
+ */
+DEFINE_STATIC_SRCU(uprobe_srcu);
+/*
+ * Used for return_instance and single-step uprobe lifetime. Separate from
+ * uprobe_srcu in order to minimize the synchronize_srcu() cost at unregister.
+ */
+DEFINE_STATIC_SRCU(uretprobe_srcu);
 
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
@@ -54,7 +66,8 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	refcount_t		ref;
-	struct rw_semaphore	register_rwsem;
+	struct rcu_head		rcu;
+	struct mutex		register_mutex;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
@@ -67,7 +80,7 @@ struct uprobe {
 	 * The generic code assumes that it has two members of unknown type
 	 * owned by the arch-specific code:
 	 *
-	 * 	insn -	copy_insn() saves the original instruction here for
+	 *	insn -	copy_insn() saves the original instruction here for
 	 *		arch_uprobe_analyze_insn().
 	 *
 	 *	ixol -	potentially modified instruction to execute out of
@@ -205,7 +218,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	folio_put(old_folio);
 
 	err = 0;
- unlock:
+unlock:
 	mmu_notifier_invalidate_range_end(&range);
 	folio_unlock(old_folio);
 	return err;
@@ -593,6 +606,22 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
 	return uprobe;
 }
 
+static void uprobe_free_stage2(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+	kfree(uprobe);
+}
+
+static void uprobe_free_stage1(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+	/*
+	 * At this point all the consumers are complete and gone, but retprobe
+	 * and single-step might still reference the uprobe itself.
+	 */
+	call_srcu(&uretprobe_srcu, &uprobe->rcu, uprobe_free_stage2);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (refcount_dec_and_test(&uprobe->ref)) {
@@ -604,7 +633,8 @@ static void put_uprobe(struct uprobe *uprobe)
 		mutex_lock(&delayed_uprobe_lock);
 		delayed_uprobe_remove(uprobe, NULL);
 		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
+
+		call_srcu(&uprobe_srcu, &uprobe->rcu, uprobe_free_stage1);
 	}
 }
 
@@ -653,10 +683,10 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 		.inode = inode,
 		.offset = offset,
 	};
-	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
+	struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
 
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		return __node_2_uprobe(node);
 
 	return NULL;
 }
@@ -667,20 +697,32 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
  */
 static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 {
-	struct uprobe *uprobe;
+	unsigned seq;
 
-	read_lock(&uprobes_treelock);
-	uprobe = __find_uprobe(inode, offset);
-	read_unlock(&uprobes_treelock);
+	lockdep_assert(srcu_read_lock_held(&uprobe_srcu));
 
-	return uprobe;
+	do {
+		seq = read_seqcount_begin(&uprobes_seqcount);
+		struct uprobe *uprobe = __find_uprobe(inode, offset);
+		if (uprobe) {
+			/*
+			 * Lockless RB-tree lookups are prone to false-negatives.
+			 * If they find something, it's good. If they do not find,
+			 * it needs to be validated.
+			 */
+			return uprobe;
+		}
+	} while (read_seqcount_retry(&uprobes_seqcount, seq));
+
+	/* Really didn't find anything. */
+	return NULL;
 }
 
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
 
-	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
+	node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
 	if (node)
 		return get_uprobe(__node_2_uprobe(node));
 
@@ -702,7 +744,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	struct uprobe *u;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	u = __insert_uprobe(uprobe);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 
 	return u;
@@ -730,7 +774,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->inode = inode;
 	uprobe->offset = offset;
 	uprobe->ref_ctr_offset = ref_ctr_offset;
-	init_rwsem(&uprobe->register_rwsem);
+	mutex_init(&uprobe->register_mutex);
 	init_rwsem(&uprobe->consumer_rwsem);
 
 	/* add to uprobes_tree, sorted on inode:offset */
@@ -754,7 +798,7 @@ static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	down_write(&uprobe->consumer_rwsem);
 	uc->next = uprobe->consumers;
-	uprobe->consumers = uc;
+	rcu_assign_pointer(uprobe->consumers, uc);
 	up_write(&uprobe->consumer_rwsem);
 }
 
@@ -771,7 +815,7 @@ static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	down_write(&uprobe->consumer_rwsem);
 	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
 		if (*con == uc) {
-			*con = uc->next;
+			rcu_assign_pointer(*con, uc->next);
 			ret = true;
 			break;
 		}
@@ -857,7 +901,7 @@ static int prepare_uprobe(struct uprobe *uprobe, struct file *file,
 	smp_wmb(); /* pairs with the smp_rmb() in handle_swbp() */
 	set_bit(UPROBE_COPY_INSN, &uprobe->flags);
 
- out:
+out:
 	up_write(&uprobe->consumer_rwsem);
 
 	return ret;
@@ -936,7 +980,9 @@ static void delete_uprobe(struct uprobe *uprobe)
 		return;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	rb_erase(&uprobe->rb_node, &uprobes_tree);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
 	put_uprobe(uprobe);
@@ -965,7 +1011,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 	struct map_info *info;
 	int more = 0;
 
- again:
+again:
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		if (!valid_vma(vma, is_register))
@@ -1019,7 +1065,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 	} while (--more);
 
 	goto again;
- out:
+out:
 	while (prev)
 		prev = free_map_info(prev);
 	return curr;
@@ -1068,13 +1114,13 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 				err |= remove_breakpoint(uprobe, mm, info->vaddr);
 		}
 
- unlock:
+unlock:
 		mmap_write_unlock(mm);
- free:
+free:
 		mmput(mm);
 		info = free_map_info(info);
 	}
- out:
+out:
 	percpu_up_write(&dup_mmap_sem);
 	return err;
 }
@@ -1101,16 +1147,17 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
  */
 void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
 {
-	struct uprobe *uprobe;
+	scoped_guard (srcu, &uprobe_srcu) {
+		struct uprobe *uprobe = find_uprobe(inode, offset);
+		if (WARN_ON(!uprobe))
+			return;
 
-	uprobe = find_uprobe(inode, offset);
-	if (WARN_ON(!uprobe))
-		return;
+		mutex_lock(&uprobe->register_mutex);
+		__uprobe_unregister(uprobe, uc);
+		mutex_unlock(&uprobe->register_mutex);
+	}
 
-	down_write(&uprobe->register_rwsem);
-	__uprobe_unregister(uprobe, uc);
-	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
+	synchronize_srcu(&uprobe_srcu); // XXX amortize / batch
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
@@ -1159,7 +1206,7 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
 		return -EINVAL;
 
- retry:
+retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (!uprobe)
 		return -ENOMEM;
@@ -1170,7 +1217,7 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 	 * We can race with uprobe_unregister()->delete_uprobe().
 	 * Check uprobe_is_active() and retry if it is false.
 	 */
-	down_write(&uprobe->register_rwsem);
+	mutex_lock(&uprobe->register_mutex);
 	ret = -EAGAIN;
 	if (likely(uprobe_is_active(uprobe))) {
 		consumer_add(uprobe, uc);
@@ -1178,7 +1225,7 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 		if (ret)
 			__uprobe_unregister(uprobe, uc);
 	}
-	up_write(&uprobe->register_rwsem);
+	mutex_unlock(&uprobe->register_mutex);
 	put_uprobe(uprobe);
 
 	if (unlikely(ret == -EAGAIN))
@@ -1214,17 +1261,18 @@ int uprobe_apply(struct inode *inode, loff_t offset,
 	struct uprobe_consumer *con;
 	int ret = -ENOENT;
 
+	guard(srcu)(&uprobe_srcu);
+
 	uprobe = find_uprobe(inode, offset);
 	if (WARN_ON(!uprobe))
 		return ret;
 
-	down_write(&uprobe->register_rwsem);
+	mutex_lock(&uprobe->register_mutex);
 	for (con = uprobe->consumers; con && con != uc ; con = con->next)
 		;
 	if (con)
 		ret = register_for_each_vma(uprobe, add ? uc : NULL);
-	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
+	mutex_unlock(&uprobe->register_mutex);
 
 	return ret;
 }
@@ -1468,7 +1516,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	ret = 0;
 	/* pairs with get_xol_area() */
 	smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */
- fail:
+fail:
 	mmap_write_unlock(mm);
 
 	return ret;
@@ -1512,7 +1560,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
 	kfree(area->bitmap);
  free_area:
 	kfree(area);
- out:
+out:
 	return NULL;
 }
 
@@ -1700,7 +1748,7 @@ unsigned long uprobe_get_trap_addr(struct pt_regs *regs)
 static struct return_instance *free_ret_instance(struct return_instance *ri)
 {
 	struct return_instance *next = ri->next;
-	put_uprobe(ri->uprobe);
+	srcu_read_unlock(&uretprobe_srcu, ri->srcu_idx);
 	kfree(ri);
 	return next;
 }
@@ -1718,7 +1766,7 @@ void uprobe_free_utask(struct task_struct *t)
 		return;
 
 	if (utask->active_uprobe)
-		put_uprobe(utask->active_uprobe);
+		srcu_read_unlock(&uretprobe_srcu, utask->active_srcu_idx);
 
 	ri = utask->return_instances;
 	while (ri)
@@ -1761,7 +1809,7 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 			return -ENOMEM;
 
 		*n = *o;
-		get_uprobe(n->uprobe);
+		__srcu_read_clone_lock(&uretprobe_srcu, n->srcu_idx);
 		n->next = NULL;
 
 		*p = n;
@@ -1904,7 +1952,8 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
 
-	ri->uprobe = get_uprobe(uprobe);
+	ri->srcu_idx = srcu_read_lock(&uretprobe_srcu);
+	ri->uprobe = uprobe;
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
@@ -1915,7 +1964,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 	utask->return_instances = ri;
 
 	return;
- fail:
+fail:
 	kfree(ri);
 }
 
@@ -1944,6 +1993,7 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 		return err;
 	}
 
+	utask->active_srcu_idx = srcu_read_lock(&uretprobe_srcu);
 	utask->active_uprobe = uprobe;
 	utask->state = UTASK_SSTEP;
 	return 0;
@@ -2031,7 +2081,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 
 	copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 	put_page(page);
- out:
+out:
 	/* This needs to return true for any variant of the trap insn */
 	return is_trap_insn(&opcode);
 }
@@ -2071,8 +2121,9 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	int remove = UPROBE_HANDLER_REMOVE;
 	bool need_prep = false; /* prepare return uprobe, when needed */
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	lockdep_assert(srcu_read_lock_held(&uprobe_srcu));
+
+	for (uc = rcu_dereference_raw(uprobe->consumers); uc; uc = rcu_dereference(uc->next)) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2094,7 +2145,6 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 		WARN_ON(!uprobe_is_active(uprobe));
 		unapply_uprobe(uprobe, current->mm);
 	}
-	up_read(&uprobe->register_rwsem);
 }
 
 static void
@@ -2103,12 +2153,12 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	guard(srcu)(&uprobe_srcu);
+
+	for (uc = rcu_dereference_raw(uprobe->consumers); uc; uc = rcu_dereference_raw(uc->next)) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	up_read(&uprobe->register_rwsem);
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)
@@ -2159,7 +2209,7 @@ static void handle_trampoline(struct pt_regs *regs)
 	utask->return_instances = ri;
 	return;
 
- sigill:
+sigill:
 	uprobe_warn(current, "handle uretprobe, sending SIGILL.");
 	force_sig(SIGILL);
 
@@ -2190,6 +2240,8 @@ static void handle_swbp(struct pt_regs *regs)
 	if (bp_vaddr == get_trampoline_vaddr())
 		return handle_trampoline(regs);
 
+	guard(srcu)(&uprobe_srcu);
+
 	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
 	if (!uprobe) {
 		if (is_swbp > 0) {
@@ -2218,7 +2270,7 @@ static void handle_swbp(struct pt_regs *regs)
 	 * new and not-yet-analyzed uprobe at the same address, restart.
 	 */
 	if (unlikely(!test_bit(UPROBE_COPY_INSN, &uprobe->flags)))
-		goto out;
+		return;
 
 	/*
 	 * Pairs with the smp_wmb() in prepare_uprobe().
@@ -2231,22 +2283,18 @@ static void handle_swbp(struct pt_regs *regs)
 
 	/* Tracing handlers use ->utask to communicate with fetch methods */
 	if (!get_utask())
-		goto out;
+		return;
 
 	if (arch_uprobe_ignore(&uprobe->arch, regs))
-		goto out;
+		return;
 
 	handler_chain(uprobe, regs);
 
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
-		goto out;
+		return;
 
 	if (!pre_ssout(uprobe, regs, bp_vaddr))
 		return;
-
-	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
-out:
-	put_uprobe(uprobe);
 }
 
 /*
@@ -2266,7 +2314,7 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 	else
 		WARN_ON_ONCE(1);
 
-	put_uprobe(uprobe);
+	srcu_read_unlock(&uretprobe_srcu, utask->active_srcu_idx);
 	utask->active_uprobe = NULL;
 	utask->state = UTASK_RUNNING;
 	xol_free_insn_slot(current);
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index bc4b58b0204e..d8cda9003da4 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -720,6 +720,11 @@ int __srcu_read_lock(struct srcu_struct *ssp)
 }
 EXPORT_SYMBOL_GPL(__srcu_read_lock);
 
+int __srcu_read_clone_lock(struct srcu_struct *ssp, int idx)
+{
+	this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter);
+}
+
 /*
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-03  7:50           ` Peter Zijlstra
@ 2024-07-03 14:08             ` Paul E. McKenney
  2024-07-04  8:39               ` Peter Zijlstra
  2024-07-03 21:57             ` Steven Rostedt
  1 sibling, 1 reply; 67+ messages in thread
From: Paul E. McKenney @ 2024-07-03 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Wed, Jul 03, 2024 at 09:50:57AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 02, 2024 at 04:56:53PM -0700, Paul E. McKenney wrote:
> 
> > > Paul, isn't this the RCU flavour you created to deal with
> > > !rcu_is_watching()? The flavour that never should have been created in
> > > favour of just cleaning up the mess instead of making more.
> > 
> > My guess is that you are instead thinking of RCU Tasks Rude, which can
> > be eliminated once all architectures get their entry/exit/deep-idle
> > functions either inlined or marked noinstr.
> 
> Would it make sense to disable it for those architectures that have
> already done this work?

It might well.  Any architectures other than x86 at this point?

But this is still used in common code, so let's see...  In that case,
synchronize_rcu_tasks_rude() becomes a no-op, call_rcu_tasks_rude() can be
a wrapper around something like queue_work(), and rcu_barrier_tasks_rude()
can be a wrapper around something like flush_work().

Except that call_rcu_tasks_rude() and rcu_barrier_tasks_rude() are not
actually used outside of testing, so maybe they can be dropped globally.

Let me see what happens when I do this:

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7d18b90356fd..5c8492a054f5 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -936,8 +936,8 @@ static struct rcu_torture_ops tasks_rude_ops = {
 	.deferred_free	= rcu_tasks_rude_torture_deferred_free,
 	.sync		= synchronize_rcu_tasks_rude,
 	.exp_sync	= synchronize_rcu_tasks_rude,
-	.call		= call_rcu_tasks_rude,
-	.cb_barrier	= rcu_barrier_tasks_rude,
+	// .call		= call_rcu_tasks_rude,
+	// .cb_barrier	= rcu_barrier_tasks_rude,
 	.gp_kthread_dbg	= show_rcu_tasks_rude_gp_kthread,
 	.get_gp_data	= rcu_tasks_rude_get_gp_data,
 	.cbflood_max	= 50000,

It should be at least mildly amusing...

> > > > I will
> > > > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > > > and to protect uprobe->consumers iteration and uc->handler() calls,
> > > > which could be sleepable, so would need rcu_read_lock_trace().
> > > 
> > > I don't think you need trace-rcu for that. SRCU would do nicely I think.
> > 
> > From a functional viewpoint, agreed.
> > 
> > However, in the past, the memory-barrier and array-indexing overhead
> > of SRCU has made it a no-go for lightweight probes into fastpath code.
> > And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> > Tasks Rude).
> 
> I'm thinking we're growing too many RCU flavours again :/ I suppose I'll
> have to go read up on rcu/tasks.* and see what's what.

Well, you are in luck.  I am well along with the task of putting together
the 2024 LWN RCU API article, which will include RCU Tasks Trace.  ;-)

And I do sympathize with discomfort with lots of RCU flavors.  After all,
hhad you told me 30 years ago that there would be more than one flavor,
I would have been quite surprised.  Of course, I would also have been
surprised by a great many other things (just how many flavors of locking
and reference counting???), so maybe having three flavors (four if we
cannot drop RCU Tasks RUDE) is not so bad.

Oh, and no one is yet using srcu_down_read() and srcu_up_read(), so
if they stay unused for another year or so...

> > The other rule for RCU Tasks Trace is that although readers are permitted
> > to block, this blocking can be for no longer than a major page fault.
> > If you need longer-term blocking, then you should instead use SRCU.
> 
> I think this would render it unsuitable for uprobes. The whole point of
> having a sleepable handler is to be able to take faults.

???

I said "longer than a major page fault", so it is OK to take a fault,
just not one that potentially blocks forever.  (And those faulting onto
things like NFS filesystems have enough other problems that this would
be in the noise for them, right?)

And yes, RCU Tasks Trace is specialized.  I would expect that unexpected
new uses would get serious scrutiny by those currently using it.

							Thanx, Paul

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
  2024-07-03 10:13     ` Masami Hiramatsu
@ 2024-07-03 18:23       ` Andrii Nakryiko
  0 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03 18:23 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, paulmck, clm

On Wed, Jul 3, 2024 at 3:13 AM Masami Hiramatsu
<masami.hiramatsu@gmail.com> wrote:
>
> On Wed, 3 Jul 2024 10:13:15 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Mon, Jul 01, 2024 at 03:39:28PM -0700, Andrii Nakryiko wrote:
> > > Simplify uprobe registration/unregistration interfaces by making offset
> > > and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
> > > existing users already store these fields somewhere in uprobe_consumer's
> > > containing structure, so this doesn't pose any problem. We just move
> > > some fields around.
> > >
> > > On the other hand, this simplifies uprobe_register() and
> > > uprobe_unregister() API by having only struct uprobe_consumer as one
> > > thing representing attachment/detachment entity. This makes batched
> > > versions of uprobe_register() and uprobe_unregister() simpler.
> > >
> > > This also makes uprobe_register_refctr() unnecessary, so remove it and
> > > simplify consumers.
> > >
> > > No functional changes intended.
> > >
> > > Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/uprobes.h                       | 18 +++----
> > >  kernel/events/uprobes.c                       | 19 ++-----
> > >  kernel/trace/bpf_trace.c                      | 21 +++-----
> > >  kernel/trace/trace_uprobe.c                   | 53 ++++++++-----------
> > >  .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 22 ++++----
> > >  5 files changed, 55 insertions(+), 78 deletions(-)
> > >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index b503fafb7fb3..a75ba37ce3c8 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -42,6 +42,11 @@ struct uprobe_consumer {
> > >                             enum uprobe_filter_ctx ctx,
> > >                             struct mm_struct *mm);
> > >
> > > +   /* associated file offset of this probe */
> > > +   loff_t offset;
> > > +   /* associated refctr file offset of this probe, or zero */
> > > +   loff_t ref_ctr_offset;
> > > +   /* for internal uprobe infra use, consumers shouldn't touch fields below */
> > >     struct uprobe_consumer *next;
> > >  };
> > >
> > > @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
> > >  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
> > >  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> > >  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> > > -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> > > -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> > > +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
> > >  extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
> > > -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> > > +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
> >
> > It seems very weird and unnatural to split inode and offset like this.
> > The whole offset thing only makes sense within the context of an inode.
>
> Hm, so would you mean we should have inode inside the uprobe_consumer?
> If so, I think it is reasonable.
>

I don't think so, for at least two reasons.

1) We will be wasting 8 bytes per consumer saving exactly the same
inode pointer for no good reason, while uprobe itself already stores
this inode. One can argue that having offset and ref_ctr_offset inside
uprobe_consumer is wasteful, in principle, and I agree. But all
existing users already store them in the same struct that contains
uprobe_consumer, so we are not regressing anything, while making the
interface simpler. We can always optimize that further, if necessary,
but right now that would give us nothing.

But moving inode into uprobe_consumer will regress memory usage.

2) In the context of batched APIs, offset and ref_ctr_offset varies
with each uprobe_consumer, while inode is explicitly the same for all
consumers in that batch. This API makes it very clear.

Technically, we can remove inode completely from *uprobe_unregister*,
because we now can access it from uprobe_consumer->uprobe->inode, but
it still feels right for symmetry and explicitly making a point that
callers should ensure that inode is kept alive.


> Thank you,
>
> >
> > So yeah, lets not do this.
>
>
> --
> Masami Hiramatsu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 01/12] uprobes: update outdated comment
  2024-07-03 11:38   ` Oleg Nesterov
@ 2024-07-03 18:24     ` Andrii Nakryiko
  2024-07-03 21:51     ` Andrii Nakryiko
  2024-07-10 13:31     ` Oleg Nesterov
  2 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03 18:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Sorry for the late reply. I'll try to read this version/discussion
> when I have time... yes, I have already promised this before, sorry :/
>
> On 07/01, Andrii Nakryiko wrote:
> >
> > There is no task_struct passed into get_user_pages_remote() anymore,
> > drop the parts of comment mentioning NULL tsk, it's just confusing at
> > this point.
>
> Agreed.
>
> > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
> >               goto out;
> >
> >       /*
> > -      * The NULL 'tsk' here ensures that any faults that occur here
> > -      * will not be accounted to the task.  'mm' *is* current->mm,
> > -      * but we treat this as a 'remote' access since it is
> > -      * essentially a kernel access to the memory.
> > +      * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +      * it is essentially a kernel access to the memory.
> >        */
> >       result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
>
>
> ---------------------------------------------------------------------
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?

I don't know, it was a drive-by cleanup which I'm starting to regret already :)

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  2024-07-03 13:15   ` Masami Hiramatsu
@ 2024-07-03 18:25     ` Andrii Nakryiko
  2024-07-03 21:47       ` Masami Hiramatsu
  0 siblings, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03 18:25 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Wed, Jul 3, 2024 at 6:15 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Mon,  1 Jul 2024 15:39:25 -0700
> Andrii Nakryiko <andrii@kernel.org> wrote:
>
> > It seems like uprobe_write_opcode() doesn't require writer locked
> > mmap_sem, any lock (reader or writer) should be sufficient. This was
> > established in a discussion in [0] and looking through existing code
> > seems to confirm that there is no need for write-locked mmap_sem.
> >
> > Fix the comment to state this clearly.
> >
> >   [0] https://lore.kernel.org/linux-trace-kernel/20240625190748.GC14254@redhat.com/
> >
> > Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into uprobe_write_opcode()")
>
> nit: why this has Fixes but [01/12] doesn't?
>
> Should I pick both to fixes branch?

I'd keep both of them in probes/for-next, tbh. They are not literally
fixing anything, just cleaning up comments. I can drop the Fixes tag
from this one, if you'd like.

>
> Thank you,
>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/events/uprobes.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 081821fd529a..f87049c08ee9 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
> >   * @vaddr: the virtual address to store the opcode.
> >   * @opcode: opcode to be written at @vaddr.
> >   *
> > - * Called with mm->mmap_lock held for write.
> > + * Called with mm->mmap_lock held for read or write.
> >   * Return 0 (success) or a negative errno.
> >   */
> >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-03 13:36   ` Peter Zijlstra
@ 2024-07-03 20:47     ` Andrii Nakryiko
  2024-07-04  8:03       ` Peter Zijlstra
  2024-07-04  8:31       ` Peter Zijlstra
  0 siblings, 2 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, paulmck, clm

On Wed, Jul 3, 2024 at 6:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:
>
> > One, attempted initially, way to solve this is through using
> > atomic_inc_not_zero() approach, turning get_uprobe() into
> > try_get_uprobe(),
>
> This is the canonical thing to do. Everybody does this.

Sure, and I provided arguments why I don't do it. Can you provide your
counter argument, please? "Everybody does this." is hardly one.

>
> > which can fail to bump refcount if uprobe is already
> > destined to be destroyed. This, unfortunately, turns out to be a rather
> > expensive due to underlying cmpxchg() operation in
> > atomic_inc_not_zero() and scales rather poorly with increased amount of
> > parallel threads triggering uprobes.
>
> Different archs different trade-offs. You'll not see this on LL/SC archs
> for example.

Clearly x86-64 is the highest priority target, and I've shown that it
benefits from atomic addition vs cmpxchg. Sure, other architecture
might benefit less. But will atomic addition be slower than cmpxchg on
any other architecture?

It's clearly beneficial for x86-64 and not regressing other
architectures, right?

>
> > Furthermore, CPU profiling showed the following overall CPU usage:
> >   - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
> >     atomic_inc_not_zero approach;
> >   - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
> >     atomic_add_and_return approach implemented by this patch.
>
> I think those numbers suggest trying to not have a refcount in the first
> place. Both are pretty terrible, yes one is less terrible than the
> other, but still terrible.

Good, we are on the same page here, yes.

>
> Specifically, I'm thinking it is the refcounting in handlw_swbp() that
> is actually the problem, all the other stuff is noise.
>
> So if you have SRCU protected consumers, what is the reason for still
> having a refcount in handlw_swbp() ? Simply have the whole of it inside
> a single SRCU critical section, then all consumers you find get a hit.

That's the goal (except SRCU vs RCU Tasks Trace) and that's the next
step. I didn't want to add all that complexity to an already pretty
big and complex patch set. I do believe that batch APIs are the first
necessary step.

Your innocuous "// XXX amortize / batch" comment below is *the major
point of this patch set*. Try to appreciate that. It's not a small
todo, it took this entire patch set to allow for that.

Now, if you are so against percpu RW semapshore, I can just drop the
last patch for now, but the rest is necessary regardless.

Note how I didn't really touch locking *at all*. uprobes_treelock used
to be a spinlock, which we 1-to-1 replaced with rw_spinlock. And now
I'm replacing it, again 1-to-1, with percpu RW semaphore. Specifically
not to entangle batching with the locking schema changes.

>
> Hmm, return probes are a pain, they require the uprobe to stay extant
> between handle_swbp() and handle_trampoline(). I'm thinking we can do
> that with SRCU as well.

I don't think we can, and I'm surprised you don't think that way.

uretprobe might never be triggered for various reasons. Either user
space never returns from the function, or uretprobe was never
installed in the right place (and so uprobe part will trigger, but
there will never be returning probe triggering). I don't think it's
acceptable to delay whole global uprobes SRCU unlocking indefinitely
and keep that at user space's code will.

So, with that, I think refcounting *for return probe* will have to
stay. And will have to be fast.

>
> When I cobble all that together (it really shouldn't be one patch, but
> you get the idea I hope) it looks a little something like the below.
>
> I *think* it should work, but perhaps I've missed something?

Well, at the very least you missed that we can't delay SRCU (or any
other sleepable RCU flavor) potentially indefinitely for uretprobes,
which are completely under user space control.

>
> TL;DR replace treelock with seqcount+SRCU
>       replace register_rwsem with SRCU
>       replace handle_swbp() refcount with SRCU
>       replace return_instance refcount with a second SRCU

So, as I mentioned. I haven't considered seqcount just yet, and I will
think that through. This patch set was meant to add batched API to
unblock all of the above you describe. Percpu RW semaphore switch was
a no-brainer with batched APIs, so I went for that to get more
performance with zero added effort and complexity. If you hate that
part, I can drop it. But batching APIs are unavoidable, no matter what
specific RCU-protected locking schema we end up doing.

Can we agree on that and move this forward, please?

>
> Paul, I had to do something vile with SRCU. The basic problem is that we
> want to keep a SRCU critical section across fork(), which leads to both
> parent and child doing srcu_read_unlock(&srcu, idx). As such, I need an
> extra increment on the @idx ssp counter to even things out, see
> __srcu_read_clone_lock().
>
> ---
>  include/linux/rbtree.h  |  45 +++++++++++++
>  include/linux/srcu.h    |   2 +
>  include/linux/uprobes.h |   2 +
>  kernel/events/uprobes.c | 166 +++++++++++++++++++++++++++++++-----------------
>  kernel/rcu/srcutree.c   |   5 ++
>  5 files changed, 161 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
> index f7edca369eda..9847fa58a287 100644
> --- a/include/linux/rbtree.h
> +++ b/include/linux/rbtree.h
> @@ -244,6 +244,31 @@ rb_find_add(struct rb_node *node, struct rb_root *tree,
>         return NULL;
>  }
>
> +static __always_inline struct rb_node *
> +rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
> +               int (*cmp)(struct rb_node *, const struct rb_node *))
> +{
> +       struct rb_node **link = &tree->rb_node;
> +       struct rb_node *parent = NULL;
> +       int c;
> +
> +       while (*link) {
> +               parent = *link;
> +               c = cmp(node, parent);
> +
> +               if (c < 0)
> +                       link = &parent->rb_left;
> +               else if (c > 0)
> +                       link = &parent->rb_right;
> +               else
> +                       return parent;
> +       }
> +
> +       rb_link_node_rcu(node, parent, link);
> +       rb_insert_color(node, tree);
> +       return NULL;
> +}
> +
>  /**
>   * rb_find() - find @key in tree @tree
>   * @key: key to match
> @@ -272,6 +297,26 @@ rb_find(const void *key, const struct rb_root *tree,
>         return NULL;
>  }
>
> +static __always_inline struct rb_node *
> +rb_find_rcu(const void *key, const struct rb_root *tree,
> +           int (*cmp)(const void *key, const struct rb_node *))
> +{
> +       struct rb_node *node = tree->rb_node;
> +
> +       while (node) {
> +               int c = cmp(key, node);
> +
> +               if (c < 0)
> +                       node = rcu_dereference_raw(node->rb_left);
> +               else if (c > 0)
> +                       node = rcu_dereference_raw(node->rb_right);
> +               else
> +                       return node;
> +       }
> +
> +       return NULL;
> +}
> +
>  /**
>   * rb_find_first() - find the first @key in @tree
>   * @key: key to match
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 236610e4a8fa..9b14acecbb9d 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -55,7 +55,9 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
>                 void (*func)(struct rcu_head *head));
>  void cleanup_srcu_struct(struct srcu_struct *ssp);
>  int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp);
> +void __srcu_read_clone_lock(struct srcu_struct *ssp, int idx);
>  void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
> +
>  void synchronize_srcu(struct srcu_struct *ssp);
>  unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp);
>  unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp);
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index f46e0ca0169c..354cab634341 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -78,6 +78,7 @@ struct uprobe_task {
>
>         struct return_instance          *return_instances;
>         unsigned int                    depth;
> +       unsigned int                    active_srcu_idx;
>  };
>
>  struct return_instance {
> @@ -86,6 +87,7 @@ struct return_instance {
>         unsigned long           stack;          /* stack pointer */
>         unsigned long           orig_ret_vaddr; /* original return address */
>         bool                    chained;        /* true, if instance is nested */
> +       int                     srcu_idx;
>
>         struct return_instance  *next;          /* keep as stack */
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 2c83ba776fc7..0b7574a54093 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -26,6 +26,7 @@
>  #include <linux/task_work.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/khugepaged.h>
> +#include <linux/srcu.h>
>
>  #include <linux/uprobes.h>
>
> @@ -40,6 +41,17 @@ static struct rb_root uprobes_tree = RB_ROOT;
>  #define no_uprobe_events()     RB_EMPTY_ROOT(&uprobes_tree)
>
>  static DEFINE_RWLOCK(uprobes_treelock);        /* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
> +
> +/*
> + * Used for both the uprobes_tree and the uprobe->consumer list.
> + */
> +DEFINE_STATIC_SRCU(uprobe_srcu);
> +/*
> + * Used for return_instance and single-step uprobe lifetime. Separate from
> + * uprobe_srcu in order to minimize the synchronize_srcu() cost at unregister.
> + */
> +DEFINE_STATIC_SRCU(uretprobe_srcu);
>
>  #define UPROBES_HASH_SZ        13
>  /* serialize uprobe->pending_list */
> @@ -54,7 +66,8 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>  struct uprobe {
>         struct rb_node          rb_node;        /* node in the rb tree */
>         refcount_t              ref;
> -       struct rw_semaphore     register_rwsem;
> +       struct rcu_head         rcu;
> +       struct mutex            register_mutex;
>         struct rw_semaphore     consumer_rwsem;
>         struct list_head        pending_list;
>         struct uprobe_consumer  *consumers;
> @@ -67,7 +80,7 @@ struct uprobe {
>          * The generic code assumes that it has two members of unknown type
>          * owned by the arch-specific code:
>          *
> -        *      insn -  copy_insn() saves the original instruction here for
> +        *      insn -  copy_insn() saves the original instruction here for
>          *              arch_uprobe_analyze_insn().
>          *
>          *      ixol -  potentially modified instruction to execute out of
> @@ -205,7 +218,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>         folio_put(old_folio);
>
>         err = 0;
> - unlock:
> +unlock:
>         mmu_notifier_invalidate_range_end(&range);
>         folio_unlock(old_folio);
>         return err;
> @@ -593,6 +606,22 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
>         return uprobe;
>  }
>
> +static void uprobe_free_stage2(struct rcu_head *rcu)
> +{
> +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> +       kfree(uprobe);
> +}
> +
> +static void uprobe_free_stage1(struct rcu_head *rcu)
> +{
> +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> +       /*
> +        * At this point all the consumers are complete and gone, but retprobe
> +        * and single-step might still reference the uprobe itself.
> +        */
> +       call_srcu(&uretprobe_srcu, &uprobe->rcu, uprobe_free_stage2);
> +}
> +
>  static void put_uprobe(struct uprobe *uprobe)
>  {
>         if (refcount_dec_and_test(&uprobe->ref)) {
> @@ -604,7 +633,8 @@ static void put_uprobe(struct uprobe *uprobe)
>                 mutex_lock(&delayed_uprobe_lock);
>                 delayed_uprobe_remove(uprobe, NULL);
>                 mutex_unlock(&delayed_uprobe_lock);
> -               kfree(uprobe);
> +
> +               call_srcu(&uprobe_srcu, &uprobe->rcu, uprobe_free_stage1);
>         }
>  }
>
> @@ -653,10 +683,10 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
>                 .inode = inode,
>                 .offset = offset,
>         };
> -       struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
> +       struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
>
>         if (node)
> -               return get_uprobe(__node_2_uprobe(node));
> +               return __node_2_uprobe(node);
>
>         return NULL;
>  }
> @@ -667,20 +697,32 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
>   */
>  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
>  {
> -       struct uprobe *uprobe;
> +       unsigned seq;
>
> -       read_lock(&uprobes_treelock);
> -       uprobe = __find_uprobe(inode, offset);
> -       read_unlock(&uprobes_treelock);
> +       lockdep_assert(srcu_read_lock_held(&uprobe_srcu));
>
> -       return uprobe;
> +       do {
> +               seq = read_seqcount_begin(&uprobes_seqcount);
> +               struct uprobe *uprobe = __find_uprobe(inode, offset);
> +               if (uprobe) {
> +                       /*
> +                        * Lockless RB-tree lookups are prone to false-negatives.
> +                        * If they find something, it's good. If they do not find,
> +                        * it needs to be validated.
> +                        */
> +                       return uprobe;
> +               }
> +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> +
> +       /* Really didn't find anything. */
> +       return NULL;
>  }
>
>  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
>  {
>         struct rb_node *node;
>
> -       node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
> +       node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
>         if (node)
>                 return get_uprobe(__node_2_uprobe(node));
>
> @@ -702,7 +744,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>         struct uprobe *u;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         u = __insert_uprobe(uprobe);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>
>         return u;
> @@ -730,7 +774,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
>         uprobe->inode = inode;
>         uprobe->offset = offset;
>         uprobe->ref_ctr_offset = ref_ctr_offset;
> -       init_rwsem(&uprobe->register_rwsem);
> +       mutex_init(&uprobe->register_mutex);
>         init_rwsem(&uprobe->consumer_rwsem);
>
>         /* add to uprobes_tree, sorted on inode:offset */
> @@ -754,7 +798,7 @@ static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  {
>         down_write(&uprobe->consumer_rwsem);
>         uc->next = uprobe->consumers;
> -       uprobe->consumers = uc;
> +       rcu_assign_pointer(uprobe->consumers, uc);
>         up_write(&uprobe->consumer_rwsem);
>  }
>
> @@ -771,7 +815,7 @@ static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
>         down_write(&uprobe->consumer_rwsem);
>         for (con = &uprobe->consumers; *con; con = &(*con)->next) {
>                 if (*con == uc) {
> -                       *con = uc->next;
> +                       rcu_assign_pointer(*con, uc->next);
>                         ret = true;
>                         break;
>                 }
> @@ -857,7 +901,7 @@ static int prepare_uprobe(struct uprobe *uprobe, struct file *file,
>         smp_wmb(); /* pairs with the smp_rmb() in handle_swbp() */
>         set_bit(UPROBE_COPY_INSN, &uprobe->flags);
>
> - out:
> +out:
>         up_write(&uprobe->consumer_rwsem);
>
>         return ret;
> @@ -936,7 +980,9 @@ static void delete_uprobe(struct uprobe *uprobe)
>                 return;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         rb_erase(&uprobe->rb_node, &uprobes_tree);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>         RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
>         put_uprobe(uprobe);
> @@ -965,7 +1011,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
>         struct map_info *info;
>         int more = 0;
>
> - again:
> +again:
>         i_mmap_lock_read(mapping);
>         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>                 if (!valid_vma(vma, is_register))
> @@ -1019,7 +1065,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
>         } while (--more);
>
>         goto again;
> - out:
> +out:
>         while (prev)
>                 prev = free_map_info(prev);
>         return curr;
> @@ -1068,13 +1114,13 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>                                 err |= remove_breakpoint(uprobe, mm, info->vaddr);
>                 }
>
> - unlock:
> +unlock:
>                 mmap_write_unlock(mm);
> - free:
> +free:
>                 mmput(mm);
>                 info = free_map_info(info);
>         }
> - out:
> +out:
>         percpu_up_write(&dup_mmap_sem);
>         return err;
>  }
> @@ -1101,16 +1147,17 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>   */
>  void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
>  {
> -       struct uprobe *uprobe;
> +       scoped_guard (srcu, &uprobe_srcu) {
> +               struct uprobe *uprobe = find_uprobe(inode, offset);
> +               if (WARN_ON(!uprobe))
> +                       return;
>
> -       uprobe = find_uprobe(inode, offset);
> -       if (WARN_ON(!uprobe))
> -               return;
> +               mutex_lock(&uprobe->register_mutex);
> +               __uprobe_unregister(uprobe, uc);
> +               mutex_unlock(&uprobe->register_mutex);
> +       }
>
> -       down_write(&uprobe->register_rwsem);
> -       __uprobe_unregister(uprobe, uc);
> -       up_write(&uprobe->register_rwsem);
> -       put_uprobe(uprobe);
> +       synchronize_srcu(&uprobe_srcu); // XXX amortize / batch
>  }
>  EXPORT_SYMBOL_GPL(uprobe_unregister);
>
> @@ -1159,7 +1206,7 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
>         if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
>                 return -EINVAL;
>
> - retry:
> +retry:
>         uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
>         if (!uprobe)
>                 return -ENOMEM;
> @@ -1170,7 +1217,7 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
>          * We can race with uprobe_unregister()->delete_uprobe().
>          * Check uprobe_is_active() and retry if it is false.
>          */
> -       down_write(&uprobe->register_rwsem);
> +       mutex_lock(&uprobe->register_mutex);
>         ret = -EAGAIN;
>         if (likely(uprobe_is_active(uprobe))) {
>                 consumer_add(uprobe, uc);
> @@ -1178,7 +1225,7 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
>                 if (ret)
>                         __uprobe_unregister(uprobe, uc);
>         }
> -       up_write(&uprobe->register_rwsem);
> +       mutex_unlock(&uprobe->register_mutex);
>         put_uprobe(uprobe);
>
>         if (unlikely(ret == -EAGAIN))
> @@ -1214,17 +1261,18 @@ int uprobe_apply(struct inode *inode, loff_t offset,
>         struct uprobe_consumer *con;
>         int ret = -ENOENT;
>
> +       guard(srcu)(&uprobe_srcu);
> +
>         uprobe = find_uprobe(inode, offset);
>         if (WARN_ON(!uprobe))
>                 return ret;
>
> -       down_write(&uprobe->register_rwsem);
> +       mutex_lock(&uprobe->register_mutex);
>         for (con = uprobe->consumers; con && con != uc ; con = con->next)
>                 ;
>         if (con)
>                 ret = register_for_each_vma(uprobe, add ? uc : NULL);
> -       up_write(&uprobe->register_rwsem);
> -       put_uprobe(uprobe);
> +       mutex_unlock(&uprobe->register_mutex);
>
>         return ret;
>  }
> @@ -1468,7 +1516,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
>         ret = 0;
>         /* pairs with get_xol_area() */
>         smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */
> - fail:
> +fail:
>         mmap_write_unlock(mm);
>
>         return ret;
> @@ -1512,7 +1560,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
>         kfree(area->bitmap);
>   free_area:
>         kfree(area);
> - out:
> +out:
>         return NULL;
>  }
>
> @@ -1700,7 +1748,7 @@ unsigned long uprobe_get_trap_addr(struct pt_regs *regs)
>  static struct return_instance *free_ret_instance(struct return_instance *ri)
>  {
>         struct return_instance *next = ri->next;
> -       put_uprobe(ri->uprobe);
> +       srcu_read_unlock(&uretprobe_srcu, ri->srcu_idx);
>         kfree(ri);
>         return next;
>  }
> @@ -1718,7 +1766,7 @@ void uprobe_free_utask(struct task_struct *t)
>                 return;
>
>         if (utask->active_uprobe)
> -               put_uprobe(utask->active_uprobe);
> +               srcu_read_unlock(&uretprobe_srcu, utask->active_srcu_idx);
>
>         ri = utask->return_instances;
>         while (ri)
> @@ -1761,7 +1809,7 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
>                         return -ENOMEM;
>
>                 *n = *o;
> -               get_uprobe(n->uprobe);
> +               __srcu_read_clone_lock(&uretprobe_srcu, n->srcu_idx);
>                 n->next = NULL;
>
>                 *p = n;
> @@ -1904,7 +1952,8 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>                 orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
>         }
>
> -       ri->uprobe = get_uprobe(uprobe);
> +       ri->srcu_idx = srcu_read_lock(&uretprobe_srcu);
> +       ri->uprobe = uprobe;
>         ri->func = instruction_pointer(regs);
>         ri->stack = user_stack_pointer(regs);
>         ri->orig_ret_vaddr = orig_ret_vaddr;
> @@ -1915,7 +1964,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>         utask->return_instances = ri;
>
>         return;
> - fail:
> +fail:
>         kfree(ri);
>  }
>
> @@ -1944,6 +1993,7 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
>                 return err;
>         }
>
> +       utask->active_srcu_idx = srcu_read_lock(&uretprobe_srcu);
>         utask->active_uprobe = uprobe;
>         utask->state = UTASK_SSTEP;
>         return 0;
> @@ -2031,7 +2081,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>
>         copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
>         put_page(page);
> - out:
> +out:
>         /* This needs to return true for any variant of the trap insn */
>         return is_trap_insn(&opcode);
>  }
> @@ -2071,8 +2121,9 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
>         int remove = UPROBE_HANDLER_REMOVE;
>         bool need_prep = false; /* prepare return uprobe, when needed */
>
> -       down_read(&uprobe->register_rwsem);
> -       for (uc = uprobe->consumers; uc; uc = uc->next) {
> +       lockdep_assert(srcu_read_lock_held(&uprobe_srcu));
> +
> +       for (uc = rcu_dereference_raw(uprobe->consumers); uc; uc = rcu_dereference(uc->next)) {
>                 int rc = 0;
>
>                 if (uc->handler) {
> @@ -2094,7 +2145,6 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
>                 WARN_ON(!uprobe_is_active(uprobe));
>                 unapply_uprobe(uprobe, current->mm);
>         }
> -       up_read(&uprobe->register_rwsem);
>  }
>
>  static void
> @@ -2103,12 +2153,12 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
>         struct uprobe *uprobe = ri->uprobe;
>         struct uprobe_consumer *uc;
>
> -       down_read(&uprobe->register_rwsem);
> -       for (uc = uprobe->consumers; uc; uc = uc->next) {
> +       guard(srcu)(&uprobe_srcu);
> +
> +       for (uc = rcu_dereference_raw(uprobe->consumers); uc; uc = rcu_dereference_raw(uc->next)) {
>                 if (uc->ret_handler)
>                         uc->ret_handler(uc, ri->func, regs);
>         }
> -       up_read(&uprobe->register_rwsem);
>  }
>
>  static struct return_instance *find_next_ret_chain(struct return_instance *ri)
> @@ -2159,7 +2209,7 @@ static void handle_trampoline(struct pt_regs *regs)
>         utask->return_instances = ri;
>         return;
>
> - sigill:
> +sigill:
>         uprobe_warn(current, "handle uretprobe, sending SIGILL.");
>         force_sig(SIGILL);
>
> @@ -2190,6 +2240,8 @@ static void handle_swbp(struct pt_regs *regs)
>         if (bp_vaddr == get_trampoline_vaddr())
>                 return handle_trampoline(regs);
>
> +       guard(srcu)(&uprobe_srcu);
> +
>         uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
>         if (!uprobe) {
>                 if (is_swbp > 0) {
> @@ -2218,7 +2270,7 @@ static void handle_swbp(struct pt_regs *regs)
>          * new and not-yet-analyzed uprobe at the same address, restart.
>          */
>         if (unlikely(!test_bit(UPROBE_COPY_INSN, &uprobe->flags)))
> -               goto out;
> +               return;
>
>         /*
>          * Pairs with the smp_wmb() in prepare_uprobe().
> @@ -2231,22 +2283,18 @@ static void handle_swbp(struct pt_regs *regs)
>
>         /* Tracing handlers use ->utask to communicate with fetch methods */
>         if (!get_utask())
> -               goto out;
> +               return;
>
>         if (arch_uprobe_ignore(&uprobe->arch, regs))
> -               goto out;
> +               return;
>
>         handler_chain(uprobe, regs);
>
>         if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> -               goto out;
> +               return;
>
>         if (!pre_ssout(uprobe, regs, bp_vaddr))
>                 return;
> -
> -       /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
> -out:
> -       put_uprobe(uprobe);
>  }
>
>  /*
> @@ -2266,7 +2314,7 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
>         else
>                 WARN_ON_ONCE(1);
>
> -       put_uprobe(uprobe);
> +       srcu_read_unlock(&uretprobe_srcu, utask->active_srcu_idx);
>         utask->active_uprobe = NULL;
>         utask->state = UTASK_RUNNING;
>         xol_free_insn_slot(current);
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index bc4b58b0204e..d8cda9003da4 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -720,6 +720,11 @@ int __srcu_read_lock(struct srcu_struct *ssp)
>  }
>  EXPORT_SYMBOL_GPL(__srcu_read_lock);
>
> +int __srcu_read_clone_lock(struct srcu_struct *ssp, int idx)
> +{
> +       this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter);
> +}
> +
>  /*
>   * Removes the count for the old reader from the appropriate per-CPU
>   * element of the srcu_struct.  Note that this may well be a different

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-03  8:07           ` Peter Zijlstra
@ 2024-07-03 20:55             ` Andrii Nakryiko
  0 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03 20:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E . McKenney, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Wed, Jul 3, 2024 at 1:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 02, 2024 at 09:47:41PM -0700, Andrii Nakryiko wrote:
>
> > > As you noted, that percpu-rwsem write side is quite insane. And you're
> > > creating this batch complexity to mitigate that.
> >
> >
> > Note that batch API is needed regardless of percpu RW semaphore or
> > not. As I mentioned, once uprobes_treelock is mitigated one way or the
> > other, the next one is uprobe->register_rwsem. For scalability, we
> > need to get rid of it and preferably not add any locking at all. So
> > tentatively I'd like to have lockless RCU-protected iteration over
> > uprobe->consumers list and call consumer->handler(). This means that
> > on uprobes_unregister we'd need synchronize_rcu (for whatever RCU
> > flavor we end up using), to ensure that we don't free uprobe_consumer
> > memory from under handle_swbp() while it is actually triggering
> > consumers.
> >
> > So, without batched unregistration we'll be back to the same problem
> > I'm solving here: doing synchronize_rcu() for each attached uprobe one
> > by one is prohibitively slow. We went through this exercise with
> > ftrace/kprobes already and fixed it with batched APIs. Doing that for
> > uprobes seems unavoidable as well.
>
> I'm not immediately seeing how you need that terrible refcount stuff for

Which part is terrible, please be more specific. I can switch to
refcount_inc_not_zero() and leave performance improvement on the
table, but why is that a good idea?

> the batching though. If all you need is group a few unregisters together
> in order to share a sync_rcu() that seems way overkill.
>
> You seem to have muddled the order of things, which makes the actual
> reason for doing things utterly unclear.

See -EGAIN handling in uprobe_register() code in current upstream
kernel. We manage to allocate and insert (or update existing) uprobe
in uprobes_tree. And then when we try to register we can post factum
detect that uprobe was removed from RB tree from under us. And we have
to go on a retry, allocating/inserting/updating it again.

This is quite problematic for batched API, in which I split the whole
attachment into few independent phase:

  - preallocate uprobe instances (for all consumers/uprobes)
  - insert them or reuse pre-existing ones (again, for all consumers
in one batch, protected by single writer lock on uprobes_treelock);
  - then register/apply for each VMA (you get it, for all consumers in one go).

Having this retry for some of uprobes because of this race is hugely
problematic, so I wanted to make it cleaner and simpler: once you
manage to insert/reuse uprobe, it's not going away from under me.
Which is why the change to refcounting schema.

And I think it's a major improvement. We can argue about
refcount_inc_not_zero vs this custom refcounting schema, but I think
the change should be made.

Now, imagine I also did all the seqcount and RCU stuff across entire
uprobe functionality. Wouldn't that be mind bending a little bit to
wrap your head around this?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (12 preceding siblings ...)
  2024-07-02 10:23 ` [PATCH v2 00/12] uprobes: add batched register/unregister APIs and " Peter Zijlstra
@ 2024-07-03 21:33 ` Andrii Nakryiko
  2024-07-04  9:15   ` Peter Zijlstra
  13 siblings, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03 21:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, oleg, peterz, mingo, bpf,
	jolsa, paulmck, clm, open list

On Mon, Jul 1, 2024 at 3:39 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> to per-CPU RW semaphore, which has better performance and scales better under
> contention and multiple parallel threads triggering lots of uprobes.
>
> To make this work well with attaching multiple uprobes (through BPF
> multi-uprobe), we need to add batched versions of uprobe register/unregister
> APIs. This is what most of the patch set is actually doing. The actual switch
> to per-CPU RW semaphore is trivial after that and is done in the very last
> patch #12. See commit message with some comparison numbers.
>

Peter,

I think I've addressed all the questions so far, but I wanted to take
a moment and bring all the discussions into a single palace, summarize
what I think are the main points of contention and hopefully make some
progress, or at least get us to a bit more constructive discussion
where *both sides* provide arguments. Right now there is a lot of "you
are doing X, but why don't you just do Y" with no argument for a) why
X is bad/wrong/inferior and b) why Y is better (and not just
equivalent or, even worse, inferior).

I trust you have the best intentions in mind for this piece of kernel
infrastructure, so do I, so let's try to find a path forward.

1. Strategically, uprobes/uretprobes have to be improved. Customers do
complain more and more that "uprobes are slow", justifiably so. Both
single-threaded performance matters, but also, critically, uprobes
scalability. I.e., if the kernel can handle N uprobe per second on a
single uncontended CPU, then triggering uprobes across M CPUs should,
ideally and roughly, give us about N * M total throughput.

This doesn't seem controversial, but I wanted to make it clear that
this is the end goal of my work. And no, this patch set alone doesn't,
yet, get us there. But it's a necessary step, IMO. Jiri Olsa took
single-threaded performance and is improving it with sys_uretprobe and
soon sys_uprobe, I'm looking into scalability and other smaller
single-threaded wins, where possible.

2. More tactically, RCU protection seems like the best way forward. We
got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
clarified that RCU Tasks Trace has nothing to do with Tasks Rude
flavor (whatever that is, I have no idea).

Now, RCU Tasks Trace were specifically designed for least overhead
hotpath (reader side) performance, at the expense of slowing down much
rarer writers. My microbenchmarking does show at least 5% difference.
Both flavors can handle sleepable uprobes waiting for page faults.
Tasks Trace flavor is already used for tracing in the BPF realm,
including for sleepable uprobes and works well. It's not going away.

Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
haven't seen a single argument why. Please provide that, or let's
stick to RCU Tasks Trace, because uprobe's use case is an ideal case
of what Tasks Trace flavor was designed for.

3. Regardless of RCU flavor, due to RCU protection, we have to add
batched register/unregister APIs, so we can amortize sync_rcu cost
during deregistration. Can we please agree on that as well? This is
the main goal of this patch set and I'd like to land it before working
further on changing and improving the rest of the locking schema.

I won't be happy about it, but just to move things forward, I can drop
a) custom refcounting and/or b) percpu RW semaphore. Both are
beneficial but not essential for batched APIs work. But if you force
me to do that, please state clearly your reasons/arguments. No one had
yet pointed out why refcounting is broken and why percpu RW semaphore
is bad. On the contrary, Ingo Molnar did suggest percpu RW semaphore
in the first place (see [0]), but we postponed it due to the lack of
batched APIs, and promised to do this work. Here I am, doing the
promised work. Not purely because of percpu RW semaphore, but
benefiting from it just as well.

  [0] https://lore.kernel.org/linux-trace-kernel/Zf+d9twfyIDosINf@gmail.com/

4. Another tactical thing, but an important one. Refcounting schema
for uprobes. I've replied already, but I think refcounting is
unavoidable for uretprobes, and current refcounting schema is
problematic for batched APIs due to race between finding uprobe and
there still being a possibility we'd need to undo all that and retry
again.

I think the main thing is to agree to change refcounting to avoid this
race, allowing for simpler batched registration. Hopefully we can
agree on that.

But also, refcount_inc_not_zero() which is another limiting factor for
scalability (see above about the end goal of scalability) vs
atomic64_add()-based epoch+refcount approach I took, which is
noticeably better on x86-64, and I don't think hurts any other
architecture, to say the least. I think the latter could be
generalized as an alternative flavor of refcount_t, but I'd prefer to
land it in uprobes in current shape, and if we think it's a good idea
to generalize, we can always do that refactoring once things stabilize
a bit.

You seem to have problems with the refcounting implementation I did
(besides overflow detection, which I'll address in the next revision,
so not a problem). My arguments are a) performance and b) it's well
contained within get/put helpers and doesn't leak outside of them *at
all*, while providing a nice always successful get_uprobe() primitive.

Can I please hear the arguments for not doing it, besides "Everyone is
using refcount_inc_not_zero", which isn't much of a reason (we'd never
do anything novel in the kernel if that was a good enough reason to
not do something new).

Again, thanks for engagement, I do appreciate it. But let's try to
move this forward. Thanks!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  2024-07-03 18:25     ` Andrii Nakryiko
@ 2024-07-03 21:47       ` Masami Hiramatsu
  0 siblings, 0 replies; 67+ messages in thread
From: Masami Hiramatsu @ 2024-07-03 21:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Wed, 3 Jul 2024 11:25:35 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Wed, Jul 3, 2024 at 6:15 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Mon,  1 Jul 2024 15:39:25 -0700
> > Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > > It seems like uprobe_write_opcode() doesn't require writer locked
> > > mmap_sem, any lock (reader or writer) should be sufficient. This was
> > > established in a discussion in [0] and looking through existing code
> > > seems to confirm that there is no need for write-locked mmap_sem.
> > >
> > > Fix the comment to state this clearly.
> > >
> > >   [0] https://lore.kernel.org/linux-trace-kernel/20240625190748.GC14254@redhat.com/
> > >
> > > Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into uprobe_write_opcode()")
> >
> > nit: why this has Fixes but [01/12] doesn't?
> >
> > Should I pick both to fixes branch?
> 
> I'd keep both of them in probes/for-next, tbh. They are not literally
> fixing anything, just cleaning up comments. I can drop the Fixes tag
> from this one, if you'd like.

Got it. Usually cleanup patch will not have Fixed tag, so if you don't mind,
please drop it.

Thank you,

> 
> >
> > Thank you,
> >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  kernel/events/uprobes.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 081821fd529a..f87049c08ee9 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
> > >   * @vaddr: the virtual address to store the opcode.
> > >   * @opcode: opcode to be written at @vaddr.
> > >   *
> > > - * Called with mm->mmap_lock held for write.
> > > + * Called with mm->mmap_lock held for read or write.
> > >   * Return 0 (success) or a negative errno.
> > >   */
> > >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 01/12] uprobes: update outdated comment
  2024-07-03 11:38   ` Oleg Nesterov
  2024-07-03 18:24     ` Andrii Nakryiko
@ 2024-07-03 21:51     ` Andrii Nakryiko
  2024-07-10 13:31     ` Oleg Nesterov
  2 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-03 21:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Sorry for the late reply. I'll try to read this version/discussion
> when I have time... yes, I have already promised this before, sorry :/
>

No worries, there will be v3 next week (I'm going on short PTO
starting tomorrow). So you still have time. I appreciate you looking
at it, though!

> On 07/01, Andrii Nakryiko wrote:
> >
> > There is no task_struct passed into get_user_pages_remote() anymore,
> > drop the parts of comment mentioning NULL tsk, it's just confusing at
> > this point.
>
> Agreed.
>
> > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
> >               goto out;
> >
> >       /*
> > -      * The NULL 'tsk' here ensures that any faults that occur here
> > -      * will not be accounted to the task.  'mm' *is* current->mm,
> > -      * but we treat this as a 'remote' access since it is
> > -      * essentially a kernel access to the memory.
> > +      * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +      * it is essentially a kernel access to the memory.
> >        */
> >       result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
>
>
> ---------------------------------------------------------------------
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?
>
> Oleg.
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-03  7:50           ` Peter Zijlstra
  2024-07-03 14:08             ` Paul E. McKenney
@ 2024-07-03 21:57             ` Steven Rostedt
  2024-07-03 22:07               ` Paul E. McKenney
  1 sibling, 1 reply; 67+ messages in thread
From: Steven Rostedt @ 2024-07-03 21:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Andrii Nakryiko, Andrii Nakryiko,
	linux-trace-kernel, mhiramat, oleg, mingo, bpf, jolsa, clm,
	linux-kernel

On Wed, 3 Jul 2024 09:50:57 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > However, in the past, the memory-barrier and array-indexing overhead
> > of SRCU has made it a no-go for lightweight probes into fastpath code.
> > And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> > Tasks Rude).  
> 
> I'm thinking we're growing too many RCU flavours again :/ I suppose I'll
> have to go read up on rcu/tasks.* and see what's what.

This RCU flavor is the one to handle trampolines. If the trampoline
never voluntarily schedules, then the quiescent state is a voluntary
schedule. The issue with trampolines is that if something was preempted
as it was jumping to a trampoline, there's no way to know when it is
safe to free that trampoline, as some preempted task's next instruction
is on that trampoline.

Any trampoline that does not voluntary schedule can use RCU task
synchronization. As it will wait till all tasks have voluntarily
scheduled or have entered user space (IIRC, Paul can correct me if I'm
wrong).

Now, if a trampoline does schedule, it would need to incorporate some
ref counting on the trampoline to handle the scheduling, but could
still use RCU task synchronization up to the point of the ref count.

And yes, the rude flavor was to handle the !rcu_is_watching case, and
can now be removed.

-- Steve

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-03 21:57             ` Steven Rostedt
@ 2024-07-03 22:07               ` Paul E. McKenney
  0 siblings, 0 replies; 67+ messages in thread
From: Paul E. McKenney @ 2024-07-03 22:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Andrii Nakryiko, Andrii Nakryiko,
	linux-trace-kernel, mhiramat, oleg, mingo, bpf, jolsa, clm,
	linux-kernel

On Wed, Jul 03, 2024 at 05:57:54PM -0400, Steven Rostedt wrote:
> On Wed, 3 Jul 2024 09:50:57 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > However, in the past, the memory-barrier and array-indexing overhead
> > > of SRCU has made it a no-go for lightweight probes into fastpath code.
> > > And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> > > Tasks Rude).  
> > 
> > I'm thinking we're growing too many RCU flavours again :/ I suppose I'll
> > have to go read up on rcu/tasks.* and see what's what.
> 
> This RCU flavor is the one to handle trampolines. If the trampoline
> never voluntarily schedules, then the quiescent state is a voluntary
> schedule. The issue with trampolines is that if something was preempted
> as it was jumping to a trampoline, there's no way to know when it is
> safe to free that trampoline, as some preempted task's next instruction
> is on that trampoline.
> 
> Any trampoline that does not voluntary schedule can use RCU task
> synchronization. As it will wait till all tasks have voluntarily
> scheduled or have entered user space (IIRC, Paul can correct me if I'm
> wrong).

Agreed!

> Now, if a trampoline does schedule, it would need to incorporate some
> ref counting on the trampoline to handle the scheduling, but could
> still use RCU task synchronization up to the point of the ref count.

Or, if the schedule is due at most to a page fault, it can use RCU
Tasks Trace.

> And yes, the rude flavor was to handle the !rcu_is_watching case, and
> can now be removed.

From x86, agreed.

But have the other architectures done all the inlining and addition of
noistr required to permit this?  (Maybe they have, I honestly do not know.
But last I checked a few months ago, ARMv8 was not ready yet.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-03 20:47     ` Andrii Nakryiko
@ 2024-07-04  8:03       ` Peter Zijlstra
  2024-07-04  8:45         ` Peter Zijlstra
  2024-07-04  8:31       ` Peter Zijlstra
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-04  8:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, paulmck, clm

On Wed, Jul 03, 2024 at 01:47:23PM -0700, Andrii Nakryiko wrote:
> Your innocuous "// XXX amortize / batch" comment below is *the major
> point of this patch set*. Try to appreciate that. It's not a small
> todo, it took this entire patch set to allow for that.

Tada!

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 354cab634341..c9c9ec87ab9a 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -115,7 +115,8 @@ extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm
 extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
 extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+#define URF_NO_SYNC	0x01
+extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -165,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo
 	return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0b7574a54093..1f4151c518ed 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1145,7 +1145,7 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
  * @offset: offset from the start of the file.
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags)
 {
 	scoped_guard (srcu, &uprobe_srcu) {
 		struct uprobe *uprobe = find_uprobe(inode, offset);
@@ -1157,7 +1157,8 @@ void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consume
 		mutex_unlock(&uprobe->register_mutex);
 	}
 
-	synchronize_srcu(&uprobe_srcu); // XXX amortize / batch
+	if (!(flags & URF_NO_SYNC))
+		synchronize_srcu(&uprobe_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d1daeab1bbc1..950b5241244a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3182,7 +3182,7 @@ static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
 
 	for (i = 0; i < cnt; i++) {
 		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
-				  &uprobes[i].consumer);
+				  &uprobes[i].consumer, i != cnt-1 ? URF_NO_SYNC : 0);
 	}
 }
 
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c98e3b3386ba..4aafb4485be7 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1112,7 +1112,8 @@ static void __probe_event_disable(struct trace_probe *tp)
 		if (!tu->inode)
 			continue;
 
-		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
+		uprobe_unregister(tu->inode, tu->offset, &tu->consumer,
+				  list_is_last(trace_probe_probe_list(tp), &tu->tp.list) ? 0 : URF_NO_SYNC);
 		tu->inode = NULL;
 	}
 }

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-03 20:47     ` Andrii Nakryiko
  2024-07-04  8:03       ` Peter Zijlstra
@ 2024-07-04  8:31       ` Peter Zijlstra
  1 sibling, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-04  8:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, paulmck, clm

On Wed, Jul 03, 2024 at 01:47:23PM -0700, Andrii Nakryiko wrote:

> > When I cobble all that together (it really shouldn't be one patch, but
> > you get the idea I hope) it looks a little something like the below.
> >
> > I *think* it should work, but perhaps I've missed something?
> 
> Well, at the very least you missed that we can't delay SRCU (or any
> other sleepable RCU flavor) potentially indefinitely for uretprobes,
> which are completely under user space control.

Sure, but that's fixable. You can work around that by having (u)tasks
with a non-empty return_instance list carry a timer. When/if that timer
fires, it goes and converts the SRCU references to actual references.

Not so very hard to do, but very much not needed for a PoC.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-03 14:08             ` Paul E. McKenney
@ 2024-07-04  8:39               ` Peter Zijlstra
  2024-07-04 15:13                 ` Paul E. McKenney
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-04  8:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Wed, Jul 03, 2024 at 07:08:21AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 03, 2024 at 09:50:57AM +0200, Peter Zijlstra wrote:

> > Would it make sense to disable it for those architectures that have
> > already done this work?
> 
> It might well.  Any architectures other than x86 at this point?

Per 408b961146be ("tracing: WARN on rcuidle")
and git grep "select.*ARCH_WANTS_NO_INSTR"
arch/arm64/Kconfig:     select ARCH_WANTS_NO_INSTR
arch/loongarch/Kconfig: select ARCH_WANTS_NO_INSTR
arch/riscv/Kconfig:     select ARCH_WANTS_NO_INSTR
arch/s390/Kconfig:      select ARCH_WANTS_NO_INSTR
arch/x86/Kconfig:       select ARCH_WANTS_NO_INSTR

I'm thinking you can simply use that same condition here?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-04  8:03       ` Peter Zijlstra
@ 2024-07-04  8:45         ` Peter Zijlstra
  2024-07-04 14:40           ` Masami Hiramatsu
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-04  8:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, paulmck, clm

On Thu, Jul 04, 2024 at 10:03:48AM +0200, Peter Zijlstra wrote:

> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index c98e3b3386ba..4aafb4485be7 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1112,7 +1112,8 @@ static void __probe_event_disable(struct trace_probe *tp)
>  		if (!tu->inode)
>  			continue;
>  
> -		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
> +		uprobe_unregister(tu->inode, tu->offset, &tu->consumer,
> +				  list_is_last(trace_probe_probe_list(tp), &tu->tp.list) ? 0 : URF_NO_SYNC);
>  		tu->inode = NULL;
>  	}
>  }


Hmm, that continue clause might ruin things. Still easy enough to add
uprobe_unregister_sync() and simpy always pass URF_NO_SYNC.

I really don't see why we should make this more complicated than it
needs to be.

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 354cab634341..681741a51df3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -115,7 +115,9 @@ extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm
 extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
 extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+#define URF_NO_SYNC	0x01
+extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags);
+extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -165,7 +167,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo
 	return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0b7574a54093..d09f7b942076 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1145,7 +1145,7 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
  * @offset: offset from the start of the file.
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags)
 {
 	scoped_guard (srcu, &uprobe_srcu) {
 		struct uprobe *uprobe = find_uprobe(inode, offset);
@@ -1157,10 +1157,17 @@ void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consume
 		mutex_unlock(&uprobe->register_mutex);
 	}
 
-	synchronize_srcu(&uprobe_srcu); // XXX amortize / batch
+	if (!(flags & URF_NO_SYNC))
+		synchronize_srcu(&uprobe_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
+void uprobe_unregister_sync(void)
+{
+	synchronize_srcu(&uprobe_srcu);
+}
+EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
+
 /*
  * __uprobe_register - register a probe
  * @inode: the file in which the probe has to be placed.
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d1daeab1bbc1..1f6adabbb1e7 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3181,9 +3181,10 @@ static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
 	u32 i;
 
 	for (i = 0; i < cnt; i++) {
-		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
-				  &uprobes[i].consumer);
+		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset, URF_NO_SYNC);
 	}
+	if (cnt > 0)
+		uprobe_unregister_sync();
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c98e3b3386ba..6b64470a1c5c 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 static void __probe_event_disable(struct trace_probe *tp)
 {
 	struct trace_uprobe *tu;
+	bool sync = false;
 
 	tu = container_of(tp, struct trace_uprobe, tp);
 	WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
@@ -1112,9 +1113,12 @@ static void __probe_event_disable(struct trace_probe *tp)
 		if (!tu->inode)
 			continue;
 
-		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
+		uprobe_unregister(tu->inode, tu->offset, &tu->consumer, URF_NO_SYNC);
+		sync = true;
 		tu->inode = NULL;
 	}
+	if (sync)
+		uprobe_unregister_sync();
 }
 
 static int probe_event_enable(struct trace_event_call *call,

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-03 21:33 ` Andrii Nakryiko
@ 2024-07-04  9:15   ` Peter Zijlstra
  2024-07-04 13:56     ` Steven Rostedt
                       ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Peter Zijlstra @ 2024-07-04  9:15 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, paulmck, clm, open list

On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:

> 2. More tactically, RCU protection seems like the best way forward. We
> got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> flavor (whatever that is, I have no idea).
> 
> Now, RCU Tasks Trace were specifically designed for least overhead
> hotpath (reader side) performance, at the expense of slowing down much
> rarer writers. My microbenchmarking does show at least 5% difference.
> Both flavors can handle sleepable uprobes waiting for page faults.
> Tasks Trace flavor is already used for tracing in the BPF realm,
> including for sleepable uprobes and works well. It's not going away.

I need to look into this new RCU flavour and why it exists -- for
example, why can't SRCU be improved to gain the same benefits. This is
what we've always done, improve SRCU.

> Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> haven't seen a single argument why. Please provide that, or let's
> stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> of what Tasks Trace flavor was designed for.

Because I actually know SRCU, and because it provides a local scope.
It isolates the unregister waiters from other random users. I'm not
going to use this funky new flavour until I truly understand it.

Also, we actually want two scopes here, there is no reason for the
consumer unreg to wait for the retprobe stuff.

> 3. Regardless of RCU flavor, due to RCU protection, we have to add
> batched register/unregister APIs, so we can amortize sync_rcu cost
> during deregistration. Can we please agree on that as well? This is
> the main goal of this patch set and I'd like to land it before working
> further on changing and improving the rest of the locking schema.

See my patch here:

  https://lkml.kernel.org/r/20240704084524.GC28838@noisy.programming.kicks-ass.net

I don't think it needs to be more complicated than that.

> I won't be happy about it, but just to move things forward, I can drop
> a) custom refcounting and/or b) percpu RW semaphore. Both are
> beneficial but not essential for batched APIs work. But if you force
> me to do that, please state clearly your reasons/arguments.

The reason I'm pushing RCU here is because AFAICT uprobes doesn't
actually need the stronger serialisation that rwlock (any flavour)
provide. It is a prime candidate for RCU, and I think you'll find plenty
papers / articles (by both Paul and others) that show that RCU scales
better.

As a bonus, you avoid that horrific write side cost that per-cpu rwsem
has.

The reason I'm not keen on that refcount thing was initially because I
did not understand the justification for it, but worse, once I did read
your justification, your very own numbers convinced me that the refcount
is fundamentally problematic, in any way shape or form.

> No one had yet pointed out why refcounting is broken 

Your very own numbers point out that refcounting is a problem here. 

> and why percpu RW semaphore is bad. 

Literature and history show us that RCU -- where possible -- is
always better than any reader-writer locking scheme.

> 4. Another tactical thing, but an important one. Refcounting schema
> for uprobes. I've replied already, but I think refcounting is
> unavoidable for uretprobes,

I think we can fix that, I replied here:

  https://lkml.kernel.org/r/20240704083152.GQ11386@noisy.programming.kicks-ass.net

> and current refcounting schema is
> problematic for batched APIs due to race between finding uprobe and
> there still being a possibility we'd need to undo all that and retry
> again.

Right, I've not looked too deeply at that, because I've not seen a
reason to actually change that. I can go think about it if you want, but
meh.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-04  9:15   ` Peter Zijlstra
@ 2024-07-04 13:56     ` Steven Rostedt
  2024-07-04 15:44     ` Paul E. McKenney
  2024-07-08 17:48     ` Andrii Nakryiko
  2 siblings, 0 replies; 67+ messages in thread
From: Steven Rostedt @ 2024-07-04 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, mhiramat,
	oleg, mingo, bpf, jolsa, paulmck, clm, open list

On Thu, 4 Jul 2024 11:15:59 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Now, RCU Tasks Trace were specifically designed for least overhead
> > hotpath (reader side) performance, at the expense of slowing down much
> > rarer writers. My microbenchmarking does show at least 5% difference.
> > Both flavors can handle sleepable uprobes waiting for page faults.
> > Tasks Trace flavor is already used for tracing in the BPF realm,
> > including for sleepable uprobes and works well. It's not going away.  
> 
> I need to look into this new RCU flavour and why it exists -- for
> example, why can't SRCU be improved to gain the same benefits. This is
> what we've always done, improve SRCU.

I don't know about this use case, but for the trampoline use case SRCU
doesn't work as it requires calling a srcu_read_lock() which isn't
possible when you need to take that lock from all function calls just
before it jumps to the ftrace trampoline. That is, it needs to be taken
before "call fentry".

I'm just stating this to provide the reason why we needed that flavor
of RCU.

-- Steve

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-04  8:45         ` Peter Zijlstra
@ 2024-07-04 14:40           ` Masami Hiramatsu
  0 siblings, 0 replies; 67+ messages in thread
From: Masami Hiramatsu @ 2024-07-04 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, paulmck, clm

On Thu, 4 Jul 2024 10:45:24 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jul 04, 2024 at 10:03:48AM +0200, Peter Zijlstra wrote:
> 
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index c98e3b3386ba..4aafb4485be7 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -1112,7 +1112,8 @@ static void __probe_event_disable(struct trace_probe *tp)
> >  		if (!tu->inode)
> >  			continue;
> >  
> > -		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
> > +		uprobe_unregister(tu->inode, tu->offset, &tu->consumer,
> > +				  list_is_last(trace_probe_probe_list(tp), &tu->tp.list) ? 0 : URF_NO_SYNC);
> >  		tu->inode = NULL;
> >  	}
> >  }
> 
> 
> Hmm, that continue clause might ruin things. Still easy enough to add
> uprobe_unregister_sync() and simpy always pass URF_NO_SYNC.
> 
> I really don't see why we should make this more complicated than it
> needs to be.
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 354cab634341..681741a51df3 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -115,7 +115,9 @@ extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm
>  extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
>  extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
> -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> +#define URF_NO_SYNC	0x01
> +extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags);
> +extern void uprobe_unregister_sync(void);
>  extern int uprobe_mmap(struct vm_area_struct *vma);
>  extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
>  extern void uprobe_start_dup_mmap(void);
> @@ -165,7 +167,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo
>  	return -ENOSYS;
>  }
>  static inline void
> -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> +uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags)

nit: IMHO, I would like to see uprobe_unregister_nosync() variant instead of
adding flags.

Thank you,

>  {
>  }
>  static inline int uprobe_mmap(struct vm_area_struct *vma)
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 0b7574a54093..d09f7b942076 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1145,7 +1145,7 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>   * @offset: offset from the start of the file.
>   * @uc: identify which probe if multiple probes are colocated.
>   */
> -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> +void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, unsigned int flags)
>  {
>  	scoped_guard (srcu, &uprobe_srcu) {
>  		struct uprobe *uprobe = find_uprobe(inode, offset);
> @@ -1157,10 +1157,17 @@ void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consume
>  		mutex_unlock(&uprobe->register_mutex);
>  	}
>  
> -	synchronize_srcu(&uprobe_srcu); // XXX amortize / batch
> +	if (!(flags & URF_NO_SYNC))
> +		synchronize_srcu(&uprobe_srcu);
>  }
>  EXPORT_SYMBOL_GPL(uprobe_unregister);
>  
> +void uprobe_unregister_sync(void)
> +{
> +	synchronize_srcu(&uprobe_srcu);
> +}
> +EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
> +
>  /*
>   * __uprobe_register - register a probe
>   * @inode: the file in which the probe has to be placed.
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d1daeab1bbc1..1f6adabbb1e7 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -3181,9 +3181,10 @@ static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
>  	u32 i;
>  
>  	for (i = 0; i < cnt; i++) {
> -		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
> -				  &uprobes[i].consumer);
> +		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset, URF_NO_SYNC);
>  	}
> +	if (cnt > 0)
> +		uprobe_unregister_sync();
>  }
>  
>  static void bpf_uprobe_multi_link_release(struct bpf_link *link)
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index c98e3b3386ba..6b64470a1c5c 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
>  static void __probe_event_disable(struct trace_probe *tp)
>  {
>  	struct trace_uprobe *tu;
> +	bool sync = false;
>  
>  	tu = container_of(tp, struct trace_uprobe, tp);
>  	WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
> @@ -1112,9 +1113,12 @@ static void __probe_event_disable(struct trace_probe *tp)
>  		if (!tu->inode)
>  			continue;
>  
> -		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
> +		uprobe_unregister(tu->inode, tu->offset, &tu->consumer, URF_NO_SYNC);
> +		sync = true;
>  		tu->inode = NULL;
>  	}
> +	if (sync)
> +		uprobe_unregister_sync();
>  }
>  
>  static int probe_event_enable(struct trace_event_call *call,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-04  8:39               ` Peter Zijlstra
@ 2024-07-04 15:13                 ` Paul E. McKenney
  0 siblings, 0 replies; 67+ messages in thread
From: Paul E. McKenney @ 2024-07-04 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, linux-kernel

On Thu, Jul 04, 2024 at 10:39:35AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2024 at 07:08:21AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2024 at 09:50:57AM +0200, Peter Zijlstra wrote:
> 
> > > Would it make sense to disable it for those architectures that have
> > > already done this work?
> > 
> > It might well.  Any architectures other than x86 at this point?
> 
> Per 408b961146be ("tracing: WARN on rcuidle")
> and git grep "select.*ARCH_WANTS_NO_INSTR"
> arch/arm64/Kconfig:     select ARCH_WANTS_NO_INSTR
> arch/loongarch/Kconfig: select ARCH_WANTS_NO_INSTR
> arch/riscv/Kconfig:     select ARCH_WANTS_NO_INSTR
> arch/s390/Kconfig:      select ARCH_WANTS_NO_INSTR
> arch/x86/Kconfig:       select ARCH_WANTS_NO_INSTR
> 
> I'm thinking you can simply use that same condition here?

New one on me!  And it does look like that would work, and it also
looks like other code assumes that these architectures have all of their
deep-idle and entry/exit functions either inlined or noinstr-ed, so it
should be OK for Tasks Trace Rude to do likewise.  Thank you!!!

If you would like a sneak preview, please see the last few commits on
the "dev" branch of -rcu.  And this is easier than my original plan
immortalized (at least temporarily) on the "dev.2024.07.02a" branch.

Things left to do: (1) Rebase fixes into original commits. (2)
Make RCU Tasks stop ignoring idle tasks.  (3) Reorder the commits
for bisectability.  (4) Make rcutorture test RCU Tasks Rude even when
running on platforms that don't need it.  (5) Fix other bugs that I have
not yet spotted.

I expect to post an RFC patch early next week.  Unless there is some
emergency, I will slate these for the v6.12 merge window to give them
some soak time.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-04  9:15   ` Peter Zijlstra
  2024-07-04 13:56     ` Steven Rostedt
@ 2024-07-04 15:44     ` Paul E. McKenney
  2024-07-08 17:47       ` Andrii Nakryiko
  2024-07-08 17:48     ` Andrii Nakryiko
  2 siblings, 1 reply; 67+ messages in thread
From: Paul E. McKenney @ 2024-07-04 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, open list

On Thu, Jul 04, 2024 at 11:15:59AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:
> 
> > 2. More tactically, RCU protection seems like the best way forward. We
> > got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> > clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> > flavor (whatever that is, I have no idea).
> > 
> > Now, RCU Tasks Trace were specifically designed for least overhead
> > hotpath (reader side) performance, at the expense of slowing down much
> > rarer writers. My microbenchmarking does show at least 5% difference.
> > Both flavors can handle sleepable uprobes waiting for page faults.
> > Tasks Trace flavor is already used for tracing in the BPF realm,
> > including for sleepable uprobes and works well. It's not going away.
> 
> I need to look into this new RCU flavour and why it exists -- for
> example, why can't SRCU be improved to gain the same benefits. This is
> what we've always done, improve SRCU.

Well, it is all software.  And I certainly pushed SRCU hard.  If I recall
correctly, it took them a year to convince me that they needed something
more than SRCU could reasonably be convinced to do.

The big problem is that they need to be able to hook a simple BPF program
(for example, count the number of calls with given argument values) on
a fastpath function on a system running in production without causing
the automation to decide that this system is too slow, thus whacking it
over the head.  Any appreciable overhead is a no-go in this use case.
It is not just that the srcu_read_lock() function's smp_mb() call would
disqualify SRCU, its other added overhead would as well.  Plus this needs
RCU Tasks Trace CPU stall warnings to catch abuse, and SRCU doesn't
impose any limits on readers (how long to set the stall time?) and
doesn't track tasks.

> > Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> > haven't seen a single argument why. Please provide that, or let's
> > stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> > of what Tasks Trace flavor was designed for.
> 
> Because I actually know SRCU, and because it provides a local scope.
> It isolates the unregister waiters from other random users. I'm not
> going to use this funky new flavour until I truly understand it.

It is only a few hundred lines of code on top of the infrastructure
that also supports RCU Tasks and RCU Tasks Rude.  If you understand
SRCU and preemptible RCU, there will be nothing exotic there, and it is
simpler than Tree SRCU, to say nothing of preemptible RCU.  I would be
more than happy to take you through it if you would like, but not before
this coming Monday.

> Also, we actually want two scopes here, there is no reason for the
> consumer unreg to wait for the retprobe stuff.

I don't know that the performance requirements for userspace retprobes are
as severe as for function-call probes -- on that, I must defer to Andrii.
To your two-scopes point, it is quite possible that SRCU could be used
for userspace retprobes and RCU Tasks Trace for the others.  It certainly
seems to me that SRCU would be better than explicit reference counting,
but I could be missing something.  (Memory footprint, perhaps?  Though
maybe a single srcu_struct could be shared among all userspace retprobes.
Given the time-bounded reads, maybe stall warnings aren't needed,
give or take things like interrupts, preemption, and vCPU preemption.
Plus it is not like it would be hard to figure out which read-side code
region was at fault when the synchronize_srcu() took too long.)

							Thanx, Paul

> > 3. Regardless of RCU flavor, due to RCU protection, we have to add
> > batched register/unregister APIs, so we can amortize sync_rcu cost
> > during deregistration. Can we please agree on that as well? This is
> > the main goal of this patch set and I'd like to land it before working
> > further on changing and improving the rest of the locking schema.
> 
> See my patch here:
> 
>   https://lkml.kernel.org/r/20240704084524.GC28838@noisy.programming.kicks-ass.net
> 
> I don't think it needs to be more complicated than that.
> 
> > I won't be happy about it, but just to move things forward, I can drop
> > a) custom refcounting and/or b) percpu RW semaphore. Both are
> > beneficial but not essential for batched APIs work. But if you force
> > me to do that, please state clearly your reasons/arguments.
> 
> The reason I'm pushing RCU here is because AFAICT uprobes doesn't
> actually need the stronger serialisation that rwlock (any flavour)
> provide. It is a prime candidate for RCU, and I think you'll find plenty
> papers / articles (by both Paul and others) that show that RCU scales
> better.
> 
> As a bonus, you avoid that horrific write side cost that per-cpu rwsem
> has.
> 
> The reason I'm not keen on that refcount thing was initially because I
> did not understand the justification for it, but worse, once I did read
> your justification, your very own numbers convinced me that the refcount
> is fundamentally problematic, in any way shape or form.
> 
> > No one had yet pointed out why refcounting is broken 
> 
> Your very own numbers point out that refcounting is a problem here. 
> 
> > and why percpu RW semaphore is bad. 
> 
> Literature and history show us that RCU -- where possible -- is
> always better than any reader-writer locking scheme.
> 
> > 4. Another tactical thing, but an important one. Refcounting schema
> > for uprobes. I've replied already, but I think refcounting is
> > unavoidable for uretprobes,
> 
> I think we can fix that, I replied here:
> 
>   https://lkml.kernel.org/r/20240704083152.GQ11386@noisy.programming.kicks-ass.net
> 
> > and current refcounting schema is
> > problematic for batched APIs due to race between finding uprobe and
> > there still being a possibility we'd need to undo all that and retry
> > again.
> 
> Right, I've not looked too deeply at that, because I've not seen a
> reason to actually change that. I can go think about it if you want, but
> meh.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-01 22:39 ` [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
  2024-07-02 10:22   ` Peter Zijlstra
  2024-07-03 13:36   ` Peter Zijlstra
@ 2024-07-05 15:37   ` Oleg Nesterov
  2024-07-06 17:00     ` Jiri Olsa
                       ` (2 more replies)
  2 siblings, 3 replies; 67+ messages in thread
From: Oleg Nesterov @ 2024-07-05 15:37 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, peterz, mingo, bpf, jolsa,
	paulmck, clm

Tried to read this patch, but I fail to understand it. It looks
obvioulsy wrong to me, see below.

I tend to agree with the comments from Peter, but lets ignore them
for the moment.

On 07/01, Andrii Nakryiko wrote:
>
>  static void put_uprobe(struct uprobe *uprobe)
>  {
> -	if (refcount_dec_and_test(&uprobe->ref)) {
> +	s64 v;
> +
> +	/*
> +	 * here uprobe instance is guaranteed to be alive, so we use Tasks
> +	 * Trace RCU to guarantee that uprobe won't be freed from under us, if
> +	 * we end up being a losing "destructor" inside uprobe_treelock'ed
> +	 * section double-checking uprobe->ref value below.
> +	 * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> +	 */
> +	rcu_read_lock_trace();
> +
> +	v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
> +
> +	if (unlikely((u32)v == 0)) {

I must have missed something, but how can this ever happen?

Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
that this binary is not used, so _register() doesn't install breakpoints/etc.

IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.

Now suppose that uprobe_unregister() is called right after that. It does

	uprobe = find_uprobe(inode, offset);

this increments the counter, (u32)uprobe->ref == 2

	__uprobe_unregister(...);

this wont't change the counter,

	put_uprobe(uprobe);

this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.

Where should the "final" put_uprobe() come from?

IIUC, this patch lacks another put_uprobe() after consumer_del(), no?

Oleg.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-05 15:37   ` Oleg Nesterov
@ 2024-07-06 17:00     ` Jiri Olsa
  2024-07-06 17:05       ` Jiri Olsa
  2024-07-07 14:46     ` Oleg Nesterov
  2024-07-08 17:47     ` Andrii Nakryiko
  2 siblings, 1 reply; 67+ messages in thread
From: Jiri Olsa @ 2024-07-06 17:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, paulmck, clm

On Fri, Jul 05, 2024 at 05:37:05PM +0200, Oleg Nesterov wrote:
> Tried to read this patch, but I fail to understand it. It looks
> obvioulsy wrong to me, see below.
> 
> I tend to agree with the comments from Peter, but lets ignore them
> for the moment.
> 
> On 07/01, Andrii Nakryiko wrote:
> >
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -	if (refcount_dec_and_test(&uprobe->ref)) {
> > +	s64 v;
> > +
> > +	/*
> > +	 * here uprobe instance is guaranteed to be alive, so we use Tasks
> > +	 * Trace RCU to guarantee that uprobe won't be freed from under us, if
> > +	 * we end up being a losing "destructor" inside uprobe_treelock'ed
> > +	 * section double-checking uprobe->ref value below.
> > +	 * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +	 */
> > +	rcu_read_lock_trace();
> > +
> > +	v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
> > +
> > +	if (unlikely((u32)v == 0)) {
> 
> I must have missed something, but how can this ever happen?
> 
> Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> that this binary is not used, so _register() doesn't install breakpoints/etc.
> 
> IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
> 
> Now suppose that uprobe_unregister() is called right after that. It does
> 
> 	uprobe = find_uprobe(inode, offset);
> 
> this increments the counter, (u32)uprobe->ref == 2
> 
> 	__uprobe_unregister(...);
> 
> this wont't change the counter,

__uprobe_unregister calls delete_uprobe that calls put_uprobe ?

jirka

> 
> 	put_uprobe(uprobe);
> 
> this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
> 
> Where should the "final" put_uprobe() come from?
> 
> IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-06 17:00     ` Jiri Olsa
@ 2024-07-06 17:05       ` Jiri Olsa
  0 siblings, 0 replies; 67+ messages in thread
From: Jiri Olsa @ 2024-07-06 17:05 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, peterz, mingo, bpf, paulmck, clm

On Sat, Jul 06, 2024 at 07:00:34PM +0200, Jiri Olsa wrote:
> On Fri, Jul 05, 2024 at 05:37:05PM +0200, Oleg Nesterov wrote:
> > Tried to read this patch, but I fail to understand it. It looks
> > obvioulsy wrong to me, see below.
> > 
> > I tend to agree with the comments from Peter, but lets ignore them
> > for the moment.
> > 
> > On 07/01, Andrii Nakryiko wrote:
> > >
> > >  static void put_uprobe(struct uprobe *uprobe)
> > >  {
> > > -	if (refcount_dec_and_test(&uprobe->ref)) {
> > > +	s64 v;
> > > +
> > > +	/*
> > > +	 * here uprobe instance is guaranteed to be alive, so we use Tasks
> > > +	 * Trace RCU to guarantee that uprobe won't be freed from under us, if
> > > +	 * we end up being a losing "destructor" inside uprobe_treelock'ed
> > > +	 * section double-checking uprobe->ref value below.
> > > +	 * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > > +	 */
> > > +	rcu_read_lock_trace();
> > > +
> > > +	v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
> > > +
> > > +	if (unlikely((u32)v == 0)) {
> > 
> > I must have missed something, but how can this ever happen?
> > 
> > Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> > that this binary is not used, so _register() doesn't install breakpoints/etc.
> > 
> > IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
> > 
> > Now suppose that uprobe_unregister() is called right after that. It does
> > 
> > 	uprobe = find_uprobe(inode, offset);
> > 
> > this increments the counter, (u32)uprobe->ref == 2
> > 
> > 	__uprobe_unregister(...);
> > 
> > this wont't change the counter,
> 
> __uprobe_unregister calls delete_uprobe that calls put_uprobe ?

ugh, wrong sources.. ok, don't know ;-)

jirka

> 
> jirka
> 
> > 
> > 	put_uprobe(uprobe);
> > 
> > this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
> > 
> > Where should the "final" put_uprobe() come from?
> > 
> > IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
> > 
> > Oleg.
> > 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
  2024-07-01 22:39 ` [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
  2024-07-03  8:13   ` Peter Zijlstra
@ 2024-07-07 12:48   ` Oleg Nesterov
  2024-07-08 17:56     ` Andrii Nakryiko
  1 sibling, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2024-07-07 12:48 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, peterz, mingo, bpf, jolsa,
	paulmck, clm

On 07/01, Andrii Nakryiko wrote:
>
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -42,6 +42,11 @@ struct uprobe_consumer {
>  				enum uprobe_filter_ctx ctx,
>  				struct mm_struct *mm);
>  
> +	/* associated file offset of this probe */
> +	loff_t offset;
> +	/* associated refctr file offset of this probe, or zero */
> +	loff_t ref_ctr_offset;
> +	/* for internal uprobe infra use, consumers shouldn't touch fields below */
>  	struct uprobe_consumer *next;


Well, I don't really like this patch either...

If nothing else because all the consumers in uprobe->consumers list
must have the same offset/ref_ctr_offset.

--------------------------------------------------------------------------
But I agree, the ugly uprobe_register_refctr() must die, we need a single
function

	int uprobe_register(inode, offset, ref_ctr_offset, consumer);

This change is trivial.

--------------------------------------------------------------------------
And speaking of cleanups, I think another change makes sense:

	-	int uprobe_register(...);
	+	struct uprobe* uprobe_register(...);

so that uprobe_register() returns uprobe or ERR_PTR.

	-	void uprobe_unregister(inode, offset, consumer);
	+	void uprobe_unregister(uprobe, consumer);

this way unregister() doesn't need the extra find_uprobe() + put_uprobe().
The same for uprobe_apply().

The necessary changes in kernel/trace/trace_uprobe.c are trivial, we just
need to change struct trace_uprobe

	-	struct inode                    *inode;
	+	struct uprobe			*uprobe;

and fix the compilation errors.


As for kernel/trace/bpf_trace.c, I guess struct bpf_uprobe  needs the new
->uprobe member, we can't kill bpf_uprobe->offset because of
bpf_uprobe_multi_link_fill_link_info(), but I think this is not that bad.

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-05 15:37   ` Oleg Nesterov
  2024-07-06 17:00     ` Jiri Olsa
@ 2024-07-07 14:46     ` Oleg Nesterov
  2024-07-08 17:47       ` Andrii Nakryiko
  2024-07-08 17:47     ` Andrii Nakryiko
  2 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2024-07-07 14:46 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, peterz, mingo, bpf, jolsa,
	paulmck, clm

And I forgot to mention...

In any case __uprobe_unregister() can't ignore the error code from
register_for_each_vma(). If it fails to restore the original insn,
we should not remove this uprobe from uprobes_tree.

Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
probed application.

On 07/05, Oleg Nesterov wrote:
>
> Tried to read this patch, but I fail to understand it. It looks
> obvioulsy wrong to me, see below.
>
> I tend to agree with the comments from Peter, but lets ignore them
> for the moment.
>
> On 07/01, Andrii Nakryiko wrote:
> >
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -	if (refcount_dec_and_test(&uprobe->ref)) {
> > +	s64 v;
> > +
> > +	/*
> > +	 * here uprobe instance is guaranteed to be alive, so we use Tasks
> > +	 * Trace RCU to guarantee that uprobe won't be freed from under us, if
> > +	 * we end up being a losing "destructor" inside uprobe_treelock'ed
> > +	 * section double-checking uprobe->ref value below.
> > +	 * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +	 */
> > +	rcu_read_lock_trace();
> > +
> > +	v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
> > +
> > +	if (unlikely((u32)v == 0)) {
>
> I must have missed something, but how can this ever happen?
>
> Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> that this binary is not used, so _register() doesn't install breakpoints/etc.
>
> IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
>
> Now suppose that uprobe_unregister() is called right after that. It does
>
> 	uprobe = find_uprobe(inode, offset);
>
> this increments the counter, (u32)uprobe->ref == 2
>
> 	__uprobe_unregister(...);
>
> this wont't change the counter,
>
> 	put_uprobe(uprobe);
>
> this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
>
> Where should the "final" put_uprobe() come from?
>
> IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
>
> Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-05 15:37   ` Oleg Nesterov
  2024-07-06 17:00     ` Jiri Olsa
  2024-07-07 14:46     ` Oleg Nesterov
@ 2024-07-08 17:47     ` Andrii Nakryiko
  2 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-08 17:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Fri, Jul 5, 2024 at 8:38 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Tried to read this patch, but I fail to understand it. It looks
> obvioulsy wrong to me, see below.
>
> I tend to agree with the comments from Peter, but lets ignore them
> for the moment.
>
> On 07/01, Andrii Nakryiko wrote:
> >
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -     if (refcount_dec_and_test(&uprobe->ref)) {
> > +     s64 v;
> > +
> > +     /*
> > +      * here uprobe instance is guaranteed to be alive, so we use Tasks
> > +      * Trace RCU to guarantee that uprobe won't be freed from under us, if
> > +      * we end up being a losing "destructor" inside uprobe_treelock'ed
> > +      * section double-checking uprobe->ref value below.
> > +      * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +      */
> > +     rcu_read_lock_trace();
> > +
> > +     v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
> > +
> > +     if (unlikely((u32)v == 0)) {
>
> I must have missed something, but how can this ever happen?
>
> Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> that this binary is not used, so _register() doesn't install breakpoints/etc.
>
> IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
>
> Now suppose that uprobe_unregister() is called right after that. It does
>
>         uprobe = find_uprobe(inode, offset);
>
> this increments the counter, (u32)uprobe->ref == 2
>
>         __uprobe_unregister(...);
>
> this wont't change the counter,
>
>         put_uprobe(uprobe);
>
> this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
>
> Where should the "final" put_uprobe() come from?
>
> IIUC, this patch lacks another put_uprobe() after consumer_del(), no?

Argh, this is an artifact of splitting the overall change into
separate patches. The final version of uprobe_unregister() doesn't do
find_uprobe(), we just get it from uprobe_consumer->uprobe pointer
without any tree lookup.

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-07 14:46     ` Oleg Nesterov
@ 2024-07-08 17:47       ` Andrii Nakryiko
  2024-07-09 18:47         ` Oleg Nesterov
  0 siblings, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-08 17:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> And I forgot to mention...
>
> In any case __uprobe_unregister() can't ignore the error code from
> register_for_each_vma(). If it fails to restore the original insn,
> we should not remove this uprobe from uprobes_tree.
>
> Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> probed application.

Yep, that would be unfortunate (just like SIGILL sent when uretprobe
detects "improper" stack pointer progression, for example), but from
what I gather it's not really expected to fail on unregistration given
we successfully registered uprobe. I guess it's a decision between
leaking memory with an uprobe stuck in the tree or killing process due
to some very rare (or buggy) condition?


>
> On 07/05, Oleg Nesterov wrote:
> >
> > Tried to read this patch, but I fail to understand it. It looks
> > obvioulsy wrong to me, see below.
> >
> > I tend to agree with the comments from Peter, but lets ignore them
> > for the moment.
> >
> > On 07/01, Andrii Nakryiko wrote:
> > >
> > >  static void put_uprobe(struct uprobe *uprobe)
> > >  {
> > > -   if (refcount_dec_and_test(&uprobe->ref)) {
> > > +   s64 v;
> > > +
> > > +   /*
> > > +    * here uprobe instance is guaranteed to be alive, so we use Tasks
> > > +    * Trace RCU to guarantee that uprobe won't be freed from under us, if
> > > +    * we end up being a losing "destructor" inside uprobe_treelock'ed
> > > +    * section double-checking uprobe->ref value below.
> > > +    * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > > +    */
> > > +   rcu_read_lock_trace();
> > > +
> > > +   v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
> > > +
> > > +   if (unlikely((u32)v == 0)) {
> >
> > I must have missed something, but how can this ever happen?
> >
> > Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> > that this binary is not used, so _register() doesn't install breakpoints/etc.
> >
> > IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
> >
> > Now suppose that uprobe_unregister() is called right after that. It does
> >
> >       uprobe = find_uprobe(inode, offset);
> >
> > this increments the counter, (u32)uprobe->ref == 2
> >
> >       __uprobe_unregister(...);
> >
> > this wont't change the counter,
> >
> >       put_uprobe(uprobe);
> >
> > this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
> >
> > Where should the "final" put_uprobe() come from?
> >
> > IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
> >
> > Oleg.
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-04 15:44     ` Paul E. McKenney
@ 2024-07-08 17:47       ` Andrii Nakryiko
  0 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-08 17:47 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Andrii Nakryiko, linux-trace-kernel, rostedt,
	mhiramat, oleg, mingo, bpf, jolsa, clm, open list

On Thu, Jul 4, 2024 at 8:44 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Jul 04, 2024 at 11:15:59AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:
> >
> > > 2. More tactically, RCU protection seems like the best way forward. We
> > > got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> > > clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> > > flavor (whatever that is, I have no idea).
> > >
> > > Now, RCU Tasks Trace were specifically designed for least overhead
> > > hotpath (reader side) performance, at the expense of slowing down much
> > > rarer writers. My microbenchmarking does show at least 5% difference.
> > > Both flavors can handle sleepable uprobes waiting for page faults.
> > > Tasks Trace flavor is already used for tracing in the BPF realm,
> > > including for sleepable uprobes and works well. It's not going away.
> >
> > I need to look into this new RCU flavour and why it exists -- for
> > example, why can't SRCU be improved to gain the same benefits. This is
> > what we've always done, improve SRCU.
>
> Well, it is all software.  And I certainly pushed SRCU hard.  If I recall
> correctly, it took them a year to convince me that they needed something
> more than SRCU could reasonably be convinced to do.
>
> The big problem is that they need to be able to hook a simple BPF program
> (for example, count the number of calls with given argument values) on
> a fastpath function on a system running in production without causing
> the automation to decide that this system is too slow, thus whacking it
> over the head.  Any appreciable overhead is a no-go in this use case.
> It is not just that the srcu_read_lock() function's smp_mb() call would
> disqualify SRCU, its other added overhead would as well.  Plus this needs
> RCU Tasks Trace CPU stall warnings to catch abuse, and SRCU doesn't
> impose any limits on readers (how long to set the stall time?) and
> doesn't track tasks.
>
> > > Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> > > haven't seen a single argument why. Please provide that, or let's
> > > stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> > > of what Tasks Trace flavor was designed for.
> >
> > Because I actually know SRCU, and because it provides a local scope.
> > It isolates the unregister waiters from other random users. I'm not
> > going to use this funky new flavour until I truly understand it.
>
> It is only a few hundred lines of code on top of the infrastructure
> that also supports RCU Tasks and RCU Tasks Rude.  If you understand
> SRCU and preemptible RCU, there will be nothing exotic there, and it is
> simpler than Tree SRCU, to say nothing of preemptible RCU.  I would be
> more than happy to take you through it if you would like, but not before
> this coming Monday.
>
> > Also, we actually want two scopes here, there is no reason for the
> > consumer unreg to wait for the retprobe stuff.
>
> I don't know that the performance requirements for userspace retprobes are
> as severe as for function-call probes -- on that, I must defer to Andrii.

uretprobes are just as important (performance-wise and just in term of
functionality), as they are often used simultaneously (e.g., to time
some user function or capture input args and make decision whether to
log them based on return value). uretprobes are inherently slower
(because they are entry probe + some extra bookkeeping and overhead),
but we should do the best we can to ensure they are as performant as
possible


> To your two-scopes point, it is quite possible that SRCU could be used
> for userspace retprobes and RCU Tasks Trace for the others.  It certainly
> seems to me that SRCU would be better than explicit reference counting,
> but I could be missing something.  (Memory footprint, perhaps?  Though
> maybe a single srcu_struct could be shared among all userspace retprobes.
> Given the time-bounded reads, maybe stall warnings aren't needed,
> give or take things like interrupts, preemption, and vCPU preemption.
> Plus it is not like it would be hard to figure out which read-side code
> region was at fault when the synchronize_srcu() took too long.)
>
>                                                         Thanx, Paul
>
> > > 3. Regardless of RCU flavor, due to RCU protection, we have to add
> > > batched register/unregister APIs, so we can amortize sync_rcu cost
> > > during deregistration. Can we please agree on that as well? This is
> > > the main goal of this patch set and I'd like to land it before working
> > > further on changing and improving the rest of the locking schema.
> >
> > See my patch here:
> >
> >   https://lkml.kernel.org/r/20240704084524.GC28838@noisy.programming.kicks-ass.net
> >
> > I don't think it needs to be more complicated than that.
> >
> > > I won't be happy about it, but just to move things forward, I can drop
> > > a) custom refcounting and/or b) percpu RW semaphore. Both are
> > > beneficial but not essential for batched APIs work. But if you force
> > > me to do that, please state clearly your reasons/arguments.
> >
> > The reason I'm pushing RCU here is because AFAICT uprobes doesn't
> > actually need the stronger serialisation that rwlock (any flavour)
> > provide. It is a prime candidate for RCU, and I think you'll find plenty
> > papers / articles (by both Paul and others) that show that RCU scales
> > better.
> >
> > As a bonus, you avoid that horrific write side cost that per-cpu rwsem
> > has.
> >
> > The reason I'm not keen on that refcount thing was initially because I
> > did not understand the justification for it, but worse, once I did read
> > your justification, your very own numbers convinced me that the refcount
> > is fundamentally problematic, in any way shape or form.
> >
> > > No one had yet pointed out why refcounting is broken
> >
> > Your very own numbers point out that refcounting is a problem here.
> >
> > > and why percpu RW semaphore is bad.
> >
> > Literature and history show us that RCU -- where possible -- is
> > always better than any reader-writer locking scheme.
> >
> > > 4. Another tactical thing, but an important one. Refcounting schema
> > > for uprobes. I've replied already, but I think refcounting is
> > > unavoidable for uretprobes,
> >
> > I think we can fix that, I replied here:
> >
> >   https://lkml.kernel.org/r/20240704083152.GQ11386@noisy.programming.kicks-ass.net
> >
> > > and current refcounting schema is
> > > problematic for batched APIs due to race between finding uprobe and
> > > there still being a possibility we'd need to undo all that and retry
> > > again.
> >
> > Right, I've not looked too deeply at that, because I've not seen a
> > reason to actually change that. I can go think about it if you want, but
> > meh.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
  2024-07-04  9:15   ` Peter Zijlstra
  2024-07-04 13:56     ` Steven Rostedt
  2024-07-04 15:44     ` Paul E. McKenney
@ 2024-07-08 17:48     ` Andrii Nakryiko
  2 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-08 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	mingo, bpf, jolsa, paulmck, clm, open list

On Thu, Jul 4, 2024 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:
>
> > 2. More tactically, RCU protection seems like the best way forward. We
> > got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> > clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> > flavor (whatever that is, I have no idea).
> >
> > Now, RCU Tasks Trace were specifically designed for least overhead
> > hotpath (reader side) performance, at the expense of slowing down much
> > rarer writers. My microbenchmarking does show at least 5% difference.
> > Both flavors can handle sleepable uprobes waiting for page faults.
> > Tasks Trace flavor is already used for tracing in the BPF realm,
> > including for sleepable uprobes and works well. It's not going away.
>
> I need to look into this new RCU flavour and why it exists -- for
> example, why can't SRCU be improved to gain the same benefits. This is
> what we've always done, improve SRCU.

Yes, that makes sense, in principle. But if it takes too much time to
improve SRCU, I'd say it's reasonable to use the faster solution until
it can be unified (if at all, of course).

>
> > Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> > haven't seen a single argument why. Please provide that, or let's
> > stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> > of what Tasks Trace flavor was designed for.
>
> Because I actually know SRCU, and because it provides a local scope.
> It isolates the unregister waiters from other random users. I'm not
> going to use this funky new flavour until I truly understand it.
>
> Also, we actually want two scopes here, there is no reason for the
> consumer unreg to wait for the retprobe stuff.
>

Uprobe attachment/detachment (i.e., register/unregister) is a very
rare operation. Its performance doesn't really matter in the great
scheme of things. In the sense that whether it takes 1, 10, or 200
milliseconds is immaterial compared to uprobe/uretprobe triggering
performance. The only important thing is that it doesn't take multiple
seconds and minutes (or even hours, if we do synchronize_rcu
unconditionally after each unregister) to attach/detach 100s/1000s+
uprobes.

I'm just saying this is the wrong target to optimize for if we just
ensure that it's reasonably performant in the face of multiple uprobes
registering/unregistering. (so one common SRCU scope for
registration/unregistration is totally fine, IMO)


> > 3. Regardless of RCU flavor, due to RCU protection, we have to add
> > batched register/unregister APIs, so we can amortize sync_rcu cost
> > during deregistration. Can we please agree on that as well? This is
> > the main goal of this patch set and I'd like to land it before working
> > further on changing and improving the rest of the locking schema.
>
> See my patch here:
>
>   https://lkml.kernel.org/r/20240704084524.GC28838@noisy.programming.kicks-ass.net
>
> I don't think it needs to be more complicated than that.

Alright, I'll take a closer look this week and will run it through my
tests and benchmarks, thanks for working on this and sending it out!

>
> > I won't be happy about it, but just to move things forward, I can drop
> > a) custom refcounting and/or b) percpu RW semaphore. Both are
> > beneficial but not essential for batched APIs work. But if you force
> > me to do that, please state clearly your reasons/arguments.
>
> The reason I'm pushing RCU here is because AFAICT uprobes doesn't
> actually need the stronger serialisation that rwlock (any flavour)
> provide. It is a prime candidate for RCU, and I think you'll find plenty
> papers / articles (by both Paul and others) that show that RCU scales
> better.
>
> As a bonus, you avoid that horrific write side cost that per-cpu rwsem
> has.
>
> The reason I'm not keen on that refcount thing was initially because I
> did not understand the justification for it, but worse, once I did read
> your justification, your very own numbers convinced me that the refcount
> is fundamentally problematic, in any way shape or form.
>
> > No one had yet pointed out why refcounting is broken
>
> Your very own numbers point out that refcounting is a problem here.

Yes, I already agreed on avoiding refcounting if possible. The
question above was why the refcounting I added was broken by itself.
But it's a moot point (at least for now), let me go look at your
patches.

>
> > and why percpu RW semaphore is bad.
>
> Literature and history show us that RCU -- where possible -- is
> always better than any reader-writer locking scheme.
>
> > 4. Another tactical thing, but an important one. Refcounting schema
> > for uprobes. I've replied already, but I think refcounting is
> > unavoidable for uretprobes,
>
> I think we can fix that, I replied here:
>
>   https://lkml.kernel.org/r/20240704083152.GQ11386@noisy.programming.kicks-ass.net
>
> > and current refcounting schema is
> > problematic for batched APIs due to race between finding uprobe and
> > there still being a possibility we'd need to undo all that and retry
> > again.
>
> Right, I've not looked too deeply at that, because I've not seen a
> reason to actually change that. I can go think about it if you want, but
> meh.

Ok, let's postpone that if we can get away with just sync/nosync
uprobe_unregister.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
  2024-07-07 12:48   ` Oleg Nesterov
@ 2024-07-08 17:56     ` Andrii Nakryiko
  0 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-08 17:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Sun, Jul 7, 2024 at 5:50 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/01, Andrii Nakryiko wrote:
> >
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -42,6 +42,11 @@ struct uprobe_consumer {
> >                               enum uprobe_filter_ctx ctx,
> >                               struct mm_struct *mm);
> >
> > +     /* associated file offset of this probe */
> > +     loff_t offset;
> > +     /* associated refctr file offset of this probe, or zero */
> > +     loff_t ref_ctr_offset;
> > +     /* for internal uprobe infra use, consumers shouldn't touch fields below */
> >       struct uprobe_consumer *next;
>
>
> Well, I don't really like this patch either...
>
> If nothing else because all the consumers in uprobe->consumers list
> must have the same offset/ref_ctr_offset.

You are thinking from a per-uprobe's perspective. But during
attachment you are attaching multiple consumers at different locations
within a given inode (and that matches for consumers are already
doing, they remember those offsets in their own structs), so each
consumer has a different offset.

Again, I'm just saying that I'm codifying what uprobe users already do
and simplifying the interface (otherwise we'd need another set of
callbacks or some new struct just to pass those
offsets/ref_ctr_offset).

But we can put all that on hold if Peter's approach works well enough.
My goal is to have faster uprobes, not to land *my* patches.

>
> --------------------------------------------------------------------------
> But I agree, the ugly uprobe_register_refctr() must die, we need a single
> function
>
>         int uprobe_register(inode, offset, ref_ctr_offset, consumer);
>
> This change is trivial.
>
> --------------------------------------------------------------------------
> And speaking of cleanups, I think another change makes sense:
>
>         -       int uprobe_register(...);
>         +       struct uprobe* uprobe_register(...);
>
> so that uprobe_register() returns uprobe or ERR_PTR.
>
>         -       void uprobe_unregister(inode, offset, consumer);
>         +       void uprobe_unregister(uprobe, consumer);
>
> this way unregister() doesn't need the extra find_uprobe() + put_uprobe().
> The same for uprobe_apply().

I'm achieving this by keeping uprobe pointer inside uprobe_consumer
(and not requiring callers to keep explicit track of that)

>
> The necessary changes in kernel/trace/trace_uprobe.c are trivial, we just
> need to change struct trace_uprobe
>
>         -       struct inode                    *inode;
>         +       struct uprobe                   *uprobe;
>
> and fix the compilation errors.
>
>
> As for kernel/trace/bpf_trace.c, I guess struct bpf_uprobe  needs the new
> ->uprobe member, we can't kill bpf_uprobe->offset because of
> bpf_uprobe_multi_link_fill_link_info(), but I think this is not that bad.
>
> What do you think?

I'd add an uprobe field to uprobe_consumer, tbh, and keep callers
simpler (less aware of uprobe existence in principle). Even if we
don't do batch register/unregister APIs.

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-08 17:47       ` Andrii Nakryiko
@ 2024-07-09 18:47         ` Oleg Nesterov
  2024-07-09 20:59           ` Andrii Nakryiko
  0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2024-07-09 18:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On 07/08, Andrii Nakryiko wrote:
>
> On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > And I forgot to mention...
> >
> > In any case __uprobe_unregister() can't ignore the error code from
> > register_for_each_vma(). If it fails to restore the original insn,
> > we should not remove this uprobe from uprobes_tree.
> >
> > Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> > probed application.
>
> Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> detects "improper" stack pointer progression, for example),

In this case we a) assume that user-space tries to fool the kernel and
b) the kernel can't handle this case in any case, thus uprobe_warn().

> but from
> what I gather it's not really expected to fail on unregistration given
> we successfully registered uprobe.

Not really expected, and that is why the "TODO" comment in _unregister()
was never implemented. Although the real reason is that we are lazy ;)

But register_for_each_vma(NULL) can fail. Say, simply because
kmalloc(GFP_KERNEL) in build_map_info() can fail even if it "never" should.
A lot of other reasons.

> I guess it's a decision between
> leaking memory with an uprobe stuck in the tree or killing process due
> to some very rare (or buggy) condition?

Yes. I think in this case it is better to leak uprobe than kill the
no longer probed task.

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-09 18:47         ` Oleg Nesterov
@ 2024-07-09 20:59           ` Andrii Nakryiko
  2024-07-09 21:31             ` Oleg Nesterov
  0 siblings, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-09 20:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/08, Andrii Nakryiko wrote:
> >
> > On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > And I forgot to mention...
> > >
> > > In any case __uprobe_unregister() can't ignore the error code from
> > > register_for_each_vma(). If it fails to restore the original insn,
> > > we should not remove this uprobe from uprobes_tree.
> > >
> > > Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> > > probed application.
> >
> > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > detects "improper" stack pointer progression, for example),
>
> In this case we a) assume that user-space tries to fool the kernel and

Well, it's a bad assumption. User space might just be using fibers and
managing its own stack. Not saying SIGILL is good, but it's part of
the uprobe system regardless.

> b) the kernel can't handle this case in any case, thus uprobe_warn().
>
> > but from
> > what I gather it's not really expected to fail on unregistration given
> > we successfully registered uprobe.
>
> Not really expected, and that is why the "TODO" comment in _unregister()
> was never implemented. Although the real reason is that we are lazy ;)

Worked fine for 10+ years, which says something ;)

>
> But register_for_each_vma(NULL) can fail. Say, simply because
> kmalloc(GFP_KERNEL) in build_map_info() can fail even if it "never" should.
> A lot of other reasons.
>
> > I guess it's a decision between
> > leaking memory with an uprobe stuck in the tree or killing process due
> > to some very rare (or buggy) condition?
>
> Yes. I think in this case it is better to leak uprobe than kill the
> no longer probed task.

Ok, I think it's not hard to keep uprobe around if
__uprobe_unregister() fails, should be an easy addition from what I
can see.

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-09 20:59           ` Andrii Nakryiko
@ 2024-07-09 21:31             ` Oleg Nesterov
  2024-07-09 21:45               ` Andrii Nakryiko
  0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2024-07-09 21:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On 07/09, Andrii Nakryiko wrote:
>
> On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > > detects "improper" stack pointer progression, for example),
> >
> > In this case we a) assume that user-space tries to fool the kernel and
>
> Well, it's a bad assumption. User space might just be using fibers and
> managing its own stack.

Do you mean something like the "go" language?

Yes, not supported. And from the kernel perspective it still looks as if
user-space tries to fool the kernel. I mean, if you insert a ret-probe,
the kernel assumes that it "owns" the stack, if nothing else the kernel
has to change the ret-address on stack.

I agree, this is not good. But again, what else the kernel can do in
this case?

> > Not really expected, and that is why the "TODO" comment in _unregister()
> > was never implemented. Although the real reason is that we are lazy ;)
>
> Worked fine for 10+ years, which says something ;)

Or may be it doesn't but we do not know because this code doesn't do
uprobe_warn() ;)

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-07-09 21:31             ` Oleg Nesterov
@ 2024-07-09 21:45               ` Andrii Nakryiko
  0 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-09 21:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Tue, Jul 9, 2024 at 2:33 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/09, Andrii Nakryiko wrote:
> >
> > On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > > > detects "improper" stack pointer progression, for example),
> > >
> > > In this case we a) assume that user-space tries to fool the kernel and
> >
> > Well, it's a bad assumption. User space might just be using fibers and
> > managing its own stack.
>
> Do you mean something like the "go" language?
>

No, I think it was C++ application. I think we have some uses of
fibers in which an application does its own user-space scheduling and
manages stack in user space. But it's basically the same class of
problems that you'd get with Go, yes.

> Yes, not supported. And from the kernel perspective it still looks as if
> user-space tries to fool the kernel. I mean, if you insert a ret-probe,
> the kernel assumes that it "owns" the stack, if nothing else the kernel
> has to change the ret-address on stack.
>
> I agree, this is not good. But again, what else the kernel can do in
> this case?

Not that I'm proposing this, but kernel could probably maintain a
lookup table keyed by thread stack pointer, instead of maintaining
implicit stack (but that would probably be more expensive). With some
limits and stuff this probably would work fine.

>
> > > Not really expected, and that is why the "TODO" comment in _unregister()
> > > was never implemented. Although the real reason is that we are lazy ;)
> >
> > Worked fine for 10+ years, which says something ;)
>
> Or may be it doesn't but we do not know because this code doesn't do
> uprobe_warn() ;)

sure :)

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 01/12] uprobes: update outdated comment
  2024-07-03 11:38   ` Oleg Nesterov
  2024-07-03 18:24     ` Andrii Nakryiko
  2024-07-03 21:51     ` Andrii Nakryiko
@ 2024-07-10 13:31     ` Oleg Nesterov
  2024-07-10 15:14       ` Andrii Nakryiko
  2 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2024-07-10 13:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, peterz, mingo, bpf, jolsa,
	paulmck, clm

On 07/03, Oleg Nesterov wrote:
>
> >  	/*
> > -	 * The NULL 'tsk' here ensures that any faults that occur here
> > -	 * will not be accounted to the task.  'mm' *is* current->mm,
> > -	 * but we treat this as a 'remote' access since it is
> > -	 * essentially a kernel access to the memory.
> > +	 * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +	 * it is essentially a kernel access to the memory.
> >  	 */
> >  	result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
>
> ---------------------------------------------------------------------
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?

Well, yes, it still looks confusing, imo.

Andrii, I hope you won't mind if I redo/resend this and the next cleanup?

The next one only updates the comment above uprobe_write_opcode(), but
it would be nice to explain mmap_write_lock() in register_for_each_vma().

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2 01/12] uprobes: update outdated comment
  2024-07-10 13:31     ` Oleg Nesterov
@ 2024-07-10 15:14       ` Andrii Nakryiko
  0 siblings, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2024-07-10 15:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Wed, Jul 10, 2024 at 6:33 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/03, Oleg Nesterov wrote:
> >
> > >     /*
> > > -    * The NULL 'tsk' here ensures that any faults that occur here
> > > -    * will not be accounted to the task.  'mm' *is* current->mm,
> > > -    * but we treat this as a 'remote' access since it is
> > > -    * essentially a kernel access to the memory.
> > > +    * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > > +    * it is essentially a kernel access to the memory.
> > >      */
> > >     result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);
> >
> > OK, this makes it less confusing, so
> >
> > Acked-by: Oleg Nesterov <oleg@redhat.com>
> >
> > ---------------------------------------------------------------------
> > but it still looks confusing to me. This code used to pass tsk = NULL
> > only to avoid tsk->maj/min_flt++ in faultin_page().
> >
> > But today mm_account_fault() increments these counters without checking
> > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> > just use get_user_pages() and remove this comment?
>
> Well, yes, it still looks confusing, imo.
>
> Andrii, I hope you won't mind if I redo/resend this and the next cleanup?
>
> The next one only updates the comment above uprobe_write_opcode(), but
> it would be nice to explain mmap_write_lock() in register_for_each_vma().
>

I don't mind a bit, thanks for sending the patches!

> Oleg.
>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2024-07-10 15:14 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 01/12] uprobes: update outdated comment Andrii Nakryiko
2024-07-03 11:38   ` Oleg Nesterov
2024-07-03 18:24     ` Andrii Nakryiko
2024-07-03 21:51     ` Andrii Nakryiko
2024-07-10 13:31     ` Oleg Nesterov
2024-07-10 15:14       ` Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() Andrii Nakryiko
2024-07-03 11:41   ` Oleg Nesterov
2024-07-03 13:15   ` Masami Hiramatsu
2024-07-03 18:25     ` Andrii Nakryiko
2024-07-03 21:47       ` Masami Hiramatsu
2024-07-01 22:39 ` [PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe() Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
2024-07-02 10:22   ` Peter Zijlstra
2024-07-02 17:54     ` Andrii Nakryiko
2024-07-03 13:36   ` Peter Zijlstra
2024-07-03 20:47     ` Andrii Nakryiko
2024-07-04  8:03       ` Peter Zijlstra
2024-07-04  8:45         ` Peter Zijlstra
2024-07-04 14:40           ` Masami Hiramatsu
2024-07-04  8:31       ` Peter Zijlstra
2024-07-05 15:37   ` Oleg Nesterov
2024-07-06 17:00     ` Jiri Olsa
2024-07-06 17:05       ` Jiri Olsa
2024-07-07 14:46     ` Oleg Nesterov
2024-07-08 17:47       ` Andrii Nakryiko
2024-07-09 18:47         ` Oleg Nesterov
2024-07-09 20:59           ` Andrii Nakryiko
2024-07-09 21:31             ` Oleg Nesterov
2024-07-09 21:45               ` Andrii Nakryiko
2024-07-08 17:47     ` Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
2024-07-03  8:13   ` Peter Zijlstra
2024-07-03 10:13     ` Masami Hiramatsu
2024-07-03 18:23       ` Andrii Nakryiko
2024-07-07 12:48   ` Oleg Nesterov
2024-07-08 17:56     ` Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 09/12] uprobes: batch uprobes_treelock during registration Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore Andrii Nakryiko
2024-07-02 10:23 ` [PATCH v2 00/12] uprobes: add batched register/unregister APIs and " Peter Zijlstra
2024-07-02 11:54   ` Peter Zijlstra
2024-07-02 12:01     ` Peter Zijlstra
2024-07-02 17:54     ` Andrii Nakryiko
2024-07-02 19:18       ` Peter Zijlstra
2024-07-02 23:56         ` Paul E. McKenney
2024-07-03  4:54           ` Andrii Nakryiko
2024-07-03  7:50           ` Peter Zijlstra
2024-07-03 14:08             ` Paul E. McKenney
2024-07-04  8:39               ` Peter Zijlstra
2024-07-04 15:13                 ` Paul E. McKenney
2024-07-03 21:57             ` Steven Rostedt
2024-07-03 22:07               ` Paul E. McKenney
2024-07-03  4:47         ` Andrii Nakryiko
2024-07-03  8:07           ` Peter Zijlstra
2024-07-03 20:55             ` Andrii Nakryiko
2024-07-03 21:33 ` Andrii Nakryiko
2024-07-04  9:15   ` Peter Zijlstra
2024-07-04 13:56     ` Steven Rostedt
2024-07-04 15:44     ` Paul E. McKenney
2024-07-08 17:47       ` Andrii Nakryiko
2024-07-08 17:48     ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).