From: Peter Zijlstra <peterz@infradead.org>
To: Andrii Nakryiko <andrii@kernel.org>
Cc: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org,
mhiramat@kernel.org, oleg@redhat.com, mingo@redhat.com,
bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org,
clm@meta.com
Subject: Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
Date: Tue, 2 Jul 2024 12:22:54 +0200 [thread overview]
Message-ID: <20240702102254.GF11386@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <20240701223935.3783951-5-andrii@kernel.org>
On Mon, Jul 01, 2024 at 03:39:27PM -0700, Andrii Nakryiko wrote:
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 23449a8c5e7e..560cf1ca512a 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>
> struct uprobe {
> struct rb_node rb_node; /* node in the rb tree */
> - refcount_t ref;
> + atomic64_t ref; /* see UPROBE_REFCNT_GET below */
> struct rw_semaphore register_rwsem;
> struct rw_semaphore consumer_rwsem;
> + struct rcu_head rcu;
> struct list_head pending_list;
> struct uprobe_consumer *consumers;
> struct inode *inode; /* Also hold a ref to inode */
> @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> *(uprobe_opcode_t *)&auprobe->insn);
> }
>
> -static struct uprobe *get_uprobe(struct uprobe *uprobe)
> +/*
> + * Uprobe's 64-bit refcount is actually two independent counters co-located in
> + * a single u64 value:
> + * - lower 32 bits are just a normal refcount with is increment and
> + * decremented on get and put, respectively, just like normal refcount
> + * would;
> + * - upper 32 bits are a tag (or epoch, if you will), which is always
> + * incremented by one, no matter whether get or put operation is done.
> + *
> + * This upper counter is meant to distinguish between:
> + * - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
> + * - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
> + * sequence, also proceeding to "destruction".
> + *
> + * In both cases refcount drops to zero, but in one case it will have epoch N,
> + * while the second drop to zero will have a different epoch N + 2, allowing
> + * first destructor to bail out because epoch changed between refcount going
> + * to zero and put_uprobe() taking uprobes_treelock (under which overall
> + * 64-bit refcount is double-checked, see put_uprobe() for details).
> + *
> + * Lower 32-bit counter is not meant to over overflow, while it's expected
So refcount_t very explicitly handles both overflow and underflow and
screams bloody murder if they happen. Your thing does not..
> + * that upper 32-bit counter will overflow occasionally. Note, though, that we
> + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit counter,
> + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
> + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
> + * epoch effectively a 31-bit counter with highest bit used as a flag to
> + * perform a fix-up. This ensures epoch and refcnt parts do not "interfere".
> + *
> + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> + * epoch and refcnt parts atomically with one atomic_add().
> + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
> + * *increment* epoch part.
> + */
> +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x0000000100000001LL */
> +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0x00000000ffffffffLL */
> +
> +/*
> + * Caller has to make sure that:
> + * a) either uprobe's refcnt is positive before this call;
> + * b) or uprobes_treelock is held (doesn't matter if for read or write),
> + * preventing uprobe's destructor from removing it from uprobes_tree.
> + *
> + * In the latter case, uprobe's destructor will "resurrect" uprobe instance if
> + * it detects that its refcount went back to being positive again inbetween it
> + * dropping to zero at some point and (potentially delayed) destructor
> + * callback actually running.
> + */
> +static struct uprobe *__get_uprobe(struct uprobe *uprobe)
> {
> - refcount_inc(&uprobe->ref);
> + s64 v;
> +
> + v = atomic64_add_return(UPROBE_REFCNT_GET, &uprobe->ref);
Distinct lack of u32 overflow testing here..
> +
> + /*
> + * If the highest bit is set, we need to clear it. If cmpxchg() fails,
> + * we don't retry because there is another CPU that just managed to
> + * update refcnt and will attempt the same "fix up". Eventually one of
> + * them will succeed to clear highset bit.
> + */
> + if (unlikely(v < 0))
> + (void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
> +
> return uprobe;
> }
> static void put_uprobe(struct uprobe *uprobe)
> {
> - if (refcount_dec_and_test(&uprobe->ref)) {
> + s64 v;
> +
> + /*
> + * here uprobe instance is guaranteed to be alive, so we use Tasks
> + * Trace RCU to guarantee that uprobe won't be freed from under us, if
What's wrong with normal RCU?
> + * we end up being a losing "destructor" inside uprobe_treelock'ed
> + * section double-checking uprobe->ref value below.
> + * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> + */
> + rcu_read_lock_trace();
> +
> + v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
No underflow handling... because nobody ever had a double put bug.
> + if (unlikely((u32)v == 0)) {
> + bool destroy;
> +
> + write_lock(&uprobes_treelock);
> + /*
> + * We might race with find_uprobe()->__get_uprobe() executed
> + * from inside read-locked uprobes_treelock, which can bump
> + * refcount from zero back to one, after we got here. Even
> + * worse, it's possible for another CPU to do 0 -> 1 -> 0
> + * transition between this CPU doing atomic_add() and taking
> + * uprobes_treelock. In either case this CPU should bail out
> + * and not proceed with destruction.
> + *
> + * So now that we have exclusive write lock, we double check
> + * the total 64-bit refcount value, which includes the epoch.
> + * If nothing changed (i.e., epoch is the same and refcnt is
> + * still zero), we are good and we proceed with the clean up.
> + *
> + * But if it managed to be updated back at least once, we just
> + * pretend it never went to zero. If lower 32-bit refcnt part
> + * drops to zero again, another CPU will proceed with
> + * destruction, due to more up to date epoch.
> + */
> + destroy = atomic64_read(&uprobe->ref) == v;
> + if (destroy && uprobe_is_active(uprobe))
> + rb_erase(&uprobe->rb_node, &uprobes_tree);
> + write_unlock(&uprobes_treelock);
> +
> + /*
> + * Beyond here we don't need RCU protection, we are either the
> + * winning destructor and we control the rest of uprobe's
> + * lifetime; or we lost and we are bailing without accessing
> + * uprobe fields anymore.
> + */
> + rcu_read_unlock_trace();
> +
> + /* uprobe got resurrected, pretend we never tried to free it */
> + if (!destroy)
> + return;
> +
> /*
> * If application munmap(exec_vma) before uprobe_unregister()
> * gets called, we don't get a chance to remove uprobe from
> @@ -604,8 +728,21 @@ static void put_uprobe(struct uprobe *uprobe)
> mutex_lock(&delayed_uprobe_lock);
> delayed_uprobe_remove(uprobe, NULL);
> mutex_unlock(&delayed_uprobe_lock);
> - kfree(uprobe);
> +
> + call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
> + return;
> }
> +
> + /*
> + * If the highest bit is set, we need to clear it. If cmpxchg() fails,
> + * we don't retry because there is another CPU that just managed to
> + * update refcnt and will attempt the same "fix up". Eventually one of
> + * them will succeed to clear highset bit.
> + */
> + if (unlikely(v < 0))
> + (void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
> +
> + rcu_read_unlock_trace();
> }
next prev parent reply other threads:[~2024-07-02 10:23 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-01 22:39 [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 01/12] uprobes: update outdated comment Andrii Nakryiko
2024-07-03 11:38 ` Oleg Nesterov
2024-07-03 18:24 ` Andrii Nakryiko
2024-07-03 21:51 ` Andrii Nakryiko
2024-07-10 13:31 ` Oleg Nesterov
2024-07-10 15:14 ` Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() Andrii Nakryiko
2024-07-03 11:41 ` Oleg Nesterov
2024-07-03 13:15 ` Masami Hiramatsu
2024-07-03 18:25 ` Andrii Nakryiko
2024-07-03 21:47 ` Masami Hiramatsu
2024-07-01 22:39 ` [PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe() Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
2024-07-02 10:22 ` Peter Zijlstra [this message]
2024-07-02 17:54 ` Andrii Nakryiko
2024-07-03 13:36 ` Peter Zijlstra
2024-07-03 20:47 ` Andrii Nakryiko
2024-07-04 8:03 ` Peter Zijlstra
2024-07-04 8:45 ` Peter Zijlstra
2024-07-04 14:40 ` Masami Hiramatsu
2024-07-04 8:31 ` Peter Zijlstra
2024-07-05 15:37 ` Oleg Nesterov
2024-07-06 17:00 ` Jiri Olsa
2024-07-06 17:05 ` Jiri Olsa
2024-07-07 14:46 ` Oleg Nesterov
2024-07-08 17:47 ` Andrii Nakryiko
2024-07-09 18:47 ` Oleg Nesterov
2024-07-09 20:59 ` Andrii Nakryiko
2024-07-09 21:31 ` Oleg Nesterov
2024-07-09 21:45 ` Andrii Nakryiko
2024-07-08 17:47 ` Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
2024-07-03 8:13 ` Peter Zijlstra
2024-07-03 10:13 ` Masami Hiramatsu
2024-07-03 18:23 ` Andrii Nakryiko
2024-07-07 12:48 ` Oleg Nesterov
2024-07-08 17:56 ` Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 09/12] uprobes: batch uprobes_treelock during registration Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes Andrii Nakryiko
2024-07-01 22:39 ` [PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore Andrii Nakryiko
2024-07-02 10:23 ` [PATCH v2 00/12] uprobes: add batched register/unregister APIs and " Peter Zijlstra
2024-07-02 11:54 ` Peter Zijlstra
2024-07-02 12:01 ` Peter Zijlstra
2024-07-02 17:54 ` Andrii Nakryiko
2024-07-02 19:18 ` Peter Zijlstra
2024-07-02 23:56 ` Paul E. McKenney
2024-07-03 4:54 ` Andrii Nakryiko
2024-07-03 7:50 ` Peter Zijlstra
2024-07-03 14:08 ` Paul E. McKenney
2024-07-04 8:39 ` Peter Zijlstra
2024-07-04 15:13 ` Paul E. McKenney
2024-07-03 21:57 ` Steven Rostedt
2024-07-03 22:07 ` Paul E. McKenney
2024-07-03 4:47 ` Andrii Nakryiko
2024-07-03 8:07 ` Peter Zijlstra
2024-07-03 20:55 ` Andrii Nakryiko
2024-07-03 21:33 ` Andrii Nakryiko
2024-07-04 9:15 ` Peter Zijlstra
2024-07-04 13:56 ` Steven Rostedt
2024-07-04 15:44 ` Paul E. McKenney
2024-07-08 17:47 ` Andrii Nakryiko
2024-07-08 17:48 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240702102254.GF11386@noisy.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=clm@meta.com \
--cc=jolsa@kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mhiramat@kernel.org \
--cc=mingo@redhat.com \
--cc=oleg@redhat.com \
--cc=paulmck@kernel.org \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox