From: Dmitry Vyukov <dvyukov@google.com>
To: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Frederic Weisbecker <frederic@kernel.org>,
Ingo Molnar <mingo@kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
Arnaldo Carvalho de Melo <acme@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
Michael Ellerman <mpe@ellerman.id.au>,
linuxppc-dev@lists.ozlabs.org, linux-perf-users@vger.kernel.org,
x86@kernel.org, linux-sh@vger.kernel.org,
kasan-dev@googlegroups.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 03/13] perf/hw_breakpoint: Optimize list of per-task breakpoints
Date: Tue, 28 Jun 2022 17:27:37 +0200 [thread overview]
Message-ID: <CACT4Y+YQibtAk0y=SVTSp27Ythjk4c1jCV2_BNAL5Uiw-fMo_w@mail.gmail.com> (raw)
In-Reply-To: <CANpmjNOT=npm9Bu9QGNO=SgCJVB2fr8ojO4-u-Ffgw4gmRuSfw@mail.gmail.com>
On Tue, 28 Jun 2022 at 16:54, Marco Elver <elver@google.com> wrote:
> > > On a machine with 256 CPUs, running the recently added perf breakpoint
> > > benchmark results in:
> > >
> > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > | # Running 'breakpoint/thread' benchmark:
> > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > | Total time: 236.418 [sec]
> > > |
> > > | 123134.794271 usecs/op
> > > | 7880626.833333 usecs/op/cpu
> > >
> > > The benchmark tests inherited breakpoint perf events across many
> > > threads.
> > >
> > > Looking at a perf profile, we can see that the majority of the time is
> > > spent in various hw_breakpoint.c functions, which execute within the
> > > 'nr_bp_mutex' critical sections which then results in contention on that
> > > mutex as well:
> > >
> > > 37.27% [kernel] [k] osq_lock
> > > 34.92% [kernel] [k] mutex_spin_on_owner
> > > 12.15% [kernel] [k] toggle_bp_slot
> > > 11.90% [kernel] [k] __reserve_bp_slot
> > >
> > > The culprit here is task_bp_pinned(), which has a runtime complexity of
> > > O(#tasks) due to storing all task breakpoints in the same list and
> > > iterating through that list looking for a matching task. Clearly, this
> > > does not scale to thousands of tasks.
> > >
> > > Instead, make use of the "rhashtable" variant "rhltable" which stores
> > > multiple items with the same key in a list. This results in average
> > > runtime complexity of O(1) for task_bp_pinned().
> > >
> > > With the optimization, the benchmark shows:
> > >
> > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > | # Running 'breakpoint/thread' benchmark:
> > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > | Total time: 0.208 [sec]
> > > |
> > > | 108.422396 usecs/op
> > > | 6939.033333 usecs/op/cpu
> > >
> > > On this particular setup that's a speedup of ~1135x.
> > >
> > > While one option would be to make task_struct a breakpoint list node,
> > > this would only further bloat task_struct for infrequently used data.
> > > Furthermore, after all optimizations in this series, there's no evidence
> > > it would result in better performance: later optimizations make the time
> > > spent looking up entries in the hash table negligible (we'll reach the
> > > theoretical ideal performance i.e. no constraints).
> > >
> > > Signed-off-by: Marco Elver <elver@google.com>
> > > ---
> > > v2:
> > > * Commit message tweaks.
> > > ---
> > > include/linux/perf_event.h | 3 +-
> > > kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++-------------
> > > 2 files changed, 37 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > > index 01231f1d976c..e27360436dc6 100644
> > > --- a/include/linux/perf_event.h
> > > +++ b/include/linux/perf_event.h
> > > @@ -36,6 +36,7 @@ struct perf_guest_info_callbacks {
> > > };
> > >
> > > #ifdef CONFIG_HAVE_HW_BREAKPOINT
> > > +#include <linux/rhashtable-types.h>
> > > #include <asm/hw_breakpoint.h>
> > > #endif
> > >
> > > @@ -178,7 +179,7 @@ struct hw_perf_event {
> > > * creation and event initalization.
> > > */
> > > struct arch_hw_breakpoint info;
> > > - struct list_head bp_list;
> > > + struct rhlist_head bp_list;
> > > };
> > > #endif
> > > struct { /* amd_iommu */
> > > diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> > > index 1b013968b395..add1b9c59631 100644
> > > --- a/kernel/events/hw_breakpoint.c
> > > +++ b/kernel/events/hw_breakpoint.c
> > > @@ -26,10 +26,10 @@
> > > #include <linux/irqflags.h>
> > > #include <linux/kdebug.h>
> > > #include <linux/kernel.h>
> > > -#include <linux/list.h>
> > > #include <linux/mutex.h>
> > > #include <linux/notifier.h>
> > > #include <linux/percpu.h>
> > > +#include <linux/rhashtable.h>
> > > #include <linux/sched.h>
> > > #include <linux/slab.h>
> > >
> > > @@ -54,7 +54,13 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
> > > }
> > >
> > > /* Keep track of the breakpoints attached to tasks */
> > > -static LIST_HEAD(bp_task_head);
> > > +static struct rhltable task_bps_ht;
> > > +static const struct rhashtable_params task_bps_ht_params = {
> > > + .head_offset = offsetof(struct hw_perf_event, bp_list),
> > > + .key_offset = offsetof(struct hw_perf_event, target),
> > > + .key_len = sizeof_field(struct hw_perf_event, target),
> > > + .automatic_shrinking = true,
> > > +};
> > >
> > > static int constraints_initialized;
> > >
> > > @@ -103,17 +109,23 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
> > > */
> > > static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
> > > {
> > > - struct task_struct *tsk = bp->hw.target;
> > > + struct rhlist_head *head, *pos;
> > > struct perf_event *iter;
> > > int count = 0;
> > >
> > > - list_for_each_entry(iter, &bp_task_head, hw.bp_list) {
> > > - if (iter->hw.target == tsk &&
> > > - find_slot_idx(iter->attr.bp_type) == type &&
> > > + rcu_read_lock();
> > > + head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
> > > + if (!head)
> > > + goto out;
> > > +
> > > + rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {
> > > + if (find_slot_idx(iter->attr.bp_type) == type &&
> > > (iter->cpu < 0 || cpu == iter->cpu))
> > > count += hw_breakpoint_weight(iter);
> > > }
> > >
> > > +out:
> > > + rcu_read_unlock();
> > > return count;
> > > }
> > >
> > > @@ -186,7 +198,7 @@ static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
> > > /*
> > > * Add/remove the given breakpoint in our constraint table
> > > */
> > > -static void
> > > +static int
> > > toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
> > > int weight)
> > > {
> > > @@ -199,7 +211,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
> > > /* Pinned counter cpu profiling */
> > > if (!bp->hw.target) {
> > > get_bp_info(bp->cpu, type)->cpu_pinned += weight;
> > > - return;
> > > + return 0;
> > > }
> > >
> > > /* Pinned counter task profiling */
> > > @@ -207,9 +219,9 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
> > > toggle_bp_task_slot(bp, cpu, type, weight);
> > >
> > > if (enable)
> > > - list_add_tail(&bp->hw.bp_list, &bp_task_head);
> > > + return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
> > > else
> > > - list_del(&bp->hw.bp_list);
> > > + return rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
> > > }
> > >
> > > __weak int arch_reserve_bp_slot(struct perf_event *bp)
> > > @@ -307,9 +319,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
> > > if (ret)
> > > return ret;
> > >
> > > - toggle_bp_slot(bp, true, type, weight);
> > > -
> > > - return 0;
> > > + return toggle_bp_slot(bp, true, type, weight);
> > > }
> > >
> > > int reserve_bp_slot(struct perf_event *bp)
> > > @@ -334,7 +344,7 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
> > >
> > > type = find_slot_idx(bp_type);
> > > weight = hw_breakpoint_weight(bp);
> > > - toggle_bp_slot(bp, false, type, weight);
> > > + WARN_ON(toggle_bp_slot(bp, false, type, weight));
> > > }
> > >
> > > void release_bp_slot(struct perf_event *bp)
> > > @@ -678,7 +688,7 @@ static struct pmu perf_breakpoint = {
> > > int __init init_hw_breakpoint(void)
> > > {
> > > int cpu, err_cpu;
> > > - int i;
> > > + int i, ret;
> > >
> > > for (i = 0; i < TYPE_MAX; i++)
> > > nr_slots[i] = hw_breakpoint_slots(i);
> > > @@ -689,18 +699,24 @@ int __init init_hw_breakpoint(void)
> > >
> > > info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int),
> > > GFP_KERNEL);
> > > - if (!info->tsk_pinned)
> > > - goto err_alloc;
> > > + if (!info->tsk_pinned) {
> > > + ret = -ENOMEM;
> > > + goto err;
> > > + }
> > > }
> > > }
> > >
> > > + ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
> > > + if (ret)
> > > + goto err;
> > > +
> > > constraints_initialized = 1;
> > >
> > > perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);
> > >
> > > return register_die_notifier(&hw_breakpoint_exceptions_nb);
> >
> > It seems there is a latent bug here:
> > if register_die_notifier() fails we also need to execute the err: label code.
>
> I think we should ignore it, because it's just a notifier when the
> kernel dies. I'd rather have working breakpoints (which we have if we
> made it to this point) when the kernel is live, and sacrifice some bad
> behaviour when the kernel dies.
I don't have a strong opinion either way. If ignoring such functions
is acceptable practice, it sounds fine.
> > Otherwise the patch looks good.
> >
> > Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
>
> Thanks,
> -- Marco
next prev parent reply other threads:[~2022-06-28 15:27 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-28 9:58 [PATCH v2 00/13] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
2022-06-28 9:58 ` [PATCH v2 01/13] perf/hw_breakpoint: Add KUnit test for constraints accounting Marco Elver
2022-06-28 12:53 ` Dmitry Vyukov
2022-06-28 13:26 ` Marco Elver
2022-06-28 9:58 ` [PATCH v2 02/13] perf/hw_breakpoint: Clean up headers Marco Elver
2022-06-28 9:58 ` [PATCH v2 03/13] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
2022-06-28 13:08 ` Dmitry Vyukov
2022-06-28 14:53 ` Marco Elver
2022-06-28 15:27 ` Dmitry Vyukov [this message]
2022-06-28 9:58 ` [PATCH v2 04/13] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
2022-06-28 9:58 ` [PATCH v2 05/13] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
2022-06-28 9:58 ` [PATCH v2 06/13] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
2022-06-28 13:16 ` Dmitry Vyukov
2022-06-28 9:58 ` [PATCH v2 07/13] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
2022-06-28 13:18 ` Dmitry Vyukov
2022-06-28 9:58 ` [PATCH v2 08/13] powerpc/hw_breakpoint: Avoid relying on caller synchronization Marco Elver
2022-06-28 13:21 ` Dmitry Vyukov
2022-07-01 8:54 ` Christophe Leroy
2022-07-01 9:41 ` Marco Elver
2022-07-01 10:15 ` Christophe Leroy
2022-06-28 9:58 ` [PATCH v2 09/13] locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() Marco Elver
2022-06-28 14:44 ` Dmitry Vyukov
2022-06-28 9:58 ` [PATCH v2 10/13] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
2022-06-28 14:45 ` Dmitry Vyukov
2022-06-28 9:58 ` [PATCH v2 11/13] perf/hw_breakpoint: Introduce bp_slots_histogram Marco Elver
2022-06-28 14:52 ` Dmitry Vyukov
2022-06-28 9:58 ` [PATCH v2 12/13] perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets Marco Elver
2022-06-28 15:41 ` Dmitry Vyukov
2022-06-28 9:58 ` [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() " Marco Elver
2022-06-28 10:54 ` Marco Elver
2022-06-28 15:45 ` Dmitry Vyukov
2022-06-28 16:00 ` Marco Elver
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CACT4Y+YQibtAk0y=SVTSp27Ythjk4c1jCV2_BNAL5Uiw-fMo_w@mail.gmail.com' \
--to=dvyukov@google.com \
--cc=acme@kernel.org \
--cc=alexander.shishkin@linux.intel.com \
--cc=elver@google.com \
--cc=frederic@kernel.org \
--cc=jolsa@redhat.com \
--cc=kasan-dev@googlegroups.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=linux-sh@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mark.rutland@arm.com \
--cc=mingo@kernel.org \
--cc=mpe@ellerman.id.au \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).