public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Joel Fernandes <joel@joelfernandes.org>
To: Qais Yousef <qais.yousef@arm.com>
Cc: "Valentin Schneider" <valentin.schneider@arm.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	linux-kernel@vger.kernel.org, "Ingo Molnar" <mingo@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Jirka Hladký" <jhladky@redhat.com>,
	"Jiří Vozár" <jvozar@redhat.com>,
	x86@kernel.org
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
Date: Wed, 4 Sep 2019 09:06:28 -0400	[thread overview]
Message-ID: <20190904130628.GE144846@google.com> (raw)
In-Reply-To: <20190904104332.ogsjtbtuadhsglxh@e107158-lin.cambridge.arm.com>

On Wed, Sep 04, 2019 at 11:43:33AM +0100, Qais Yousef wrote:
> On 09/04/19 00:23, Joel Fernandes wrote:
> > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > > custom data gathering points to better understand what was going on in
> > > > the scheduler.
> > > > Red Hat adapted one of them for the tracepoint framework and created a
> > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > > tracepoint is being used for fine grained monitoring of scheduling
> > > > imbalance.
> > > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > > 
> > > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > > which requires some shenanigans to make it work as they are defined
> > > > inside sched.h.
> > > > The tracepoints have to be included from sched.h, which means that
> > > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > > 
> > > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > > headers, because they don't include dependecies properly and expect
> > > > sched.h to do it, so it is simpler to keep sched.h there and
> > > > preventively undefine CREATE_TRACE_POINTS right after.
> > > > 
> > > > Exports of the pelt tracepoints remain because they don't need to be
> > > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > > unsightly.
> > > > 
> > > 
> > > Pure trace events are frowned upon in scheduler world, try going with
> > > trace points. Qais did something very similar recently:
> > > 
> > > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.yousef@arm.com/
> > > 
> > > You'll have to implement the associated trace events in a module, which
> > > lets you define your own event format and doesn't form an ABI :).
> > 
> > Is that really true? eBPF programs loaded from userspace can access
> > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
> > 
> > I don't have a strong opinion about considering tracepoints as ABI / API or
> > not, but just want to get the facts straight :)
> 
> It is actually true.
>
> But you need to make the distinction between a tracepoint
> and a trace event first.

I know this distinction well.

> What Valentin is talking about here is the *bare*
> tracepoint without any event associated with them like the one I added to the
> scheduler recently. These ones are not accessible via eBPF, unless something
> has changed since I last tried.

Can this tracepoint be registered on with tracepoint_probe_register()?
Quickly looking at these new tracepoint, they can be otherwise how would they
even work right? If so, then eBPF can very well access it. Look at
__bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
BPF_RAW_TRACEPOINT_OPEN.

> The current infrastructure needs to be expanded to allow eBPF to attach these
> bare tracepoints. Something similar to what I have in [1] is needed - but
> instead of creating a new macro it needs to expand the current macro. [2] might
> give full context of when I was trying to come up with alternatives to using
> trace events.
> 
> [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> [2] https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45v6k@e107158-lin.cambridge.arm.com/


As I was mentioning, tracepoints, not "trace events" can already be opened
directly with BPF. I don't see how these new tracepoints are different.

I wonder if this distinction of "tracepoint" being non-ABI can be documented
somewhere. I would be happy to do that if there is a place for the same. I
really want some general "policy" in the kernel on where we draw a line in
the sand with respect to tracepoints and ABI :).

For instance, perhaps VFS can also start having non-ABI tracepoints for the
benefit of people tracing the VFS.

thanks,

 - Joel


  reply	other threads:[~2019-09-04 13:06 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-03 15:43 [PATCH 0/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
2019-09-03 15:43 ` [PATCH 1/2] x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h Radim Krčmář
2019-09-03 15:43 ` [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
2019-09-03 16:05   ` Valentin Schneider
2019-09-04  4:23     ` Joel Fernandes
2019-09-04  8:14       ` Peter Zijlstra
2019-09-04 10:52         ` Qais Yousef
2019-09-04 13:11         ` Joel Fernandes
2019-09-04 15:26           ` Alexei Starovoitov
2019-09-04 15:33             ` Joel Fernandes
2019-09-04 15:37               ` Alexei Starovoitov
2019-09-04 15:41                 ` Joel Fernandes
2019-09-04 10:43       ` Qais Yousef
2019-09-04 13:06         ` Joel Fernandes [this message]
2019-09-04 14:20           ` Qais Yousef
2019-09-04 14:41             ` Joel Fernandes
2019-09-04 14:57               ` Qais Yousef
2019-09-04 15:46                 ` Joel Fernandes
2019-09-04 15:25           ` Alexei Starovoitov
2019-09-04 15:40             ` Joel Fernandes
2019-09-04 15:51               ` Alexei Starovoitov
2019-09-04 17:47                 ` Peter Zijlstra
2019-09-04 17:53                   ` Alexei Starovoitov
2019-09-05  8:13                     ` Ingo Molnar
2019-09-05 16:49                       ` Alexei Starovoitov
2019-09-04  6:55     ` chengjian (D)
2019-09-04 13:13   ` Peter Zijlstra
2019-09-04 13:21     ` Valentin Schneider
2019-09-04 14:37   ` Qais Yousef
2019-09-04 17:48     ` Peter Zijlstra
2019-09-09 10:59       ` Qais Yousef

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190904130628.GE144846@google.com \
    --to=joel@joelfernandes.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jhladky@redhat.com \
    --cc=jvozar@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=qais.yousef@arm.com \
    --cc=rkrcmar@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=valentin.schneider@arm.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox