public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Namhyung Kim <namhyung@kernel.org>
To: Dmitry Vyukov <dvyukov@google.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>,
	Ian Rogers <irogers@google.com>,
	Kan Liang <kan.liang@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-perf-users@vger.kernel.org, Andi Kleen <ak@linux.intel.com>
Subject: Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
Date: Wed, 7 May 2025 16:43:16 -0700	[thread overview]
Message-ID: <aBvwFPRwA2LVQJkO@google.com> (raw)
In-Reply-To: <CACT4Y+aiU-dHVgTKEpyJtn=RUUyYJp8U5BjyWSOHm6b2ODp9cA@mail.gmail.com>

On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > >
> > > > Basically yes, but you can use --sample-cpu to add it.
> > >
> > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> >
> > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> >
> > We may disable latency column by default in this case and show warning
> > if it's requested.  Or we may add a new attribute to emit sched-switch
> > records only for idle tasks and enable the latency report only if the
> > data has sched-switch records.
> >
> > What do you think?
> 
> Depends on what problem we are trying to solve:
> 
> 1. Enabling latency profiling for system-wide mode.
> 
> 2. Switch events bloating trace too much.
> 
> 3. Lost switch events lead to imprecise accounting.
> 
> The patch mentions all 3 :)
> But I think 2 and 3 are not really specific to system-wide mode.
> An active single process profile can emit more samples than a
> system-wide profile on a lightly loaded system.

True.  But we don't need to care about lightly loaded systems as they
won't cause problems.


> Similarly, if we rely on switch events for system-wide mode, then it's
> equally subject to the lost events problem.

Right, but I'm afraid practically it'll increase the chance of lost
in system-wide mode.  The default size of the sample for system-wide
is 56 byte and the size of the switch is 48 byte.  And the default
sample frequency is 4000 Hz but it cannot control the rate of the
switch.  I saw around 10000 Hz of switches per CPU on my work env.

> 
> For problem 1: we can just permit --latency for system wide mode and
> fully rely on switch events.
> It's not any worse than we do now (wrt both profile size and lost events).

This can be an option and it'd work well on lightly loaded systems.
Maybe we can just try it first.  But I think it's better to have an
option to make it work on heavily loaded systems.

> 
> For problem 2: yes, we could emit only switches to idle tasks. Or
> maybe just a fake CPU sample for an idle task? That's effectively what
> we want, then your current accounting code will work w/o any changes.
> This should help wrt trace size only for system-wide mode (provided
> that user already enables CPU accounting for other reasons, otherwise
> it's unclear what's better -- attaching CPU to each sample, or writing
> switch events).

I'm not sure how we can add the fake samples.  The switch events will be
from the kernel and we may add the condition in the attribute.

And PERF_SAMPLE_CPU is on by default in system-wide mode.

> 
> For problem 3: switches to idle task won't really help. There can be
> lots of them, and missing any will lead to wrong accounting.

I don't know how severe the situation will be.  On heavily loaded
systems, the idle task won't run much and data size won't increase.
On lightly loaded systems, increased data will likely be handled well.


> A principled approach would be to attach a per-thread scheduler
> quantum sequence number to each CPU sample. The sequence number would
> be incremented on every context switch. Then any subset of CPU should
> be enough to understand when a task was scheduled in and out
> (scheduled in on the first CPU sample with sequence number N, and
> switched out on the last sample with sequence number N).

I'm not sure how it can help.  We don't need the switch info itself.
What's needed is when the CPU was idle, right?

Thanks,
Namhyung


  reply	other threads:[~2025-05-07 23:43 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-03  0:36 [RFC/PATCH] perf report: Support latency profiling in system-wide mode Namhyung Kim
2025-05-04  8:22 ` Ingo Molnar
2025-05-04 19:52   ` Namhyung Kim
2025-05-05  8:03     ` Dmitry Vyukov
2025-05-06 14:57       ` Arnaldo Carvalho de Melo
2025-05-06 16:58         ` Arnaldo Carvalho de Melo
2025-05-07  9:58           ` Dmitry Vyukov
2025-05-07 15:47             ` Arnaldo Carvalho de Melo
2025-05-07 23:51               ` Namhyung Kim
2025-05-05  8:08 ` Dmitry Vyukov
2025-05-06  5:30   ` Namhyung Kim
2025-05-06  5:55     ` Dmitry Vyukov
2025-05-06  6:43       ` Namhyung Kim
2025-05-06  6:46         ` Dmitry Vyukov
2025-05-06  7:09           ` Namhyung Kim
2025-05-06  7:40             ` Dmitry Vyukov
2025-05-07 23:43               ` Namhyung Kim [this message]
2025-05-08 12:24                 ` Dmitry Vyukov
2025-05-16 16:33                   ` Namhyung Kim
2025-05-19  6:00                     ` Dmitry Vyukov
2025-05-20  1:43                       ` Namhyung Kim
2025-05-20  6:45                         ` Dmitry Vyukov
2025-05-20 22:50                           ` Namhyung Kim
2025-05-21  7:30                             ` Dmitry Vyukov
2025-05-27  7:14                               ` Dmitry Vyukov
2025-05-28 18:38                                 ` Namhyung Kim
2025-05-30  5:50                                   ` Dmitry Vyukov
2025-05-30 22:05                                     ` Namhyung Kim
2025-05-31  6:31                                       ` Dmitry Vyukov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBvwFPRwA2LVQJkO@google.com \
    --to=namhyung@kernel.org \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=ak@linux.intel.com \
    --cc=dvyukov@google.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox