Re: [PATCH v5 00/10] perf sched: Introduce stats tool

public inbox for linux-perf-users@vger.kernel.org
 help / color / mirror / Atom feed

From: Swapnil Sapkal <swapnil.sapkal@amd.com>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: <ravi.bangoria@amd.com>, <yu.c.chen@intel.com>,
	<mark.rutland@arm.com>, <alexander.shishkin@linux.intel.com>,
	<jolsa@kernel.org>, <rostedt@goodmis.org>,
	<vincent.guittot@linaro.org>, <adrian.hunter@intel.com>,
	<kan.liang@linux.intel.com>, <gautham.shenoy@amd.com>,
	<kprateek.nayak@amd.com>, <juri.lelli@redhat.com>,
	<yangjihong@bytedance.com>, <void@manifault.com>, <tj@kernel.org>,
	<ctshao@google.com>, <quic_zhonhan@quicinc.com>,
	<thomas.falcon@intel.com>, <blakejones@google.com>,
	<ashelat@redhat.com>, <leo.yan@arm.com>, <dvyukov@google.com>,
	<ak@linux.intel.com>, <yujie.liu@intel.com>,
	<graham.woodward@arm.com>, <ben.gainey@arm.com>,
	<vineethr@linux.ibm.com>, <tim.c.chen@linux.intel.com>,
	<linux@treblig.org>, <santosh.shukla@amd.com>,
	<sandipan.das@amd.com>, <linux-kernel@vger.kernel.org>,
	<linux-perf-users@vger.kernel.org>, <peterz@infradead.org>,
	<mingo@redhat.com>, <acme@kernel.org>, <namhyung@kernel.org>,
	<irogers@google.com>, <james.clark@arm.com>
Subject: Re: [PATCH v5 00/10] perf sched: Introduce stats tool
Date: Fri, 23 Jan 2026 21:49:45 +0530	[thread overview]
Message-ID: <0778a20a-00cb-4a90-9e8e-99ff033dc23d@amd.com> (raw)
In-Reply-To: <ae2b9422-d203-4fc2-ab52-e0fd86e93f39@linux.ibm.com>

Hi Shrikanth,

On 21-01-2026 23:22, Shrikanth Hegde wrote:
> 
> 
> On 1/19/26 11:28 PM, Swapnil Sapkal wrote:
>> MOTIVATION
>> ----------
>>
>> Existing `perf sched` is quite exhaustive and provides lot of insights
>> into scheduler behavior but it quickly becomes impractical to use for
>> long running or scheduler intensive workload. For ex, `perf sched record`
>> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
>> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
>> generates huge 56G perf.data for which perf takes ~137 mins to prepare
>> and write it to disk [1].
>>
>> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
>> and generates samples on a tracepoint hit, `perf sched stats record` 
>> takes
>> snapshot of the /proc/schedstat file before and after the workload, i.e.
>> there is almost zero interference on workload run. Also, it takes very
>> minimal time to parse /proc/schedstat, convert it into perf samples and
>> save those samples into perf.data file. Result perf.data file is much
>> smaller. So, overall `perf sched stats record` is much more light weight
>> compare to `perf sched record`.
>>
>> We, internally at AMD, have been using this (a variant of this, known as
>> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
>> of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
>> series to report the analysis[6][7].
>>
>> Please note that, this is not a replacement of perf sched record/report.
>> The intended users of the new tool are scheduler developers, not regular
>> users.
>>
>> USAGE
>> -----
>>
>>    # perf sched stats record
>>    # perf sched stats report
>>    # perf sched stats diff
>>
>> Note: Although `perf sched stats` tool supports workload profiling syntax
>> (i.e. -- <workload> ), the recorded profile is still systemwide since the
>> /proc/schedstat is a systemwide file.
>>
>> HOW TO INTERPRET THE REPORT
>> ---------------------------
>>
>> The `perf sched stats report` starts with description of the columns
>> present in the report. These column names are given before cpu and
>> domain stats to improve the readability of the report.
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    DESC                    -> Description of the field
>>    COUNT                   -> Value of the field
>>    PCT_CHANGE              -> Percent change with corresponding base 
>> value
>>    AVG_JIFFIES             -> Avg time in jiffies between two 
>> consecutive occurrence of event
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is the total profiling time in terms of jiffies:
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    Time elapsed (in jiffies)                                   :       
>> 24537
>>    
>> ----------------------------------------------------------------------------------------------------
>>
> 
> nit:
> Is there a way to export HZ value too here?

As per my knowledge, we can get this value from /proc/config.gz and this 
depends on 'CONFIG_IKCONFIG_PROC' being enabled.

Peter, Is it okay to export the HZ value through a debugfs file, say 
something like '/sys/kernel/debug/sched/hz_value'? Though I am not sure 
if this is useful for anythyng else.

> 
>> Next is CPU scheduling statistics. These are simple diffs of
>> /proc/schedstat CPU lines along with description. The report also
>> prints % relative to base stat.
>>
>> In the example below, schedule() left the CPU0 idle 36.58% of the time.
>> 0.45% of total try_to_wake_up() was to wakeup local CPU. And, the total
>> waittime by tasks on CPU0 is 48.70% of the total runtime by tasks on the
>> same CPU.
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    CPU 0
>>    
>> ----------------------------------------------------------------------------------------------------
>>    
>> DESC                                                                     COUNT   PCT_CHANGE
>>    
>> ----------------------------------------------------------------------------------------------------
>>    
>> yld_count                                                        :           0
>>    
>> array_exp                                                        :           0
>>    
>> sched_count                                                      :      402267
>>    
>> sched_goidle                                                     :      147161  (    36.58% )
>>    
>> ttwu_count                                                       :      236309
>>    
>> ttwu_local                                                       :        1062  (     0.45% )
>>    rq_cpu_time                                                      :  
>> 7083791148
>>    run_delay                                                        :  
>> 3449973971  (    48.70% )
>>    
>> pcount                                                           :      255035
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is load balancing statistics. For each of the sched domains
>> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
>> the following three categories:
>>
>>    1) Idle Load Balance: Load balancing performed on behalf of a long
>>                          idling CPU by some other CPU.
>>    2) Busy Load Balance: Load balancing performed when the CPU was busy.
>>    3) New Idle Balance : Load balancing performed when a CPU just became
>>                          idle.
>>
>> Under each of these three categories, sched stats report provides
>> different load balancing statistics. Along with direct stats, the
>> report also contains derived metrics prefixed with *. Example:
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    CPU: 0 | DOMAIN: SMT | DOMAIN_CPUS: 0,64
>>    
>> ----------------------------------------------------------------------------------------------------
>>    
>> DESC                                                                     COUNT    AVG_JIFFIES
>>    ----------------------------------------- <Category busy> 
>> ------------------------------------------
>>    
>> busy_lb_count                                                    :         136  $       17.08 $
>>    
>> busy_lb_balanced                                                 :         131  $       17.73 $
>>    
>> busy_lb_failed                                                   :           0  $        0.00 $
>>    
>> busy_lb_imbalance_load                                           :          58
>>    
>> busy_lb_imbalance_util                                           :           0
>>    
>> busy_lb_imbalance_task                                           :           0
>>    
>> busy_lb_imbalance_misfit                                         :           0
>>    
>> busy_lb_gained                                                   :           7
>>    
>> busy_lb_hot_gained                                               :           0
>>    
>> busy_lb_nobusyq                                                  :           2  $     1161.50 $
>>    
>> busy_lb_nobusyg                                                  :         129  $       18.01 $
>>    
>> *busy_lb_success_count                                           :           5
>>    
>> *busy_lb_avg_pulled                                              :        1.40
>>    ----------------------------------------- <Category idle> 
>> ------------------------------------------
>>    
>> idle_lb_count                                                    :         449  $        5.17 $
>>    
>> idle_lb_balanced                                                 :         382  $        6.08 $
>>    
>> idle_lb_failed                                                   :           3  $      774.33 $
>>    
>> idle_lb_imbalance_load                                           :           0
>>    
>> idle_lb_imbalance_util                                           :           0
>>    
>> idle_lb_imbalance_task                                           :          71
>>    
>> idle_lb_imbalance_misfit                                         :           0
>>    
>> idle_lb_gained                                                   :          67
>>    
>> idle_lb_hot_gained                                               :           0
>>    
>> idle_lb_nobusyq                                                  :           0  $        0.00 $
>>    
>> idle_lb_nobusyg                                                  :         382  $        6.08 $
>>    
>> *idle_lb_success_count                                           :          64
>>    
>> *idle_lb_avg_pulled                                              :        1.05
>>    ---------------------------------------- <Category newidle> 
>> ----------------------------------------
>>    
>> newidle_lb_count                                                 :       30471  $        0.08 $
>>    
>> newidle_lb_balanced                                              :       28490  $        0.08 $
>>    
>> newidle_lb_failed                                                :         633  $        3.67 $
>>    
>> newidle_lb_imbalance_load                                        :           0
>>    
>> newidle_lb_imbalance_util                                        :           0
>>    
>> newidle_lb_imbalance_task                                        :        2040
>>    
>> newidle_lb_imbalance_misfit                                      :           0
>>    
>> newidle_lb_gained                                                :        1348
>>    
>> newidle_lb_hot_gained                                            :           0
>>    
>> newidle_lb_nobusyq                                               :           6  $      387.17 $
>>    
>> newidle_lb_nobusyg                                               :       26634  $        0.09 $
>>    
>> *newidle_lb_success_count                                        :        1348
>>    
>> *newidle_lb_avg_pulled                                           :        1.00
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Consider following line:
>>
>> newidle_lb_balanced                                              :       28490  $        0.08 $
>>
>> While profiling was active, the load-balancer found 28490 times the load
>> needs to be balanced on a newly idle CPU 0. Following value encapsulated
>> inside $ is average jiffies between two events (28490 / 24537 = 0.08).
>>
> 
> Could you please explain this? I couldn't understand.
> 
> IIUC, you are parsing two instance of /proc/schedtstat,
> once in the beginning and once in the end.
> 
> newidle_lb_balanced is a counter. In the beginning every iteration could
> have decided domain is imbalanced and once load stabilized, it could have
> decided now domain is balanced more often. i.e initially counter would add
> quickly and then may stay more or less same value.
> 
> Also, what is this logic ? (28490 / 24537 = 0.08)?
> 

Thanks for catching this. This is a miss from my end while writing the 
cover letter. Here the jiffies value and the counter values are from two 
different runs. Total jiffies for the run was 2323.

The values inside $ .. $ represents per jiffie how many times this event 
has occured. The calculation here is (total_jiffies / counter_value).
In the run it is (2323 / 24537 = 0.08)

I will fix this in man page also.

--
Thanks and Regards,
Swapnil

next prev parent reply	other threads:[~2026-01-23 16:20 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-19 17:58 [PATCH v5 00/10] perf sched: Introduce stats tool Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 01/10] tools/lib: Add list_is_first() Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 02/10] perf header: Support CPU DOMAIN relation info Swapnil Sapkal
2026-01-21 17:18   ` Shrikanth Hegde
2026-01-23 15:19     ` Swapnil Sapkal
2026-01-23 16:31       ` Arnaldo Carvalho de Melo
2026-01-26 14:46   ` kernel test robot
2026-01-26 17:12     ` Ian Rogers
2026-01-27  2:46       ` Oliver Sang
2026-01-19 17:58 ` [PATCH v5 03/10] perf sched stats: Add record and rawdump support Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 04/10] perf sched stats: Add schedstat v16 support Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 05/10] perf sched stats: Add schedstat v17 support Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 06/10] perf sched stats: Add support for report subcommand Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 07/10] perf sched stats: Add support for live mode Swapnil Sapkal
2026-01-22  0:54   ` Arnaldo Carvalho de Melo
2026-01-22  1:32     ` Arnaldo Carvalho de Melo
2026-01-22 15:34       ` Arnaldo Carvalho de Melo
2026-01-22 15:42     ` Arnaldo Carvalho de Melo
2026-03-03 18:47   ` Ian Rogers
2026-03-10 10:08     ` Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 08/10] perf sched stats: Add support for diff subcommand Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 09/10] perf sched stats: Add basic perf sched stats test Swapnil Sapkal
2026-01-19 17:58 ` [PATCH v5 10/10] perf sched stats: Add details in man page Swapnil Sapkal
2026-01-21 16:09 ` [PATCH v5 00/10] perf sched: Introduce stats tool Chen, Yu C
2026-01-21 16:33   ` Peter Zijlstra
2026-01-21 17:12     ` Ian Rogers
2026-01-21 19:51       ` Peter Zijlstra
2026-01-21 20:08         ` Arnaldo Carvalho de Melo
2026-01-21 20:10           ` Peter Zijlstra
2026-01-22 16:00             ` Arnaldo Carvalho de Melo
2026-01-23 16:32               ` Swapnil Sapkal
2026-01-23 16:37                 ` Arnaldo Carvalho de Melo
2026-01-23 16:42                   ` Swapnil Sapkal
2026-01-21 17:52 ` Shrikanth Hegde
2026-01-23 16:19   ` Swapnil Sapkal [this message]
2026-01-21 22:59 ` Namhyung Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0778a20a-00cb-4a90-9e8e-99ff033dc23d@amd.com \
    --to=swapnil.sapkal@amd.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=ak@linux.intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=ashelat@redhat.com \
    --cc=ben.gainey@arm.com \
    --cc=blakejones@google.com \
    --cc=ctshao@google.com \
    --cc=dvyukov@google.com \
    --cc=gautham.shenoy@amd.com \
    --cc=graham.woodward@arm.com \
    --cc=irogers@google.com \
    --cc=james.clark@arm.com \
    --cc=jolsa@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=kan.liang@linux.intel.com \
    --cc=kprateek.nayak@amd.com \
    --cc=leo.yan@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux@treblig.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=quic_zhonhan@quicinc.com \
    --cc=ravi.bangoria@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=sandipan.das@amd.com \
    --cc=santosh.shukla@amd.com \
    --cc=sshegde@linux.ibm.com \
    --cc=thomas.falcon@intel.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethr@linux.ibm.com \
    --cc=void@manifault.com \
    --cc=yangjihong@bytedance.com \
    --cc=yu.c.chen@intel.com \
    --cc=yujie.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox