Re: [PATCH 0/5] perf sched: Introduce stats tool

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Sapkal, Swapnil" <swapnil.sapkal@amd.com>
To: James Clark <james.clark@linaro.org>,
	Ravi Bangoria <ravi.bangoria@amd.com>
Cc: <yu.c.chen@intel.com>, <mark.rutland@arm.com>,
	<alexander.shishkin@linux.intel.com>, <jolsa@kernel.org>,
	<rostedt@goodmis.org>, <vincent.guittot@linaro.org>,
	<bristot@redhat.com>, <adrian.hunter@intel.com>,
	<james.clark@arm.com>, <kan.liang@linux.intel.com>,
	<gautham.shenoy@amd.com>, <kprateek.nayak@amd.com>,
	<juri.lelli@redhat.com>, <yangjihong@bytedance.com>,
	<void@manifault.com>, <tj@kernel.org>,
	<linux-kernel@vger.kernel.org>,
	<linux-perf-users@vger.kernel.org>, <santosh.shukla@amd.com>,
	<ananth.narayan@amd.com>, <sandipan.das@amd.com>,
	<peterz@infradead.org>, <mingo@redhat.com>, <acme@kernel.org>,
	<namhyung@kernel.org>, <irogers@google.com>
Subject: Re: [PATCH 0/5] perf sched: Introduce stats tool
Date: Wed, 18 Sep 2024 18:49:52 +0530	[thread overview]
Message-ID: <124e9602-77c6-4e85-907c-907dfdc42087@amd.com> (raw)
In-Reply-To: <f022afad-793d-4a3d-acad-13fd27dd5e62@linaro.org>

Hi James,

On 9/17/2024 4:05 PM, James Clark wrote:
> 
> 
> On 16/09/2024 17:47, Ravi Bangoria wrote:
>> MOTIVATION
>> ----------
>>
>> Existing `perf sched` is quite exhaustive and provides lot of insights
>> into scheduler behavior but it quickly becomes impractical to use for
>> long running or scheduler intensive workload. For ex, `perf sched record`
>> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
>> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
>> generates huge 56G perf.data for which perf takes ~137 mins to prepare
>> and write it to disk [1].
>>
>> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
>> and generates samples on a tracepoint hit, `perf sched stats record` 
>> takes
>> snapshot of the /proc/schedstat file before and after the workload, i.e.
>> there is almost zero interference on workload run. Also, it takes very
>> minimal time to parse /proc/schedstat, convert it into perf samples and
>> save those samples into perf.data file. Result perf.data file is much
>> smaller. So, overall `perf sched stats record` is much more light weight
>> compare to `perf sched record`.
>>
>> We, internally at AMD, have been using this (a variant of this, known as
>> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
>> of any scheduler code changes[3][4].
>>
>> Please note that, this is not a replacement of perf sched record/report.
>> The intended users of the new tool are scheduler developers, not regular
>> users.
>>
>> USAGE
>> -----
>>
>>    # perf sched stats record
>>    # perf sched stats report
>>
>> Note: Although `perf sched stats` tool supports workload profiling syntax
>> (i.e. -- <workload> ), the recorded profile is still systemwide since the
>> /proc/schedstat is a systemwide file.
>>
>> HOW TO INTERPRET THE REPORT
>> ---------------------------
>>
>> The `perf sched stats report` starts with total time profiling was active
>> in terms of jiffies:
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    Time elapsed (in jiffies)                                   :       
>> 24537
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is CPU scheduling statistics. These are simple diffs of
>> /proc/schedstat CPU lines along with description. The report also
>> prints % relative to base stat.
>>
>> In the example below, schedule() left the CPU0 idle 98.19% of the time.
>> 16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
>> waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
>> same CPU.
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    CPU 0
>>    
>> ----------------------------------------------------------------------------------------------------
>>    sched_yield() count                                         
>> :           0
>>    Legacy counter can be ignored                               
>> :           0
>>    schedule() called                                           :       
>> 17138
>>    schedule() left the processor idle                          :       
>> 16827 ( 98.19% )
>>    try_to_wake_up() was called                                 
>> :         508
>>    try_to_wake_up() was called to wake up the local cpu        
>> :          84 ( 16.54% )
>>    total runtime by tasks on this processor (in jiffies)       :  
>> 2408959243
>>    total waittime by tasks on this processor (in jiffies)      :    
>> 11731825 ( 0.49% )
>>    total timeslices run on this cpu                            
>> :         311
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is load balancing statistics. For each of the sched domains
>> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
>> the following three categories:
>>
>>    1) Idle Load Balance: Load balancing performed on behalf of a long
>>                          idling CPU by some other CPU.
>>    2) Busy Load Balance: Load balancing performed when the CPU was busy.
>>    3) New Idle Balance : Load balancing performed when a CPU just became
>>                          idle.
>>
>> Under each of these three categories, sched stats report provides
>> different load balancing statistics. Along with direct stats, the
>> report also contains derived metrics prefixed with *. Example:
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    CPU 0 DOMAIN SMT CPUS <0, 64>
>>    ----------------------------------------- <Category idle> 
>> ------------------------------------------
>>    load_balance() count on cpu idle                                 
>> :          50   $      490.74 $
>>    load_balance() found balanced on cpu idle                        
>> :          42   $      584.21 $
>>    load_balance() move task failed on cpu idle                      
>> :           8   $     3067.12 $
>>    imbalance sum on cpu idle                                        
>> :           8
>>    pull_task() count on cpu idle                                    
>> :           0
>>    pull_task() when target task was cache-hot on cpu idle           
>> :           0
>>    load_balance() failed to find busier queue on cpu idle           
>> :           0   $        0.00 $
>>    load_balance() failed to find busier group on cpu idle           
>> :          42   $      584.21 $
>>    *load_balance() success count on cpu idle                        
>> :           0
>>    *avg task pulled per successful lb attempt (cpu idle)            
>> :        0.00
>>    ----------------------------------------- <Category busy> 
>> ------------------------------------------
>>    load_balance() count on cpu busy                                 
>> :           2   $    12268.50 $
>>    load_balance() found balanced on cpu busy                        
>> :           2   $    12268.50 $
>>    load_balance() move task failed on cpu busy                      
>> :           0   $        0.00 $
>>    imbalance sum on cpu busy                                        
>> :           0
>>    pull_task() count on cpu busy                                    
>> :           0
>>    pull_task() when target task was cache-hot on cpu busy           
>> :           0
>>    load_balance() failed to find busier queue on cpu busy           
>> :           0   $        0.00 $
>>    load_balance() failed to find busier group on cpu busy           
>> :           1   $    24537.00 $
>>    *load_balance() success count on cpu busy                        
>> :           0
>>    *avg task pulled per successful lb attempt (cpu busy)            
>> :        0.00
>>    ---------------------------------------- <Category newidle> 
>> ----------------------------------------
>>    load_balance() count on cpu newly idle                           
>> :         427   $       57.46 $
>>    load_balance() found balanced on cpu newly idle                  
>> :         382   $       64.23 $
>>    load_balance() move task failed on cpu newly idle                
>> :          45   $      545.27 $
>>    imbalance sum on cpu newly idle                                  
>> :          48
>>    pull_task() count on cpu newly idle                              
>> :           0
>>    pull_task() when target task was cache-hot on cpu newly idle     
>> :           0
>>    load_balance() failed to find busier queue on cpu newly idle     
>> :           0   $        0.00 $
>>    load_balance() failed to find busier group on cpu newly idle     
>> :         382   $       64.23 $
>>    *load_balance() success count on cpu newly idle                  
>> :           0
>>    *avg task pulled per successful lb attempt (cpu newly idle)      
>> :        0.00
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Consider following line:
>>
>>    load_balance() found balanced on cpu newly idle                  
>> :         382    $      64.23 $
>>
>> While profiling was active, the load-balancer found 382 times the load
>> needs to be balanced on a newly idle CPU 0. Following value encapsulated
>> inside $ is average jiffies between two events (24537 / 382 = 64.23).
> 
> This explanation of the $ fields is quite buried. Is there a way of 
> making it clearer with a column header in the report? I think even if it 
> was documented in the man pages it might not be that useful.
> 
Thank you for the suggestion. I will add a header in the report to 
explain what each column values are representing.

> There are also other jiffies fields that don't use $. Maybe if it was 
> like this it could be semi self documenting:
> 
> ----------------------------------------------------------------------
>      Time elapsed (in jiffies)               :        $  24537        $ 
>   ----------------------------------------------------------------------
> 
> ------------------<Category newidle> ---------------------------------
>    load_balance() count on cpu newly idle    :   427  $     57.46 avg $
> ----------------------------------------------------------------------
> 
> 
> Other than that:
> 
> Tested-by: James Clark <james.clark@linaro.org>

Ack.

--
Thanks and Regards,
Swapnil

next prev parent reply	other threads:[~2024-09-18 13:20 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
2024-09-16 16:47 ` [PATCH 1/5] sched/stats: Print domain name in /proc/schedstat Ravi Bangoria
2024-09-16 16:47 ` [PATCH 2/5] perf sched stats: Add record and rawdump support Ravi Bangoria
2024-09-17 10:35   ` James Clark
2024-09-18  8:52     ` Sapkal, Swapnil
2024-09-26  6:12   ` Namhyung Kim
2024-09-27 11:04     ` Sapkal, Swapnil
2024-09-16 16:47 ` [PATCH 3/5] perf sched stats: Add schedstat v16 support Ravi Bangoria
2024-09-26  6:14   ` Namhyung Kim
2024-09-27 11:08     ` Sapkal, Swapnil
2024-09-16 16:47 ` [PATCH 4/5] perf sched stats: Add support for report subcommand Ravi Bangoria
2024-09-16 16:47 ` [PATCH 5/5] perf sched stats: Add support for live mode Ravi Bangoria
2024-09-17 10:35 ` [PATCH 0/5] perf sched: Introduce stats tool James Clark
2024-09-18 13:19   ` Sapkal, Swapnil [this message]
2024-09-17 10:57 ` Madadi Vineeth Reddy
2024-09-18  8:43   ` Sapkal, Swapnil
2024-09-18  8:45   ` Sapkal, Swapnil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=124e9602-77c6-4e85-907c-907dfdc42087@amd.com \
    --to=swapnil.sapkal@amd.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=ananth.narayan@amd.com \
    --cc=bristot@redhat.com \
    --cc=gautham.shenoy@amd.com \
    --cc=irogers@google.com \
    --cc=james.clark@arm.com \
    --cc=james.clark@linaro.org \
    --cc=jolsa@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=kan.liang@linux.intel.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=ravi.bangoria@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=sandipan.das@amd.com \
    --cc=santosh.shukla@amd.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=yangjihong@bytedance.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox