public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Nikhil Rao <ncrao@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Mike Galbraith <efault@gmx.de>,
	linux-kernel@vger.kernel.org,
	"Nikunj A. Dadhania" <nikunj@linux.vnet.ibm.com>,
	Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
	Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
Subject: Re: [PATCH v1 00/19] Increase resolution of load weights
Date: Wed, 11 May 2011 08:59:10 +0200	[thread overview]
Message-ID: <20110511065910.GD22551@elte.hu> (raw)
In-Reply-To: <BANLkTimevJFjvoDSE2q0bu6gQaajmaZATw@mail.gmail.com>


* Nikhil Rao <ncrao@google.com> wrote:

> I think we need the branch for 64-bit kernels. I don't like the branch but I 
> can't think of a better way to avoid it. Do you have any suggestion?

It was just a quick stab into the dark, i was fishing for more 
micro-optimizations on 64-bit (32-bit, as long as we leave its resolution 
alone, should not matter much): clearly there is *some* new overhead on 64-bit 
kernels too, so it would be nice to reduce that to the absolute minimum.

> For 32-bit systems, the compiler should ideally optimize this branch away. 
> Unfortunately gcc-4.4.3 doesn't do that (and I'm not sure if a later version 
> does it either). We could add a macro around this check to avoid the branch 
> for 32-bit and do the check for 64-bit kernels?

I'd rather keep it easy to read. If we keep the 32-bit unit of load a 32-bit 
word then 32-bit will see basically no extra overhead, right? (modulo the 
compiler not noticing such optimizations.)

Also, it's a good idea to do performance measurements with newest gcc (4.6) if 
possible: by the time such a change hits distros it will be the established 
stock distro compiler that kernels get built with. Maybe your figures will get 
better and maybe it can optimize this branch as well.

> > Also, the above (and the other scale-adjustment changes) probably explains 
> > why the instruction count went up on 64-bit.
> 
> Yes, that makes sense. We see an increase in instruction count of about 2% 
> with the new version of the patchset, down from 5.8% (will post the new 
> patchset soon). Assuming 30% of the cost of pipe test is scheduling, that is 
> an effective increase of approx. 6.7%. I'll post the data and some analysis 
> along with the new version.

An instruction count increase does not necessarily mean a linear slowdown: if 
those instructions are cheaper or scheduled better by the CPU then often the 
slowdown will be less.

Sometimes a 1% increase in the instruction count can slow down a workload by 
5%, if the 1% increase does divisions, has complex data path dependencies or is 
missing the branch-cache a lot.

So you should keep an eye on the cycle count as well. Latest -tip's perf stat 
can also measure 'stalled cycles':

aldebaran:~/sched-tests> taskset 1 perf stat --repeat 3 ./pipe-test-1m

 Performance counter stats for './pipe-test-1m' (3 runs):

       6499.787926 task-clock               #    0.437 CPUs utilized            ( +-  0.41% )
         2,000,108 context-switches         #    0.308 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-100.00% )
               147 page-faults              #    0.000 M/sec                    ( +-  0.00% )
    14,226,565,939 cycles                   #    2.189 GHz                      ( +-  0.49% )
     6,897,331,129 stalled-cycles-frontend  #   48.48% frontend cycles idle     ( +-  0.90% )
     4,230,895,459 stalled-cycles-backend   #   29.74% backend  cycles idle     ( +-  1.31% )
    14,002,256,109 instructions             #    0.98  insns per cycle        
                                            #    0.49  stalled cycles per insn  ( +-  0.02% )
     2,703,891,945 branches                 #  415.997 M/sec                    ( +-  0.02% )
        44,994,805 branch-misses            #    1.66% of all branches          ( +-  0.27% )

       14.859234036  seconds time elapsed  ( +-  0.19% )

Te stalled-cycles frontend/backend metrics indicate whether a workload utilizes 
the CPU's resources optimally. Looking at a 'perf record -e 
stalled-cycles-frontend' and 'perf report' will show you the problem areas.
 
Most of the 'problem areas' will be unrelated to your code.

A 'near perfectly utilized' CPU looks like this:

aldebaran:~/opt> taskset 1 perf stat --repeat 10 ./fill_1b

 Performance counter stats for './fill_1b' (10 runs):

       1880.489837 task-clock               #    0.998 CPUs utilized            ( +-  0.15% )
                36 context-switches         #    0.000 M/sec                    ( +- 19.87% )
                 1 CPU-migrations           #    0.000 M/sec                    ( +- 59.63% )
                99 page-faults              #    0.000 M/sec                    ( +-  0.10% )
     6,027,432,226 cycles                   #    3.205 GHz                      ( +-  0.15% )
        22,138,455 stalled-cycles-frontend  #    0.37% frontend cycles idle     ( +- 36.56% )
        16,400,224 stalled-cycles-backend   #    0.27% backend  cycles idle     ( +- 38.12% )
    18,008,803,113 instructions             #    2.99  insns per cycle        
                                            #    0.00  stalled cycles per insn  ( +-  0.00% )
     1,001,802,536 branches                 #  532.735 M/sec                    ( +-  0.01% )
            22,842 branch-misses            #    0.00% of all branches          ( +-  9.07% )

        1.884595529  seconds time elapsed  ( +-  0.15% )

Both stall counts are very low. This is pretty hard to achieve in general, so 
before/after comparisons are used. For that there's 'perf diff' which you can 
use to compare before/after profiles:

 aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
 [ perf record: Woken up 2 times to write data ]
 [ perf record: Captured and wrote 0.427 MB perf.data (~18677 samples) ]
 aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
 [ perf record: Woken up 2 times to write data ]
 [ perf record: Captured and wrote 0.428 MB perf.data (~18685 samples) ]
 aldebaran:~/sched-tests> perf diff | head -10
 # Baseline  Delta          Shared Object                         Symbol
 # ........ ..........  .................  .............................
 #
     2.68%     +0.84%  [kernel.kallsyms]  [k] select_task_rq_fair
     3.28%     -0.17%  [kernel.kallsyms]  [k] fsnotify
     2.67%     +0.13%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     2.46%     +0.11%  [kernel.kallsyms]  [k] pipe_read
     2.42%             [kernel.kallsyms]  [k] schedule
     2.11%     +0.28%  [kernel.kallsyms]  [k] copy_user_generic_string
     2.13%     +0.18%  [kernel.kallsyms]  [k] mutex_lock

 ( Note: these were two short runs on the same kernel so the diff shows the 
   natural noise of the profile of this workload. Longer runs are needed to 
   measure effects smaller than 1%. )

So there's a wide range of tools you can use to understand the precise 
performance impact of your patch and in turn you can present to us what you 
learned about it.

Such analysis saves quite a bit of time on the side of us scheduler maintainers 
and makes performance impacting patches a lot more easy to apply :-)

Thanks,

	Ingo

  reply	other threads:[~2011-05-11 17:23 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
2011-05-02  1:18 ` [PATCH v1 01/19] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 02/19] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 03/19] sched: use u64 for load_weight fields Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 04/19] sched: update cpu_load to be u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 05/19] sched: update this_cpu_load() to return u64 value Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 06/19] sched: update source_load(), target_load() and weighted_cpuload() to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 07/19] sched: update find_idlest_cpu() to use u64 for load Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 08/19] sched: update find_idlest_group() to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 09/19] sched: update division in cpu_avg_load_per_task to use div_u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 10/19] sched: update wake_affine path to use u64, s64 for weights Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 11/19] sched: update update_sg_lb_stats() to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 12/19] sched: Update update_sd_lb_stats() " Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 13/19] sched: update f_b_g() to use u64 for weights Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 14/19] sched: change type of imbalance to be u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 15/19] sched: update h_load to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 16/19] sched: update move_task() and helper functions to use u64 for weights Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 17/19] sched: update f_b_q() to use u64 for weighted cpuload Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 18/19] sched: update shares distribution to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 19/19] sched: convert atomic ops in shares update to use atomic64_t ops Nikhil Rao
2011-05-02  6:14 ` [PATCH v1 00/19] Increase resolution of load weights Ingo Molnar
2011-05-04  0:58   ` Nikhil Rao
2011-05-04  1:07     ` Nikhil Rao
2011-05-04 11:13     ` Ingo Molnar
2011-05-06  1:29       ` Nikhil Rao
2011-05-06  6:59         ` Ingo Molnar
2011-05-11  0:14           ` Nikhil Rao
2011-05-11  6:59             ` Ingo Molnar [this message]
2011-05-12  8:56               ` Nikhil Rao
2011-05-12 10:55                 ` Ingo Molnar
2011-05-12 18:44                   ` Nikhil Rao
2011-05-12  9:08         ` Peter Zijlstra
2011-05-12 17:30           ` Nikhil Rao
2011-05-13  7:19             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110511065910.GD22551@elte.hu \
    --to=mingo@elte.hu \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ncrao@google.com \
    --cc=nikunj@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=stephan.baerwolf@tu-ilmenau.de \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox