public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* busted CFS group load balancer?
@ 2008-11-15  1:14 Ken Chen
  2008-11-17 15:37 ` Chris Friesen
  0 siblings, 1 reply; 10+ messages in thread
From: Ken Chen @ 2008-11-15  1:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra; +Cc: Linux Kernel Mailing List

It appears that the fair-group load balancer in 2.6.27 does not work
properly.  It broke down severely with real-world application workload
here at google.  We are trying to use CFS cgroup scheduler as a
proportional scheduler to control CPU cycle allocation for concurrent
groups of tasks on the same server machine.  For example, we have one
cgroup named 'bee' which has a cfs weight of 1024, and another cgroup
named 'ant' that has a cfs weight of 2.  Ideally, I would expect 99.8%
of CPU cycles goes into 'bee' cgroup while the remaining 0.2% goes into
'ant'.  However, when evaluating 2.6.27, the result deviates significantly
from the expectation.  It appears that the fair-group load balancer is sub
par and not doing its job properly.

Attached is a snapshot of /proc/sched_debug.  Two CPUs were dominated by
tasks of a very low weight 'ant' cgroup, while multiple tasks in higher
weight 'bee' cgroup landed on the same CPU and competing among themselves.
I think the load balancer does its job poorly.  Have you seen similar l-b
deficiency?

- Ken


Sched Debug Version: v0.07, 2.6.27 #1
now at 62931690.049344 msecs
  .sysctl_sched_latency                    : 80.000000
  .sysctl_sched_min_granularity            : 16.000000
  .sysctl_sched_wakeup_granularity         : 20.000000
  .sysctl_sched_child_runs_first           : 0.000001
  .sysctl_sched_features                   : 7935

cpu#0, 2333.417 MHz
  .nr_running                    : 2
  .load                          : 103
  .nr_switches                   : 12943039
  .nr_load_updates               : 62738664
  .nr_uninterruptible            : -260
  .jiffies                       : 4357598986
  .next_balance                  : 4357.599094
  .curr->pid                     : 12721
  .clock                         : 62969376.294574
  .cpu_load[0]                   : 103
  .cpu_load[1]                   : 121
  .cpu_load[2]                   : 143
  .cpu_load[3]                   : 158
  .cpu_load[4]                   : 166

cfs_rq[0]:/ant
  .exec_clock                    : 53371626.919067
  .MIN_vruntime                  : 60790257.707167
  .min_vruntime                  : 44998074.291781
  .max_vruntime                  : 60790257.707167
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14251
  .sched_switch                  : 0
  .sched_count                   : 12981786
  .sched_goidle                  : 288336
  .ttwu_count                    : 6565365
  .ttwu_local                    : 5442677
  .bkl_count                     : 46781
  .nr_spread_over                : 12567
  .shares                        : 0

cfs_rq[0]:/bee
  .exec_clock                    : 9183712.055274
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 44998074.291781
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14251
  .sched_switch                  : 0
  .sched_count                   : 12981786
  .sched_goidle                  : 288336
  .ttwu_count                    : 6565365
  .ttwu_local                    : 5442677
  .bkl_count                     : 46781
  .nr_spread_over                : 442289
  .shares                        : 101

cfs_rq[0]:/
  .exec_clock                    : 62567144.106498
  .MIN_vruntime                  : 44998074.291781
  .min_vruntime                  : 44998074.291781
  .max_vruntime                  : 44998074.291781
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 2
  .load                          : 103
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14251
  .sched_switch                  : 0
  .sched_count                   : 12981786
  .sched_goidle                  : 288336
  .ttwu_count                    : 6565365
  .ttwu_local                    : 5442677
  .bkl_count                     : 46781
  .nr_spread_over                : 43480
  .shares                        : 0

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
             hog  4202  60790257.707167   5421910   120  60790257.707167  53371613.637974         0.000000 /ant
R           hpro 12721  10434396.891434    425169   120  10434396.891434   2351102.444642   1015320.568226 /bee

cpu#1, 2333.417 MHz
  .nr_running                    : 2
  .load                          : 102
  .nr_switches                   : 35564399
  .nr_load_updates               : 62728095
  .nr_uninterruptible            : 371
  .jiffies                       : 4357598986
  .next_balance                  : 4357.599133
  .curr->pid                     : 12727
  .clock                         : 62969376.338300
  .cpu_load[0]                   : 103
  .cpu_load[1]                   : 109
  .cpu_load[2]                   : 121
  .cpu_load[3]                   : 136
  .cpu_load[4]                   : 150

cfs_rq[1]:/ant
  .exec_clock                    : 50553231.251769
  .MIN_vruntime                  : 98877044.002500
  .min_vruntime                  : 49317332.550674
  .max_vruntime                  : 98877044.002500
  .spread                        : 0.000000
  .spread0                       : 4319258.258893
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 19005
  .sched_switch                  : 0
  .sched_count                   : 35601762
  .sched_goidle                  : 284437
  .ttwu_count                    : 18517385
  .ttwu_local                    : 16585787
  .bkl_count                     : 24693
  .nr_spread_over                : 21321
  .shares                        : 0

cfs_rq[1]:/bee
  .exec_clock                    : 12002883.375364
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 49317332.550674
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 4319258.258893
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 19005
  .sched_switch                  : 0
  .sched_count                   : 35601762
  .sched_goidle                  : 284437
  .ttwu_count                    : 18517385
  .ttwu_local                    : 16585787
  .bkl_count                     : 24693
  .nr_spread_over                : 341849
  .shares                        : 100

cfs_rq[1]:/
  .exec_clock                    : 62568079.552049
  .MIN_vruntime                  : 49317332.550674
  .min_vruntime                  : 49317332.550674
  .max_vruntime                  : 49317332.550674
  .spread                        : 0.000000
  .spread0                       : 4319258.258893
  .nr_running                    : 2
  .load                          : 102
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 19005
  .sched_switch                  : 0
  .sched_count                   : 35601762
  .sched_goidle                  : 284437
  .ttwu_count                    : 18517385
  .ttwu_local                    : 16585787
  .bkl_count                     : 24693
  .nr_spread_over                : 50967
  .shares                        : 0

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
             hog  4200  98877044.002500  14797192   120  98877044.002500  50553231.092595         0.000000 /ant
R           hpro 12727  14039095.103079    422222   120  14039095.103079   2347785.641546   1017814.287309 /bee

cpu#2, 2333.417 MHz
  .nr_running                    : 3
  .load                          : 205
  .nr_switches                   : 5452324
  .nr_load_updates               : 62703726
  .nr_uninterruptible            : 1143
  .jiffies                       : 4357598986
  .next_balance                  : 4357.599024
  .curr->pid                     : 12728
  .clock                         : 62969376.294927
  .cpu_load[0]                   : 205
  .cpu_load[1]                   : 198
  .cpu_load[2]                   : 190
  .cpu_load[3]                   : 184
  .cpu_load[4]                   : 183

cfs_rq[2]:/ant
  .exec_clock                    : 53265000.243513
  .MIN_vruntime                  : 62505737.284348
  .min_vruntime                  : 44986496.033419
  .max_vruntime                  : 62505737.284348
  .spread                        : 0.000000
  .spread0                       : -11578.258362
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14697
  .sched_switch                  : 0
  .sched_count                   : 5470658
  .sched_goidle                  : 47945
  .ttwu_count                    : 2792641
  .ttwu_local                    : 1831019
  .bkl_count                     : 43426
  .nr_spread_over                : 16899
  .shares                        : 0

cfs_rq[2]:/bee
  .exec_clock                    : 9290734.241942
  .MIN_vruntime                  : 10838701.944779
  .min_vruntime                  : 44986496.033419
  .max_vruntime                  : 10838701.944779
  .spread                        : 0.000000
  .spread0                       : -11578.258362
  .nr_running                    : 2
  .load                          : 2048
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14697
  .sched_switch                  : 0
  .sched_count                   : 5470658
  .sched_goidle                  : 47945
  .ttwu_count                    : 2792641
  .ttwu_local                    : 1831019
  .bkl_count                     : 43426
  .nr_spread_over                : 403151
  .shares                        : 203

cfs_rq[2]:/
  .exec_clock                    : 62557630.621219
  .MIN_vruntime                  : 44986496.033419
  .min_vruntime                  : 44986496.033419
  .max_vruntime                  : 44986496.033419
  .spread                        : 0.000000
  .spread0                       : -11578.258362
  .nr_running                    : 2
  .load                          : 205
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14697
  .sched_switch                  : 0
  .sched_count                   : 5470658
  .sched_goidle                  : 47945
  .ttwu_count                    : 2792641
  .ttwu_local                    : 1831019
  .bkl_count                     : 43426
  .nr_spread_over                : 83787
  .shares                        : 0

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
             hog  4205  62505737.284348   1919921   120  62505737.284348  53264991.605391         0.000000 /ant
            hpro 12726  10838701.944779    423772   120  10838701.944779   2344927.648792   1017455.081051 /bee
R           hpro 12728  10838780.058709    421699   120  10838780.058709   2345123.009711   1019061.435473 /bee

cpu#3, 2333.417 MHz
  .nr_running                    : 1
  .load                          : 2
  .nr_switches                   : 4682853
  .nr_load_updates               : 62711242
  .nr_uninterruptible            : -1414
  .jiffies                       : 4357598986
  .next_balance                  : 4357.599007
  .curr->pid                     : 4204
  .clock                         : 62969376.294206
  .cpu_load[0]                   : 2
  .cpu_load[1]                   : 48
  .cpu_load[2]                   : 68
  .cpu_load[3]                   : 78
  .cpu_load[4]                   : 82

cfs_rq[3]:/ant
  .exec_clock                    : 53314990.134015
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 44976333.787158
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -21740.504623
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14720
  .sched_switch                  : 0
  .sched_count                   : 4699788
  .sched_goidle                  : 23093
  .ttwu_count                    : 2413046
  .ttwu_local                    : 1516294
  .bkl_count                     : 38944
  .nr_spread_over                : 36602
  .shares                        : 0

cfs_rq[3]:/bee
  .exec_clock                    : 9240891.195912
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 44976333.787158
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -21740.504623
  .nr_running                    : 0
  .load                          : 0
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14720
  .sched_switch                  : 0
  .sched_count                   : 4699788
  .sched_goidle                  : 23093
  .ttwu_count                    : 2413046
  .ttwu_local                    : 1516294
  .bkl_count                     : 38944
  .nr_spread_over                : 368326
  .shares                        : 101

cfs_rq[3]:/
  .exec_clock                    : 62560353.760642
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 44976333.787158
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -21740.504623
  .nr_running                    : 1
  .load                          : 2
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 14720
  .sched_switch                  : 0
  .sched_count                   : 4699788
  .sched_goidle                  : 23093
  .ttwu_count                    : 2413046
  .ttwu_local                    : 1516294
  .bkl_count                     : 38944
  .nr_spread_over                : 278060
  .shares                        : 0

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
R            hog  4204  61265712.229407   1590630   120  61265712.229407  53314966.642964         0.000000 /ant

cpu#4, 2333.417 MHz
  .nr_running                    : 3
  .load                          : 205
  .nr_switches                   : 6034988
  .nr_load_updates               : 62701982
  .nr_uninterruptible            : -489
  .jiffies                       : 4357598986
  .next_balance                  : 4357.599098
  .curr->pid                     : 12722
  .clock                         : 62969376.294969
  .cpu_load[0]                   : 205
  .cpu_load[1]                   : 198
  .cpu_load[2]                   : 190
  .cpu_load[3]                   : 181
  .cpu_load[4]                   : 163

cfs_rq[4]:/ant
  .exec_clock                    : 57463524.397256
  .MIN_vruntime                  : 61439338.118634
  .min_vruntime                  : 43422727.063373
  .max_vruntime                  : 61439338.118634
  .spread                        : 0.000000
  .spread0                       : -1575347.228408
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 7979
  .sched_switch                  : 0
  .sched_count                   : 6043766
  .sched_goidle                  : 18552
  .ttwu_count                    : 3180073
  .ttwu_local                    : 2221510
  .bkl_count                     : 8680
  .nr_spread_over                : 19946
  .shares                        : 0

cfs_rq[4]:/bee
  .exec_clock                    : 5093116.134219
  .MIN_vruntime                  : 5520870.121561
  .min_vruntime                  : 43422727.063373
  .max_vruntime                  : 5520870.121561
  .spread                        : 0.000000
  .spread0                       : -1575347.228408
  .nr_running                    : 2
  .load                          : 2048
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 7979
  .sched_switch                  : 0
  .sched_count                   : 6043766
  .sched_goidle                  : 18552
  .ttwu_count                    : 3180073
  .ttwu_local                    : 2221510
  .bkl_count                     : 8680
  .nr_spread_over                : 369128
  .shares                        : 203

cfs_rq[4]:/
  .exec_clock                    : 62557994.778412
  .MIN_vruntime                  : 43422727.063373
  .min_vruntime                  : 43422727.063373
  .max_vruntime                  : 43422727.063373
  .spread                        : 0.000000
  .spread0                       : -1575347.228408
  .nr_running                    : 2
  .load                          : 205
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 7979
  .sched_switch                  : 0
  .sched_count                   : 6043766
  .sched_goidle                  : 18552
  .ttwu_count                    : 3180073
  .ttwu_local                    : 2221510
  .bkl_count                     : 8680
  .nr_spread_over                : 191336
  .shares                        : 0

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
             hog  4199  61439338.118634   2252113   120  61439338.118634  57463477.501906         0.000000 /ant
R           hpro 12722   5520893.049620    423940   120   5520893.049620   2353601.970167   1016247.261338 /bee
            hpro 12725   5520870.121561    417763   120   5520870.121561   2352852.333553   1020343.133648 /bee

cpu#5, 2333.417 MHz
  .nr_running                    : 2
  .load                          : 102
  .nr_switches                   : 36228963
  .nr_load_updates               : 62712542
  .nr_uninterruptible            : 1684
  .jiffies                       : 4357598986
  .next_balance                  : 4357.599094
  .curr->pid                     : 12724
  .clock                         : 62969376.294760
  .cpu_load[0]                   : 103
  .cpu_load[1]                   : 109
  .cpu_load[2]                   : 115
  .cpu_load[3]                   : 112
  .cpu_load[4]                   : 105

cfs_rq[5]:/ant
  .exec_clock                    : 50479803.539169
  .MIN_vruntime                  : 97702539.108896
  .min_vruntime                  : 49951986.086422
  .max_vruntime                  : 97702539.108896
  .spread                        : 0.000000
  .spread0                       : 4953911.794641
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 19372
  .sched_switch                  : 0
  .sched_count                   : 36250568
  .sched_goidle                  : 40214
  .ttwu_count                    : 19127707
  .ttwu_local                    : 17723882
  .bkl_count                     : 1632
  .nr_spread_over                : 17524
  .shares                        : 0

cfs_rq[5]:/bee
  .exec_clock                    : 12079214.598498
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 49951986.086422
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 4953911.794641
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 19372
  .sched_switch                  : 0
  .sched_count                   : 36250568
  .sched_goidle                  : 40214
  .ttwu_count                    : 19127707
  .ttwu_local                    : 17723882
  .bkl_count                     : 1632
  .nr_spread_over                : 257170
  .shares                        : 100

cfs_rq[5]:/
  .exec_clock                    : 62560310.280302
  .MIN_vruntime                  : 49951986.086422
  .min_vruntime                  : 49951986.086422
  .max_vruntime                  : 49951986.086422
  .spread                        : 0.000000
  .spread0                       : 4953911.794641
  .nr_running                    : 2
  .load                          : 102
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 19372
  .sched_switch                  : 0
  .sched_count                   : 36250568
  .sched_goidle                  : 40214
  .ttwu_count                    : 19127707
  .ttwu_local                    : 17723882
  .bkl_count                     : 1632
  .nr_spread_over                : 51023
  .shares                        : 0

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
             hog  4201  97702539.108896  15313291   120  97702539.108896  50479803.305704         0.000000 /ant
R           hpro 12724  13553228.752335    428968   120  13553228.752335   2359267.848257   1021475.119277 /bee

cpu#6, 2333.417 MHz
  .nr_running                    : 2
  .load                          : 103
  .nr_switches                   : 5740323
  .nr_load_updates               : 62705164
  .nr_uninterruptible            : -483
  .jiffies                       : 4357598986
  .next_balance                  : 4357.599088
  .curr->pid                     : 18280
  .clock                         : 62969376.299341
  .cpu_load[0]                   : 103
  .cpu_load[1]                   : 78
  .cpu_load[2]                   : 47
  .cpu_load[3]                   : 26
  .cpu_load[4]                   : 15

cfs_rq[6]:/ant
  .exec_clock                    : 57412441.111827
  .MIN_vruntime                  : 60648319.346763
  .min_vruntime                  : 43681901.531907
  .max_vruntime                  : 60648319.346763
  .spread                        : 0.000000
  .spread0                       : -1316172.759874
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 8292
  .sched_switch                  : 0
  .sched_count                   : 5749245
  .sched_goidle                  : 17290
  .ttwu_count                    : 2988054
  .ttwu_local                    : 2083371
  .bkl_count                     : 8904
  .nr_spread_over                : 11085
  .shares                        : 0

cfs_rq[6]:/bee
  .exec_clock                    : 5144065.252998
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 43681901.531907
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -1316172.759874
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 8292
  .sched_switch                  : 0
  .sched_count                   : 5749245
  .sched_goidle                  : 17290
  .ttwu_count                    : 2988054
  .ttwu_local                    : 2083371
  .bkl_count                     : 8904
  .nr_spread_over                : 372629
  .shares                        : 101

cfs_rq[6]:/
  .exec_clock                    : 62561266.440406
  .MIN_vruntime                  : 43681901.531907
  .min_vruntime                  : 43681901.531907
  .max_vruntime                  : 43681901.531907
  .spread                        : 0.000000
  .spread0                       : -1316172.759874
  .nr_running                    : 2
  .load                          : 103
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 8292
  .sched_switch                  : 0
  .sched_count                   : 5749245
  .sched_goidle                  : 17290
  .ttwu_count                    : 2988054
  .ttwu_local                    : 2083371
  .bkl_count                     : 8904
  .nr_spread_over                : 73618
  .shares                        : 0

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
             hog  4198  60648319.346763   2155112   120  60648319.346763  57412425.922637         0.000000 /ant
R            cat 18280   5564927.823046         3   120   5564927.823046         1.995011         0.012258 /bee

cpu#7, 2333.417 MHz
  .nr_running                    : 2
  .load                          : 103
  .nr_switches                   : 6910780
  .nr_load_updates               : 62705080
  .nr_uninterruptible            : -552
  .jiffies                       : 4357598986
  .next_balance                  : 4357.599016
  .curr->pid                     : 12723
  .clock                         : 62969376.305631
  .cpu_load[0]                   : 103
  .cpu_load[1]                   : 99
  .cpu_load[2]                   : 94
  .cpu_load[3]                   : 93
  .cpu_load[4]                   : 103

cfs_rq[7]:/ant
  .exec_clock                    : 57417395.762701
  .MIN_vruntime                  : 61941066.080427
  .min_vruntime                  : 43566817.296538
  .max_vruntime                  : 61941066.080427
  .spread                        : 0.000000
  .spread0                       : -1431256.995243
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 8673
  .sched_switch                  : 0
  .sched_count                   : 6928312
  .sched_goidle                  : 24889
  .ttwu_count                    : 3645419
  .ttwu_local                    : 2666683
  .bkl_count                     : 16970
  .nr_spread_over                : 12049
  .shares                        : 0

cfs_rq[7]:/bee
  .exec_clock                    : 5142675.452517
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 43566817.296538
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : -1431256.995243
  .nr_running                    : 1
  .load                          : 1024
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 8673
  .sched_switch                  : 0
  .sched_count                   : 6928312
  .sched_goidle                  : 24889
  .ttwu_count                    : 3645419
  .ttwu_local                    : 2666683
  .bkl_count                     : 16970
  .nr_spread_over                : 389009
  .shares                        : 101

cfs_rq[7]:/
  .exec_clock                    : 62561246.337270
  .MIN_vruntime                  : 43566817.296538
  .min_vruntime                  : 43566817.296538
  .max_vruntime                  : 43566817.296538
  .spread                        : 0.000000
  .spread0                       : -1431256.995243
  .nr_running                    : 2
  .load                          : 103
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 8673
  .sched_switch                  : 0
  .sched_count                   : 6928312
  .sched_goidle                  : 24889
  .ttwu_count                    : 3645419
  .ttwu_local                    : 2666683
  .bkl_count                     : 16970
  .nr_spread_over                : 31384
  .shares                        : 0

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
             hog  4203  61941066.080427   2636297   120  61941066.080427  57417288.830170         0.000000 /ant
R           hpro 12723   5580819.679520    424397   120   5580819.679520   2359524.675698   1020120.649702 /bee


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-15  1:14 busted CFS group load balancer? Ken Chen
@ 2008-11-17 15:37 ` Chris Friesen
  2008-11-17 20:04   ` Ken Chen
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Friesen @ 2008-11-17 15:37 UTC (permalink / raw)
  To: Ken Chen; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

Ken Chen wrote:
> It appears that the fair-group load balancer in 2.6.27 does not work
> properly.

There was an issue fixed post 2.6.27 where the load balancer didn't work 
properly if there was one task per group per cpu.  You might try 
backporting commit 38736f4 and see if that helps.

Also, caea8a0 fixes up list traversal in load_balance_fair() to use the 
_rcu() version.  I'm not aware of this causing any problems in practice, 
but it might be worth a try.

Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-17 15:37 ` Chris Friesen
@ 2008-11-17 20:04   ` Ken Chen
  2008-11-17 21:19     ` Chris Friesen
  2008-11-18  7:52     ` Ken Chen
  0 siblings, 2 replies; 10+ messages in thread
From: Ken Chen @ 2008-11-17 20:04 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Mon, Nov 17, 2008 at 7:37 AM, Chris Friesen wrote:
>> It appears that the fair-group load balancer in 2.6.27 does not work
>> properly.
>
> There was an issue fixed post 2.6.27 where the load balancer didn't work
> properly if there was one task per group per cpu.  You might try
> backporting commit 38736f4 and see if that helps.

Tested git commit 38736f4, it doesn't fix the problem I'm seeing.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-17 20:04   ` Ken Chen
@ 2008-11-17 21:19     ` Chris Friesen
  2008-11-18  5:19       ` Peter Zijlstra
  2008-11-18  7:52     ` Ken Chen
  1 sibling, 1 reply; 10+ messages in thread
From: Chris Friesen @ 2008-11-17 21:19 UTC (permalink / raw)
  To: Ken Chen; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

Ken Chen wrote:
> On Mon, Nov 17, 2008 at 7:37 AM, Chris Friesen wrote:
>>> It appears that the fair-group load balancer in 2.6.27 does not work
>>> properly.
>> There was an issue fixed post 2.6.27 where the load balancer didn't work
>> properly if there was one task per group per cpu.  You might try
>> backporting commit 38736f4 and see if that helps.
> 
> Tested git commit 38736f4, it doesn't fix the problem I'm seeing.
> 

I plugged in the same weights into my test app (groups 1 and 2 instead 
of ant/bee) and got the results below for a 10-sec run.  The "actual" 
numbers give the overall average and then the values for each hog 
separately.  In this case we see that both tasks in group 2 ended up 
sharing a cpu with one of the tasks from group 1.

   group      actual(%)      expected(%)  ctx switches   max_latency(ms)
       1  99.69(99.38/99.99)   99.81       160/262          4/0
       2   0.31( 0.31/0.31)     0.19       32/33         391/375

I've only got a 2-way system.  If the results really are that much worse 
on larger systems, then that's going to cause problems for us as well. 
I'll see if I can get some time on a bigger machine.

Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-17 21:19     ` Chris Friesen
@ 2008-11-18  5:19       ` Peter Zijlstra
  2008-11-18  7:33         ` Ken Chen
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2008-11-18  5:19 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Ken Chen, Ingo Molnar, Linux Kernel Mailing List

On Mon, 2008-11-17 at 15:19 -0600, Chris Friesen wrote:
> Ken Chen wrote:
> > On Mon, Nov 17, 2008 at 7:37 AM, Chris Friesen wrote:
> >>> It appears that the fair-group load balancer in 2.6.27 does not work
> >>> properly.
> >> There was an issue fixed post 2.6.27 where the load balancer didn't work
> >> properly if there was one task per group per cpu.  You might try
> >> backporting commit 38736f4 and see if that helps.
> > 
> > Tested git commit 38736f4, it doesn't fix the problem I'm seeing.
> > 
> 
> I plugged in the same weights into my test app (groups 1 and 2 instead 
> of ant/bee) and got the results below for a 10-sec run.  The "actual" 
> numbers give the overall average and then the values for each hog 
> separately.  In this case we see that both tasks in group 2 ended up 
> sharing a cpu with one of the tasks from group 1.
> 
>    group      actual(%)      expected(%)  ctx switches   max_latency(ms)
>        1  99.69(99.38/99.99)   99.81       160/262          4/0
>        2   0.31( 0.31/0.31)     0.19       32/33         391/375
> 
> I've only got a 2-way system.  If the results really are that much worse 
> on larger systems, then that's going to cause problems for us as well. 
> I'll see if I can get some time on a bigger machine.

Note that with larger cpu count and/or lower group weight we'll quickly
run into numerical trouble...

I would recommend trying this with the minimum weight in the order of
8-16 times number of cpus on your system.

There is only so much one can do with 10 bit fixed precision math :/



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-18  5:19       ` Peter Zijlstra
@ 2008-11-18  7:33         ` Ken Chen
  2008-11-18 12:30           ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Ken Chen @ 2008-11-18  7:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Chris Friesen, Ingo Molnar, Linux Kernel Mailing List

On Mon, Nov 17, 2008 at 9:19 PM, Peter Zijlstra wrote:
> Note that with larger cpu count and/or lower group weight we'll quickly
> run into numerical trouble...
>
> I would recommend trying this with the minimum weight in the order of
> 8-16 times number of cpus on your system.
>
> There is only so much one can do with 10 bit fixed precision math :/

That is probably one of the many problems.  I also found that the
updates to the per-cpu task_group's sched_entity load weight
(tg->se[cpu]->load.weight) is very problematic and very erratic.

The total rq_weight is calculated at one beginning of tg_shares_up(),

        for_each_cpu_mask(i, sd->span) {
                rq_weight += tg->cfs_rq[i]->load.weight;
                shares += tg->cfs_rq[i]->shares;
        }

However, the scaling of per-cpu se->load.weight in function
__update_group_shares_cpu() takes another lookup of
tg->cfs_rq[cpu]->load.weight at a different time.
cfs_rq[cpu].load.weight aren't always consistent across these two
times.  Due to these inconsistency of value taken on per cpu cfs_rq,
I've see tg->se[cpu]->load.weight jumping all over the place.  In our
environment, the cpu loads are very dynamic.  Process
queuing/dequeuing at high rate.

I'm also very troubled with this calculation in __update_group_shares_cpu():

        shares = (sd_shares * rq_weight) / (sd_rq_weight + 1);

Won't you have rounding problem here?  value 'shares' will gradually
decrease for each iteration of __update_group_shares_cpu()?

- Ken

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-17 20:04   ` Ken Chen
  2008-11-17 21:19     ` Chris Friesen
@ 2008-11-18  7:52     ` Ken Chen
  1 sibling, 0 replies; 10+ messages in thread
From: Ken Chen @ 2008-11-18  7:52 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Mon, Nov 17, 2008 at 7:37 AM, Chris Friesen wrote:
>> It appears that the fair-group load balancer in 2.6.27 does not work
>> properly.
>
> There was an issue fixed post 2.6.27 where the load balancer didn't work
> properly if there was one task per group per cpu.  You might try
> backporting commit 38736f4 and see if that helps.

On Mon, Nov 17, 2008 at 12:04 PM, Ken Chen wrote:
> Tested git commit 38736f4, it doesn't fix the problem I'm seeing.

I would also like to mention that commit 38736f4 has stability
problems.  I put a kernel with this commit on several test machines,
and several of them had hard lockups within 2 hours of testing.  I
wasn't doing anything fancy.  The workload was pure 'while (1)' loop.
Though there are other background services running on those machines.

- Ken

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-18  7:33         ` Ken Chen
@ 2008-11-18 12:30           ` Peter Zijlstra
  2008-11-18 13:48             ` Peter Zijlstra
  2008-11-18 17:27             ` Ken Chen
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2008-11-18 12:30 UTC (permalink / raw)
  To: Ken Chen; +Cc: Chris Friesen, Ingo Molnar, Linux Kernel Mailing List

On Mon, 2008-11-17 at 23:33 -0800, Ken Chen wrote:
> On Mon, Nov 17, 2008 at 9:19 PM, Peter Zijlstra wrote:
> > Note that with larger cpu count and/or lower group weight we'll quickly
> > run into numerical trouble...
> >
> > I would recommend trying this with the minimum weight in the order of
> > 8-16 times number of cpus on your system.
> >
> > There is only so much one can do with 10 bit fixed precision math :/
> 
> That is probably one of the many problems.  I also found that the
> updates to the per-cpu task_group's sched_entity load weight
> (tg->se[cpu]->load.weight) is very problematic and very erratic.
> 
> The total rq_weight is calculated at one beginning of tg_shares_up(),
> 
>         for_each_cpu_mask(i, sd->span) {
>                 rq_weight += tg->cfs_rq[i]->load.weight;
>                 shares += tg->cfs_rq[i]->shares;
>         }
> 
> However, the scaling of per-cpu se->load.weight in function
> __update_group_shares_cpu() takes another lookup of
> tg->cfs_rq[cpu]->load.weight at a different time.
> cfs_rq[cpu].load.weight aren't always consistent across these two
> times.  Due to these inconsistency of value taken on per cpu cfs_rq,
> I've see tg->se[cpu]->load.weight jumping all over the place.  In our
> environment, the cpu loads are very dynamic.  Process
> queuing/dequeuing at high rate.

Ok, if your load values are very unstable in the order of the
load-balance interval then you're hosed too, the same is true for the
normal smp load-balancer.

The cgroup load-balancer makes that even more problematic.

Again, there's just very little you can do about that, except increase
the coupling between cpus and thereby increase the overhead. Try
decreasing 
sysctl_sched_shares_ratelimit.


> I'm also very troubled with this calculation in __update_group_shares_cpu():
> 
>         shares = (sd_shares * rq_weight) / (sd_rq_weight + 1);
> 
> Won't you have rounding problem here?  value 'shares' will gradually
> decrease for each iteration of __update_group_shares_cpu()?

Yes it will, however at the top of the sched-domain tree its reset. 

        if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
                shares = tg->shares;




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-18 12:30           ` Peter Zijlstra
@ 2008-11-18 13:48             ` Peter Zijlstra
  2008-11-18 17:27             ` Ken Chen
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2008-11-18 13:48 UTC (permalink / raw)
  To: Ken Chen; +Cc: Chris Friesen, Ingo Molnar, Linux Kernel Mailing List

On Tue, 2008-11-18 at 13:30 +0100, Peter Zijlstra wrote:
> On Mon, 2008-11-17 at 23:33 -0800, Ken Chen wrote:
> > On Mon, Nov 17, 2008 at 9:19 PM, Peter Zijlstra wrote:
> > > Note that with larger cpu count and/or lower group weight we'll quickly
> > > run into numerical trouble...
> > >
> > > I would recommend trying this with the minimum weight in the order of
> > > 8-16 times number of cpus on your system.
> > >
> > > There is only so much one can do with 10 bit fixed precision math :/
> > 
> > That is probably one of the many problems.  I also found that the
> > updates to the per-cpu task_group's sched_entity load weight
> > (tg->se[cpu]->load.weight) is very problematic and very erratic.
> > 
> > The total rq_weight is calculated at one beginning of tg_shares_up(),
> > 
> >         for_each_cpu_mask(i, sd->span) {
> >                 rq_weight += tg->cfs_rq[i]->load.weight;
> >                 shares += tg->cfs_rq[i]->shares;
> >         }
> > 
> > However, the scaling of per-cpu se->load.weight in function
> > __update_group_shares_cpu() takes another lookup of
> > tg->cfs_rq[cpu]->load.weight at a different time.
> > cfs_rq[cpu].load.weight aren't always consistent across these two
> > times.  Due to these inconsistency of value taken on per cpu cfs_rq,
> > I've see tg->se[cpu]->load.weight jumping all over the place.  In our
> > environment, the cpu loads are very dynamic.  Process
> > queuing/dequeuing at high rate.
> 
> Ok, if your load values are very unstable in the order of the
> load-balance interval then you're hosed too, the same is true for the
> normal smp load-balancer.
> 
> The cgroup load-balancer makes that even more problematic.
> 
> Again, there's just very little you can do about that, except increase
> the coupling between cpus and thereby increase the overhead. Try
> decreasing 
> sysctl_sched_shares_ratelimit.


Also, lower sysctl_sched_shares_thresh to 1 or 0.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: busted CFS group load balancer?
  2008-11-18 12:30           ` Peter Zijlstra
  2008-11-18 13:48             ` Peter Zijlstra
@ 2008-11-18 17:27             ` Ken Chen
  1 sibling, 0 replies; 10+ messages in thread
From: Ken Chen @ 2008-11-18 17:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Chris Friesen, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 18, 2008 at 4:30 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> I'm also very troubled with this calculation in __update_group_shares_cpu():
>>
>>         shares = (sd_shares * rq_weight) / (sd_rq_weight + 1);
>>
>> Won't you have rounding problem here?  value 'shares' will gradually
>> decrease for each iteration of __update_group_shares_cpu()?
>
> Yes it will, however at the top of the sched-domain tree its reset.
>
>        if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
>                shares = tg->shares;

Hmm?  it is only true mostly on flat smp machine, or on numa system
where task is woken up across a sched-domain.

There is this code in try_to_wake_up():

                for_each_domain(this_cpu, sd) {
                        if (cpu_isset(cpu, sd->span)) {
                                update_shares(sd);
                                break;
                        }
                }

It doesn't iterate to the top half of the time on numa machine.

- Ken

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-11-18 17:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-15  1:14 busted CFS group load balancer? Ken Chen
2008-11-17 15:37 ` Chris Friesen
2008-11-17 20:04   ` Ken Chen
2008-11-17 21:19     ` Chris Friesen
2008-11-18  5:19       ` Peter Zijlstra
2008-11-18  7:33         ` Ken Chen
2008-11-18 12:30           ` Peter Zijlstra
2008-11-18 13:48             ` Peter Zijlstra
2008-11-18 17:27             ` Ken Chen
2008-11-18  7:52     ` Ken Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox