public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Database regression due to scheduler changes ?
@ 2005-11-07 22:17 Brian Twichell
  2005-11-07 22:35 ` David Lang
  2005-11-07 22:47 ` linux-os (Dick Johnson)
  0 siblings, 2 replies; 18+ messages in thread
From: Brian Twichell @ 2005-11-07 22:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: mbligh, slpratt, anton

Hi,

We observed a 1.5% regression in an OLTP database workload going from
2.6.13-rc4 to 2.6.13-rc5.  The regression has been carried forward
at least as far as 2.6.14-rc5.

Through experimentation, and through examining the changes that
went into 2.6.13-rc5, we found that we can eliminate the regression
in 2.6.13-rc5 with one straightforward change:  eliminating the
NUMA level from the CPU scheduler domain structures.

After observing this, we collected schedstats (provided below)
to try to determine how the scheduler behaves differently
when the NUMA level is eliminated.  It appears to us that
the scheduler is having more success in balancing in this
case.  We tried to duplicate this effect by changing parameters
in the NUMA-level and SMP-level domain definitions to
increase the aggressiveness of the balancing, but none of the
changes could recoup the regression.

We suspect the regression was introduced in the scheduler changes
that went into 2.6.13-rc1.  However, the regression was hidden
from us by a bug in include/asm-ppc64/topology.h that made ppc64
look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4.  That bug was
fixed in 2.6.13-rc5.  Unfortunately the workload does not run to
completion on 2.6.12 or 2.6.13-rc1.  We have measurements on
2.6.12-rc6-git7 that do not show the regression.

One alternative for fixing this in 2.6.13 would have been to #define
ARCH_HAS_SCHED_DOMAINS and to introduce a ppc64-specific version
of build_sched_domains that eliminates the NUMA-level domain for
small (e.g. 4-way) ppc64 systems.  However, ARCH_HAS_SCHED_DOMAINS
has been eliminated from 2.6.14, and anyways that solution doesn't
seem very encompassing to me.

So, at this point I am soliciting assistance from scheduler experts
to determine how this regression can be eliminated.  We are keen
to prevent this regression from going into the next distro versions.
Simply shipping a distro kernel with CONFIG_NUMA off isn't a viable
option because we need it for our larger configurations.

Our system configuration is a 4-way 1.9 GHz Power5-based server.  As
the system supports SMT, it shows eight online CPUs.

Below are the schedstats.  The first set is with the NUMA-level
domain, while the second set is without the NUMA-level domain.

Cheers,
Brian Twichell

Schedstats (NUMA-level domain included)
----------------------------------------------------------------------
00:09:05--------------------------------------------------------------
       2845          sys_sched_yield()
          0(  0.00%) found (only) active queue empty on current cpu
          0(  0.00%) found (only) expired queue empty on current cpu
        157(  5.52%) found both queues empty on current cpu
       2688( 94.48%) found neither queue empty on current cpu


    23287180          schedule()
          1(  0.00%) switched active and expired queues
          0(  0.00%) used existing active queue

          0          active_load_balance()
          0          sched_balance_exec()

      0.19/1.17      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
    1418943          load_balance()
     112240(  7.91%) called while idle
                         499(  0.44%) tried but failed to move any tasks
                       80433( 71.66%) found no busier group
                       31308( 27.89%) succeeded in moving at least one task
                                      (average imbalance:   1.549)
     316022( 22.27%) called while busy
                          21(  0.01%) tried but failed to move any tasks
                      220440( 69.75%) found no busier group
                       95561( 30.24%) succeeded in moving at least one task
                                      (average imbalance:   1.727)
     990681( 69.82%) called when newly idle
                         533(  0.05%) tried but failed to move any tasks
                      808816( 81.64%) found no busier group
                      181332( 18.30%) succeeded in moving at least one task
                                      (average imbalance:   1.500)

          0          sched_balance_exec() tried to push a task

[scheduler domain #1]
     922193          load_balance()
      85822(  9.31%) called while idle
                        4032(  4.70%) tried but failed to move any tasks
                       70982( 82.71%) found no busier group
                       10808( 12.59%) succeeded in moving at least one task
                                      (average imbalance:   1.348)
      27022(  2.93%) called while busy
                         106(  0.39%) tried but failed to move any tasks
                       25478( 94.29%) found no busier group
                        1438(  5.32%) succeeded in moving at least one task
                                      (average imbalance:   1.712)
     809349( 87.76%) called when newly idle
                        6967(  0.86%) tried but failed to move any tasks
                      757097( 93.54%) found no busier group
                       45285(  5.60%) succeeded in moving at least one task
                                      (average imbalance:   1.338)

          0          sched_balance_exec() tried to push a task

[scheduler domain #2]
     825662          load_balance()
      52074(  6.31%) called while idle
                       17791( 34.16%) tried but failed to move any tasks
                       32839( 63.06%) found no busier group
                        1444(  2.77%) succeeded in moving at least one task
                                      (average imbalance:   1.981)
       9524(  1.15%) called while busy
                        1072( 11.26%) tried but failed to move any tasks
                        7654( 80.37%) found no busier group
                         798(  8.38%) succeeded in moving at least one task
                                      (average imbalance:   2.976)
     764064( 92.54%) called when newly idle
                      262831( 34.40%) tried but failed to move any tasks
                      409353( 53.58%) found no busier group
                       91880( 12.03%) succeeded in moving at least one task
                                      (average imbalance:   2.518)

          0          sched_balance_exec() tried to push a task


Schedstats (NUMA-level domain eliminated)
----------------------------------------------------------------------
00:09:03--------------------------------------------------------------
       2576          sys_sched_yield()
          0(  0.00%) found (only) active queue empty on current cpu
          0(  0.00%) found (only) expired queue empty on current cpu
        118(  4.58%) found both queues empty on current cpu
       2458( 95.42%) found neither queue empty on current cpu


    23617887          schedule()
    1106774          goes idle
          0(  0.00%) switched active and expired queues
          0(  0.00%) used existing active queue

          0          active_load_balance()
          0          sched_balance_exec()

      0.19/1.10      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
    1810988          load_balance()
     153509(  8.48%) called while idle
                         680(  0.44%) tried but failed to move any tasks
                      104906( 68.34%) found no busier group
                       47923( 31.22%) succeeded in moving at least one task
                                      (average imbalance:   1.658)
     317016( 17.51%) called while busy
                          30(  0.01%) tried but failed to move any tasks
                      217438( 68.59%) found no busier group
                       99548( 31.40%) succeeded in moving at least one task
                                      (average imbalance:   1.831)
    1340463( 74.02%) called when newly idle
                         762(  0.06%) tried but failed to move any tasks
                     1092960( 81.54%) found no busier group
                      246741( 18.41%) succeeded in moving at least one task
                                      (average imbalance:   1.564)

          0          sched_balance_exec() tried to push a task

[scheduler domain #1]
    1244187          load_balance()
     111326(  8.95%) called while idle
                        8396(  7.54%) tried but failed to move any tasks
                       71276( 64.02%) found no busier group
                       31654( 28.43%) succeeded in moving at least one task
                                      (average imbalance:   1.412)
      39138(  3.15%) called while busy
                         220(  0.56%) tried but failed to move any tasks
                       34676( 88.60%) found no busier group
                        4242( 10.84%) succeeded in moving at least one task
                                      (average imbalance:   1.360)
    1093723( 87.91%) called when newly idle
                       15971(  1.46%) tried but failed to move any tasks
                      932422( 85.25%) found no busier group
                      145330( 13.29%) succeeded in moving at least one task
                                      (average imbalance:   1.189)

          0          sched_balance_exec() tried to push a task



^ permalink raw reply	[flat|nested] 18+ messages in thread
[parent not found: <43715361.3070802@us.ibm.com>]

end of thread, other threads:[~2005-11-14 23:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-07 22:17 Database regression due to scheduler changes ? Brian Twichell
2005-11-07 22:35 ` David Lang
2005-11-07 23:06   ` Brian Twichell
2005-11-08  0:51     ` Nick Piggin
2005-11-08  1:15       ` Anton Blanchard
2005-11-08  1:34         ` Martin J. Bligh
2005-11-08  1:46           ` Nick Piggin
2005-11-08  1:48             ` Nick Piggin
2005-11-08  1:58             ` Martin J. Bligh
2005-11-08  2:04             ` David Lang
2005-11-08  2:12               ` Martin J. Bligh
2005-11-08  2:15               ` Nick Piggin
2005-11-09  5:03       ` Brian Twichell
     [not found]         ` <43718DFE.3040600@yahoo.com.au>
2005-11-14 23:03           ` Brian Twichell
2005-11-08  2:31   ` Byron Stanoszek
2005-11-07 22:47 ` linux-os (Dick Johnson)
2005-11-08  3:54   ` Nick Piggin
     [not found] <43715361.3070802@us.ibm.com>
2005-11-09  2:14 ` Andrew Theurer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox