Database regression due to scheduler changes ?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Database regression due to scheduler changes ?
@ 2005-11-07 22:17 Brian Twichell
  2005-11-07 22:35 ` David Lang
  2005-11-07 22:47 ` linux-os (Dick Johnson)
  0 siblings, 2 replies; 18+ messages in thread
From: Brian Twichell @ 2005-11-07 22:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: mbligh, slpratt, anton

Hi,

We observed a 1.5% regression in an OLTP database workload going from
2.6.13-rc4 to 2.6.13-rc5.  The regression has been carried forward
at least as far as 2.6.14-rc5.

Through experimentation, and through examining the changes that
went into 2.6.13-rc5, we found that we can eliminate the regression
in 2.6.13-rc5 with one straightforward change:  eliminating the
NUMA level from the CPU scheduler domain structures.

After observing this, we collected schedstats (provided below)
to try to determine how the scheduler behaves differently
when the NUMA level is eliminated.  It appears to us that
the scheduler is having more success in balancing in this
case.  We tried to duplicate this effect by changing parameters
in the NUMA-level and SMP-level domain definitions to
increase the aggressiveness of the balancing, but none of the
changes could recoup the regression.

We suspect the regression was introduced in the scheduler changes
that went into 2.6.13-rc1.  However, the regression was hidden
from us by a bug in include/asm-ppc64/topology.h that made ppc64
look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4.  That bug was
fixed in 2.6.13-rc5.  Unfortunately the workload does not run to
completion on 2.6.12 or 2.6.13-rc1.  We have measurements on
2.6.12-rc6-git7 that do not show the regression.

One alternative for fixing this in 2.6.13 would have been to #define
ARCH_HAS_SCHED_DOMAINS and to introduce a ppc64-specific version
of build_sched_domains that eliminates the NUMA-level domain for
small (e.g. 4-way) ppc64 systems.  However, ARCH_HAS_SCHED_DOMAINS
has been eliminated from 2.6.14, and anyways that solution doesn't
seem very encompassing to me.

So, at this point I am soliciting assistance from scheduler experts
to determine how this regression can be eliminated.  We are keen
to prevent this regression from going into the next distro versions.
Simply shipping a distro kernel with CONFIG_NUMA off isn't a viable
option because we need it for our larger configurations.

Our system configuration is a 4-way 1.9 GHz Power5-based server.  As
the system supports SMT, it shows eight online CPUs.

Below are the schedstats.  The first set is with the NUMA-level
domain, while the second set is without the NUMA-level domain.

Cheers,
Brian Twichell

Schedstats (NUMA-level domain included)
----------------------------------------------------------------------
00:09:05--------------------------------------------------------------
       2845          sys_sched_yield()
          0(  0.00%) found (only) active queue empty on current cpu
          0(  0.00%) found (only) expired queue empty on current cpu
        157(  5.52%) found both queues empty on current cpu
       2688( 94.48%) found neither queue empty on current cpu


    23287180          schedule()
          1(  0.00%) switched active and expired queues
          0(  0.00%) used existing active queue

          0          active_load_balance()
          0          sched_balance_exec()

      0.19/1.17      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
    1418943          load_balance()
     112240(  7.91%) called while idle
                         499(  0.44%) tried but failed to move any tasks
                       80433( 71.66%) found no busier group
                       31308( 27.89%) succeeded in moving at least one task
                                      (average imbalance:   1.549)
     316022( 22.27%) called while busy
                          21(  0.01%) tried but failed to move any tasks
                      220440( 69.75%) found no busier group
                       95561( 30.24%) succeeded in moving at least one task
                                      (average imbalance:   1.727)
     990681( 69.82%) called when newly idle
                         533(  0.05%) tried but failed to move any tasks
                      808816( 81.64%) found no busier group
                      181332( 18.30%) succeeded in moving at least one task
                                      (average imbalance:   1.500)

          0          sched_balance_exec() tried to push a task

[scheduler domain #1]
     922193          load_balance()
      85822(  9.31%) called while idle
                        4032(  4.70%) tried but failed to move any tasks
                       70982( 82.71%) found no busier group
                       10808( 12.59%) succeeded in moving at least one task
                                      (average imbalance:   1.348)
      27022(  2.93%) called while busy
                         106(  0.39%) tried but failed to move any tasks
                       25478( 94.29%) found no busier group
                        1438(  5.32%) succeeded in moving at least one task
                                      (average imbalance:   1.712)
     809349( 87.76%) called when newly idle
                        6967(  0.86%) tried but failed to move any tasks
                      757097( 93.54%) found no busier group
                       45285(  5.60%) succeeded in moving at least one task
                                      (average imbalance:   1.338)

          0          sched_balance_exec() tried to push a task

[scheduler domain #2]
     825662          load_balance()
      52074(  6.31%) called while idle
                       17791( 34.16%) tried but failed to move any tasks
                       32839( 63.06%) found no busier group
                        1444(  2.77%) succeeded in moving at least one task
                                      (average imbalance:   1.981)
       9524(  1.15%) called while busy
                        1072( 11.26%) tried but failed to move any tasks
                        7654( 80.37%) found no busier group
                         798(  8.38%) succeeded in moving at least one task
                                      (average imbalance:   2.976)
     764064( 92.54%) called when newly idle
                      262831( 34.40%) tried but failed to move any tasks
                      409353( 53.58%) found no busier group
                       91880( 12.03%) succeeded in moving at least one task
                                      (average imbalance:   2.518)

          0          sched_balance_exec() tried to push a task


Schedstats (NUMA-level domain eliminated)
----------------------------------------------------------------------
00:09:03--------------------------------------------------------------
       2576          sys_sched_yield()
          0(  0.00%) found (only) active queue empty on current cpu
          0(  0.00%) found (only) expired queue empty on current cpu
        118(  4.58%) found both queues empty on current cpu
       2458( 95.42%) found neither queue empty on current cpu


    23617887          schedule()
    1106774          goes idle
          0(  0.00%) switched active and expired queues
          0(  0.00%) used existing active queue

          0          active_load_balance()
          0          sched_balance_exec()

      0.19/1.10      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
    1810988          load_balance()
     153509(  8.48%) called while idle
                         680(  0.44%) tried but failed to move any tasks
                      104906( 68.34%) found no busier group
                       47923( 31.22%) succeeded in moving at least one task
                                      (average imbalance:   1.658)
     317016( 17.51%) called while busy
                          30(  0.01%) tried but failed to move any tasks
                      217438( 68.59%) found no busier group
                       99548( 31.40%) succeeded in moving at least one task
                                      (average imbalance:   1.831)
    1340463( 74.02%) called when newly idle
                         762(  0.06%) tried but failed to move any tasks
                     1092960( 81.54%) found no busier group
                      246741( 18.41%) succeeded in moving at least one task
                                      (average imbalance:   1.564)

          0          sched_balance_exec() tried to push a task

[scheduler domain #1]
    1244187          load_balance()
     111326(  8.95%) called while idle
                        8396(  7.54%) tried but failed to move any tasks
                       71276( 64.02%) found no busier group
                       31654( 28.43%) succeeded in moving at least one task
                                      (average imbalance:   1.412)
      39138(  3.15%) called while busy
                         220(  0.56%) tried but failed to move any tasks
                       34676( 88.60%) found no busier group
                        4242( 10.84%) succeeded in moving at least one task
                                      (average imbalance:   1.360)
    1093723( 87.91%) called when newly idle
                       15971(  1.46%) tried but failed to move any tasks
                      932422( 85.25%) found no busier group
                      145330( 13.29%) succeeded in moving at least one task
                                      (average imbalance:   1.189)

          0          sched_balance_exec() tried to push a task



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-07 22:17 Database regression due to scheduler changes ? Brian Twichell
@ 2005-11-07 22:35 ` David Lang
  2005-11-07 23:06   ` Brian Twichell
  2005-11-08  2:31   ` Byron Stanoszek
  2005-11-07 22:47 ` linux-os (Dick Johnson)
  1 sibling, 2 replies; 18+ messages in thread
From: David Lang @ 2005-11-07 22:35 UTC (permalink / raw)
  To: Brian Twichell; +Cc: linux-kernel, mbligh, slpratt, anton

Brian,
   If I am understanding the data you posted, it looks like you are useing 
sched_yield extensivly in your database. This is known to have significant 
problems on SMP machines, and even bigger ones on NUMA machines, in part 
becouse the process doing the sched_yield may get rescheduled immediatly 
and not allow other processes to run (to free up whatever resource it's 
waiting for). This causes the processor to look busy to the scheduler and 
therefor the scheduler doesn't migrate other processes to the CPU that's 
spinning on sched_yield. On NUMA machines this is even more noticable as 
processes now have to migrate through an additional layer of the 
scheduler.

have to tried eliminating the sched_yield to see what difference it makes?

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-07 22:17 Database regression due to scheduler changes ? Brian Twichell
  2005-11-07 22:35 ` David Lang
@ 2005-11-07 22:47 ` linux-os (Dick Johnson)
  2005-11-08  3:54   ` Nick Piggin
  1 sibling, 1 reply; 18+ messages in thread
From: linux-os (Dick Johnson) @ 2005-11-07 22:47 UTC (permalink / raw)
  To: Brian Twichell; +Cc: linux-kernel, mbligh, slpratt, anton


On Mon, 7 Nov 2005, Brian Twichell wrote:

> Hi,
>
> We observed a 1.5% regression in an OLTP database workload going from
> 2.6.13-rc4 to 2.6.13-rc5.  The regression has been carried forward
> at least as far as 2.6.14-rc5.
>
> Through experimentation, and through examining the changes that
> went into 2.6.13-rc5, we found that we can eliminate the regression
> in 2.6.13-rc5 with one straightforward change:  eliminating the
> NUMA level from the CPU scheduler domain structures.
>
> After observing this, we collected schedstats (provided below)
> to try to determine how the scheduler behaves differently
> when the NUMA level is eliminated.  It appears to us that
> the scheduler is having more success in balancing in this
> case.  We tried to duplicate this effect by changing parameters
> in the NUMA-level and SMP-level domain definitions to
> increase the aggressiveness of the balancing, but none of the
> changes could recoup the regression.
>
> We suspect the regression was introduced in the scheduler changes
> that went into 2.6.13-rc1.  However, the regression was hidden
> from us by a bug in include/asm-ppc64/topology.h that made ppc64
> look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4.  That bug was
> fixed in 2.6.13-rc5.  Unfortunately the workload does not run to
> completion on 2.6.12 or 2.6.13-rc1.  We have measurements on
> 2.6.12-rc6-git7 that do not show the regression.
>
> One alternative for fixing this in 2.6.13 would have been to #define
> ARCH_HAS_SCHED_DOMAINS and to introduce a ppc64-specific version
> of build_sched_domains that eliminates the NUMA-level domain for
> small (e.g. 4-way) ppc64 systems.  However, ARCH_HAS_SCHED_DOMAINS
> has been eliminated from 2.6.14, and anyways that solution doesn't
> seem very encompassing to me.
>
> So, at this point I am soliciting assistance from scheduler experts
> to determine how this regression can be eliminated.  We are keen
> to prevent this regression from going into the next distro versions.
> Simply shipping a distro kernel with CONFIG_NUMA off isn't a viable
> option because we need it for our larger configurations.
>
> Our system configuration is a 4-way 1.9 GHz Power5-based server.  As
> the system supports SMT, it shows eight online CPUs.
>
> Below are the schedstats.  The first set is with the NUMA-level
> domain, while the second set is without the NUMA-level domain.
>
> Cheers,
> Brian Twichell
>
> Schedstats (NUMA-level domain included)
> ----------------------------------------------------------------------
> 00:09:05--------------------------------------------------------------
>       2845          sys_sched_yield()
>          0(  0.00%) found (only) active queue empty on current cpu
>          0(  0.00%) found (only) expired queue empty on current cpu
>        157(  5.52%) found both queues empty on current cpu
>       2688( 94.48%) found neither queue empty on current cpu
>
>
>    23287180          schedule()
>          1(  0.00%) switched active and expired queues
>          0(  0.00%) used existing active queue
>
>          0          active_load_balance()
>          0          sched_balance_exec()
>
>      0.19/1.17      avg runtime/latency over all cpus (ms)
>
> [scheduler domain #0]
>    1418943          load_balance()
>     112240(  7.91%) called while idle
>                         499(  0.44%) tried but failed to move any tasks
>                       80433( 71.66%) found no busier group
>                       31308( 27.89%) succeeded in moving at least one task
>                                      (average imbalance:   1.549)
>     316022( 22.27%) called while busy
>                          21(  0.01%) tried but failed to move any tasks
>                      220440( 69.75%) found no busier group
>                       95561( 30.24%) succeeded in moving at least one task
>                                      (average imbalance:   1.727)
>     990681( 69.82%) called when newly idle
>                         533(  0.05%) tried but failed to move any tasks
>                      808816( 81.64%) found no busier group
>                      181332( 18.30%) succeeded in moving at least one task
>                                      (average imbalance:   1.500)
>
>          0          sched_balance_exec() tried to push a task
>
> [scheduler domain #1]
>     922193          load_balance()
>      85822(  9.31%) called while idle
>                        4032(  4.70%) tried but failed to move any tasks
>                       70982( 82.71%) found no busier group
>                       10808( 12.59%) succeeded in moving at least one task
>                                      (average imbalance:   1.348)
>      27022(  2.93%) called while busy
>                         106(  0.39%) tried but failed to move any tasks
>                       25478( 94.29%) found no busier group
>                        1438(  5.32%) succeeded in moving at least one task
>                                      (average imbalance:   1.712)
>     809349( 87.76%) called when newly idle
>                        6967(  0.86%) tried but failed to move any tasks
>                      757097( 93.54%) found no busier group
>                       45285(  5.60%) succeeded in moving at least one task
>                                      (average imbalance:   1.338)
>
>          0          sched_balance_exec() tried to push a task
>
> [scheduler domain #2]
>     825662          load_balance()
>      52074(  6.31%) called while idle
>                       17791( 34.16%) tried but failed to move any tasks
>                       32839( 63.06%) found no busier group
>                        1444(  2.77%) succeeded in moving at least one task
>                                      (average imbalance:   1.981)
>       9524(  1.15%) called while busy
>                        1072( 11.26%) tried but failed to move any tasks
>                        7654( 80.37%) found no busier group
>                         798(  8.38%) succeeded in moving at least one task
>                                      (average imbalance:   2.976)
>     764064( 92.54%) called when newly idle
>                      262831( 34.40%) tried but failed to move any tasks
>                      409353( 53.58%) found no busier group
>                       91880( 12.03%) succeeded in moving at least one task
>                                      (average imbalance:   2.518)
>
>          0          sched_balance_exec() tried to push a task
>
>
> Schedstats (NUMA-level domain eliminated)
> ----------------------------------------------------------------------
> 00:09:03--------------------------------------------------------------
>       2576          sys_sched_yield()
>          0(  0.00%) found (only) active queue empty on current cpu
>          0(  0.00%) found (only) expired queue empty on current cpu
>        118(  4.58%) found both queues empty on current cpu
>       2458( 95.42%) found neither queue empty on current cpu
>
>
>    23617887          schedule()
>    1106774          goes idle
>          0(  0.00%) switched active and expired queues
>          0(  0.00%) used existing active queue
>
>          0          active_load_balance()
>          0          sched_balance_exec()
>
>      0.19/1.10      avg runtime/latency over all cpus (ms)
>
> [scheduler domain #0]
>    1810988          load_balance()
>     153509(  8.48%) called while idle
>                         680(  0.44%) tried but failed to move any tasks
>                      104906( 68.34%) found no busier group
>                       47923( 31.22%) succeeded in moving at least one task
>                                      (average imbalance:   1.658)
>     317016( 17.51%) called while busy
>                          30(  0.01%) tried but failed to move any tasks
>                      217438( 68.59%) found no busier group
>                       99548( 31.40%) succeeded in moving at least one task
>                                      (average imbalance:   1.831)
>    1340463( 74.02%) called when newly idle
>                         762(  0.06%) tried but failed to move any tasks
>                     1092960( 81.54%) found no busier group
>                      246741( 18.41%) succeeded in moving at least one task
>                                      (average imbalance:   1.564)
>
>          0          sched_balance_exec() tried to push a task
>
> [scheduler domain #1]
>    1244187          load_balance()
>     111326(  8.95%) called while idle
>                        8396(  7.54%) tried but failed to move any tasks
>                       71276( 64.02%) found no busier group
>                       31654( 28.43%) succeeded in moving at least one task
>                                      (average imbalance:   1.412)
>      39138(  3.15%) called while busy
>                         220(  0.56%) tried but failed to move any tasks
>                       34676( 88.60%) found no busier group
>                        4242( 10.84%) succeeded in moving at least one task
>                                      (average imbalance:   1.360)
>    1093723( 87.91%) called when newly idle
>                       15971(  1.46%) tried but failed to move any tasks
>                      932422( 85.25%) found no busier group
>                      145330( 13.29%) succeeded in moving at least one task
>                                      (average imbalance:   1.189)
>
>          0          sched_balance_exec() tried to push a task
>

Can you change sched_yield() to usleep(1) or usleep(0) and see if
that works. I found that in recent kernels sched_yield() just seems
to spin (may not actually spin, but seems to with a high CPU usage).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-07 22:35 ` David Lang
@ 2005-11-07 23:06   ` Brian Twichell
  2005-11-08  0:51     ` Nick Piggin
  2005-11-08  2:31   ` Byron Stanoszek
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Twichell @ 2005-11-07 23:06 UTC (permalink / raw)
  To: David Lang; +Cc: linux-kernel, mbligh, slpratt, anton

David Lang wrote:

>   If I am understanding the data you posted, it looks like you are 
> useing sched_yield extensivly in your database. 

Yes, I've seen problems in the past with workloads that use sched_yield 
heavily.

But bear in mind, the ~2700 sched_yields shown in the schedstats 
occurred over a 9 minute period. 
That means that sched_yield is being called at a rate of around 5 per 
second -- this is not a heavy user of sched_yield.

To put this into a broader perspective, this workload has around 270 
tasks, and the context switch rate is around
45,000 per second.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-07 23:06   ` Brian Twichell
@ 2005-11-08  0:51     ` Nick Piggin
  2005-11-08  1:15       ` Anton Blanchard
  2005-11-09  5:03       ` Brian Twichell
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2005-11-08  0:51 UTC (permalink / raw)
  To: Brian Twichell; +Cc: David Lang, linux-kernel, mbligh, slpratt, anton

Brian Twichell wrote:
> David Lang wrote:
> 
>>   If I am understanding the data you posted, it looks like you are 
>> useing sched_yield extensivly in your database. 
> 
> 
> Yes, I've seen problems in the past with workloads that use sched_yield 
> heavily.
> 
> But bear in mind, the ~2700 sched_yields shown in the schedstats 
> occurred over a 9 minute period. That means that sched_yield is being 
> called at a rate of around 5 per second -- this is not a heavy user of 
> sched_yield.
> 
> To put this into a broader perspective, this workload has around 270 
> tasks, and the context switch rate is around
> 45,000 per second.
> 

Hi,

Thanks for your detailed report (and schedstats analysis). Sorry
I didn't see it until now.

I think you are right that the NUMA domain is probably being too
constrictive of task balancing, and that is where the regression
is coming from.

For some workloads it is definitely important to have the NUMA
domain, because it helps spread load over memory controllers as
well as CPUs - so I guess eliminating that domain is not a good
long term solution.

I would look at changing parameters of SD_NODE_INIT in include/
asm-powerpc/topology.h so they are closer to SD_CPU_INIT parameters
(ie. more aggressive).

Reducing min_ and max_interval, busy_factor, cache_hot_time, will
all do this.

I would also take a look at removing SD_WAKE_IDLE from the flags.
This flag should make balancing more aggressive, but it can have
problems when applied to a NUMA domain due to too much task
movement.

I agree that sched_yield would be unlikely to be a problem at
those rates, and either way it doesn't explain the regression.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  0:51     ` Nick Piggin
@ 2005-11-08  1:15       ` Anton Blanchard
  2005-11-08  1:34         ` Martin J. Bligh
  2005-11-09  5:03       ` Brian Twichell
  1 sibling, 1 reply; 18+ messages in thread
From: Anton Blanchard @ 2005-11-08  1:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Brian Twichell, David Lang, linux-kernel, mbligh, slpratt


Hi Nick,

> I would also take a look at removing SD_WAKE_IDLE from the flags.
> This flag should make balancing more aggressive, but it can have
> problems when applied to a NUMA domain due to too much task
> movement.

I was wondering how ppc64 ended up with different parameters in the NODE
definitions (added SD_BALANCE_NEWIDLE and SD_WAKE_IDLE)	and it looks
like it was Andrew :)

http://lkml.org/lkml/2004/11/2/205

It looks like balancing was not agressive enough on his workload too.
Im a bit uneasy with only ppc64 having the two flags though.

Im also considering adding balance on fork for ppc64, it seems like a
lot of people like to run stream like benchmarks and Im getting tired of
telling them to lock their threads down to cpus.

Anton

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  1:15       ` Anton Blanchard
@ 2005-11-08  1:34         ` Martin J. Bligh
  2005-11-08  1:46           ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Martin J. Bligh @ 2005-11-08  1:34 UTC (permalink / raw)
  To: Anton Blanchard, Nick Piggin
  Cc: Brian Twichell, David Lang, linux-kernel, slpratt

>> I would also take a look at removing SD_WAKE_IDLE from the flags.
>> This flag should make balancing more aggressive, but it can have
>> problems when applied to a NUMA domain due to too much task
>> movement.
> 
> I was wondering how ppc64 ended up with different parameters in the NODE
> definitions (added SD_BALANCE_NEWIDLE and SD_WAKE_IDLE)	and it looks
> like it was Andrew :)
> 
> http://lkml.org/lkml/2004/11/2/205
> 
> It looks like balancing was not agressive enough on his workload too.
> Im a bit uneasy with only ppc64 having the two flags though.
> 
> Im also considering adding balance on fork for ppc64, it seems like a
> lot of people like to run stream like benchmarks and Im getting tired of
> telling them to lock their threads down to cpus.

Please don't screw up everything else just for stream. It's a silly 
frigging benchmark. There's very little real-world stuff that really
needs balance on fork, as opposed to balance on clone, and it'll slow
down everything else.

M.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  1:34         ` Martin J. Bligh
@ 2005-11-08  1:46           ` Nick Piggin
  2005-11-08  1:48             ` Nick Piggin
                               ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Nick Piggin @ 2005-11-08  1:46 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Anton Blanchard, Brian Twichell, David Lang, linux-kernel,
	slpratt

Martin J. Bligh wrote:

>>Im also considering adding balance on fork for ppc64, it seems like a
>>lot of people like to run stream like benchmarks and Im getting tired of
>>telling them to lock their threads down to cpus.
> 
> 
> Please don't screw up everything else just for stream. It's a silly 
> frigging benchmark. There's very little real-world stuff that really
> needs balance on fork, as opposed to balance on clone, and it'll slow
> down everything else.
> 

Long lived and memory intensive cloned or forked tasks will often
[but far from always :(] want to be put on another memory controller
from their siblings.

On workloads where there are lots of short lived ones (some bloated
java programs), the load balancer should normally detect this and
cut the balance-on-fork/clone.

Of course there are going to be cases where this fails. I haven't
seen significant slowdowns in tests, although I'm sure there would
be some at least small regressions. Have you seen any? Do you have
any tests in mind that might show a problem?

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  1:46           ` Nick Piggin
@ 2005-11-08  1:48             ` Nick Piggin
  2005-11-08  1:58             ` Martin J. Bligh
  2005-11-08  2:04             ` David Lang
  2 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2005-11-08  1:48 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Anton Blanchard, Brian Twichell, David Lang, linux-kernel,
	slpratt

Nick Piggin wrote:

[...]

> be some at least small regressions. Have you seen any? Do you have
> any tests in mind that might show a problem?
> 

To clarify, I'm not suggesting you should go one way or the other
for POWER4/5, but if you did have regressions I would be interested
at least so I can try helping platforms that do use balance on clone.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  1:46           ` Nick Piggin
  2005-11-08  1:48             ` Nick Piggin
@ 2005-11-08  1:58             ` Martin J. Bligh
  2005-11-08  2:04             ` David Lang
  2 siblings, 0 replies; 18+ messages in thread
From: Martin J. Bligh @ 2005-11-08  1:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Anton Blanchard, Brian Twichell, David Lang, linux-kernel,
	slpratt

>>> Im also considering adding balance on fork for ppc64, it seems like a
>>> lot of people like to run stream like benchmarks and Im getting tired of
>>> telling them to lock their threads down to cpus.
>> 
>> Please don't screw up everything else just for stream. It's a silly 
>> frigging benchmark. There's very little real-world stuff that really
>> needs balance on fork, as opposed to balance on clone, and it'll slow
>> down everything else.
> 
> Long lived and memory intensive cloned or forked tasks will often
> [but far from always :(] want to be put on another memory controller
> from their siblings.
> 
> On workloads where there are lots of short lived ones (some bloated
> java programs), the load balancer should normally detect this and
> cut the balance-on-fork/clone.
> 
> Of course there are going to be cases where this fails. I haven't
> seen significant slowdowns in tests, although I'm sure there would
> be some at least small regressions. Have you seen any? Do you have
> any tests in mind that might show a problem?

Anything fork/exec-y should show it's slower. Most stuff either forks
and execs (in case it's silly to do it twice, and much cheaper to do
it at exec time), or it's a clone, in which case a different set of
rules applies for what you want (and actually, I suspect fork w/o exec
is much the same).

Of course the pig is you can't determine at fork whether it'll exec 
or not, so you optimise for the common case, which is "do exec", unless
given a hint otherwise.

For clone, and I suspect fork w/o exec, you have a tightly coupled 
group of processes that really would like to be close to each other.
If you have 1 app on the whole system, you *may* want it spread across
the system. If you have nr_apps >= nr_nodes, you probably want them
node local. Determining which workload you have is messy, and may
change.

Tweak the freak benchmark, not everything else ;-)

M.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  1:46           ` Nick Piggin
  2005-11-08  1:48             ` Nick Piggin
  2005-11-08  1:58             ` Martin J. Bligh
@ 2005-11-08  2:04             ` David Lang
  2005-11-08  2:12               ` Martin J. Bligh
  2005-11-08  2:15               ` Nick Piggin
  2 siblings, 2 replies; 18+ messages in thread
From: David Lang @ 2005-11-08  2:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Anton Blanchard, Brian Twichell, linux-kernel,
	slpratt

On Tue, 8 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
>
>>> Im also considering adding balance on fork for ppc64, it seems like a
>>> lot of people like to run stream like benchmarks and Im getting tired of
>>> telling them to lock their threads down to cpus.
>> 
>> 
>> Please don't screw up everything else just for stream. It's a silly 
>> frigging benchmark. There's very little real-world stuff that really
>> needs balance on fork, as opposed to balance on clone, and it'll slow
>> down everything else.
>> 
>
> Long lived and memory intensive cloned or forked tasks will often
> [but far from always :(] want to be put on another memory controller
> from their siblings.
>
> On workloads where there are lots of short lived ones (some bloated
> java programs), the load balancer should normally detect this and
> cut the balance-on-fork/clone.

although if the primary workload is short-lived tasks and you don't do 
balance-on-fork/clone won't you have trouble ever balancing things? 
(anything that you do move over will probably exit quickly and put you 
right back where you started)

at the risk of a slowdown from an extra test it almost sounds like what is 
needed is to get feedback from the last scheduled balance attempt and use 
that to decide per-fork what to do.

for example say the scheduled balance attempt leaves a per-cpu value that 
has it's high bit tested every fork/clone (and then rotated left 1 bit) 
and if it's a 1 do a balance for this new process.

with a reasonable sized item (I would guess the default int size would 
probably be the most efficiant to process, but even 8 bits may be enough) 
the scheduled balance attempt can leave quite an extensive range of 
behavior, from 'always balance' to 'never balance' to 'balance every 5th 
and 8th fork', etc.

> Of course there are going to be cases where this fails. I haven't
> seen significant slowdowns in tests, although I'm sure there would
> be some at least small regressions. Have you seen any? Do you have
> any tests in mind that might show a problem?

even though people will point out that it's a brin-dead workload (that 
should be converted to a state machine) I would expect that most 
fork-per-connection servers would show problems if the work per connection 
is small

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  2:04             ` David Lang
@ 2005-11-08  2:12               ` Martin J. Bligh
  2005-11-08  2:15               ` Nick Piggin
  1 sibling, 0 replies; 18+ messages in thread
From: Martin J. Bligh @ 2005-11-08  2:12 UTC (permalink / raw)
  To: David Lang, Nick Piggin
  Cc: Anton Blanchard, Brian Twichell, linux-kernel, slpratt



--On Monday, November 07, 2005 18:04:23 -0800 David Lang <david.lang@digitalinsight.com> wrote:

> On Tue, 8 Nov 2005, Nick Piggin wrote:
> 
>> Martin J. Bligh wrote:
>> 
>>>> Im also considering adding balance on fork for ppc64, it seems like a
>>>> lot of people like to run stream like benchmarks and Im getting tired of
>>>> telling them to lock their threads down to cpus.
>>> 
>>> 
>>> Please don't screw up everything else just for stream. It's a silly 
>>> frigging benchmark. There's very little real-world stuff that really
>>> needs balance on fork, as opposed to balance on clone, and it'll slow
>>> down everything else.
>>> 
>> 
>> Long lived and memory intensive cloned or forked tasks will often
>> [but far from always :(] want to be put on another memory controller
>> from their siblings.
>> 
>> On workloads where there are lots of short lived ones (some bloated
>> java programs), the load balancer should normally detect this and
>> cut the balance-on-fork/clone.
> 
> although if the primary workload is short-lived tasks and you don't do balance-on-fork/clone won't you have trouble ever balancing things? 
(anything that you do move over will probably exit quickly and put you 
right back where you started)

If you fork without execing a lot, with no hints, and they all exit
quickly, then yes. But I don't think that's a common workload ;-)

> at the risk of a slowdown from an extra test it almost sounds like what is needed is to get feedback from the last scheduled balance attempt and use that to decide per-fork what to do.
> 
> for example say the scheduled balance attempt leaves a per-cpu value that has it's high bit tested every fork/clone (and then rotated left 1 bit) and if it's a 1 do a balance for this new process.
> 
> with a reasonable sized item (I would guess the default int size would probably be the most efficiant to process, but even 8 bits may be enough) the scheduled balance attempt can leave quite an extensive range of behavior, from 'always balance' to 'never balance' to 'balance every 5th and 8th fork', etc.

That might work, yes. But I'd prefer to see a real workload that
suffers before worrying about it too much. You have something in mind?

>> Of course there are going to be cases where this fails. I haven't
>> seen significant slowdowns in tests, although I'm sure there would
>> be some at least small regressions. Have you seen any? Do you have
>> any tests in mind that might show a problem?
> 
> even though people will point out that it's a brin-dead workload (that should be converted to a state machine) I would expect that most fork-per-connection servers would show problems if the work per connection is small

I suspect most of those are either inetd (exec's) or multiple servers
that service requests by now. maybe not. Threads might be quicker if
it's heavy anyway ;-)

M.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  2:04             ` David Lang
  2005-11-08  2:12               ` Martin J. Bligh
@ 2005-11-08  2:15               ` Nick Piggin
  1 sibling, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2005-11-08  2:15 UTC (permalink / raw)
  To: David Lang
  Cc: Martin J. Bligh, Anton Blanchard, Brian Twichell, linux-kernel,
	slpratt

David Lang wrote:
> On Tue, 8 Nov 2005, Nick Piggin wrote:
> 
>>
>> Long lived and memory intensive cloned or forked tasks will often
>> [but far from always :(] want to be put on another memory controller
>> from their siblings.
>>
>> On workloads where there are lots of short lived ones (some bloated
>> java programs), the load balancer should normally detect this and
>> cut the balance-on-fork/clone.
> 
> 
> although if the primary workload is short-lived tasks and you don't do 
> balance-on-fork/clone won't you have trouble ever balancing things? 
> (anything that you do move over will probably exit quickly and put you 
> right back where you started)
> 

You'll have no trouble if things *need* to be balanced, because
that would imply the runqueue length average is significantly
above the lengths of other runqueues.

As far as the extra test goes, it's really a miniscule overhead
compared with the fork / clone cost itself, and can be really
worthwhile if we get it right.

> 
>> Of course there are going to be cases where this fails. I haven't
>> seen significant slowdowns in tests, although I'm sure there would
>> be some at least small regressions. Have you seen any? Do you have
>> any tests in mind that might show a problem?
> 
> 
> even though people will point out that it's a brin-dead workload (that 
> should be converted to a state machine) I would expect that most 
> fork-per-connection servers would show problems if the work per 
> connection is small
> 

Well it may be brain-dead, but if people use them (and they do)
then I would really be interested to see results.

I did testing with some things like apache and volanomark, however
I was not able to make out much difference on my setups. Though
obviously that's not to say that there won't be with other software
or other workloads / architectures etc.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-07 22:35 ` David Lang
  2005-11-07 23:06   ` Brian Twichell
@ 2005-11-08  2:31   ` Byron Stanoszek
  1 sibling, 0 replies; 18+ messages in thread
From: Byron Stanoszek @ 2005-11-08  2:31 UTC (permalink / raw)
  To: David Lang; +Cc: Brian Twichell, linux-kernel, mbligh, slpratt, anton

On Mon, 7 Nov 2005, David Lang wrote:

> Brian,
>  If I am understanding the data you posted, it looks like you are useing 
> sched_yield extensivly in your database. This is known to have significant 
> problems on SMP machines, and even bigger ones on NUMA machines, in part 
> becouse the process doing the sched_yield may get rescheduled immediatly and 
> not allow other processes to run (to free up whatever resource it's waiting 
> for). This causes the processor to look busy to the scheduler and therefor 
> the scheduler doesn't migrate other processes to the CPU that's spinning on 
> sched_yield. On NUMA machines this is even more noticable as processes now 
> have to migrate through an additional layer of the scheduler.

I have an application designed on Linux where the only processes running are
'init' and those integral to the application. Each communicates using mutual
exclusion & semaphores across a shared file/memory backing.

The application was designed to be as close intrinsically as to what Linux
does--manage processes. There's only 1 thread per process, and each process has
a different executable for its own task.

One day I plan to extend this application across multiple CPUs using either SMP
or NUMA. Therefore a lot of the mutual exclusion routines I've coded in use
sched_yield().

What should I do instead to alleviate the problem of causing the processor to
look busy? In this case I _want_ other processes to be migrated over to the
CPU in order to free up the critical section faster.

A simple test using a 2-cpu SMP system resulted in sched_yield() being a lot
faster than using futexes, but I don't know for the NUMA case.

Best regards,
  -Byron

--
Byron Stanoszek                         Ph: (330) 644-3059
Systems Programmer                      Fax: (330) 644-8110
Commercial Timesharing Inc.             Email: byron@comtime.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-07 22:47 ` linux-os (Dick Johnson)
@ 2005-11-08  3:54   ` Nick Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2005-11-08  3:54 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Brian Twichell, linux-kernel, mbligh, slpratt, anton

linux-os (Dick Johnson) wrote:

> 
> Can you change sched_yield() to usleep(1) or usleep(0) and see if
> that works. I found that in recent kernels sched_yield() just seems
> to spin (may not actually spin, but seems to with a high CPU usage).
> 

I've told you that it *does* spin and always has. Even with 2.4
kernels. In fact, it is *specified* to spin, anything else would
be a bug.

Caveat: it also yields the CPU, but only if there is another
runnable task with a higher priority (which is meaningless
between SCHED_OTHER tasks, though we try to do something sane
there too).

Secondly, Brian actually pinpointed the source of the
regression and it is not sched_yield(), nor has sched_yield
changed since the regression. So wouldn't this just be a wild
goose chase.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
       [not found] <43715361.3070802@us.ibm.com>
@ 2005-11-09  2:14 ` Andrew Theurer
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Theurer @ 2005-11-09  2:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Theurer, nickpiggin, anton, tbrian

Nick wrote:

>> I would also take a look at removing SD_WAKE_IDLE from the flags.
>> This flag should make balancing more aggressive, but it can have
>> problems when applied to a NUMA domain due to too much task
>> movement.
>
> Anton wrote:
> I was wondering how ppc64 ended up with different parameters in the NODE
> definitions (added SD_BALANCE_NEWIDLE and SD_WAKE_IDLE)    and it looks
> like it was Andrew :)
>
> http://lkml.org/lkml/2004/11/2/205

FWIW I changed all arch's, but most (except ppc) got changed back.  At 
the time we had data showing the more aggressive wake idle and newidle 
was good for things like OLTP.

Brian, do you have cpu util numbers and runqueue lengths for both tests?

>
> It looks like balancing was not agressive enough on his workload too.
> Im a bit uneasy with only ppc64 having the two flags though.

Brian wrote:

> We suspect the regression was introduced in the scheduler changes
> that went into 2.6.13-rc1.  However, the regression was hidden
> from us by a bug in include/asm-ppc64/topology.h that made ppc64
> look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4.  That bug was
> fixed in 2.6.13-rc5.  Unfortunately the workload does not run to
> completion on 2.6.12 or 2.6.13-rc1.

Brian, I am not sure if you were thinking of a particular set of sched 
changes, but I suspect it might be one or more in the list below (my 
guess is the first and last).  Would it be possible to back out these 
change-sets from 2.6.13-rc5 and see if there is any difference?  FWIW, 
even if they do help, I am not suggesting, yet, that they should be 
reverted.  I am hoping there is some compromise that can work better in 
all situations.

-Andrew

commit cafb20c1f9976a70d633bb1e1c8c24eab00e4e80
Author: Nick Piggin <nickpiggin@yahoo.com.au>
Date:   Sat Jun 25 14:57:17 2005 -0700

    [PATCH] sched: no aggressive idle balancing
    
    Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
    going against the direction we are trying to go.  Hopefully we can regain
    performance through other methods.
    
    Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
Author: Nick Piggin <nickpiggin@yahoo.com.au>
Date:   Sat Jun 25 14:57:15 2005 -0700

    [PATCH] sched: tweak affine wakeups
    
    Do less affine wakeups.  We're trying to reduce dbt2-pgsql idle time
    regressions here...  make sure we don't don't move tasks the wrong way in an
    imbalance condition.  Also, remove the cache coldness requirement from the
    calculation - this seems to induce sharp cutoff points where behaviour will
    suddenly change on some workloads if the load creeps slightly over or under
    some point.  It is good for periodic balancing because in that case have
    otherwise have no other context to determine what task to move.
    
    But also make a minor tweak to "wake balancing" - the imbalance tolerance is
    now set at half the domain's imbalance, so we get the opportunity to do wake
    balancing before the more random periodic rebalancing gets preformed.
    
    Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

commit 7897986bad8f6cd50d6149345aca7f6480f49464
Author: Nick Piggin <nickpiggin@yahoo.com.au>
Date:   Sat Jun 25 14:57:13 2005 -0700

    [PATCH] sched: balance timers
    
    Do CPU load averaging over a number of different intervals.  Allow each
    interval to be chosen by sending a parameter to source_load and target_load.
    0 is instantaneous, idx > 0 returns a decaying average with the most recent
    sample weighted at 2^(idx-1).  To a maximum of 3 (could be easily increased).
    
    So generally a higher number will result in more conservative balancing.
    
    Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

commit 99b61ccf0bf0e9a85823d39a5db6a1519caeb13d
Author: Nick Piggin <nickpiggin@yahoo.com.au>
Date:   Sat Jun 25 14:57:12 2005 -0700

    [PATCH] sched: less aggressive idle balancing
    
    Remove the special casing for idle CPU balancing.  Things like this are
    hurting for example on SMT, where are single sibling being idle doesn't really
    warrant a really aggressive pull over the NUMA domain, for example.
    
    Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
  2005-11-08  0:51     ` Nick Piggin
  2005-11-08  1:15       ` Anton Blanchard
@ 2005-11-09  5:03       ` Brian Twichell
       [not found]         ` <43718DFE.3040600@yahoo.com.au>
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Twichell @ 2005-11-09  5:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: David Lang, linux-kernel, mbligh, slpratt, anton

Nick Piggin wrote:

>
> I think you are right that the NUMA domain is probably being too
> constrictive of task balancing, and that is where the regression
> is coming from.
>
> For some workloads it is definitely important to have the NUMA
> domain, because it helps spread load over memory controllers as
> well as CPUs - so I guess eliminating that domain is not a good
> long term solution.
>
> I would look at changing parameters of SD_NODE_INIT in include/
> asm-powerpc/topology.h so they are closer to SD_CPU_INIT parameters
> (ie. more aggressive).

I ran with the following:

--- topology.h.orig     2005-11-08 13:11:57.000000000 -0600
+++ topology.h  2005-11-08 13:17:15.000000000 -0600
@@ -43,11 +43,11 @@ static inline int node_to_first_cpu(int
        .span                   = CPU_MASK_NONE,        \
        .parent                 = NULL,                 \
        .groups                 = NULL,                 \
-       .min_interval           = 8,                    \
-       .max_interval           = 32,                   \
-       .busy_factor            = 32,                   \
+       .min_interval           = 1,                    \
+       .max_interval           = 4,                    \
+       .busy_factor            = 64,                   \
        .imbalance_pct          = 125,                  \
-       .cache_hot_time         = (10*1000000),         \
+       .cache_hot_time         = (5*1000000/2),        \
        .cache_nice_tries       = 1,                    \
        .per_cpu_gain           = 100,                  \
        .flags                  = SD_LOAD_BALANCE       \

There was no improvement in performance.  The schedstats from this run 
follow:

       2516          sys_sched_yield()
          0(  0.00%) found (only) active queue empty on current cpu
          0(  0.00%) found (only) expired queue empty on current cpu
         46(  1.83%) found both queues empty on current cpu
       2470( 98.17%) found neither queue empty on current cpu


    22969106          schedule()
     694922          goes idle
          3(  0.00%) switched active and expired queues
          0(  0.00%) used existing active queue

          0          active_load_balance()
          0          sched_balance_exec()

      0.19/1.28      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
    1153606          load_balance()
      82580(  7.16%) called while idle
                         488(  0.59%) tried but failed to move any tasks
                       63876( 77.35%) found no busier group
                       18216( 22.06%) succeeded in moving at least one task
                                      (average imbalance:   1.526)
     317610( 27.53%) called while busy
                          15(  0.00%) tried but failed to move any tasks
                      220139( 69.31%) found no busier group
                       97456( 30.68%) succeeded in moving at least one task
                                      (average imbalance:   1.752)
     753416( 65.31%) called when newly idle
                         487(  0.06%) tried but failed to move any tasks
                      624132( 82.84%) found no busier group
                      128797( 17.10%) succeeded in moving at least one task
                                      (average imbalance:   1.531)

          0          sched_balance_exec() tried to push a task

[scheduler domain #1]
     715638          load_balance()
      68533(  9.58%) called while idle
                        3140(  4.58%) tried but failed to move any tasks
                       60357( 88.07%) found no busier group
                        5036(  7.35%) succeeded in moving at least one task
                                      (average imbalance:   1.251)
      22486(  3.14%) called while busy
                          64(  0.28%) tried but failed to move any tasks
                       21352( 94.96%) found no busier group
                        1070(  4.76%) succeeded in moving at least one task
                                      (average imbalance:   1.922)
     624619( 87.28%) called when newly idle
                        5218(  0.84%) tried but failed to move any tasks
                      591970( 94.77%) found no busier group
                       27431(  4.39%) succeeded in moving at least one task
                                      (average imbalance:   1.382)

          0          sched_balance_exec() tried to push a task

[scheduler domain #2]
     685164          load_balance()
      63247(  9.23%) called while idle
                        7280( 11.51%) tried but failed to move any tasks
                       52200( 82.53%) found no busier group
                        3767(  5.96%) succeeded in moving at least one task
                                      (average imbalance:   1.361)
      24729(  3.61%) called while busy
                         418(  1.69%) tried but failed to move any tasks
                       21025( 85.02%) found no busier group
                        3286( 13.29%) succeeded in moving at least one task
                                      (average imbalance:   3.579)
     597188( 87.16%) called when newly idle
                       67577( 11.32%) tried but failed to move any tasks
                      371377( 62.19%) found no busier group
                      158234( 26.50%) succeeded in moving at least one task
                                      (average imbalance:   2.146)

          0          sched_balance_exec() tried to push a task

>
> I would also take a look at removing SD_WAKE_IDLE from the flags.
> This flag should make balancing more aggressive, but it can have
> problems when applied to a NUMA domain due to too much task
> movement.

Independent from the run above, I ran with the following:

--- topology.h.orig     2005-11-08 19:32:19.000000000 -0600
+++ topology.h  2005-11-08 19:34:25.000000000 -0600
@@ -53,7 +53,6 @@ static inline int node_to_first_cpu(int
        .flags                  = SD_LOAD_BALANCE       \
                                | SD_BALANCE_EXEC       \
                                | SD_BALANCE_NEWIDLE    \
-                               | SD_WAKE_IDLE          \
                                | SD_WAKE_BALANCE,      \
        .last_balance           = jiffies,              \
        .balance_interval       = 1,                    \

There was no improvement in performance. 

I didn't expect any change in performance this time, because I
don't think the SD_WAKE_IDLE flag is effective in the NUMA
domain, due to the following code in wake_idle:

        for_each_domain(cpu, sd) {
                if (sd->flags & SD_WAKE_IDLE) {
                        cpus_and(tmp, sd->span, p->cpus_allowed);
                        for_each_cpu_mask(i, tmp) {
                                if (idle_cpu(i))
                                        return i;
                        }
                }
                else
                        break;
        }
 
If I read that loop correctly it stops at the first domain
which doesn't have SD_WAKE_IDLE set, which is the CPU domain
(see SD_CPU_INIT), and thus it never gets to the NUMA domain.

Thanks for the suggestions Nick.  Andrew raises some
good questions that I will address tomorrow.

Cheers,
Brian


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Database regression due to scheduler changes ?
       [not found]         ` <43718DFE.3040600@yahoo.com.au>
@ 2005-11-14 23:03           ` Brian Twichell
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Twichell @ 2005-11-14 23:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: mbligh, anton, slpratt, habanero, linux-kernel

Nick Piggin wrote:

> Just one other thing - A couple of fields aren't actually getting
> initialised at all, which I didn't pick up on.
>
> This bug looks to have been due to a mismerge between the
> common asm-powerpc directory and one of my scheduler changes
> somewhere along the line.
>
> If you get time to try this out, that would be great.
>
>===================================================================
>--- linux-2.6.orig/include/asm-powerpc/topology.h	2005-11-09 16:43:16.000000000 +1100
>+++ linux-2.6/include/asm-powerpc/topology.h	2005-11-09 16:45:17.000000000 +1100
>@@ -51,6 +51,10 @@ static inline int node_to_first_cpu(int 
> 	.cache_hot_time		= (10*1000000),		\
> 	.cache_nice_tries	= 1,			\
> 	.per_cpu_gain		= 100,			\
>+	.busy_idx		= 3,			\
>+	.idle_id		= 1,			\
>+	.newidle_idx		= 2,			\
>+	.wake_idx		= 1,			\
> 	.flags			= SD_LOAD_BALANCE	\
> 				| SD_BALANCE_EXEC	\
> 				| SD_BALANCE_NEWIDLE	\
>  
>
Nick,

That patch eliminates the regression on 2.6.13-rc5.  Thanks !!
We are currently evaluating it with other workloads.

It also gives a boost on 2.6.14, but unfortunately we are still 1%
regressed on 2.6.14.  (The regression on 2.6.14 was larger than
the regression on 2.6.13-rc5.)  We're trying to isolate the 2.6.14
regression now.  I'll let you know if we isolate it to a
scheduler change.

Cheers,
Brian


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-11-14 23:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-07 22:17 Database regression due to scheduler changes ? Brian Twichell
2005-11-07 22:35 ` David Lang
2005-11-07 23:06   ` Brian Twichell
2005-11-08  0:51     ` Nick Piggin
2005-11-08  1:15       ` Anton Blanchard
2005-11-08  1:34         ` Martin J. Bligh
2005-11-08  1:46           ` Nick Piggin
2005-11-08  1:48             ` Nick Piggin
2005-11-08  1:58             ` Martin J. Bligh
2005-11-08  2:04             ` David Lang
2005-11-08  2:12               ` Martin J. Bligh
2005-11-08  2:15               ` Nick Piggin
2005-11-09  5:03       ` Brian Twichell
     [not found]         ` <43718DFE.3040600@yahoo.com.au>
2005-11-14 23:03           ` Brian Twichell
2005-11-08  2:31   ` Byron Stanoszek
2005-11-07 22:47 ` linux-os (Dick Johnson)
2005-11-08  3:54   ` Nick Piggin
     [not found] <43715361.3070802@us.ibm.com>
2005-11-09  2:14 ` Andrew Theurer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox