public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Volanomark slows by 80% under CFS
@ 2007-07-27 22:01 Tim Chen
  2007-07-28  0:31 ` Chris Snook
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Tim Chen @ 2007-07-27 22:01 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1141 bytes --]

Ingo,

Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
Benchmark was run on a 2 socket Core2 machine.

The change in scheduler treatment of sched_yield 
could play a part in changing Volanomark behavior.
In CFS, sched_yield is implemented
by dequeueing and requeueing a process .  The time a process 
has spent running probably reduced the the cpu time due it 
by only a bit. The process could get re-queued pretty close
to head of the queue, and may get scheduled again pretty
quickly if there is still a lot of cpu time due.  

It may make sense to queue the
yielding process a bit further behind in the queue. 
I made a slight change by zeroing out wait_runtime 
(i.e. have the process gives
up cpu time due for it to run) for experimentation. 
Let's put aside gripes that Volanomark should have used a 
better mechanism to coordinate threads instead sched_yield for 
a second.   Volanomark runs better
and is only 40% (instead of 80%) down from old scheduler 
without CFS.  

Of course we should not tune for Volanomark and this is
reference data. 
What are your view on how CFS's sched_yield should behave?

Regards,
Tim




[-- Attachment #2: patch.sched_yield --]
[-- Type: text/plain, Size: 336 bytes --]

--- linux-2.6.23-rc1/kernel/sched_fair.c.orig	2007-07-27 09:39:11.000000000 -0700
+++ linux-2.6.23-rc1/kernel/sched_fair.c	2007-07-27 09:40:41.000000000 -0700
@@ -841,6 +841,7 @@
 	 * position within the tree:
 	 */
 	dequeue_entity(cfs_rq, &p->se, 0, now);
+    	p->se.wait_runtime = 0; 
 	enqueue_entity(cfs_rq, &p->se, 0, now);
 }
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Volanomark slows by 80% under CFS
  2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen
@ 2007-07-28  0:31 ` Chris Snook
  2007-07-28  0:59   ` Andrea Arcangeli
  2007-07-28 13:28   ` Volanomark slows by 80% under CFS Dmitry Adamushko
  2007-07-28  2:47 ` Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Chris Snook @ 2007-07-28  0:31 UTC (permalink / raw)
  To: tim.c.chen; +Cc: mingo, linux-kernel

Tim Chen wrote:
> Ingo,
> 
> Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
> Benchmark was run on a 2 socket Core2 machine.
> 
> The change in scheduler treatment of sched_yield 
> could play a part in changing Volanomark behavior.
> In CFS, sched_yield is implemented
> by dequeueing and requeueing a process .  The time a process 
> has spent running probably reduced the the cpu time due it 
> by only a bit. The process could get re-queued pretty close
> to head of the queue, and may get scheduled again pretty
> quickly if there is still a lot of cpu time due.  
> 
> It may make sense to queue the
> yielding process a bit further behind in the queue. 
> I made a slight change by zeroing out wait_runtime 
> (i.e. have the process gives
> up cpu time due for it to run) for experimentation. 
> Let's put aside gripes that Volanomark should have used a 
> better mechanism to coordinate threads instead sched_yield for 
> a second.   Volanomark runs better
> and is only 40% (instead of 80%) down from old scheduler 
> without CFS.  
> 
> Of course we should not tune for Volanomark and this is
> reference data. 
> What are your view on how CFS's sched_yield should behave?
> 
> Regards,
> Tim

The primary purpose of sched_yield is for SCHED_FIFO realtime processes.  Where 
nothing else will run, ever, unless the running thread blocks or yields the CPU. 
  Under CFS, the yielding process will still be leftmost in the rbtree, 
otherwise it would have already been scheduled out.

Zeroing out wait_runtime on sched_yield strikes me as completely appropriate. 
If the process wanted to sleep a finite duration, it should actually call a 
sleep function, but sched_yield is essentially saying "I don't have anything 
else to do right now", so it's hardly fair to claim you've been waiting for your 
chance when you just gave it up.

As for the remaining 40% degradation, if Volanomark is using it for 
synchronization, the scheduler is probably cycling through threads until it gets 
to the one that actually wants to do work.  The O(1) scheduler will do this very 
  quickly, whereas CFS has a bit more overhead.  Interactivity boosting may have 
also helped the old scheduler find the right thread faster.

I think Volanomark is being pretty stupid, and deserves to run slowly, but there 
are legitimate reasons to want to call sched_yield in a non-SCHED_FIFO process. 
  If I'm performing multiple different calculations on the same set of data in 
multiple threads, and accessing the shared data in a linear fashion, I'd like to 
be able to have one thread give the other some CPU time so they can stay at the 
same point in the stream and improve cache hit rates, but this is only an 
optimization if I can do it without wasting CPU or gradually nicing myself into 
oblivion.  Having sched_yield zero out wait_runtime seems like an appropriate 
way to make this use case work to the extent possible.  Any user attempting such 
an optimization should have the good sense to do real work between sched_yield 
calls, to avoid calling the scheduler in a tight loop.

	-- Chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Volanomark slows by 80% under CFS
  2007-07-28  0:31 ` Chris Snook
@ 2007-07-28  0:59   ` Andrea Arcangeli
  2007-07-28  3:43     ` pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS) Chris Snook
  2007-07-28 13:28   ` Volanomark slows by 80% under CFS Dmitry Adamushko
  1 sibling, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2007-07-28  0:59 UTC (permalink / raw)
  To: Chris Snook; +Cc: tim.c.chen, mingo, linux-kernel

On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote:
> I think Volanomark is being pretty stupid, and deserves to run slowly, but 

Indeed, any app doing what volanomark does is pretty inefficient.

But this is not the point. I/O schedulers are pluggable to help for
inefficient apps too. If apps would be extremely smart they would all
use async-io for their reads, and there wouldn't be the need of
anticipatory scheduler just for an example.

The fact is there's no technical explanation for which we're forbidden
to be able to choose between CFS and O(1) at least at boot time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Volanomark slows by 80% under CFS
  2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen
  2007-07-28  0:31 ` Chris Snook
@ 2007-07-28  2:47 ` Rik van Riel
  2007-07-28 20:26   ` Dave Jones
  2007-07-28 12:36 ` Dmitry Adamushko
  2007-07-29 17:37 ` [patch] sched: yield debugging Ingo Molnar
  3 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2007-07-28  2:47 UTC (permalink / raw)
  To: tim.c.chen; +Cc: mingo, linux-kernel

Tim Chen wrote:
> Ingo,
> 
> Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
> Benchmark was run on a 2 socket Core2 machine.
> 
> The change in scheduler treatment of sched_yield 
> could play a part in changing Volanomark behavior.
> In CFS, sched_yield is implemented
> by dequeueing and requeueing a process .  The time a process 
> has spent running probably reduced the the cpu time due it 
> by only a bit. The process could get re-queued pretty close
> to head of the queue, and may get scheduled again pretty
> quickly if there is still a lot of cpu time due.  

I wonder if this explains the 30% drop in top performance
seen with the MySQL sysbench benchmark when the scheduler
changed to CFS...

See http://people.freebsd.org/~jeff/sysbench.png

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS)
  2007-07-28  0:59   ` Andrea Arcangeli
@ 2007-07-28  3:43     ` Chris Snook
  2007-07-28  5:01       ` pluggable scheduler " Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Snook @ 2007-07-28  3:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: tim.c.chen, mingo, linux-kernel

Andrea Arcangeli wrote:
> On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote:
>> I think Volanomark is being pretty stupid, and deserves to run slowly, but 
> 
> Indeed, any app doing what volanomark does is pretty inefficient.
> 
> But this is not the point. I/O schedulers are pluggable to help for
> inefficient apps too. If apps would be extremely smart they would all
> use async-io for their reads, and there wouldn't be the need of
> anticipatory scheduler just for an example.

I'm pretty sure the point of posting a patch that triples CFS performance on a 
certain benchmark and arguably improves the semantics of sched_yield was to 
improve CFS.  You have a point, but it is a point for a different thread.  I 
have taken the liberty of starting this thread for you.

> The fact is there's no technical explanation for which we're forbidden
> to be able to choose between CFS and O(1) at least at boot time.

Sure there is.  We can run a fully-functional POSIX OS without using any block 
devices at all.  We cannot run a fully-functional POSIX OS without a scheduler. 
  Any feature without which the OS cannot execute userspace code is sufficiently 
primitive that somewhere there is a device on which it will be impossible to 
debug if that feature fails to initialize.  It is quite reasonable to insist on 
only having one implementation of such features in any given kernel build.

Whether or not these alternatives belong in the source tree as config-time 
options is a political question, but preserving boot-time debugging capability 
is a perfectly reasonable technical motivation.

	-- Chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)
  2007-07-28  3:43     ` pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS) Chris Snook
@ 2007-07-28  5:01       ` Andrea Arcangeli
  2007-07-28  6:51         ` Chris Snook
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2007-07-28  5:01 UTC (permalink / raw)
  To: Chris Snook; +Cc: tim.c.chen, mingo, linux-kernel

On Fri, Jul 27, 2007 at 11:43:23PM -0400, Chris Snook wrote:
> I'm pretty sure the point of posting a patch that triples CFS performance 
> on a certain benchmark and arguably improves the semantics of sched_yield 
> was to improve CFS.  You have a point, but it is a point for a different 
> thread.  I have taken the liberty of starting this thread for you.

I've no real interest in starting or participating in flamewars
(especially the ones not backed by hard numbers). So I adjusted the
subject a bit in the hope the discussion will not degenerate as you
predicted, hope you don't mind.

I'm pretty sure the point of posting that email was to show the
remaining performance regression with the sched_yield fix applied
too. Given you considered my post both offtopic and inflammatory, I
guess you think it's possible and reasonably easy to fix that
remaining regression without a pluggable scheduler, right? So please
enlighten us on your intend to achieve it.

Also consider the other numbers likely used nptl so they shouldn't be
affected by sched_yield changes.

> Sure there is.  We can run a fully-functional POSIX OS without using any 
> block devices at all.  We cannot run a fully-functional POSIX OS without a 
> scheduler. Any feature without which the OS cannot execute userspace code 
>  is sufficiently primitive that somewhere there is a device on which it will 
> be impossible to debug if that feature fails to initialize.  It is quite 
> reasonable to insist on only having one implementation of such features in 
> any given kernel build.

Sounds like a red-herring to me... There aren't just pluggable I/O
schedulers in the kernel, there are pluggable packet schedulers too
(see `tc qdisc`). And both are switchable at runtime (not just at boot
time).

Can you run your fully-functional POSIX OS without a packet scheduler
and without an I/O scheduler? I wonder where are you going to
read/write data without HD and network?

Also those pluggable things don't increase the risk of crash much, if
compared to the complexity of the schedulers.

> Whether or not these alternatives belong in the source tree as config-time 
> options is a political question, but preserving boot-time debugging 
> capability is a perfectly reasonable technical motivation.

The scheduler is invoked very late in the boot process (printk and
serial console, kdb are working for ages when scheduler kicks in), so
it's fully debuggable (no debugger depends on the scheduler, they run
inside the nmi handler...), I don't really see your point.

And even if there would be a subtle bug in the scheduler you'll never
trigger it at boot with so few tasks and so few context switches.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)
  2007-07-28  5:01       ` pluggable scheduler " Andrea Arcangeli
@ 2007-07-28  6:51         ` Chris Snook
  2007-07-30 18:49           ` Tim Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Snook @ 2007-07-28  6:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: tim.c.chen, mingo, linux-kernel

Andrea Arcangeli wrote:
> On Fri, Jul 27, 2007 at 11:43:23PM -0400, Chris Snook wrote:
>> I'm pretty sure the point of posting a patch that triples CFS performance 
>> on a certain benchmark and arguably improves the semantics of sched_yield 
>> was to improve CFS.  You have a point, but it is a point for a different 
>> thread.  I have taken the liberty of starting this thread for you.
> 
> I've no real interest in starting or participating in flamewars
> (especially the ones not backed by hard numbers). So I adjusted the
> subject a bit in the hope the discussion will not degenerate as you
> predicted, hope you don't mind.

Not at all.  I clearly misread your tone.

> I'm pretty sure the point of posting that email was to show the
> remaining performance regression with the sched_yield fix applied
> too. Given you considered my post both offtopic and inflammatory, I
> guess you think it's possible and reasonably easy to fix that
> remaining regression without a pluggable scheduler, right? So please
> enlighten us on your intend to achieve it.

There are four possibilities that are immediately obvious to me:

a) The remaining difference is due mostly to the algorithmic complexity 
of the rbtree algorithm in CFS.

If this is the case, we should be able to vary the test parameters (CPU 
count, thread count, etc.) graph the results, and see a roughly 
logarithmic divergence between the schedulers as some parameter(s) vary. 
  If this is the problem, we may be able to fix it with data structure 
tweaks or optimized base cases, like how quicksort can be optimized by 
using insertion sort below a certain threshold.

b) The remaining difference is due mostly to how the scheduler handles 
volanomark.

vmstat can give us a comparison of context switches between O(1), CFS, 
and CFS+patch.  If the decrease in throughput correlates with an 
increase in context switches, we may be able to induce more O(1)-like 
behavior by charging tasks for context switch overhead.

c) The remaining difference is due mostly to how the scheduler handles 
something other than volanomark.

If context switch count is not the problem, context switch pattern still 
could be.  I doubt we'd see a 40% difference due to cache misses, but 
it's possible.  Fortunately, oprofile can sample based on cache misses, 
so we can debug this too.

d) The remaining difference is due mostly to some implementation detail 
in CFS.

It's possible there's some constant-factor overhead in CFS that is 
magnified heavily by the context switching volanomark deliberately 
induces.  If this is the case, oprofile sampling on clock cycles should 
catch it.

Tim --

	Since you're already set up to do this benchmarking, would you mind 
varying the parameters a bit and collecting vmstat data?  If you want to 
run oprofile too, that wouldn't hurt.

> Also consider the other numbers likely used nptl so they shouldn't be
> affected by sched_yield changes.
> 
>> Sure there is.  We can run a fully-functional POSIX OS without using any 
>> block devices at all.  We cannot run a fully-functional POSIX OS without a 
>> scheduler. Any feature without which the OS cannot execute userspace code 
>>  is sufficiently primitive that somewhere there is a device on which it will 
>> be impossible to debug if that feature fails to initialize.  It is quite 
>> reasonable to insist on only having one implementation of such features in 
>> any given kernel build.
> 
> Sounds like a red-herring to me... There aren't just pluggable I/O
> schedulers in the kernel, there are pluggable packet schedulers too
> (see `tc qdisc`). And both are switchable at runtime (not just at boot
> time).
> 
> Can you run your fully-functional POSIX OS without a packet scheduler
> and without an I/O scheduler? I wonder where are you going to
> read/write data without HD and network?

If I'm missing both, I'm pretty screwed, but if either one is 
functional, I can send something out.

> Also those pluggable things don't increase the risk of crash much, if
> compared to the complexity of the schedulers.
> 
>> Whether or not these alternatives belong in the source tree as config-time 
>> options is a political question, but preserving boot-time debugging 
>> capability is a perfectly reasonable technical motivation.
> 
> The scheduler is invoked very late in the boot process (printk and
> serial console, kdb are working for ages when scheduler kicks in), so
> it's fully debuggable (no debugger depends on the scheduler, they run
> inside the nmi handler...), I don't really see your point.

I'm more concerned about embedded systems.  These are the same people 
who want userspace character drivers to control their custom hardware. 
Having the robot point to where it hurts is a lot more convenient than 
hooking up a JTAG debugger.

> And even if there would be a subtle bug in the scheduler you'll never
> trigger it at boot with so few tasks and so few context switches.

Sure, but it's the non-subtle bugs that worry me.  These are usually 
related to low-level hardware setup, so they could miss the mainstream 
developers and clobber unsuspecting embedded developers.

I acknowledge that debugging such problems shouldn't be terribly hard on 
mainstream systems, but some people are going to want to choose a single 
scheduler at build time and avoid the hassle.  If we can improve CFS to 
be regression-free, and I think we can if we give ourselves a few 
percent tolerance and keep tracking down the corner cases, the pluggable 
scheduler infrastructure will just be another disused feature.

	-- Chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Volanomark slows by 80% under CFS
  2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen
  2007-07-28  0:31 ` Chris Snook
  2007-07-28  2:47 ` Rik van Riel
@ 2007-07-28 12:36 ` Dmitry Adamushko
  2007-07-28 18:55   ` David Schwartz
  2007-07-29 17:37 ` [patch] sched: yield debugging Ingo Molnar
  3 siblings, 1 reply; 18+ messages in thread
From: Dmitry Adamushko @ 2007-07-28 12:36 UTC (permalink / raw)
  To: tim.c.chen; +Cc: Ingo Molnar, Linux Kernel

On 28/07/07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> [ ... ]
> It may make sense to queue the
> yielding process a bit further behind in the queue.
> I made a slight change by zeroing out wait_runtime
> (i.e. have the process gives
> up cpu time due for it to run) for experimentation.

But that's wrong. The 'wait_runtime' might have been negative at this
point (i.e. a task is in the negative 'run-time' balance wrt the
'etalon' nice-0 task). Your change ends up helping such a task to
actually stay closer to the 'left most' element of the tree (or to be
it) and not "further behind in the queue" as your intention is.

I don't know Volanomark's details so refrain from speculating on why
this change "improves" benchmark results indeed (maybe some afected
tasks have
positive 'wait_runtime's on average for this setup).

If you want to make sure (just for a test) a yeilding task is not the
left-most (at least) for some short interval of time (likely to be <=
1 tick), take a look at yield_task_fair() in e.g. cfs-v15.

> Volanomark runs better
> and is only 40% (instead of 80%) down from old scheduler
> without CFS.

40 or 80 % is still a huge regression.


>
> Regards,
> Tim
>

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Volanomark slows by 80% under CFS
  2007-07-28  0:31 ` Chris Snook
  2007-07-28  0:59   ` Andrea Arcangeli
@ 2007-07-28 13:28   ` Dmitry Adamushko
  1 sibling, 0 replies; 18+ messages in thread
From: Dmitry Adamushko @ 2007-07-28 13:28 UTC (permalink / raw)
  To: Chris Snook; +Cc: tim.c.chen, Ingo Molnar, Linux Kernel

On 28/07/07, Chris Snook <csnook@redhat.com> wrote:
> [ ... ]
>   Under CFS, the yielding process will still be leftmost in the rbtree,
> otherwise it would have already been scheduled out.

Not actually true. The position of the 'current' task within the
rb-tree is updated with a timer tick's frequency. Being called
somewhere in between 2 ticks, sched_yield() may trigger a reschedule
which would otherwise take place upon the next tick.

Moreover, 'scheduling granularity' may also take effect. e.g.
effectively, the yielding task's 'fair_key' was already != 'left_most'
upon the previous timer tick but an actual reschedule has been delayed
due to the 'scheduling granularity' taking effect... and sched_yield()
may trigger it.


> Zeroing out wait_runtime on sched_yield strikes me as completely appropriate.
> If the process wanted to sleep a finite duration, it should actually call a
> sleep function, but sched_yield is essentially saying "I don't have anything
> else to do right now", so it's hardly fair to claim you've been waiting for your
> chance when you just gave it up.

'wait_runtime' describes dynamic behavior of a task on the rq. It
doesn't matter what the task is about to do as 'wait_runtime' is
something it fully deserves. Note, 'wait_runtime' can be both positive
and negative, meaning credit/punishment appropriately.

When it's negative, 'zeroing it out' effectively means the task gets
helped -- in the sense that it doen't get 'punished' for some amount
of time it actually spent running. Which is wrong.

One more thing: we don't take time accounted to 'wait_runtime' just
from the thin air.
e.g. sleepers get an additional bonus to their 'wait_runtime' upon a
wakeup _but_ the amount of "wait_runtime" == "a given bonus" will be
additionally substracted from tasks which happen to run later on (grep
for "sleeper_bonus" in sched_fair.c). That said, the sum (of
additionally given/taken wait_runtime) is zero.

All in all, I doubt the "zeroing out wait_runtime on sched_yield"
thing is really appropriate.


>
>         -- Chris
>

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Volanomark slows by 80% under CFS
  2007-07-28 12:36 ` Dmitry Adamushko
@ 2007-07-28 18:55   ` David Schwartz
  0 siblings, 0 replies; 18+ messages in thread
From: David Schwartz @ 2007-07-28 18:55 UTC (permalink / raw)
  To: Linux-Kernel@Vger. Kernel. Org


> > Volanomark runs better
> > and is only 40% (instead of 80%) down from old scheduler
> > without CFS.

> 40 or 80 % is still a huge regression.
> Dmitry Adamushko

Can anyone explain precisely what Volanomark is doing? If it's something
dumb like "looping on sched_yield until the 'right' thread runs and finishes
what we're waiting for" then I think any regression can be ignored.

This applies if and only if CFS' sched_yield behavior is sane and Volano's
is insane.

A sane sched_yield implementation must do two things:

1) Reward processes that actually do yield most of their CPU time to another
process.

2) Make an effort to run every ready-to-run process at the same or higher
static priority level before re-scheduling this process. (That won't always
be possible due to SMP issues, but a reasonable effort is needed.)

If CFS is doing these two things, and Volanomark is looping on sched_yield
until the 'right thread' runs, then CFS is doing the right and Volanomark
isn't. Volanomark deserves to lose.

If CFS binds processes to processors more tightly and thus sched_yield can't
yield to a process that was planned to run on another CPU in the future,
that would be a legitimate complaint about CFS.

DS



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Volanomark slows by 80% under CFS
  2007-07-28  2:47 ` Rik van Riel
@ 2007-07-28 20:26   ` Dave Jones
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Jones @ 2007-07-28 20:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: tim.c.chen, mingo, linux-kernel

On Fri, Jul 27, 2007 at 10:47:21PM -0400, Rik van Riel wrote:
 > Tim Chen wrote:
 > > Ingo,
 > > 
 > > Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
 > > Benchmark was run on a 2 socket Core2 machine.
 > > 
 > > The change in scheduler treatment of sched_yield 
 > > could play a part in changing Volanomark behavior.
 > > In CFS, sched_yield is implemented
 > > by dequeueing and requeueing a process .  The time a process 
 > > has spent running probably reduced the the cpu time due it 
 > > by only a bit. The process could get re-queued pretty close
 > > to head of the queue, and may get scheduled again pretty
 > > quickly if there is still a lot of cpu time due.  
 > 
 > I wonder if this explains the 30% drop in top performance
 > seen with the MySQL sysbench benchmark when the scheduler
 > changed to CFS...
 > 
 > See http://people.freebsd.org/~jeff/sysbench.png

 From the authors blog when he did that graph:
 http://jeffr-tech.livejournal.com/10103.html

"So I updated the image for the second time today to include Ingo's cfs
 scheduler. This kernel is from the rpm on his website. I double checked
 that it was not using tcmalloc at the time and switching back to a
 2.6.21 kernel returned to the expected perf.

 Basically, it has the same performance as the FreeBSD 4BSD scheduler
 now. Which is to say the peak is terrible but it has virtually no
 dropoff and performs better under load than the default 2.6.21
 scheduler. "



	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch] sched: yield debugging
  2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen
                   ` (2 preceding siblings ...)
  2007-07-28 12:36 ` Dmitry Adamushko
@ 2007-07-29 17:37 ` Ingo Molnar
  2007-07-30 18:10   ` Tim Chen
  3 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2007-07-29 17:37 UTC (permalink / raw)
  To: Tim Chen; +Cc: linux-kernel

Tim,

* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> Ingo,
> 
> Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  Benchmark 
> was run on a 2 socket Core2 machine.

thanks for testing and reporting this!

> The change in scheduler treatment of sched_yield could play a part in 
> changing Volanomark behavior.

Could you try the patch below? It does not change the default behavior 
of yield but introduces 2 other yield strategies which you can activate 
runtime (if CONFIG_SCHED_DEBUG=y) via:

  # default one:
  echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

  # always queues the current task next to the next task:
  echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

  # NOP:
  echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

does variant '1' improve Java's VolanoMark performance perhaps?

i'm also wondering, which JDK is this, and where does Java make use of 
sys_sched_yield()? It's a voefully badly defined (and thus unreliable) 
system call, IMO Java should stop using it ASAP and use a saner locking 
model.

thanks,

	Ingo

------------------------------->
Subject: sched: yield debugging
From: Ingo Molnar <mingo@elte.hu>

introduce various sched_yield implementations:

 # default one:
 echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

 # always queues the current task next to the next task:
 echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

 # NOP:
 echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

tunability depends on CONFIG_SCHED_DEBUG=y.

Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    1 
 kernel/sched_fair.c   |   71 +++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sysctl.c       |    8 +++++
 3 files changed, 74 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1401,6 +1401,7 @@ extern unsigned int sysctl_sched_wakeup_
 extern unsigned int sysctl_sched_batch_wakeup_granularity;
 extern unsigned int sysctl_sched_stat_granularity;
 extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_yield_bug_workaround;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_features;
 
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -62,6 +62,16 @@ unsigned int sysctl_sched_stat_granulari
 unsigned int sysctl_sched_runtime_limit __read_mostly;
 
 /*
+ * sys_sched_yield workaround switch.
+ *
+ * This option switches the yield implementation of the
+ * old scheduler back on.
+ */
+unsigned int sysctl_sched_yield_bug_workaround __read_mostly = 0;
+
+EXPORT_SYMBOL_GPL(sysctl_sched_yield_bug_workaround);
+
+/*
  * Debugging: various feature bits
  */
 enum {
@@ -834,14 +844,63 @@ dequeue_task_fair(struct rq *rq, struct 
 static void yield_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+	struct rb_node *curr, *next, *first;
+	struct task_struct *p_next;
 	u64 now = __rq_clock(rq);
+	s64 yield_key;
 
-	/*
-	 * Dequeue and enqueue the task to update its
-	 * position within the tree:
-	 */
-	dequeue_entity(cfs_rq, &p->se, 0, now);
-	enqueue_entity(cfs_rq, &p->se, 0, now);
+
+	switch (sysctl_sched_yield_bug_workaround) {
+	default:
+		/*
+		 * Dequeue and enqueue the task to update its
+		 * position within the tree:
+		 */
+		dequeue_entity(cfs_rq, &p->se, 0, now);
+		enqueue_entity(cfs_rq, &p->se, 0, now);
+		break;
+	case 1:
+		curr = &p->se.run_node;
+		first = first_fair(cfs_rq);
+		/*
+		 * Move this task to the second place in the tree:
+		 */
+		if (unlikely(curr != first)) {
+			next = first;
+		} else {
+			next = rb_next(curr);
+			/*
+			 * We were the last one already - nothing to do, return
+			 * and reschedule:
+			 */
+			if (unlikely(!next))
+				return;
+		}
+
+		p_next = rb_entry(next, struct task_struct, se.run_node);
+		/*
+		 * Minimally necessary key value to be the second in the tree:
+		 */
+		yield_key = p_next->se.fair_key + (int)sysctl_sched_granularity;
+
+		dequeue_entity(cfs_rq, &p->se, 0, now);
+
+		/*
+		 * Only update the key if we need to move more backwards
+		 * than the minimally necessary position to be the second:
+		 */
+		if (p->se.fair_key < yield_key)
+			p->se.fair_key = yield_key;
+
+		__enqueue_entity(cfs_rq, &p->se);
+		break;
+	case 2:
+		/*
+		 * Just reschedule, do nothing else:
+		 */
+		resched_task(p);
+		break;
+	}
 }
 
 /*
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -278,6 +278,14 @@ static ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_yield_bug_workaround",
+		.data		= &sysctl_sched_yield_bug_workaround,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_child_runs_first",
 		.data		= &sysctl_sched_child_runs_first,
 		.maxlen		= sizeof(unsigned int),

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] sched: yield debugging
  2007-07-29 17:37 ` [patch] sched: yield debugging Ingo Molnar
@ 2007-07-30 18:10   ` Tim Chen
  2007-07-31 20:33     ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Tim Chen @ 2007-07-30 18:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Sun, 2007-07-29 at 19:37 +0200, Ingo Molnar wrote:
> Tim,
> 
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 

> 
> Could you try the patch below? It does not change the default behavior 
> of yield but introduces 2 other yield strategies which you can activate 
> runtime (if CONFIG_SCHED_DEBUG=y) via:
> 
>   # default one:
>   echo 0 > /proc/sys/kernel/sched_yield_bug_workaround
> 
>   # always queues the current task next to the next task:
>   echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
> 
>   # NOP:
>   echo 2 > /proc/sys/kernel/sched_yield_bug_workaround
> 
> does variant '1' improve Java's VolanoMark performance perhaps?
> 

Here's a summary of Volanomark performance numbers:
Variant 0 is 80% down from 2.6.22
Variant 1 is 20% down from 2.6.22  (this is indeed helped)
Variant 2 is 89% down from 2.6.22

> i'm also wondering, which JDK is this, and where does Java make use of 
> sys_sched_yield()? It's a voefully badly defined (and thus unreliable) 
> system call, IMO Java should stop using it ASAP and use a saner locking 
> model.

I am using a JRockit JDK.

Thanks.
Tim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)
  2007-07-28  6:51         ` Chris Snook
@ 2007-07-30 18:49           ` Tim Chen
  2007-07-30 21:07             ` Chris Snook
  0 siblings, 1 reply; 18+ messages in thread
From: Tim Chen @ 2007-07-30 18:49 UTC (permalink / raw)
  To: Chris Snook; +Cc: Andrea Arcangeli, mingo, linux-kernel

On Sat, 2007-07-28 at 02:51 -0400, Chris Snook wrote:

> 
> Tim --
> 
> 	Since you're already set up to do this benchmarking, would you mind 
> varying the parameters a bit and collecting vmstat data?  If you want to 
> run oprofile too, that wouldn't hurt.
> 

Here's the vmstat data.  The number of runnable processes are 
fewer and there're more contex switches with CFS. 
 
The vmstat for 2.6.22 looks like

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
391  0      0 1722564  14416  95472    0    0   169    25   76 6520  3  3 89  5  0
400  0      0 1722372  14416  95496    0    0     0     0  264 641685 47 53  0  0  0
368  0      0 1721504  14424  95496    0    0     0     7  261 648493 46 51  3  0  0
438  0      0 1721504  14432  95496    0    0     0     2  264 690834 46 54  0  0  0
400  0      0 1721380  14432  95496    0    0     0     0  260 657157 46 53  1  0  0
393  0      0 1719892  14440  95496    0    0     0     6  265 671599 45 53  2  0  0
423  0      0 1719892  14440  95496    0    0     0    15  264 701626 44 56  0  0  0
375  0      0 1720240  14472  95504    0    0     0    72  265 671795 43 53  3  0  0
393  0      0 1720140  14480  95504    0    0     0     7  265 733561 45 55  0  0  0
355  0      0 1716052  14480  95504    0    0     0     0  260 670676 43 54  3  0  0
419  0      0 1718900  14480  95504    0    0     0     4  265 680690 43 55  2  0  0
396  0      0 1719148  14488  95504    0    0     0     3  261 712307 43 56  0  0  0
395  0      0 1719148  14488  95504    0    0     0     2  264 692781 44 54  1  0  0
387  0      0 1719148  14492  95504    0    0     0    41  268 709579 43 57  0  0  0
420  0      0 1719148  14500  95504    0    0     0     3  265 690862 44 54  2  0  0
429  0      0 1719396  14500  95504    0    0     0     0  260 704872 46 54  0  0  0
460  0      0 1719396  14500  95504    0    0     0     0  264 716272 46 54  0  0  0
419  0      0 1719396  14508  95504    0    0     0     3  261 685864 43 55  2  0  0
455  0      0 1719396  14508  95504    0    0     0     0  264 703718 44 56  0  0  0
395  0      0 1719372  14540  95512    0    0     0    64  265 692785 45 54  1  0  0
424  0      0 1719396  14548  95512    0    0     0    10  265 732866 45 55  0  0  0

While 2.6.23-rc1 look like
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
23  0      0 1705992  17020  95720    0    0     0     0  261 1010016 53 42  5  0  0
 7  0      0 1706116  17020  95720    0    0     0    13  267 1060997 52 41  7  0  0
 5  0      0 1706116  17020  95720    0    0     0    28  266 1313361 56 41  3  0  0
19  0      0 1706116  17028  95720    0    0     0     8  265 1273669 55 41  4  0  0
18  0      0 1706116  17032  95720    0    0     0     2  262 1403588 55 41  4  0  0
23  0      0 1706116  17032  95720    0    0     0     0  264 1272561 56 40  4  0  0
14  0      0 1706116  17032  95720    0    0     0     0  262 1046795 55 40  5  0  0
16  0      0 1706116  17032  95720    0    0     0     0  260 1361102 58 39  4  0  0
 4  0      0 1706224  17120  95724    0    0     0   126  273 1488711 56 41  3  0  0
24  0      0 1706224  17128  95724    0    0     0     6  261 1408432 55 41  4  0  0
 3  0      0 1706240  17128  95724    0    0     0    48  273 1299203 54 42  4  0  0
16  0      0 1706240  17132  95724    0    0     0     3  261 1356609 54 42  4  0  0
 5  0      0 1706364  17132  95724    0    0     0     0  264 1293198 58 39  3  0  0
 9  0      0 1706364  17132  95724    0    0     0     0  261 1555153 56 41  3  0  0
13  0      0 1706364  17132  95724    0    0     0     0  264 1160296 56 40  4  0  0
 8  0      0 1706364  17132  95724    0    0     0     0  261 1388909 58 38  4  0  0
18  0      0 1706364  17132  95724    0    0     0     0  264 1236774 56 39  5  0  0
11  0      0 1706364  17136  95724    0    0     0     2  261 1360325 57 40  3  0  0
 5  0      0 1706364  17136  95724    0    0     0     1  265 1201912 57 40  3  0  0
 8  0      0 1706364  17136  95724    0    0     0     0  261 1104308 57 39  4  0  0
 7  0      0 1705976  17232  95724    0    0     0   127  274 1205212 58 39  4  0  0

Tim


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)
  2007-07-30 18:49           ` Tim Chen
@ 2007-07-30 21:07             ` Chris Snook
  2007-07-30 21:24               ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Snook @ 2007-07-30 21:07 UTC (permalink / raw)
  To: tim.c.chen; +Cc: Andrea Arcangeli, mingo, linux-kernel

Tim Chen wrote:
> On Sat, 2007-07-28 at 02:51 -0400, Chris Snook wrote:
> 
>> Tim --
>>
>> 	Since you're already set up to do this benchmarking, would you mind 
>> varying the parameters a bit and collecting vmstat data?  If you want to 
>> run oprofile too, that wouldn't hurt.
>>
> 
> Here's the vmstat data.  The number of runnable processes are 
> fewer and there're more contex switches with CFS. 
>  
> The vmstat for 2.6.22 looks like
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 391  0      0 1722564  14416  95472    0    0   169    25   76 6520  3  3 89  5  0
> 400  0      0 1722372  14416  95496    0    0     0     0  264 641685 47 53  0  0  0
> 368  0      0 1721504  14424  95496    0    0     0     7  261 648493 46 51  3  0  0
> 438  0      0 1721504  14432  95496    0    0     0     2  264 690834 46 54  0  0  0
> 400  0      0 1721380  14432  95496    0    0     0     0  260 657157 46 53  1  0  0
> 393  0      0 1719892  14440  95496    0    0     0     6  265 671599 45 53  2  0  0
> 423  0      0 1719892  14440  95496    0    0     0    15  264 701626 44 56  0  0  0
> 375  0      0 1720240  14472  95504    0    0     0    72  265 671795 43 53  3  0  0
> 393  0      0 1720140  14480  95504    0    0     0     7  265 733561 45 55  0  0  0
> 355  0      0 1716052  14480  95504    0    0     0     0  260 670676 43 54  3  0  0
> 419  0      0 1718900  14480  95504    0    0     0     4  265 680690 43 55  2  0  0
> 396  0      0 1719148  14488  95504    0    0     0     3  261 712307 43 56  0  0  0
> 395  0      0 1719148  14488  95504    0    0     0     2  264 692781 44 54  1  0  0
> 387  0      0 1719148  14492  95504    0    0     0    41  268 709579 43 57  0  0  0
> 420  0      0 1719148  14500  95504    0    0     0     3  265 690862 44 54  2  0  0
> 429  0      0 1719396  14500  95504    0    0     0     0  260 704872 46 54  0  0  0
> 460  0      0 1719396  14500  95504    0    0     0     0  264 716272 46 54  0  0  0
> 419  0      0 1719396  14508  95504    0    0     0     3  261 685864 43 55  2  0  0
> 455  0      0 1719396  14508  95504    0    0     0     0  264 703718 44 56  0  0  0
> 395  0      0 1719372  14540  95512    0    0     0    64  265 692785 45 54  1  0  0
> 424  0      0 1719396  14548  95512    0    0     0    10  265 732866 45 55  0  0  0
> 
> While 2.6.23-rc1 look like
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 23  0      0 1705992  17020  95720    0    0     0     0  261 1010016 53 42  5  0  0
>  7  0      0 1706116  17020  95720    0    0     0    13  267 1060997 52 41  7  0  0
>  5  0      0 1706116  17020  95720    0    0     0    28  266 1313361 56 41  3  0  0
> 19  0      0 1706116  17028  95720    0    0     0     8  265 1273669 55 41  4  0  0
> 18  0      0 1706116  17032  95720    0    0     0     2  262 1403588 55 41  4  0  0
> 23  0      0 1706116  17032  95720    0    0     0     0  264 1272561 56 40  4  0  0
> 14  0      0 1706116  17032  95720    0    0     0     0  262 1046795 55 40  5  0  0
> 16  0      0 1706116  17032  95720    0    0     0     0  260 1361102 58 39  4  0  0
>  4  0      0 1706224  17120  95724    0    0     0   126  273 1488711 56 41  3  0  0
> 24  0      0 1706224  17128  95724    0    0     0     6  261 1408432 55 41  4  0  0
>  3  0      0 1706240  17128  95724    0    0     0    48  273 1299203 54 42  4  0  0
> 16  0      0 1706240  17132  95724    0    0     0     3  261 1356609 54 42  4  0  0
>  5  0      0 1706364  17132  95724    0    0     0     0  264 1293198 58 39  3  0  0
>  9  0      0 1706364  17132  95724    0    0     0     0  261 1555153 56 41  3  0  0
> 13  0      0 1706364  17132  95724    0    0     0     0  264 1160296 56 40  4  0  0
>  8  0      0 1706364  17132  95724    0    0     0     0  261 1388909 58 38  4  0  0
> 18  0      0 1706364  17132  95724    0    0     0     0  264 1236774 56 39  5  0  0
> 11  0      0 1706364  17136  95724    0    0     0     2  261 1360325 57 40  3  0  0
>  5  0      0 1706364  17136  95724    0    0     0     1  265 1201912 57 40  3  0  0
>  8  0      0 1706364  17136  95724    0    0     0     0  261 1104308 57 39  4  0  0
>  7  0      0 1705976  17232  95724    0    0     0   127  274 1205212 58 39  4  0  0
> 
> Tim
> 

 From a scheduler performance perspective, it looks like CFS is doing much 
better on this workload.  It's spending a lot less time in %sys despite the 
higher context switches, and there are far fewer tasks waiting for CPU time. 
The real problem seems to be that volanomark is optimized for a particular 
scheduler behavior.

That's not to say that we can't improve volanomark performance under CFS, but 
simply that CFS isn't so fundamentally flawed that this is impossible.

When I initially agreed with zeroing out wait time in sched_yield, I didn't 
realize that it could be negative and that this would actually promote processes 
in some cases.  I still think it's reasonable to zero out positive wait times. 
Can you test to see if that optimization does better than unconditionally 
zeroing them out?

	-- Chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)
  2007-07-30 21:07             ` Chris Snook
@ 2007-07-30 21:24               ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2007-07-30 21:24 UTC (permalink / raw)
  To: Chris Snook; +Cc: tim.c.chen, mingo, linux-kernel

On Mon, Jul 30, 2007 at 05:07:46PM -0400, Chris Snook wrote:
> [..]  It's spending a lot less time in %sys despite the 
> higher context switches, [..]

The workload takes 40% more so you've to add up that additional 40%
too into your math. "A lot less time" sounds an overstatement to
me. Also you've to take into account cache effects in executing the
scheduler so much etc...

> [..] and there are far fewer tasks waiting for CPU 
> time. The real problem seems to be that volanomark is optimized for a 

It looks weird that there are a lot less tasks in R state. Could you
press SYSRQ+T to see where those hundred tasks are sleeping in the CFS
run?

> That's not to say that we can't improve volanomark performance under CFS, 
> but simply that CFS isn't so fundamentally flawed that this is impossible.

Given the increase of context switches, it means not all the ctx
switches are "userland mandated", so the first thing to try here is to
increase the granularity with the new tunable sysctl. Increasing the
granularity has to reduce the context switch rate, and in turn it will
reduce the slowdown to less than 40%.

There's nothing necessarily flawed in CFS even if it's slower than
O(1) in this load no matter how you tune it. The higher context switch
rate to retain complete fariness is a feature, but fariness vs global
performance is generally a tradeoff.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] sched: yield debugging
  2007-07-30 18:10   ` Tim Chen
@ 2007-07-31 20:33     ` Ingo Molnar
  2007-08-01 20:53       ` Tim Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2007-07-31 20:33 UTC (permalink / raw)
  To: Tim Chen; +Cc: linux-kernel


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> Here's a summary of Volanomark performance numbers:
> Variant 0 is 80% down from 2.6.22
> Variant 1 is 20% down from 2.6.22  (this is indeed helped)
> Variant 2 is 89% down from 2.6.22

ok, good! Could you try the updated debug patch below? I've done two 
changes: made '1' the default, and added the 
/proc/sys/kernel/sched_yield_granularity_ns tunable. (available if 
CONFIG_SCHED_DEBUG=y)

Could you try to change the yield-granularity tunable and see which 
value gives the best performance? A value of '100000' should in theory 
give the current (80% degraded) volanomark performance, the default 
value should give the above '20% down' result. The question is, is '20% 
down' the best we can get out of it? Does larger/smaller 
yield-granularity help perhaps? You can change it to any value between 
100 usecs and 1 second.

	Ingo

------------------------------>
Subject: sched: yield debugging
From: Ingo Molnar <mingo@elte.hu>

introduce various sched_yield implementations:

 # default one:
 echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

 # always queues the current task next to the next task:
 echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

 # NOP:
 echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

tunability depends on CONFIG_SCHED_DEBUG=y.

Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    2 +
 kernel/sched_fair.c   |   72 +++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sysctl.c       |   19 +++++++++++++
 3 files changed, 87 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1397,10 +1397,12 @@ static inline void idle_task_exit(void) 
 extern void sched_idle_next(void);
 
 extern unsigned int sysctl_sched_granularity;
+extern unsigned int sysctl_sched_yield_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_batch_wakeup_granularity;
 extern unsigned int sysctl_sched_stat_granularity;
 extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_yield_bug_workaround;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_features;
 
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -32,6 +32,7 @@
  * systems, 4x on 8-way systems, 5x on 16-way systems, etc.)
  */
 unsigned int sysctl_sched_granularity __read_mostly = 2000000000ULL/HZ;
+unsigned int sysctl_sched_yield_granularity __read_mostly = 2000000000ULL/HZ;
 
 /*
  * SCHED_BATCH wake-up granularity.
@@ -62,6 +63,16 @@ unsigned int sysctl_sched_stat_granulari
 unsigned int sysctl_sched_runtime_limit __read_mostly;
 
 /*
+ * sys_sched_yield workaround switch.
+ *
+ * This option switches the yield implementation of the
+ * old scheduler back on.
+ */
+unsigned int sysctl_sched_yield_bug_workaround __read_mostly = 1;
+
+EXPORT_SYMBOL_GPL(sysctl_sched_yield_bug_workaround);
+
+/*
  * Debugging: various feature bits
  */
 enum {
@@ -834,14 +845,63 @@ dequeue_task_fair(struct rq *rq, struct 
 static void yield_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+	struct rb_node *curr, *next, *first;
+	struct task_struct *p_next;
 	u64 now = __rq_clock(rq);
+	s64 yield_key;
 
-	/*
-	 * Dequeue and enqueue the task to update its
-	 * position within the tree:
-	 */
-	dequeue_entity(cfs_rq, &p->se, 0, now);
-	enqueue_entity(cfs_rq, &p->se, 0, now);
+
+	switch (sysctl_sched_yield_bug_workaround) {
+	default:
+		/*
+		 * Dequeue and enqueue the task to update its
+		 * position within the tree:
+		 */
+		dequeue_entity(cfs_rq, &p->se, 0, now);
+		enqueue_entity(cfs_rq, &p->se, 0, now);
+		break;
+	case 1:
+		curr = &p->se.run_node;
+		first = first_fair(cfs_rq);
+		/*
+		 * Move this task to the second place in the tree:
+		 */
+		if (unlikely(curr != first)) {
+			next = first;
+		} else {
+			next = rb_next(curr);
+			/*
+			 * We were the last one already - nothing to do, return
+			 * and reschedule:
+			 */
+			if (unlikely(!next))
+				return;
+		}
+
+		p_next = rb_entry(next, struct task_struct, se.run_node);
+		/*
+		 * Minimally necessary key value to be the second in the tree:
+		 */
+		yield_key = p_next->se.fair_key + (int)sysctl_sched_yield_granularity;
+
+		dequeue_entity(cfs_rq, &p->se, 0, now);
+
+		/*
+		 * Only update the key if we need to move more backwards
+		 * than the minimally necessary position to be the second:
+		 */
+		if (p->se.fair_key < yield_key)
+			p->se.fair_key = yield_key;
+
+		__enqueue_entity(cfs_rq, &p->se);
+		break;
+	case 2:
+		/*
+		 * Just reschedule, do nothing else:
+		 */
+		resched_task(p);
+		break;
+	}
 }
 
 /*
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -234,6 +234,17 @@ static ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_yield_granularity_ns",
+		.data		= &sysctl_sched_yield_granularity,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &min_sched_granularity_ns,
+		.extra2		= &max_sched_granularity_ns,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_wakeup_granularity_ns",
 		.data		= &sysctl_sched_wakeup_granularity,
 		.maxlen		= sizeof(unsigned int),
@@ -278,6 +289,14 @@ static ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_yield_bug_workaround",
+		.data		= &sysctl_sched_yield_bug_workaround,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_child_runs_first",
 		.data		= &sysctl_sched_child_runs_first,
 		.maxlen		= sizeof(unsigned int),

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] sched: yield debugging
  2007-07-31 20:33     ` Ingo Molnar
@ 2007-08-01 20:53       ` Tim Chen
  0 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2007-08-01 20:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Tue, 2007-07-31 at 22:33 +0200, Ingo Molnar wrote:

> ok, good! Could you try the updated debug patch below? I've done two 
> changes: made '1' the default, and added the 
> /proc/sys/kernel/sched_yield_granularity_ns tunable. (available if 
> CONFIG_SCHED_DEBUG=y)
> 
> Could you try to change the yield-granularity tunable and see which 
> value gives the best performance? A value of '100000' should in theory 
> give the current (80% degraded) volanomark performance, the default 
> value should give the above '20% down' result. The question is, is '20% 
> down' the best we can get out of it? Does larger/smaller 
> yield-granularity help perhaps? You can change it to any value between 
> 100 usecs and 1 second.
> 

Turning up the granuality helped.  Here's the data I got
for Volanomark performance relative to 2.6.22
Granuality
1000000000  (max)	9% down
 800000000 		8% down
  80000000		13% down
   8000000  		20% down
    100000		56% down

Tim

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-08-01 22:47 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen
2007-07-28  0:31 ` Chris Snook
2007-07-28  0:59   ` Andrea Arcangeli
2007-07-28  3:43     ` pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS) Chris Snook
2007-07-28  5:01       ` pluggable scheduler " Andrea Arcangeli
2007-07-28  6:51         ` Chris Snook
2007-07-30 18:49           ` Tim Chen
2007-07-30 21:07             ` Chris Snook
2007-07-30 21:24               ` Andrea Arcangeli
2007-07-28 13:28   ` Volanomark slows by 80% under CFS Dmitry Adamushko
2007-07-28  2:47 ` Rik van Riel
2007-07-28 20:26   ` Dave Jones
2007-07-28 12:36 ` Dmitry Adamushko
2007-07-28 18:55   ` David Schwartz
2007-07-29 17:37 ` [patch] sched: yield debugging Ingo Molnar
2007-07-30 18:10   ` Tim Chen
2007-07-31 20:33     ` Ingo Molnar
2007-08-01 20:53       ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox