* Volanomark slows by 80% under CFS
@ 2007-07-27 22:01 Tim Chen
2007-07-28 0:31 ` Chris Snook
` (3 more replies)
0 siblings, 4 replies; 18+ messages in thread
From: Tim Chen @ 2007-07-27 22:01 UTC (permalink / raw)
To: mingo; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1141 bytes --]
Ingo,
Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.
Benchmark was run on a 2 socket Core2 machine.
The change in scheduler treatment of sched_yield
could play a part in changing Volanomark behavior.
In CFS, sched_yield is implemented
by dequeueing and requeueing a process . The time a process
has spent running probably reduced the the cpu time due it
by only a bit. The process could get re-queued pretty close
to head of the queue, and may get scheduled again pretty
quickly if there is still a lot of cpu time due.
It may make sense to queue the
yielding process a bit further behind in the queue.
I made a slight change by zeroing out wait_runtime
(i.e. have the process gives
up cpu time due for it to run) for experimentation.
Let's put aside gripes that Volanomark should have used a
better mechanism to coordinate threads instead sched_yield for
a second. Volanomark runs better
and is only 40% (instead of 80%) down from old scheduler
without CFS.
Of course we should not tune for Volanomark and this is
reference data.
What are your view on how CFS's sched_yield should behave?
Regards,
Tim
[-- Attachment #2: patch.sched_yield --]
[-- Type: text/plain, Size: 336 bytes --]
--- linux-2.6.23-rc1/kernel/sched_fair.c.orig 2007-07-27 09:39:11.000000000 -0700
+++ linux-2.6.23-rc1/kernel/sched_fair.c 2007-07-27 09:40:41.000000000 -0700
@@ -841,6 +841,7 @@
* position within the tree:
*/
dequeue_entity(cfs_rq, &p->se, 0, now);
+ p->se.wait_runtime = 0;
enqueue_entity(cfs_rq, &p->se, 0, now);
}
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Volanomark slows by 80% under CFS 2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen @ 2007-07-28 0:31 ` Chris Snook 2007-07-28 0:59 ` Andrea Arcangeli 2007-07-28 13:28 ` Volanomark slows by 80% under CFS Dmitry Adamushko 2007-07-28 2:47 ` Rik van Riel ` (2 subsequent siblings) 3 siblings, 2 replies; 18+ messages in thread From: Chris Snook @ 2007-07-28 0:31 UTC (permalink / raw) To: tim.c.chen; +Cc: mingo, linux-kernel Tim Chen wrote: > Ingo, > > Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1. > Benchmark was run on a 2 socket Core2 machine. > > The change in scheduler treatment of sched_yield > could play a part in changing Volanomark behavior. > In CFS, sched_yield is implemented > by dequeueing and requeueing a process . The time a process > has spent running probably reduced the the cpu time due it > by only a bit. The process could get re-queued pretty close > to head of the queue, and may get scheduled again pretty > quickly if there is still a lot of cpu time due. > > It may make sense to queue the > yielding process a bit further behind in the queue. > I made a slight change by zeroing out wait_runtime > (i.e. have the process gives > up cpu time due for it to run) for experimentation. > Let's put aside gripes that Volanomark should have used a > better mechanism to coordinate threads instead sched_yield for > a second. Volanomark runs better > and is only 40% (instead of 80%) down from old scheduler > without CFS. > > Of course we should not tune for Volanomark and this is > reference data. > What are your view on how CFS's sched_yield should behave? > > Regards, > Tim The primary purpose of sched_yield is for SCHED_FIFO realtime processes. Where nothing else will run, ever, unless the running thread blocks or yields the CPU. Under CFS, the yielding process will still be leftmost in the rbtree, otherwise it would have already been scheduled out. Zeroing out wait_runtime on sched_yield strikes me as completely appropriate. If the process wanted to sleep a finite duration, it should actually call a sleep function, but sched_yield is essentially saying "I don't have anything else to do right now", so it's hardly fair to claim you've been waiting for your chance when you just gave it up. As for the remaining 40% degradation, if Volanomark is using it for synchronization, the scheduler is probably cycling through threads until it gets to the one that actually wants to do work. The O(1) scheduler will do this very quickly, whereas CFS has a bit more overhead. Interactivity boosting may have also helped the old scheduler find the right thread faster. I think Volanomark is being pretty stupid, and deserves to run slowly, but there are legitimate reasons to want to call sched_yield in a non-SCHED_FIFO process. If I'm performing multiple different calculations on the same set of data in multiple threads, and accessing the shared data in a linear fashion, I'd like to be able to have one thread give the other some CPU time so they can stay at the same point in the stream and improve cache hit rates, but this is only an optimization if I can do it without wasting CPU or gradually nicing myself into oblivion. Having sched_yield zero out wait_runtime seems like an appropriate way to make this use case work to the extent possible. Any user attempting such an optimization should have the good sense to do real work between sched_yield calls, to avoid calling the scheduler in a tight loop. -- Chris ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Volanomark slows by 80% under CFS 2007-07-28 0:31 ` Chris Snook @ 2007-07-28 0:59 ` Andrea Arcangeli 2007-07-28 3:43 ` pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS) Chris Snook 2007-07-28 13:28 ` Volanomark slows by 80% under CFS Dmitry Adamushko 1 sibling, 1 reply; 18+ messages in thread From: Andrea Arcangeli @ 2007-07-28 0:59 UTC (permalink / raw) To: Chris Snook; +Cc: tim.c.chen, mingo, linux-kernel On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote: > I think Volanomark is being pretty stupid, and deserves to run slowly, but Indeed, any app doing what volanomark does is pretty inefficient. But this is not the point. I/O schedulers are pluggable to help for inefficient apps too. If apps would be extremely smart they would all use async-io for their reads, and there wouldn't be the need of anticipatory scheduler just for an example. The fact is there's no technical explanation for which we're forbidden to be able to choose between CFS and O(1) at least at boot time. ^ permalink raw reply [flat|nested] 18+ messages in thread
* pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS) 2007-07-28 0:59 ` Andrea Arcangeli @ 2007-07-28 3:43 ` Chris Snook 2007-07-28 5:01 ` pluggable scheduler " Andrea Arcangeli 0 siblings, 1 reply; 18+ messages in thread From: Chris Snook @ 2007-07-28 3:43 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: tim.c.chen, mingo, linux-kernel Andrea Arcangeli wrote: > On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote: >> I think Volanomark is being pretty stupid, and deserves to run slowly, but > > Indeed, any app doing what volanomark does is pretty inefficient. > > But this is not the point. I/O schedulers are pluggable to help for > inefficient apps too. If apps would be extremely smart they would all > use async-io for their reads, and there wouldn't be the need of > anticipatory scheduler just for an example. I'm pretty sure the point of posting a patch that triples CFS performance on a certain benchmark and arguably improves the semantics of sched_yield was to improve CFS. You have a point, but it is a point for a different thread. I have taken the liberty of starting this thread for you. > The fact is there's no technical explanation for which we're forbidden > to be able to choose between CFS and O(1) at least at boot time. Sure there is. We can run a fully-functional POSIX OS without using any block devices at all. We cannot run a fully-functional POSIX OS without a scheduler. Any feature without which the OS cannot execute userspace code is sufficiently primitive that somewhere there is a device on which it will be impossible to debug if that feature fails to initialize. It is quite reasonable to insist on only having one implementation of such features in any given kernel build. Whether or not these alternatives belong in the source tree as config-time options is a political question, but preserving boot-time debugging capability is a perfectly reasonable technical motivation. -- Chris ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS) 2007-07-28 3:43 ` pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS) Chris Snook @ 2007-07-28 5:01 ` Andrea Arcangeli 2007-07-28 6:51 ` Chris Snook 0 siblings, 1 reply; 18+ messages in thread From: Andrea Arcangeli @ 2007-07-28 5:01 UTC (permalink / raw) To: Chris Snook; +Cc: tim.c.chen, mingo, linux-kernel On Fri, Jul 27, 2007 at 11:43:23PM -0400, Chris Snook wrote: > I'm pretty sure the point of posting a patch that triples CFS performance > on a certain benchmark and arguably improves the semantics of sched_yield > was to improve CFS. You have a point, but it is a point for a different > thread. I have taken the liberty of starting this thread for you. I've no real interest in starting or participating in flamewars (especially the ones not backed by hard numbers). So I adjusted the subject a bit in the hope the discussion will not degenerate as you predicted, hope you don't mind. I'm pretty sure the point of posting that email was to show the remaining performance regression with the sched_yield fix applied too. Given you considered my post both offtopic and inflammatory, I guess you think it's possible and reasonably easy to fix that remaining regression without a pluggable scheduler, right? So please enlighten us on your intend to achieve it. Also consider the other numbers likely used nptl so they shouldn't be affected by sched_yield changes. > Sure there is. We can run a fully-functional POSIX OS without using any > block devices at all. We cannot run a fully-functional POSIX OS without a > scheduler. Any feature without which the OS cannot execute userspace code > is sufficiently primitive that somewhere there is a device on which it will > be impossible to debug if that feature fails to initialize. It is quite > reasonable to insist on only having one implementation of such features in > any given kernel build. Sounds like a red-herring to me... There aren't just pluggable I/O schedulers in the kernel, there are pluggable packet schedulers too (see `tc qdisc`). And both are switchable at runtime (not just at boot time). Can you run your fully-functional POSIX OS without a packet scheduler and without an I/O scheduler? I wonder where are you going to read/write data without HD and network? Also those pluggable things don't increase the risk of crash much, if compared to the complexity of the schedulers. > Whether or not these alternatives belong in the source tree as config-time > options is a political question, but preserving boot-time debugging > capability is a perfectly reasonable technical motivation. The scheduler is invoked very late in the boot process (printk and serial console, kdb are working for ages when scheduler kicks in), so it's fully debuggable (no debugger depends on the scheduler, they run inside the nmi handler...), I don't really see your point. And even if there would be a subtle bug in the scheduler you'll never trigger it at boot with so few tasks and so few context switches. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS) 2007-07-28 5:01 ` pluggable scheduler " Andrea Arcangeli @ 2007-07-28 6:51 ` Chris Snook 2007-07-30 18:49 ` Tim Chen 0 siblings, 1 reply; 18+ messages in thread From: Chris Snook @ 2007-07-28 6:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: tim.c.chen, mingo, linux-kernel Andrea Arcangeli wrote: > On Fri, Jul 27, 2007 at 11:43:23PM -0400, Chris Snook wrote: >> I'm pretty sure the point of posting a patch that triples CFS performance >> on a certain benchmark and arguably improves the semantics of sched_yield >> was to improve CFS. You have a point, but it is a point for a different >> thread. I have taken the liberty of starting this thread for you. > > I've no real interest in starting or participating in flamewars > (especially the ones not backed by hard numbers). So I adjusted the > subject a bit in the hope the discussion will not degenerate as you > predicted, hope you don't mind. Not at all. I clearly misread your tone. > I'm pretty sure the point of posting that email was to show the > remaining performance regression with the sched_yield fix applied > too. Given you considered my post both offtopic and inflammatory, I > guess you think it's possible and reasonably easy to fix that > remaining regression without a pluggable scheduler, right? So please > enlighten us on your intend to achieve it. There are four possibilities that are immediately obvious to me: a) The remaining difference is due mostly to the algorithmic complexity of the rbtree algorithm in CFS. If this is the case, we should be able to vary the test parameters (CPU count, thread count, etc.) graph the results, and see a roughly logarithmic divergence between the schedulers as some parameter(s) vary. If this is the problem, we may be able to fix it with data structure tweaks or optimized base cases, like how quicksort can be optimized by using insertion sort below a certain threshold. b) The remaining difference is due mostly to how the scheduler handles volanomark. vmstat can give us a comparison of context switches between O(1), CFS, and CFS+patch. If the decrease in throughput correlates with an increase in context switches, we may be able to induce more O(1)-like behavior by charging tasks for context switch overhead. c) The remaining difference is due mostly to how the scheduler handles something other than volanomark. If context switch count is not the problem, context switch pattern still could be. I doubt we'd see a 40% difference due to cache misses, but it's possible. Fortunately, oprofile can sample based on cache misses, so we can debug this too. d) The remaining difference is due mostly to some implementation detail in CFS. It's possible there's some constant-factor overhead in CFS that is magnified heavily by the context switching volanomark deliberately induces. If this is the case, oprofile sampling on clock cycles should catch it. Tim -- Since you're already set up to do this benchmarking, would you mind varying the parameters a bit and collecting vmstat data? If you want to run oprofile too, that wouldn't hurt. > Also consider the other numbers likely used nptl so they shouldn't be > affected by sched_yield changes. > >> Sure there is. We can run a fully-functional POSIX OS without using any >> block devices at all. We cannot run a fully-functional POSIX OS without a >> scheduler. Any feature without which the OS cannot execute userspace code >> is sufficiently primitive that somewhere there is a device on which it will >> be impossible to debug if that feature fails to initialize. It is quite >> reasonable to insist on only having one implementation of such features in >> any given kernel build. > > Sounds like a red-herring to me... There aren't just pluggable I/O > schedulers in the kernel, there are pluggable packet schedulers too > (see `tc qdisc`). And both are switchable at runtime (not just at boot > time). > > Can you run your fully-functional POSIX OS without a packet scheduler > and without an I/O scheduler? I wonder where are you going to > read/write data without HD and network? If I'm missing both, I'm pretty screwed, but if either one is functional, I can send something out. > Also those pluggable things don't increase the risk of crash much, if > compared to the complexity of the schedulers. > >> Whether or not these alternatives belong in the source tree as config-time >> options is a political question, but preserving boot-time debugging >> capability is a perfectly reasonable technical motivation. > > The scheduler is invoked very late in the boot process (printk and > serial console, kdb are working for ages when scheduler kicks in), so > it's fully debuggable (no debugger depends on the scheduler, they run > inside the nmi handler...), I don't really see your point. I'm more concerned about embedded systems. These are the same people who want userspace character drivers to control their custom hardware. Having the robot point to where it hurts is a lot more convenient than hooking up a JTAG debugger. > And even if there would be a subtle bug in the scheduler you'll never > trigger it at boot with so few tasks and so few context switches. Sure, but it's the non-subtle bugs that worry me. These are usually related to low-level hardware setup, so they could miss the mainstream developers and clobber unsuspecting embedded developers. I acknowledge that debugging such problems shouldn't be terribly hard on mainstream systems, but some people are going to want to choose a single scheduler at build time and avoid the hassle. If we can improve CFS to be regression-free, and I think we can if we give ourselves a few percent tolerance and keep tracking down the corner cases, the pluggable scheduler infrastructure will just be another disused feature. -- Chris ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS) 2007-07-28 6:51 ` Chris Snook @ 2007-07-30 18:49 ` Tim Chen 2007-07-30 21:07 ` Chris Snook 0 siblings, 1 reply; 18+ messages in thread From: Tim Chen @ 2007-07-30 18:49 UTC (permalink / raw) To: Chris Snook; +Cc: Andrea Arcangeli, mingo, linux-kernel On Sat, 2007-07-28 at 02:51 -0400, Chris Snook wrote: > > Tim -- > > Since you're already set up to do this benchmarking, would you mind > varying the parameters a bit and collecting vmstat data? If you want to > run oprofile too, that wouldn't hurt. > Here's the vmstat data. The number of runnable processes are fewer and there're more contex switches with CFS. The vmstat for 2.6.22 looks like procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 391 0 0 1722564 14416 95472 0 0 169 25 76 6520 3 3 89 5 0 400 0 0 1722372 14416 95496 0 0 0 0 264 641685 47 53 0 0 0 368 0 0 1721504 14424 95496 0 0 0 7 261 648493 46 51 3 0 0 438 0 0 1721504 14432 95496 0 0 0 2 264 690834 46 54 0 0 0 400 0 0 1721380 14432 95496 0 0 0 0 260 657157 46 53 1 0 0 393 0 0 1719892 14440 95496 0 0 0 6 265 671599 45 53 2 0 0 423 0 0 1719892 14440 95496 0 0 0 15 264 701626 44 56 0 0 0 375 0 0 1720240 14472 95504 0 0 0 72 265 671795 43 53 3 0 0 393 0 0 1720140 14480 95504 0 0 0 7 265 733561 45 55 0 0 0 355 0 0 1716052 14480 95504 0 0 0 0 260 670676 43 54 3 0 0 419 0 0 1718900 14480 95504 0 0 0 4 265 680690 43 55 2 0 0 396 0 0 1719148 14488 95504 0 0 0 3 261 712307 43 56 0 0 0 395 0 0 1719148 14488 95504 0 0 0 2 264 692781 44 54 1 0 0 387 0 0 1719148 14492 95504 0 0 0 41 268 709579 43 57 0 0 0 420 0 0 1719148 14500 95504 0 0 0 3 265 690862 44 54 2 0 0 429 0 0 1719396 14500 95504 0 0 0 0 260 704872 46 54 0 0 0 460 0 0 1719396 14500 95504 0 0 0 0 264 716272 46 54 0 0 0 419 0 0 1719396 14508 95504 0 0 0 3 261 685864 43 55 2 0 0 455 0 0 1719396 14508 95504 0 0 0 0 264 703718 44 56 0 0 0 395 0 0 1719372 14540 95512 0 0 0 64 265 692785 45 54 1 0 0 424 0 0 1719396 14548 95512 0 0 0 10 265 732866 45 55 0 0 0 While 2.6.23-rc1 look like procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 23 0 0 1705992 17020 95720 0 0 0 0 261 1010016 53 42 5 0 0 7 0 0 1706116 17020 95720 0 0 0 13 267 1060997 52 41 7 0 0 5 0 0 1706116 17020 95720 0 0 0 28 266 1313361 56 41 3 0 0 19 0 0 1706116 17028 95720 0 0 0 8 265 1273669 55 41 4 0 0 18 0 0 1706116 17032 95720 0 0 0 2 262 1403588 55 41 4 0 0 23 0 0 1706116 17032 95720 0 0 0 0 264 1272561 56 40 4 0 0 14 0 0 1706116 17032 95720 0 0 0 0 262 1046795 55 40 5 0 0 16 0 0 1706116 17032 95720 0 0 0 0 260 1361102 58 39 4 0 0 4 0 0 1706224 17120 95724 0 0 0 126 273 1488711 56 41 3 0 0 24 0 0 1706224 17128 95724 0 0 0 6 261 1408432 55 41 4 0 0 3 0 0 1706240 17128 95724 0 0 0 48 273 1299203 54 42 4 0 0 16 0 0 1706240 17132 95724 0 0 0 3 261 1356609 54 42 4 0 0 5 0 0 1706364 17132 95724 0 0 0 0 264 1293198 58 39 3 0 0 9 0 0 1706364 17132 95724 0 0 0 0 261 1555153 56 41 3 0 0 13 0 0 1706364 17132 95724 0 0 0 0 264 1160296 56 40 4 0 0 8 0 0 1706364 17132 95724 0 0 0 0 261 1388909 58 38 4 0 0 18 0 0 1706364 17132 95724 0 0 0 0 264 1236774 56 39 5 0 0 11 0 0 1706364 17136 95724 0 0 0 2 261 1360325 57 40 3 0 0 5 0 0 1706364 17136 95724 0 0 0 1 265 1201912 57 40 3 0 0 8 0 0 1706364 17136 95724 0 0 0 0 261 1104308 57 39 4 0 0 7 0 0 1705976 17232 95724 0 0 0 127 274 1205212 58 39 4 0 0 Tim ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS) 2007-07-30 18:49 ` Tim Chen @ 2007-07-30 21:07 ` Chris Snook 2007-07-30 21:24 ` Andrea Arcangeli 0 siblings, 1 reply; 18+ messages in thread From: Chris Snook @ 2007-07-30 21:07 UTC (permalink / raw) To: tim.c.chen; +Cc: Andrea Arcangeli, mingo, linux-kernel Tim Chen wrote: > On Sat, 2007-07-28 at 02:51 -0400, Chris Snook wrote: > >> Tim -- >> >> Since you're already set up to do this benchmarking, would you mind >> varying the parameters a bit and collecting vmstat data? If you want to >> run oprofile too, that wouldn't hurt. >> > > Here's the vmstat data. The number of runnable processes are > fewer and there're more contex switches with CFS. > > The vmstat for 2.6.22 looks like > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy id wa st > 391 0 0 1722564 14416 95472 0 0 169 25 76 6520 3 3 89 5 0 > 400 0 0 1722372 14416 95496 0 0 0 0 264 641685 47 53 0 0 0 > 368 0 0 1721504 14424 95496 0 0 0 7 261 648493 46 51 3 0 0 > 438 0 0 1721504 14432 95496 0 0 0 2 264 690834 46 54 0 0 0 > 400 0 0 1721380 14432 95496 0 0 0 0 260 657157 46 53 1 0 0 > 393 0 0 1719892 14440 95496 0 0 0 6 265 671599 45 53 2 0 0 > 423 0 0 1719892 14440 95496 0 0 0 15 264 701626 44 56 0 0 0 > 375 0 0 1720240 14472 95504 0 0 0 72 265 671795 43 53 3 0 0 > 393 0 0 1720140 14480 95504 0 0 0 7 265 733561 45 55 0 0 0 > 355 0 0 1716052 14480 95504 0 0 0 0 260 670676 43 54 3 0 0 > 419 0 0 1718900 14480 95504 0 0 0 4 265 680690 43 55 2 0 0 > 396 0 0 1719148 14488 95504 0 0 0 3 261 712307 43 56 0 0 0 > 395 0 0 1719148 14488 95504 0 0 0 2 264 692781 44 54 1 0 0 > 387 0 0 1719148 14492 95504 0 0 0 41 268 709579 43 57 0 0 0 > 420 0 0 1719148 14500 95504 0 0 0 3 265 690862 44 54 2 0 0 > 429 0 0 1719396 14500 95504 0 0 0 0 260 704872 46 54 0 0 0 > 460 0 0 1719396 14500 95504 0 0 0 0 264 716272 46 54 0 0 0 > 419 0 0 1719396 14508 95504 0 0 0 3 261 685864 43 55 2 0 0 > 455 0 0 1719396 14508 95504 0 0 0 0 264 703718 44 56 0 0 0 > 395 0 0 1719372 14540 95512 0 0 0 64 265 692785 45 54 1 0 0 > 424 0 0 1719396 14548 95512 0 0 0 10 265 732866 45 55 0 0 0 > > While 2.6.23-rc1 look like > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy id wa st > 23 0 0 1705992 17020 95720 0 0 0 0 261 1010016 53 42 5 0 0 > 7 0 0 1706116 17020 95720 0 0 0 13 267 1060997 52 41 7 0 0 > 5 0 0 1706116 17020 95720 0 0 0 28 266 1313361 56 41 3 0 0 > 19 0 0 1706116 17028 95720 0 0 0 8 265 1273669 55 41 4 0 0 > 18 0 0 1706116 17032 95720 0 0 0 2 262 1403588 55 41 4 0 0 > 23 0 0 1706116 17032 95720 0 0 0 0 264 1272561 56 40 4 0 0 > 14 0 0 1706116 17032 95720 0 0 0 0 262 1046795 55 40 5 0 0 > 16 0 0 1706116 17032 95720 0 0 0 0 260 1361102 58 39 4 0 0 > 4 0 0 1706224 17120 95724 0 0 0 126 273 1488711 56 41 3 0 0 > 24 0 0 1706224 17128 95724 0 0 0 6 261 1408432 55 41 4 0 0 > 3 0 0 1706240 17128 95724 0 0 0 48 273 1299203 54 42 4 0 0 > 16 0 0 1706240 17132 95724 0 0 0 3 261 1356609 54 42 4 0 0 > 5 0 0 1706364 17132 95724 0 0 0 0 264 1293198 58 39 3 0 0 > 9 0 0 1706364 17132 95724 0 0 0 0 261 1555153 56 41 3 0 0 > 13 0 0 1706364 17132 95724 0 0 0 0 264 1160296 56 40 4 0 0 > 8 0 0 1706364 17132 95724 0 0 0 0 261 1388909 58 38 4 0 0 > 18 0 0 1706364 17132 95724 0 0 0 0 264 1236774 56 39 5 0 0 > 11 0 0 1706364 17136 95724 0 0 0 2 261 1360325 57 40 3 0 0 > 5 0 0 1706364 17136 95724 0 0 0 1 265 1201912 57 40 3 0 0 > 8 0 0 1706364 17136 95724 0 0 0 0 261 1104308 57 39 4 0 0 > 7 0 0 1705976 17232 95724 0 0 0 127 274 1205212 58 39 4 0 0 > > Tim > From a scheduler performance perspective, it looks like CFS is doing much better on this workload. It's spending a lot less time in %sys despite the higher context switches, and there are far fewer tasks waiting for CPU time. The real problem seems to be that volanomark is optimized for a particular scheduler behavior. That's not to say that we can't improve volanomark performance under CFS, but simply that CFS isn't so fundamentally flawed that this is impossible. When I initially agreed with zeroing out wait time in sched_yield, I didn't realize that it could be negative and that this would actually promote processes in some cases. I still think it's reasonable to zero out positive wait times. Can you test to see if that optimization does better than unconditionally zeroing them out? -- Chris ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS) 2007-07-30 21:07 ` Chris Snook @ 2007-07-30 21:24 ` Andrea Arcangeli 0 siblings, 0 replies; 18+ messages in thread From: Andrea Arcangeli @ 2007-07-30 21:24 UTC (permalink / raw) To: Chris Snook; +Cc: tim.c.chen, mingo, linux-kernel On Mon, Jul 30, 2007 at 05:07:46PM -0400, Chris Snook wrote: > [..] It's spending a lot less time in %sys despite the > higher context switches, [..] The workload takes 40% more so you've to add up that additional 40% too into your math. "A lot less time" sounds an overstatement to me. Also you've to take into account cache effects in executing the scheduler so much etc... > [..] and there are far fewer tasks waiting for CPU > time. The real problem seems to be that volanomark is optimized for a It looks weird that there are a lot less tasks in R state. Could you press SYSRQ+T to see where those hundred tasks are sleeping in the CFS run? > That's not to say that we can't improve volanomark performance under CFS, > but simply that CFS isn't so fundamentally flawed that this is impossible. Given the increase of context switches, it means not all the ctx switches are "userland mandated", so the first thing to try here is to increase the granularity with the new tunable sysctl. Increasing the granularity has to reduce the context switch rate, and in turn it will reduce the slowdown to less than 40%. There's nothing necessarily flawed in CFS even if it's slower than O(1) in this load no matter how you tune it. The higher context switch rate to retain complete fariness is a feature, but fariness vs global performance is generally a tradeoff. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Volanomark slows by 80% under CFS 2007-07-28 0:31 ` Chris Snook 2007-07-28 0:59 ` Andrea Arcangeli @ 2007-07-28 13:28 ` Dmitry Adamushko 1 sibling, 0 replies; 18+ messages in thread From: Dmitry Adamushko @ 2007-07-28 13:28 UTC (permalink / raw) To: Chris Snook; +Cc: tim.c.chen, Ingo Molnar, Linux Kernel On 28/07/07, Chris Snook <csnook@redhat.com> wrote: > [ ... ] > Under CFS, the yielding process will still be leftmost in the rbtree, > otherwise it would have already been scheduled out. Not actually true. The position of the 'current' task within the rb-tree is updated with a timer tick's frequency. Being called somewhere in between 2 ticks, sched_yield() may trigger a reschedule which would otherwise take place upon the next tick. Moreover, 'scheduling granularity' may also take effect. e.g. effectively, the yielding task's 'fair_key' was already != 'left_most' upon the previous timer tick but an actual reschedule has been delayed due to the 'scheduling granularity' taking effect... and sched_yield() may trigger it. > Zeroing out wait_runtime on sched_yield strikes me as completely appropriate. > If the process wanted to sleep a finite duration, it should actually call a > sleep function, but sched_yield is essentially saying "I don't have anything > else to do right now", so it's hardly fair to claim you've been waiting for your > chance when you just gave it up. 'wait_runtime' describes dynamic behavior of a task on the rq. It doesn't matter what the task is about to do as 'wait_runtime' is something it fully deserves. Note, 'wait_runtime' can be both positive and negative, meaning credit/punishment appropriately. When it's negative, 'zeroing it out' effectively means the task gets helped -- in the sense that it doen't get 'punished' for some amount of time it actually spent running. Which is wrong. One more thing: we don't take time accounted to 'wait_runtime' just from the thin air. e.g. sleepers get an additional bonus to their 'wait_runtime' upon a wakeup _but_ the amount of "wait_runtime" == "a given bonus" will be additionally substracted from tasks which happen to run later on (grep for "sleeper_bonus" in sched_fair.c). That said, the sum (of additionally given/taken wait_runtime) is zero. All in all, I doubt the "zeroing out wait_runtime on sched_yield" thing is really appropriate. > > -- Chris > -- Best regards, Dmitry Adamushko ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Volanomark slows by 80% under CFS 2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen 2007-07-28 0:31 ` Chris Snook @ 2007-07-28 2:47 ` Rik van Riel 2007-07-28 20:26 ` Dave Jones 2007-07-28 12:36 ` Dmitry Adamushko 2007-07-29 17:37 ` [patch] sched: yield debugging Ingo Molnar 3 siblings, 1 reply; 18+ messages in thread From: Rik van Riel @ 2007-07-28 2:47 UTC (permalink / raw) To: tim.c.chen; +Cc: mingo, linux-kernel Tim Chen wrote: > Ingo, > > Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1. > Benchmark was run on a 2 socket Core2 machine. > > The change in scheduler treatment of sched_yield > could play a part in changing Volanomark behavior. > In CFS, sched_yield is implemented > by dequeueing and requeueing a process . The time a process > has spent running probably reduced the the cpu time due it > by only a bit. The process could get re-queued pretty close > to head of the queue, and may get scheduled again pretty > quickly if there is still a lot of cpu time due. I wonder if this explains the 30% drop in top performance seen with the MySQL sysbench benchmark when the scheduler changed to CFS... See http://people.freebsd.org/~jeff/sysbench.png -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Volanomark slows by 80% under CFS 2007-07-28 2:47 ` Rik van Riel @ 2007-07-28 20:26 ` Dave Jones 0 siblings, 0 replies; 18+ messages in thread From: Dave Jones @ 2007-07-28 20:26 UTC (permalink / raw) To: Rik van Riel; +Cc: tim.c.chen, mingo, linux-kernel On Fri, Jul 27, 2007 at 10:47:21PM -0400, Rik van Riel wrote: > Tim Chen wrote: > > Ingo, > > > > Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1. > > Benchmark was run on a 2 socket Core2 machine. > > > > The change in scheduler treatment of sched_yield > > could play a part in changing Volanomark behavior. > > In CFS, sched_yield is implemented > > by dequeueing and requeueing a process . The time a process > > has spent running probably reduced the the cpu time due it > > by only a bit. The process could get re-queued pretty close > > to head of the queue, and may get scheduled again pretty > > quickly if there is still a lot of cpu time due. > > I wonder if this explains the 30% drop in top performance > seen with the MySQL sysbench benchmark when the scheduler > changed to CFS... > > See http://people.freebsd.org/~jeff/sysbench.png From the authors blog when he did that graph: http://jeffr-tech.livejournal.com/10103.html "So I updated the image for the second time today to include Ingo's cfs scheduler. This kernel is from the rpm on his website. I double checked that it was not using tcmalloc at the time and switching back to a 2.6.21 kernel returned to the expected perf. Basically, it has the same performance as the FreeBSD 4BSD scheduler now. Which is to say the peak is terrible but it has virtually no dropoff and performs better under load than the default 2.6.21 scheduler. " Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Volanomark slows by 80% under CFS 2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen 2007-07-28 0:31 ` Chris Snook 2007-07-28 2:47 ` Rik van Riel @ 2007-07-28 12:36 ` Dmitry Adamushko 2007-07-28 18:55 ` David Schwartz 2007-07-29 17:37 ` [patch] sched: yield debugging Ingo Molnar 3 siblings, 1 reply; 18+ messages in thread From: Dmitry Adamushko @ 2007-07-28 12:36 UTC (permalink / raw) To: tim.c.chen; +Cc: Ingo Molnar, Linux Kernel On 28/07/07, Tim Chen <tim.c.chen@linux.intel.com> wrote: > [ ... ] > It may make sense to queue the > yielding process a bit further behind in the queue. > I made a slight change by zeroing out wait_runtime > (i.e. have the process gives > up cpu time due for it to run) for experimentation. But that's wrong. The 'wait_runtime' might have been negative at this point (i.e. a task is in the negative 'run-time' balance wrt the 'etalon' nice-0 task). Your change ends up helping such a task to actually stay closer to the 'left most' element of the tree (or to be it) and not "further behind in the queue" as your intention is. I don't know Volanomark's details so refrain from speculating on why this change "improves" benchmark results indeed (maybe some afected tasks have positive 'wait_runtime's on average for this setup). If you want to make sure (just for a test) a yeilding task is not the left-most (at least) for some short interval of time (likely to be <= 1 tick), take a look at yield_task_fair() in e.g. cfs-v15. > Volanomark runs better > and is only 40% (instead of 80%) down from old scheduler > without CFS. 40 or 80 % is still a huge regression. > > Regards, > Tim > -- Best regards, Dmitry Adamushko ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Volanomark slows by 80% under CFS 2007-07-28 12:36 ` Dmitry Adamushko @ 2007-07-28 18:55 ` David Schwartz 0 siblings, 0 replies; 18+ messages in thread From: David Schwartz @ 2007-07-28 18:55 UTC (permalink / raw) To: Linux-Kernel@Vger. Kernel. Org > > Volanomark runs better > > and is only 40% (instead of 80%) down from old scheduler > > without CFS. > 40 or 80 % is still a huge regression. > Dmitry Adamushko Can anyone explain precisely what Volanomark is doing? If it's something dumb like "looping on sched_yield until the 'right' thread runs and finishes what we're waiting for" then I think any regression can be ignored. This applies if and only if CFS' sched_yield behavior is sane and Volano's is insane. A sane sched_yield implementation must do two things: 1) Reward processes that actually do yield most of their CPU time to another process. 2) Make an effort to run every ready-to-run process at the same or higher static priority level before re-scheduling this process. (That won't always be possible due to SMP issues, but a reasonable effort is needed.) If CFS is doing these two things, and Volanomark is looping on sched_yield until the 'right thread' runs, then CFS is doing the right and Volanomark isn't. Volanomark deserves to lose. If CFS binds processes to processors more tightly and thus sched_yield can't yield to a process that was planned to run on another CPU in the future, that would be a legitimate complaint about CFS. DS ^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch] sched: yield debugging 2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen ` (2 preceding siblings ...) 2007-07-28 12:36 ` Dmitry Adamushko @ 2007-07-29 17:37 ` Ingo Molnar 2007-07-30 18:10 ` Tim Chen 3 siblings, 1 reply; 18+ messages in thread From: Ingo Molnar @ 2007-07-29 17:37 UTC (permalink / raw) To: Tim Chen; +Cc: linux-kernel Tim, * Tim Chen <tim.c.chen@linux.intel.com> wrote: > Ingo, > > Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1. Benchmark > was run on a 2 socket Core2 machine. thanks for testing and reporting this! > The change in scheduler treatment of sched_yield could play a part in > changing Volanomark behavior. Could you try the patch below? It does not change the default behavior of yield but introduces 2 other yield strategies which you can activate runtime (if CONFIG_SCHED_DEBUG=y) via: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround does variant '1' improve Java's VolanoMark performance perhaps? i'm also wondering, which JDK is this, and where does Java make use of sys_sched_yield()? It's a voefully badly defined (and thus unreliable) system call, IMO Java should stop using it ASAP and use a saner locking model. thanks, Ingo -------------------------------> Subject: sched: yield debugging From: Ingo Molnar <mingo@elte.hu> introduce various sched_yield implementations: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround tunability depends on CONFIG_SCHED_DEBUG=y. Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 1 kernel/sched_fair.c | 71 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 8 +++++ 3 files changed, 74 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1401,6 +1401,7 @@ extern unsigned int sysctl_sched_wakeup_ extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -62,6 +62,16 @@ unsigned int sysctl_sched_stat_granulari unsigned int sysctl_sched_runtime_limit __read_mostly; /* + * sys_sched_yield workaround switch. + * + * This option switches the yield implementation of the + * old scheduler back on. + */ +unsigned int sysctl_sched_yield_bug_workaround __read_mostly = 0; + +EXPORT_SYMBOL_GPL(sysctl_sched_yield_bug_workaround); + +/* * Debugging: various feature bits */ enum { @@ -834,14 +844,63 @@ dequeue_task_fair(struct rq *rq, struct static void yield_task_fair(struct rq *rq, struct task_struct *p) { struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node *curr, *next, *first; + struct task_struct *p_next; u64 now = __rq_clock(rq); + s64 yield_key; - /* - * Dequeue and enqueue the task to update its - * position within the tree: - */ - dequeue_entity(cfs_rq, &p->se, 0, now); - enqueue_entity(cfs_rq, &p->se, 0, now); + + switch (sysctl_sched_yield_bug_workaround) { + default: + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0, now); + enqueue_entity(cfs_rq, &p->se, 0, now); + break; + case 1: + curr = &p->se.run_node; + first = first_fair(cfs_rq); + /* + * Move this task to the second place in the tree: + */ + if (unlikely(curr != first)) { + next = first; + } else { + next = rb_next(curr); + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + } + + p_next = rb_entry(next, struct task_struct, se.run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->se.fair_key + (int)sysctl_sched_granularity; + + dequeue_entity(cfs_rq, &p->se, 0, now); + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->se.fair_key < yield_key) + p->se.fair_key = yield_key; + + __enqueue_entity(cfs_rq, &p->se); + break; + case 2: + /* + * Just reschedule, do nothing else: + */ + resched_task(p); + break; + } } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -278,6 +278,14 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_bug_workaround", + .data = &sysctl_sched_yield_bug_workaround, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_child_runs_first", .data = &sysctl_sched_child_runs_first, .maxlen = sizeof(unsigned int), ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] sched: yield debugging 2007-07-29 17:37 ` [patch] sched: yield debugging Ingo Molnar @ 2007-07-30 18:10 ` Tim Chen 2007-07-31 20:33 ` Ingo Molnar 0 siblings, 1 reply; 18+ messages in thread From: Tim Chen @ 2007-07-30 18:10 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On Sun, 2007-07-29 at 19:37 +0200, Ingo Molnar wrote: > Tim, > > * Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > Could you try the patch below? It does not change the default behavior > of yield but introduces 2 other yield strategies which you can activate > runtime (if CONFIG_SCHED_DEBUG=y) via: > > # default one: > echo 0 > /proc/sys/kernel/sched_yield_bug_workaround > > # always queues the current task next to the next task: > echo 1 > /proc/sys/kernel/sched_yield_bug_workaround > > # NOP: > echo 2 > /proc/sys/kernel/sched_yield_bug_workaround > > does variant '1' improve Java's VolanoMark performance perhaps? > Here's a summary of Volanomark performance numbers: Variant 0 is 80% down from 2.6.22 Variant 1 is 20% down from 2.6.22 (this is indeed helped) Variant 2 is 89% down from 2.6.22 > i'm also wondering, which JDK is this, and where does Java make use of > sys_sched_yield()? It's a voefully badly defined (and thus unreliable) > system call, IMO Java should stop using it ASAP and use a saner locking > model. I am using a JRockit JDK. Thanks. Tim ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] sched: yield debugging 2007-07-30 18:10 ` Tim Chen @ 2007-07-31 20:33 ` Ingo Molnar 2007-08-01 20:53 ` Tim Chen 0 siblings, 1 reply; 18+ messages in thread From: Ingo Molnar @ 2007-07-31 20:33 UTC (permalink / raw) To: Tim Chen; +Cc: linux-kernel * Tim Chen <tim.c.chen@linux.intel.com> wrote: > Here's a summary of Volanomark performance numbers: > Variant 0 is 80% down from 2.6.22 > Variant 1 is 20% down from 2.6.22 (this is indeed helped) > Variant 2 is 89% down from 2.6.22 ok, good! Could you try the updated debug patch below? I've done two changes: made '1' the default, and added the /proc/sys/kernel/sched_yield_granularity_ns tunable. (available if CONFIG_SCHED_DEBUG=y) Could you try to change the yield-granularity tunable and see which value gives the best performance? A value of '100000' should in theory give the current (80% degraded) volanomark performance, the default value should give the above '20% down' result. The question is, is '20% down' the best we can get out of it? Does larger/smaller yield-granularity help perhaps? You can change it to any value between 100 usecs and 1 second. Ingo ------------------------------> Subject: sched: yield debugging From: Ingo Molnar <mingo@elte.hu> introduce various sched_yield implementations: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround tunability depends on CONFIG_SCHED_DEBUG=y. Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 72 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 87 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1397,10 +1397,12 @@ static inline void idle_task_exit(void) extern void sched_idle_next(void); extern unsigned int sysctl_sched_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -32,6 +32,7 @@ * systems, 4x on 8-way systems, 5x on 16-way systems, etc.) */ unsigned int sysctl_sched_granularity __read_mostly = 2000000000ULL/HZ; +unsigned int sysctl_sched_yield_granularity __read_mostly = 2000000000ULL/HZ; /* * SCHED_BATCH wake-up granularity. @@ -62,6 +63,16 @@ unsigned int sysctl_sched_stat_granulari unsigned int sysctl_sched_runtime_limit __read_mostly; /* + * sys_sched_yield workaround switch. + * + * This option switches the yield implementation of the + * old scheduler back on. + */ +unsigned int sysctl_sched_yield_bug_workaround __read_mostly = 1; + +EXPORT_SYMBOL_GPL(sysctl_sched_yield_bug_workaround); + +/* * Debugging: various feature bits */ enum { @@ -834,14 +845,63 @@ dequeue_task_fair(struct rq *rq, struct static void yield_task_fair(struct rq *rq, struct task_struct *p) { struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node *curr, *next, *first; + struct task_struct *p_next; u64 now = __rq_clock(rq); + s64 yield_key; - /* - * Dequeue and enqueue the task to update its - * position within the tree: - */ - dequeue_entity(cfs_rq, &p->se, 0, now); - enqueue_entity(cfs_rq, &p->se, 0, now); + + switch (sysctl_sched_yield_bug_workaround) { + default: + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0, now); + enqueue_entity(cfs_rq, &p->se, 0, now); + break; + case 1: + curr = &p->se.run_node; + first = first_fair(cfs_rq); + /* + * Move this task to the second place in the tree: + */ + if (unlikely(curr != first)) { + next = first; + } else { + next = rb_next(curr); + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + } + + p_next = rb_entry(next, struct task_struct, se.run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->se.fair_key + (int)sysctl_sched_yield_granularity; + + dequeue_entity(cfs_rq, &p->se, 0, now); + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->se.fair_key < yield_key) + p->se.fair_key = yield_key; + + __enqueue_entity(cfs_rq, &p->se); + break; + case 2: + /* + * Just reschedule, do nothing else: + */ + resched_task(p); + break; + } } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -234,6 +234,17 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_granularity_ns", + .data = &sysctl_sched_yield_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &min_sched_granularity_ns, + .extra2 = &max_sched_granularity_ns, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_wakeup_granularity_ns", .data = &sysctl_sched_wakeup_granularity, .maxlen = sizeof(unsigned int), @@ -278,6 +289,14 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_bug_workaround", + .data = &sysctl_sched_yield_bug_workaround, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_child_runs_first", .data = &sysctl_sched_child_runs_first, .maxlen = sizeof(unsigned int), ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] sched: yield debugging 2007-07-31 20:33 ` Ingo Molnar @ 2007-08-01 20:53 ` Tim Chen 0 siblings, 0 replies; 18+ messages in thread From: Tim Chen @ 2007-08-01 20:53 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On Tue, 2007-07-31 at 22:33 +0200, Ingo Molnar wrote: > ok, good! Could you try the updated debug patch below? I've done two > changes: made '1' the default, and added the > /proc/sys/kernel/sched_yield_granularity_ns tunable. (available if > CONFIG_SCHED_DEBUG=y) > > Could you try to change the yield-granularity tunable and see which > value gives the best performance? A value of '100000' should in theory > give the current (80% degraded) volanomark performance, the default > value should give the above '20% down' result. The question is, is '20% > down' the best we can get out of it? Does larger/smaller > yield-granularity help perhaps? You can change it to any value between > 100 usecs and 1 second. > Turning up the granuality helped. Here's the data I got for Volanomark performance relative to 2.6.22 Granuality 1000000000 (max) 9% down 800000000 8% down 80000000 13% down 8000000 20% down 100000 56% down Tim ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2007-08-01 22:47 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-07-27 22:01 Volanomark slows by 80% under CFS Tim Chen 2007-07-28 0:31 ` Chris Snook 2007-07-28 0:59 ` Andrea Arcangeli 2007-07-28 3:43 ` pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS) Chris Snook 2007-07-28 5:01 ` pluggable scheduler " Andrea Arcangeli 2007-07-28 6:51 ` Chris Snook 2007-07-30 18:49 ` Tim Chen 2007-07-30 21:07 ` Chris Snook 2007-07-30 21:24 ` Andrea Arcangeli 2007-07-28 13:28 ` Volanomark slows by 80% under CFS Dmitry Adamushko 2007-07-28 2:47 ` Rik van Riel 2007-07-28 20:26 ` Dave Jones 2007-07-28 12:36 ` Dmitry Adamushko 2007-07-28 18:55 ` David Schwartz 2007-07-29 17:37 ` [patch] sched: yield debugging Ingo Molnar 2007-07-30 18:10 ` Tim Chen 2007-07-31 20:33 ` Ingo Molnar 2007-08-01 20:53 ` Tim Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox