sched_yield proposals/rationale

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sched_yield proposals/rationale
@ 2007-04-12  8:31 Buytaert_Steven
  2007-04-12  9:48 ` Sven-Thorsten Dietrich
  2007-04-12 13:31 ` Andi Kleen
  0 siblings, 2 replies; 15+ messages in thread
From: Buytaert_Steven @ 2007-04-12  8:31 UTC (permalink / raw)
  To: linux-kernel

Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due
to excessive use of sched_yield in code in components I don't have control
over. This 'problem'/behavioral change has been reported also by other
applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...)

I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and have 3 questions/proposals:

1) Would it be beneficial to give each task one try to be enqueued on the
end of the active queue (by means of a boolean flag, reset each time the
time slices reach 0 and it is put in the expired list by scheduler_tick)?

2) When a task is eventually put in the expired list in sched_yield, give it
back the full time slices round (as done in scheduler_tick), not with the
remaining slices as is done now?

3) Put the task in the expired list at a random position, not at the end as
is done now?

Rationales:

For 1) suppose the resource that the task wants to acquire (if any) and for
which it yields, since it is not available, is kept by another task that is
in the active list, behind this task. Putting the task on the expired list
would induce some kind of 'tail chasing'. This would be broken by 1.

For 2) When a task is put on the expired list with e.g. 1 time quantum left,
next time it gets the resource and is put on the expired list at the next
scheduler_tick occurence, before it has the chance to release the resource
again, keeping it for a longer time than necessary.

For 3) when the resource (if any) is requested by many tasks
regularly, always putting the yielding task at the end of the expired list
could result in 'unfairness' since the resource is handed out to the tasks
as they get on the cpu. The only 'fairness' that can be applied with yield
is than to insert in a random position.

Any insight on these proposed changes? Thanks

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-12  8:31 sched_yield proposals/rationale Buytaert_Steven
@ 2007-04-12  9:48 ` Sven-Thorsten Dietrich
  2007-04-12 13:31 ` Andi Kleen
  1 sibling, 0 replies; 15+ messages in thread
From: Sven-Thorsten Dietrich @ 2007-04-12  9:48 UTC (permalink / raw)
  To: Buytaert_Steven; +Cc: linux-kernel

On Thu, 2007-04-12 at 04:31 -0400, Buytaert_Steven@emc.com wrote:
> Since the new 2.6.x O(1) scheduler I'm having latency problems. 

1. Have you elevated the process priority?
2. Have you tried running SCHED_FIFO, or SCHED_RR?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-12  8:31 sched_yield proposals/rationale Buytaert_Steven
  2007-04-12  9:48 ` Sven-Thorsten Dietrich
@ 2007-04-12 13:31 ` Andi Kleen
  2007-04-12 13:05   ` Buytaert_Steven
  2007-04-12 21:37   ` William Lee Irwin III
  1 sibling, 2 replies; 15+ messages in thread
From: Andi Kleen @ 2007-04-12 13:31 UTC (permalink / raw)
  To: Buytaert_Steven; +Cc: linux-kernel

Buytaert_Steven@emc.com writes:

> Since the new 2.6.x O(1) scheduler I'm having latency problems. Probably due
> to excessive use of sched_yield in code in components I don't have control
> over. This 'problem'/behavioral change has been reported also by other
> applications (e.g. OpenLDAP, Gnome netmeeting, Postgress, e.google...)

On the other hand when they fix their code to not rely on sched_yield
but use some more directed wakeup method they will end up with a much
more robust configuration. sched_yield always assumes the rest of the system
is idle, which is just wrong.

But yes the new sched_yield semantics seem to be definitely unexpected
for a lot of people.

The only way I could think of to make sched_yield work the way they 
expect would be to define some way of gang scheduling and give
sched_yield semantics that it preferably yields to other members
of the gang.

But it would be still hard to get these semantics (how to define
the gangs) into your uncontrollable broken applications and also
it has the risk of either unfairness or not full utilization of the 
machine. Getting it to scale well on MP systems would be also likely 
a challenge.

> I have analysed the sched_yield code in kernel/sched.c (2.6.16 SLES10) and have 3 questions/proposals:
> 
> 1) Would it be beneficial to give each task one try to be enqueued on the
> end of the active queue (by means of a boolean flag, reset each time the
> time slices reach 0 and it is put in the expired list by scheduler_tick)?

That would still not unbreak most applications I would suspect -- they
will likely try to yield multiple times before using up a full time slice
unless their event handlers are very compute intensive.

In general other subsystems (e.g. VM) had had quite bad experiences
with similar "one more try" hacks -- they tend to be not robust and
actually penalize some loads.

> 2) When a task is eventually put in the expired list in sched_yield, give it
> back the full time slices round (as done in scheduler_tick), not with the
> remaining slices as is done now?

That would likely be unfair and exploitable.

> 3) Put the task in the expired list at a random position, not at the end as
> is done now?

Sounds like an interesting approach, but to do it in O(1) you would
need a new data structure with possibly much larger constant overhead.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: sched_yield proposals/rationale
  2007-04-12 13:31 ` Andi Kleen
@ 2007-04-12 13:05   ` Buytaert_Steven
  2007-04-12 13:27     ` Nick Piggin
  2007-04-12 13:31     ` Andi Kleen
  2007-04-12 21:37   ` William Lee Irwin III
  1 sibling, 2 replies; 15+ messages in thread
From: Buytaert_Steven @ 2007-04-12 13:05 UTC (permalink / raw)
  To: andi; +Cc: linux-kernel

> -----Original Message-----
> From: Andi Kleen
> [ ... about use of sched_yield ...]
> On the other hand when they fix their code to not rely on sched_yield
> but use [...]

Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over
the 2.6.16 kernel yields 105 hits, note including comments... An interesting spot is e.g. fs/buffer.c free_more_memory()

> The only way I could think of to make sched_yield work the way they
> expect would be to define some way of gang scheduling and give
> sched_yield semantics that it preferably yields to other members
> of the gang. [...]

That would indeed make things too complicated.

> [about question 1]
> That would still not unbreak most applications I would suspect -- they
> will likely try to yield multiple times before using up a full time slice
> unless their event handlers are very compute intensive.

No, they only get 1 go again on the active queue, until they pass the proper exit to the expired queue via scheduler_tick. And they are enqueued at the end of the active queue with whatever slices they have left, thus fair.
 
> In general other subsystems (e.g. VM) had had quite bad experiences
> with similar "one more try" hacks -- they tend to be not robust and
> actually penalize some loads.
> 
> > 2) When a task is eventually put in the expired list in sched_yield,
> > give it back the full time slices round (as done in scheduler_tick), not > > with the remaining slices as is done now?
> 
> That would likely be unfair and exploitable.

I don't understand; how more unfair would it be than passing via scheduler_tick? Grabbing a resource with a single time slice left would be more unfair towards other tasks IMHO when you get moved to the expired list with the resource in still in your possession.

> 
> > 3) Put the task in the expired list at a random position, not at the end
> > is done now?
> 
> Sounds like an interesting approach, but to do it in O(1) you would
> need a new data structure with possibly much larger constant overhead.

Agreed, but not dramatic. Suppose you need to insert at position X, you would do, on the linked list after proper setup:

while (X--) { prev = current; current = current->next }

You could have a small duffs device to reduce the X-- checking overhead.

Thanks for the insight Andi!

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-12 13:05   ` Buytaert_Steven
@ 2007-04-12 13:27     ` Nick Piggin
  2007-04-12 22:00       ` William Lee Irwin III
  2007-04-12 13:31     ` Andi Kleen
  1 sibling, 1 reply; 15+ messages in thread
From: Nick Piggin @ 2007-04-12 13:27 UTC (permalink / raw)
  To: Buytaert_Steven; +Cc: andi, linux-kernel

Buytaert_Steven@emc.com wrote:
>>-----Original Message-----
>>From: Andi Kleen
>>[ ... about use of sched_yield ...]
>>On the other hand when they fix their code to not rely on sched_yield
>>but use [...]
> 
> 
> Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over
> the 2.6.16 kernel yields 105 hits, note including comments...

Most of these (in core code, anyway) seem to use yield when they really don't
care about running for a while.

> An interesting spot is e.g. fs/buffer.c free_more_memory()

This one should be pretty rare (actually I think it is dead code in practice,
due to the way the page allocator works).

Avoiding sched_yield is a really good idea outside realtime scheduling. Since
we have gone this far with the current semantics, I think it would be sad to
back down now.

It would be nice if you could pressure those other components to adapt :)

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-12 13:27     ` Nick Piggin
@ 2007-04-12 22:00       ` William Lee Irwin III
  0 siblings, 0 replies; 15+ messages in thread
From: William Lee Irwin III @ 2007-04-12 22:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Buytaert_Steven, andi, linux-kernel

On Thu, Apr 12, 2007 at 11:27:22PM +1000, Nick Piggin wrote:
> This one should be pretty rare (actually I think it is dead code in 
> practice, due to the way the page allocator works).
> Avoiding sched_yield is a really good idea outside realtime scheduling. 
> Since we have gone this far with the current semantics, I think it
> would be sad to back down now.
> It would be nice if you could pressure those other components to adapt :)

Outside of realtime scheduling there appear to be two desired behaviors:
(1) busywait: demote as aggressively as possible
(2) CPU burn: give other apps a chance to run but demote lightly at most

There is no way for the scheduler to distinguish which of the two
behaviors is desired. A fresh system call taking an argument to describe
which is the desired behavior is my recommended solution. Most unaware
apps should be able to be dealt with via LD_PRELOAD.

Busywaiters able to be modified could be given more specific scheduling
primitives, in particular "directed yields," which donate timeslice and
possibly dynamic priority to their targets. They would look something
like:
	int yield_to(pid_t);
	int yield_to_futex(int *);
	int yield_to_sem(int);
	/* etc. */
as userspace library functions where yielding to a resource is intended
to donate timeslice to its owner or one of its owners, where those
owner(s) are to be determined by the kernel. Directed yields are a more
direct attack on the priority inversion one most desperately wants to
avoid in the case of sched_yield() -based busywaiters on a resource,
namely the resource owner falling behind the busywaiters in priority or
running out of timeslice. They furthermore reduce the competition for
CPU between resource owners and busywaiters on that resource.

A less direct alternative suggested by Andi Kleen is to have coprocess
groups and an alternative to sched_yield() that directs yielding toward
a member of the same coprocess group as the yielder, possibly using the
standard system call by making that the default behavior when a process
is a member of such a coprocess group.


-- wli

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-12 13:05   ` Buytaert_Steven
  2007-04-12 13:27     ` Nick Piggin
@ 2007-04-12 13:31     ` Andi Kleen
  2007-04-12 14:15       ` Buytaert_Steven
  1 sibling, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2007-04-12 13:31 UTC (permalink / raw)
  To: Buytaert_Steven; +Cc: andi, linux-kernel

On Thu, Apr 12, 2007 at 09:05:25AM -0400, Buytaert_Steven@emc.com wrote:
> > -----Original Message-----
> > From: Andi Kleen
> > [ ... about use of sched_yield ...]
> > On the other hand when they fix their code to not rely on sched_yield
> > but use [...]
> 
> Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc over
> the 2.6.16 kernel yields 105 hits, note including comments... An interesting spot is e.g. fs/buffer.c free_more_memory()

A lot of those are probably broken in some way agreed.

> > 
> > > 2) When a task is eventually put in the expired list in sched_yield,
> > > give it back the full time slices round (as done in scheduler_tick), not > > with the remaining slices as is done now?
> > 
> > That would likely be unfair and exploitable.
> 
> I don't understand; how more unfair would it be than passing via scheduler_tick? Grabbing a resource with a single time slice left would be more unfair towards other tasks IMHO when you get moved to the expired list with the resource in still in your possession.

With a particular sleep pattern it could get more CPU time.

> > > 3) Put the task in the expired list at a random position, not at the end
> > > is done now?
> > 
> > Sounds like an interesting approach, but to do it in O(1) you would
> > need a new data structure with possibly much larger constant overhead.
> 
> Agreed, but not dramatic. Suppose you need to insert at position X, you would do, on the linked list after proper setup:
> 
> while (X--) { prev = current; current = current->next }
> 
> You could have a small duffs device to reduce the X-- checking overhead.

You would need to rename the scheduler to "sometimes O(1)" first @)

Besides - but I guess you're aware of it - any randomized algorithms tend
to drive benchmarkers and performance analysts crazy because their performance
cannot be repeated. So it's usually better to avoid them unless there is 
really no alternative.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: sched_yield proposals/rationale
  2007-04-12 13:31     ` Andi Kleen
@ 2007-04-12 14:15       ` Buytaert_Steven
  2007-04-12 21:04         ` Thomas Gleixner
  2007-04-12 22:32         ` Bill Davidsen
  0 siblings, 2 replies; 15+ messages in thread
From: Buytaert_Steven @ 2007-04-12 14:15 UTC (permalink / raw)
  To: andi; +Cc: linux-kernel

> -----Original Message-----
> > Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc
> over
> > the 2.6.16 kernel yields 105 hits, note including comments... An
> interesting spot is e.g. fs/buffer.c free_more_memory()
> 
> A lot of those are probably broken in some way agreed.

OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-)

> [ about giving a thread its full slice quota after yielding ]
> With a particular sleep pattern it could get more CPU time.

I'm not a linux scheduler expert, but I don't understand this. Suppose 2 extreme situations:

1) I have full time slices and call yield, going to the expired list with again full times slices; this is IMHO a NOOP, i.e. fair.

2) I have only a single slice left, NOT yielding would get me to the expired list via scheduler_tick the next timer interrupt and I would get my full time slice quota. DO YIELD at that moment, will give me full time slice quota again (with my proposal). 

> 
> 3) Put the task in the expired list at a random position [...]
> > while (X--) { prev = current; current = current->next }
> 
> You would need to rename the scheduler to "sometimes O(1)" first @)

Again, agreed, but this tight loop would run in code cache completely. We could call it O(1.1) ;-) With a duffs device it would mean a few clock cycles per position.

> Besides - but I guess you're aware of it - any randomized algorithms tend
> to drive benchmarkers and performance analysts crazy because their
> performance
> cannot be repeated. So it's usually better to avoid them unless there is
> really no alternative.

That could already solve your concern from above. Statistically speaking, it will give them (benchmarkers) the smoothest curve they've ever seen.

Please be aware that I'm just exploring options/insight here. It is not something I intend to push inside the mainline kernel. I just want to find reasonable and logic criticism as you and some others have provided already. Thanks for that!

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: sched_yield proposals/rationale
  2007-04-12 14:15       ` Buytaert_Steven
@ 2007-04-12 21:04         ` Thomas Gleixner
  2007-04-12 22:32         ` Bill Davidsen
  1 sibling, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2007-04-12 21:04 UTC (permalink / raw)
  To: Buytaert_Steven; +Cc: andi, linux-kernel

On Thu, 2007-04-12 at 10:15 -0400, Buytaert_Steven@emc.com wrote:
> > -----Original Message-----
> > > Agreed, but $ find . -name "*.[ch]" | xargs grep -E "yield[ ]*\(" | wc
> > over
> > > the 2.6.16 kernel yields 105 hits, note including comments... An
> > interesting spot is e.g. fs/buffer.c free_more_memory()
> > 
> > A lot of those are probably broken in some way agreed.
> 
> OK, so there are 2 options: 1) fix 100 spots or 2) ... ;-)

100, simply because at least 95 of them are broken.

	tglx



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-12 14:15       ` Buytaert_Steven
  2007-04-12 21:04         ` Thomas Gleixner
@ 2007-04-12 22:32         ` Bill Davidsen
  2007-04-13  5:36           ` Buytaert_Steven
  1 sibling, 1 reply; 15+ messages in thread
From: Bill Davidsen @ 2007-04-12 22:32 UTC (permalink / raw)
  To: Buytaert_Steven; +Cc: andi, linux-kernel

Buytaert_Steven@emc.com wrote:
>> -----Original Message-----

>> Besides - but I guess you're aware of it - any randomized
>> algorithms tend to drive benchmarkers and performance analysts
>> crazy because their performance cannot be repeated. So it's usually
>> better to avoid them unless there is really no alternative.
> 
> That could already solve your concern from above. Statistically
speaking, it will give them (benchmarkers) the smoothest curve they've
ever seen.
> 
> Please be aware that I'm just exploring options/insight here. It is
not something I intend to push inside the mainline kernel. I just want
to find reasonable and logic criticism as you and some others have
provided already. Thanks for that!

And having gotten same, are you going to code up what appears to be a 
solution, based on this feedback?

I'm curious how well it would run poorly written programs, having 
recently worked with a company which seemed to have a whole part of 
purchasing dedicated to buying same. :-(

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: sched_yield proposals/rationale
  2007-04-12 22:32         ` Bill Davidsen
@ 2007-04-13  5:36           ` Buytaert_Steven
  2007-04-13 19:46             ` Mark Lord
  0 siblings, 1 reply; 15+ messages in thread
From: Buytaert_Steven @ 2007-04-13  5:36 UTC (permalink / raw)
  To: davidsen; +Cc: andi, linux-kernel

> From: Bill Davidsen
>
> And having gotten same, are you going to code up what appears to be a
> solution, based on this feedback?

The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding.

I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved.

Additional settings: my patch is running *also* with a kernel modified to have only 8 default time slices at 250Hz setting. And no, the overall number of context switches per second hasn't blown up. The kernel was compiled with low latency and in-kernel preemption enabled, BKL preemption enabled. I haven't checked the patch stand alone yet.

> I'm curious how well it would run poorly written programs, having
> recently worked with a company which seemed to have a whole part of
> purchasing dedicated to buying same. :-(

So first signs are positive; note that it requires much more run time and a slew of other tests/scrutiny before we can be really sure.

W.r.t. the remarks; I am most interested in possibilities of DOS attacks that could exploit this change in sched_yield. Therefore the comments of Andi were interesting, but I haven't heard back from him yet. I'm still not sure how a task could juggle more slices from the system because of these changes.

Last remark on the O(1)'ness being violated. I think it's a mooth point. The sched_yield is executed on the CPU time of the yielder. Being O(1) is most important for the scheduler proper at each timer tick (interrupt). That being O(1) is crucial.

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-13  5:36           ` Buytaert_Steven
@ 2007-04-13 19:46             ` Mark Lord
  2007-04-17 19:37               ` Bill Davidsen
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Lord @ 2007-04-13 19:46 UTC (permalink / raw)
  To: Buytaert_Steven; +Cc: davidsen, andi, linux-kernel

Buytaert_Steven@emc.com wrote:
>> From: Bill Davidsen
>>
>> And having gotten same, are you going to code up what appears to be a
>> solution, based on this feedback?
> 
> The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding.
> 
> I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved.

Cool.  You *do know* that there is a brand new CPU scheduler
scheduled to replace the current one for the 2.6.22 Kernel, right?

-ml

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-13 19:46             ` Mark Lord
@ 2007-04-17 19:37               ` Bill Davidsen
  2007-04-18  5:40                 ` Buytaert_Steven
  0 siblings, 1 reply; 15+ messages in thread
From: Bill Davidsen @ 2007-04-17 19:37 UTC (permalink / raw)
  To: Mark Lord; +Cc: Buytaert_Steven, andi, linux-kernel

Mark Lord wrote:
> Buytaert_Steven@emc.com wrote:
>>> From: Bill Davidsen
>>>
>>> And having gotten same, are you going to code up what appears to be a
>>> solution, based on this feedback?
>>
>> The feedback was helpful in verifying whether there are any arguments 
>> against my approach. The real proof is in the pudding.
>>
>> I'm running a kernel with these changes, as we speak. Overall system 
>> throughput is about up 20%. With 'system throughput' I mean measured 
>> performance of a rather large (experimental) system. The patch isn't 
>> even 24h old... Also the application latency has improved.
> 
> Cool.  You *do know* that there is a brand new CPU scheduler
> scheduled to replace the current one for the 2.6.22 Kernel, right?
> 
Having tried both nicksched and Con's fair sched on some normal loads, 
as opposed to benchmarks, I sure hope Linus changes his mind about 
having several schedulers in the kernel. The "one perfect and 
self-adjusting scheduler" isn't here yet.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE:  Re: sched_yield proposals/rationale
  2007-04-17 19:37               ` Bill Davidsen
@ 2007-04-18  5:40                 ` Buytaert_Steven
  0 siblings, 0 replies; 15+ messages in thread
From: Buytaert_Steven @ 2007-04-18  5:40 UTC (permalink / raw)
  To: davidsen, linux-kernel; +Cc: andi, linux-kernel

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Bill Davidsen
> Sent: dinsdag 17 april 2007 21:38
> To: linux-kernel@vger.kernel.org
> Cc: Buytaert, Steven; andi@firstfloor.org; linux-kernel@vger.kernel.org
> Subject: Re: sched_yield proposals/rationale
> 
> Mark Lord wrote:
> >
> > Cool.  You *do know* that there is a brand new CPU scheduler
> > scheduled to replace the current one for the 2.6.22 Kernel, right?
> >
> Having tried both nicksched and Con's fair sched on some normal loads,
> as opposed to benchmarks, I sure hope Linus changes his mind about
> having several schedulers in the kernel. The "one perfect and
> self-adjusting scheduler" isn't here yet.

I have the same opinion, and it is still a long time out I'm afraid. Probably people only read my suggestion for a 'fix' diagonally, let alone they read my footer. Too bad the archives only go back to 95, I would love to retrieve my posts from 93. Anybody still have these?

Now a bit more on topic:

1) My problem is/was solved by making the default time slice much smaller than the default 100 in a 250Hz system. But that's only masking it away.

2) I made a schedstat program sort of like vmstat that samples and prints deltas each second (from /proc/schedstat). Just printing the jiffies delta and the # times schedule is called per CPU, is already thought provoking; a small 12 second example:

2.6.16 vanilla scheduler

   Jiffies CPU0    CPU1
    Delta 
    250  | 1756 |  7301 |
    252  | 1730 |  2638 |
    254  | 1963 |  1663 |
    385  |  868 |   658 | <--- stall starts
    325  |  138 |   112 |
    330  |  184 |   130 |
    339  |  682 |   122 |
    335  |  367 |   159 |
    334  |  653 |   127 |
    345  |  467 |   137 |
    335  |  673 |   128 |
    337  |  471 |   131 |
    334  |  673 |   127 |
    332  |  321 |   144 |
    333  |  523 |   129 |
    332  |   98 |   123 |
    356  |  496 |   124 |
    270  |   96 |    87 |
    277  | 5878 | 26228 | <-- yes 26K
    252  |18263 | 19130 | ...
    255  | 2024 |  5747 | <-- back to normal

Let's talk about fairness when we get the basics right. And yes, the real physical elapsed time per line IS 1 second, so the jiffies jumping up from the normal expected value of 250 to up to 356 is not normal; it's in fact very very bad. This box was unusable for 13 seconds in a row.

But this doesn't look serious enough to most people, it seems. 

Steven Buytaert
--
La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sched_yield proposals/rationale
  2007-04-12 13:31 ` Andi Kleen
  2007-04-12 13:05   ` Buytaert_Steven
@ 2007-04-12 21:37   ` William Lee Irwin III
  1 sibling, 0 replies; 15+ messages in thread
From: William Lee Irwin III @ 2007-04-12 21:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Buytaert_Steven, linux-kernel

On Thu, Apr 12, 2007 at 03:31:31PM +0200, Andi Kleen wrote:
> The only way I could think of to make sched_yield work the way they 
> expect would be to define some way of gang scheduling and give
> sched_yield semantics that it preferably yields to other members
> of the gang.
> But it would be still hard to get these semantics (how to define
> the gangs) into your uncontrollable broken applications and also
> it has the risk of either unfairness or not full utilization of the 
> machine. Getting it to scale well on MP systems would be also likely 
> a challenge.

Gang scheduling isn't a squishy concept whose name can be arbitrarily
repurposed. Perhaps "group scheduling" or similar would be appropriate
if the standard gang scheduling semantics are not what you have in mind.

Standard gang scheduling would not be appropriate for applications that
don't know what they're doing. All threads of a gang falling asleep
when one sleeps (or more properly, the gang is considered either
runnable or unrunnable as a unit) is not to be taken lightly.

I'd call this something like a "directed yield with a group as a target,"
but I wouldn't actually try to do this. I'd try to provide ways for a
directed yield to donate remaining timeslice and dynamic priority if
possible to a particular task associated with a resource, for instance,
a futex or SysV semaphore owner. The priority inversion one desperately
wants to avoid is the resource owner running out of timeslice or
otherwise losing priority to where it falls behind busywaiters such as
callers of sched_yield for the purposes of multitier locking.

-- wli

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-04-18  5:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-12  8:31 sched_yield proposals/rationale Buytaert_Steven
2007-04-12  9:48 ` Sven-Thorsten Dietrich
2007-04-12 13:31 ` Andi Kleen
2007-04-12 13:05   ` Buytaert_Steven
2007-04-12 13:27     ` Nick Piggin
2007-04-12 22:00       ` William Lee Irwin III
2007-04-12 13:31     ` Andi Kleen
2007-04-12 14:15       ` Buytaert_Steven
2007-04-12 21:04         ` Thomas Gleixner
2007-04-12 22:32         ` Bill Davidsen
2007-04-13  5:36           ` Buytaert_Steven
2007-04-13 19:46             ` Mark Lord
2007-04-17 19:37               ` Bill Davidsen
2007-04-18  5:40                 ` Buytaert_Steven
2007-04-12 21:37   ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox