* Context switch times @ 2001-10-04 21:04 Mike Kravetz 2001-10-04 21:14 ` arjan 0 siblings, 1 reply; 61+ messages in thread From: Mike Kravetz @ 2001-10-04 21:04 UTC (permalink / raw) To: linux-kernel I've been working on a rewrite of our Multi-Queue scheduler and am using the lat_ctx program of LMbench as a benchmark. I'm lucky enough to have access to an 8-CPU system for use during development. One time, I 'accidently' booted the kernel that came with the distribution installed on this machine. That kernel level is '2.2.16-22'. The results of running lat-ctx on this kernel when compared to 2.4.10 really surprised me. Here is an example: 2.4.10 on 8 CPUs: lat_ctx -s 0 -r 2 results "size=0k ovr=2.27 2 3.86 2.2.16-22 on 8 CPUS: lat_ctx -s 0 -r 2 results "size=0k ovr=1.99 2 1.44 As you can see, the context switch times for 2.4.10 are more than double what they were for 2.2.16-22 in this example. Comments? One observation I did make is that this may be related to CPU affinity/cache warmth. If you increase the number of 'TRIPS' to a very large number, you can run 'top' and observe per-CPU utilization. On 2.2.16-22, the '2 task' benchmark seemed to stay on 3 of the 8 CPUs. On 2.4.10, these 2 tasks were run on all 8 CPUs and utilization was about the same for each CPU. -- Mike Kravetz kravetz@us.ibm.com IBM Peace, Love and Linux Technology Center ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 21:04 Context switch times Mike Kravetz @ 2001-10-04 21:14 ` arjan 2001-10-04 21:25 ` David S. Miller 0 siblings, 1 reply; 61+ messages in thread From: arjan @ 2001-10-04 21:14 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-kernel In article <20011004140417.C1245@w-mikek2.des.beaverton.ibm.com> you wrote: > 2.4.10 on 8 CPUs: lat_ctx -s 0 -r 2 results > "size=0k ovr=2.27 > 2 3.86 > 2.2.16-22 on 8 CPUS: lat_ctx -s 0 -r 2 results > "size=0k ovr=1.99 > 2 1.44 > As you can see, the context switch times for 2.4.10 are more > than double what they were for 2.2.16-22 in this example. > Comments? 2.4.x supports SSE on pentium III/athlons, so the SSE registers need to be saved/restored on a taskswitch as well.... that's not exactly free. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 21:14 ` arjan @ 2001-10-04 21:25 ` David S. Miller 2001-10-04 21:39 ` Richard Gooch 0 siblings, 1 reply; 61+ messages in thread From: David S. Miller @ 2001-10-04 21:25 UTC (permalink / raw) To: arjan; +Cc: kravetz, linux-kernel From: arjan@fenrus.demon.nl Date: Thu, 04 Oct 2001 22:14:13 +0100 > Comments? 2.4.x supports SSE on pentium III/athlons, so the SSE registers need to be saved/restored on a taskswitch as well.... that's not exactly free. lat_ctx doesn't execute any FPU ops. So at worst this happens once on GLIBC program startup, but then never again. This assumes I understand how lazy i387 restores work in the kernel :-) Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 21:25 ` David S. Miller @ 2001-10-04 21:39 ` Richard Gooch 2001-10-04 21:52 ` David S. Miller 0 siblings, 1 reply; 61+ messages in thread From: Richard Gooch @ 2001-10-04 21:39 UTC (permalink / raw) To: David S. Miller; +Cc: arjan, kravetz, linux-kernel David S. Miller writes: > From: arjan@fenrus.demon.nl > Date: Thu, 04 Oct 2001 22:14:13 +0100 > > > Comments? > > 2.4.x supports SSE on pentium III/athlons, so the SSE registers need to be > saved/restored on a taskswitch as well.... that's not exactly free. > > lat_ctx doesn't execute any FPU ops. So at worst this happens once > on GLIBC program startup, but then never again. Has something changed? Last I looked, the whole lmbench timing harness was based on using the FPU. That was the cause of the big argument Larry and I had some years back: my context switch benchmark didn't use the FPU, and thus was more sensitive to variations (such as cache misses due to aliasing). Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 21:39 ` Richard Gooch @ 2001-10-04 21:52 ` David S. Miller 2001-10-04 21:55 ` Benjamin LaHaise 2001-10-09 4:55 ` Richard Gooch 0 siblings, 2 replies; 61+ messages in thread From: David S. Miller @ 2001-10-04 21:52 UTC (permalink / raw) To: rgooch; +Cc: arjan, kravetz, linux-kernel From: Richard Gooch <rgooch@ras.ucalgary.ca> Date: Thu, 4 Oct 2001 15:39:05 -0600 David S. Miller writes: > lat_ctx doesn't execute any FPU ops. So at worst this happens once > on GLIBC program startup, but then never again. Has something changed? Last I looked, the whole lmbench timing harness was based on using the FPU. Oops, that's entirely possible... But things are usually layed out like this: capture_start_time(); context_switch_N_times(); capture_end_time(); So the FPU hit is only before/after the runs, not during each and every iteration. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 21:52 ` David S. Miller @ 2001-10-04 21:55 ` Benjamin LaHaise 2001-10-04 22:35 ` Davide Libenzi 2001-10-04 22:42 ` Linus Torvalds 2001-10-09 4:55 ` Richard Gooch 1 sibling, 2 replies; 61+ messages in thread From: Benjamin LaHaise @ 2001-10-04 21:55 UTC (permalink / raw) To: David S. Miller; +Cc: rgooch, arjan, kravetz, linux-kernel On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote: > So the FPU hit is only before/after the runs, not during each and > every iteration. Right. Plus, the original mail mentioned that it was hitting all 8 CPUs, which is a pretty good example of braindead scheduler behaviour. -ben ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 21:55 ` Benjamin LaHaise @ 2001-10-04 22:35 ` Davide Libenzi 2001-10-04 22:49 ` Mike Kravetz 2001-10-04 22:42 ` Linus Torvalds 1 sibling, 1 reply; 61+ messages in thread From: Davide Libenzi @ 2001-10-04 22:35 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David S. Miller, rgooch, arjan, kravetz, linux-kernel On Thu, 4 Oct 2001, Benjamin LaHaise wrote: > On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote: > > So the FPU hit is only before/after the runs, not during each and > > every iteration. > > Right. Plus, the original mail mentioned that it was hitting all 8 > CPUs, which is a pretty good example of braindead scheduler behaviour. There was a discussion about process spinning among idle CPUs a couple of months ago. Mike, did you code the patch that stick the task to an idle between the send-IPI and the idle wakeup ? At that time we simply left the issue unaddressed. - Davide ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 22:35 ` Davide Libenzi @ 2001-10-04 22:49 ` Mike Kravetz 0 siblings, 0 replies; 61+ messages in thread From: Mike Kravetz @ 2001-10-04 22:49 UTC (permalink / raw) To: Davide Libenzi Cc: Benjamin LaHaise, David S. Miller, rgooch, arjan, linux-kernel On Thu, Oct 04, 2001 at 03:35:35PM -0700, Davide Libenzi wrote: > > Right. Plus, the original mail mentioned that it was hitting all 8 > > CPUs, which is a pretty good example of braindead scheduler behaviour. > > There was a discussion about process spinning among idle CPUs a couple of > months ago. > Mike, did you code the patch that stick the task to an idle between the > send-IPI and the idle wakeup ? > At that time we simply left the issue unaddressed. I believe the 'quick and dirty' patches we came up with substantially increased context switch times for this benchmark (doubled them). The reason is that you needed to add IPI time to the context switch time. Therefore, I did not actively pursue getting these accepted. :) It appears that something in the 2.2 scheduler did a much better job of handling this situation. I haven't looked at the 2.2 code. Does anyone know what feature of the 2.2 scheduler was more successful in keeping tasks on the CPUs on which they previously executed? Also, why was that code removed from 2.4? I can research, but I suspect someone here has firsthand knowledge. -- Mike Kravetz kravetz@us.ibm.com IBM Peace, Love and Linux Technology Center ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 21:55 ` Benjamin LaHaise 2001-10-04 22:35 ` Davide Libenzi @ 2001-10-04 22:42 ` Linus Torvalds 2001-10-04 22:53 ` Benjamin LaHaise 2001-10-04 23:41 ` Mike Kravetz 1 sibling, 2 replies; 61+ messages in thread From: Linus Torvalds @ 2001-10-04 22:42 UTC (permalink / raw) To: linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 1329 bytes --] In article <20011004175526.C18528@redhat.com>, Benjamin LaHaise <bcrl@redhat.com> wrote: >On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote: >> So the FPU hit is only before/after the runs, not during each and >> every iteration. > >Right. Plus, the original mail mentioned that it was hitting all 8 >CPUs, which is a pretty good example of braindead scheduler behaviour. Careful. That's not actually true (the braindead part, that i). We went through this with Ingo and Larry McVoy, and the sad fact is that to get the best numbers for lmbench, you simply have to do the wrong thing. Could we try to hit just two? Probably, but it doesn't really matter, though: to make the lmbench scheduler benchmark go at full speed, you want to limit it to _one_ CPU, which is not sensible in real-life situations. The amount of concurrency in the context switching benchmark is pretty small, and does not make up for bouncing the locks etc between CPU's. However, that lack of concurrency in lmbench is totally due to the artificial nature of the benchmark, and the bigger-footprint scheduling stuff (that isn't reported very much in the summary) are more realistic. So 2.4.x took the (painful) road of saying that we care less about that particular benchmark than about some other more realistic loads. Linus ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 22:42 ` Linus Torvalds @ 2001-10-04 22:53 ` Benjamin LaHaise 2001-10-05 15:13 ` Alan Cox 2001-10-04 23:41 ` Mike Kravetz 1 sibling, 1 reply; 61+ messages in thread From: Benjamin LaHaise @ 2001-10-04 22:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote: > Could we try to hit just two? Probably, but it doesn't really matter, > though: to make the lmbench scheduler benchmark go at full speed, you > want to limit it to _one_ CPU, which is not sensible in real-life > situations. The amount of concurrency in the context switching > benchmark is pretty small, and does not make up for bouncing the locks > etc between CPU's. I don't quite agree with you that it doesn't matter. A lot of tests (volanomark, other silly things) show that the current scheduler jumps processes from CPU to CPU on SMP boxes far too easily, in addition to the lengthy duration of run queue processing when loaded down. Yes, these applications are leaving too many runnable processes around, but that's the way some large app server loads behave. And right now it makes linux look bad compared to other OSes. Yes, low latency is good, but jumping around cpus adds more latency in cache misses across slow busses than is needed when the working set is already present in the 2MB L2 of your high end server. -ben ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 22:53 ` Benjamin LaHaise @ 2001-10-05 15:13 ` Alan Cox 2001-10-05 17:49 ` george anzinger ` (2 more replies) 0 siblings, 3 replies; 61+ messages in thread From: Alan Cox @ 2001-10-05 15:13 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Linus Torvalds, linux-kernel > I don't quite agree with you that it doesn't matter. A lot of tests > (volanomark, other silly things) show that the current scheduler jumps > processes from CPU to CPU on SMP boxes far too easily, in addition to the > lengthy duration of run queue processing when loaded down. Yes, these > applications are leaving too many runnable processes around, but that's > the way some large app server loads behave. And right now it makes linux > look bad compared to other OSes. The current scheduler is completely hopeless by the time you hit 3 processors and a bit flaky on two. Its partly the algorithm and partly implementation #1 We schedule tasks based on a priority that has no cache consideration #2 We have an O(tasks) loop refreshing priority that isnt needed (Montavista fix covers this) #3 We have an O(runnable) loop which isnt ideal #4 On x86 we are horribly cache pessimal. All the task structs are on the same cache colour. Multiple tasks waiting for the same event put their variables (like the wait queue) on the same cache line. The poor cache performance greatly comes from problem 1. Because we prioritize blindly on CPU usage with a tiny magic constant fudge factor we are executing totally wrong task orders. Instead we need to schedule for cache optimal behaviour, and the moderator is the fairness requirement, not the other way around. The classic example is two steady cpu loads and an occasionally waking client (like an editor) We end up scheduling [A, B are the loads, C is the editor) A C B C A C B C A C B C A whenever a key is hit we swap CPU hog because the priority balanced shifted. Really we should have scheduled something looking more like A C A C A C B C B C B C B I would argue there are two needed priorities 1. CPU usage based priority - the current one 2. Task priority. Task priority being set to maximum when a task sleeps and then bounded by a function of max(task_priorty, fn(cpupriority)) so that tasks fall down priority levels as their cpu usage in the set of time slices. This causes tasks that are CPU hogs to sit at the bottom of the pile with the same low task_priority meaning they run for long bursts and don't get pre-empted and switched with other hogs through each cycle of the scheduler. This isnt idle speculation - I've done some minimal playing with this but my initial re-implementation didnt handle SMP at all and I am still not 100% sure how to resolve SMP or how SMP will improve out of the current cunning plan. Alan ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 15:13 ` Alan Cox @ 2001-10-05 17:49 ` george anzinger 2001-10-05 22:29 ` Alan Cox 2001-10-06 2:24 ` Albert D. Cahalan 2001-10-07 9:57 ` Mika Liljeberg 2 siblings, 1 reply; 61+ messages in thread From: george anzinger @ 2001-10-05 17:49 UTC (permalink / raw) To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, linux-kernel Alan Cox wrote: > > > I don't quite agree with you that it doesn't matter. A lot of tests > > (volanomark, other silly things) show that the current scheduler jumps > > processes from CPU to CPU on SMP boxes far too easily, in addition to the > > lengthy duration of run queue processing when loaded down. Yes, these > > applications are leaving too many runnable processes around, but that's > > the way some large app server loads behave. And right now it makes linux > > look bad compared to other OSes. > > The current scheduler is completely hopeless by the time you hit 3 > processors and a bit flaky on two. Its partly the algorithm and partly > implementation > > #1 We schedule tasks based on a priority that has no cache > consideration > > #2 We have an O(tasks) loop refreshing priority that isnt needed > (Montavista fix covers this) > > #3 We have an O(runnable) loop which isnt ideal > > #4 On x86 we are horribly cache pessimal. All the task structs are > on the same cache colour. Multiple tasks waiting for the same event > put their variables (like the wait queue) on the same cache line. > > The poor cache performance greatly comes from problem 1. Because we prioritize > blindly on CPU usage with a tiny magic constant fudge factor we are > executing totally wrong task orders. Instead we need to schedule for cache > optimal behaviour, and the moderator is the fairness requirement, not the > other way around. > > The classic example is two steady cpu loads and an occasionally waking > client (like an editor) > > We end up scheduling [A, B are the loads, C is the editor) > > A C B C A C B C A C B C A > > whenever a key is hit we swap CPU hog because the priority balanced shifted. > Really we should have scheduled something looking more like > > A C A C A C B C B C B C B > > I would argue there are two needed priorities > > 1. CPU usage based priority - the current one > > 2. Task priority. > > Task priority being set to maximum when a task sleeps and then bounded by > a function of max(task_priorty, fn(cpupriority)) so that tasks fall down > priority levels as their cpu usage in the set of time slices. This causes > tasks that are CPU hogs to sit at the bottom of the pile with the same low > task_priority meaning they run for long bursts and don't get pre-empted and > switched with other hogs through each cycle of the scheduler. Let me see if I have this right. Task priority goes to max on any (?) sleep regardless of how long. And to min if it doesn't sleep for some period of time. Where does the time slice counter come into this, if at all? For what its worth I am currently updating the MontaVista scheduler so, I am open to ideas. George > > This isnt idle speculation - I've done some minimal playing with this but > my initial re-implementation didnt handle SMP at all and I am still not 100% > sure how to resolve SMP or how SMP will improve out of the current cunning > plan. > > Alan > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 17:49 ` george anzinger @ 2001-10-05 22:29 ` Alan Cox 2001-10-05 22:56 ` Davide Libenzi 2001-10-07 1:20 ` george anzinger 0 siblings, 2 replies; 61+ messages in thread From: Alan Cox @ 2001-10-05 22:29 UTC (permalink / raw) To: george anzinger; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel > Let me see if I have this right. Task priority goes to max on any (?) > sleep regardless of how long. And to min if it doesn't sleep for some > period of time. Where does the time slice counter come into this, if at > all? > > For what its worth I am currently updating the MontaVista scheduler so, > I am open to ideas. The time slice counter is the limit on the amount of time you can execute, the priority determines who runs first. So if you used your cpu quota you will get run reluctantly. If you slept you will get run early and as you use time slice count you will drop priority bands, but without pre-emption until you cross a band and there is another task with higher priority. This damps down task thrashing a bit, and for the cpu hogs it gets the desired behaviour - which is that the all run their full quantum in the background one after another instead of thrashing back and forth ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 22:29 ` Alan Cox @ 2001-10-05 22:56 ` Davide Libenzi 2001-10-05 23:04 ` Alan Cox 2001-10-07 1:20 ` george anzinger 1 sibling, 1 reply; 61+ messages in thread From: Davide Libenzi @ 2001-10-05 22:56 UTC (permalink / raw) To: Alan Cox; +Cc: george anzinger, Benjamin LaHaise, Linus Torvalds, linux-kernel On Fri, 5 Oct 2001, Alan Cox wrote: > > Let me see if I have this right. Task priority goes to max on any (?) > > sleep regardless of how long. And to min if it doesn't sleep for some > > period of time. Where does the time slice counter come into this, if at > > all? > > > > For what its worth I am currently updating the MontaVista scheduler so, > > I am open to ideas. > > The time slice counter is the limit on the amount of time you can execute, > the priority determines who runs first. > > So if you used your cpu quota you will get run reluctantly. If you slept > you will get run early and as you use time slice count you will drop > priority bands, but without pre-emption until you cross a band and there > is another task with higher priority. > > This damps down task thrashing a bit, and for the cpu hogs it gets the > desired behaviour - which is that the all run their full quantum in the > background one after another instead of thrashing back and forth What if we give to prev a priority boost P=F(T) where T is the time prev is ran before the current schedule ? - Davide ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 22:56 ` Davide Libenzi @ 2001-10-05 23:04 ` Alan Cox 2001-10-05 23:16 ` Davide Libenzi 2001-10-05 23:43 ` Roger Larsson 0 siblings, 2 replies; 61+ messages in thread From: Alan Cox @ 2001-10-05 23:04 UTC (permalink / raw) To: Davide Libenzi Cc: Alan Cox, george anzinger, Benjamin LaHaise, Linus Torvalds, linux-kernel > > This damps down task thrashing a bit, and for the cpu hogs it gets the > > desired behaviour - which is that the all run their full quantum in the > > background one after another instead of thrashing back and forth > > What if we give to prev a priority boost P=F(T) where T is the time > prev is ran before the current schedule ? That would be the wrong key. You can argue certainly that it is maybe appropriate to use some function based on remaining scheduler ticks, but that already occurs as the scheduler ticks is the upper bound for priority band ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 23:04 ` Alan Cox @ 2001-10-05 23:16 ` Davide Libenzi 2001-10-05 23:17 ` Alan Cox 2001-10-05 23:43 ` Roger Larsson 1 sibling, 1 reply; 61+ messages in thread From: Davide Libenzi @ 2001-10-05 23:16 UTC (permalink / raw) To: Alan Cox Cc: Davide Libenzi, george anzinger, Benjamin LaHaise, Linus Torvalds, linux-kernel On Sat, 6 Oct 2001, Alan Cox wrote: > > > This damps down task thrashing a bit, and for the cpu hogs it gets the > > > desired behaviour - which is that the all run their full quantum in the > > > background one after another instead of thrashing back and forth > > > > What if we give to prev a priority boost P=F(T) where T is the time > > prev is ran before the current schedule ? > > That would be the wrong key. You can argue certainly that it is maybe > appropriate to use some function based on remaining scheduler ticks, but > that already occurs as the scheduler ticks is the upper bound for priority > band No, i mean T = (Tstart - Tend) where : Tstart = time the current ( prev ) task has been scheduled Tend = current time ( in schedule() ) Basically it's the total time the current ( prev ) task has had the CPU - Davide ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 23:16 ` Davide Libenzi @ 2001-10-05 23:17 ` Alan Cox 2001-10-05 23:21 ` Davide Libenzi 0 siblings, 1 reply; 61+ messages in thread From: Alan Cox @ 2001-10-05 23:17 UTC (permalink / raw) To: Davide Libenzi Cc: Alan Cox, Davide Libenzi, george anzinger, Benjamin LaHaise, Linus Torvalds, linux-kernel > No, i mean T = (Tstart - Tend) where : > > Tstart = time the current ( prev ) task has been scheduled > Tend = current time ( in schedule() ) > > Basically it's the total time the current ( prev ) task has had the CPU Ok let me ask one question - why ? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 23:17 ` Alan Cox @ 2001-10-05 23:21 ` Davide Libenzi 0 siblings, 0 replies; 61+ messages in thread From: Davide Libenzi @ 2001-10-05 23:21 UTC (permalink / raw) To: Alan Cox Cc: Davide Libenzi, george anzinger, Benjamin LaHaise, Linus Torvalds, linux-kernel On Sat, 6 Oct 2001, Alan Cox wrote: > > No, i mean T = (Tstart - Tend) where : > > > > Tstart = time the current ( prev ) task has been scheduled > > Tend = current time ( in schedule() ) > > > > Basically it's the total time the current ( prev ) task has had the CPU > > Ok let me ask one question - why ? Because the more the task is ran the higher the cache footprint is and the higher the cache footprint is the more we'd like to pick up the exiting task to rerun. - Davide ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 23:04 ` Alan Cox 2001-10-05 23:16 ` Davide Libenzi @ 2001-10-05 23:43 ` Roger Larsson 1 sibling, 0 replies; 61+ messages in thread From: Roger Larsson @ 2001-10-05 23:43 UTC (permalink / raw) To: Alan Cox, Davide Libenzi Cc: Alan Cox, george anzinger, Benjamin LaHaise, Linus Torvalds, linux-kernel On Saturday 06 October 2001 01:04, Alan Cox wrote: > > > This damps down task thrashing a bit, and for the cpu hogs it gets the > > > desired behaviour - which is that the all run their full quantum in the > > > background one after another instead of thrashing back and forth > > > > What if we give to prev a priority boost P=F(T) where T is the time > > prev is ran before the current schedule ? > > That would be the wrong key. You can argue certainly that it is maybe > appropriate to use some function based on remaining scheduler ticks, but > that already occurs as the scheduler ticks is the upper bound for priority > band How about a LIFO list for each processor where preempted (count != 0) tasks go? When a preemption occurs the current goes to the LIFO. When a process has run whole of its time slot - it can be moved to an usedup queue. (No point in keeping it on the generic run queue, but remember it when giving out new ticks, could also be on a per CPU basis). Now select what to do next.. The first on the LIFO can be moved in as "current" before checking the rest of the run queue... Pros: + The LIFO will be sorted with highest prio on top - no need to search 'em all + a process that starts a time slot on one CPU stays the whole slot. Cons: - A process might get stuck waiting for the processor in the FIFO while the other CPU is idle (but that was the point, wasn't it...) /RogerL -- Roger Larsson Skellefteå Sweden ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 22:29 ` Alan Cox 2001-10-05 22:56 ` Davide Libenzi @ 2001-10-07 1:20 ` george anzinger 2001-10-07 1:33 ` Bill Davidsen 2001-10-07 9:56 ` Alan Cox 1 sibling, 2 replies; 61+ messages in thread From: george anzinger @ 2001-10-07 1:20 UTC (permalink / raw) To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, linux-kernel Alan Cox wrote: > > > Let me see if I have this right. Task priority goes to max on any (?) > > sleep regardless of how long. And to min if it doesn't sleep for some > > period of time. Where does the time slice counter come into this, if at > > all? > > > > For what its worth I am currently updating the MontaVista scheduler so, > > I am open to ideas. > > The time slice counter is the limit on the amount of time you can execute, > the priority determines who runs first. > > So if you used your cpu quota you will get run reluctantly. If you slept > you will get run early and as you use time slice count you will drop > priority bands, but without pre-emption until you cross a band and there > is another task with higher priority. > > This damps down task thrashing a bit, and for the cpu hogs it gets the > desired behaviour - which is that the all run their full quantum in the > background one after another instead of thrashing back and forth If I understand this, you are decreasing the priority of a task that is running as it consumes its slice. In the current way of doing things this happens and is noticed when the scheduler gets called. A task can loose its place by being preempted by some quick I/O bound task. I.e. A and B are cpu hogs with A having the cpu, if task Z comes along and runs for 100 micro seconds, A finds itself replace by B, where as if Z does not come along, A will complete its slice. It seems to me that A should be first in line after Z and not bumped just because it has used some of its slice. This would argue for priority being set at slice renewal time and left alone until the task blocks or completes its slice. George ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 1:20 ` george anzinger @ 2001-10-07 1:33 ` Bill Davidsen 2001-10-07 9:56 ` Alan Cox 1 sibling, 0 replies; 61+ messages in thread From: Bill Davidsen @ 2001-10-07 1:33 UTC (permalink / raw) To: linux-kernel On Sat, 6 Oct 2001, george anzinger wrote: > Alan Cox wrote: > > So if you used your cpu quota you will get run reluctantly. If you slept > > you will get run early and as you use time slice count you will drop > > priority bands, but without pre-emption until you cross a band and there > > is another task with higher priority. > > > > This damps down task thrashing a bit, and for the cpu hogs it gets the > > desired behaviour - which is that the all run their full quantum in the > > background one after another instead of thrashing back and forth > > If I understand this, you are decreasing the priority of a task that is > running as it consumes its slice. In the current way of doing things > this happens and is noticed when the scheduler gets called. A task can > loose its place by being preempted by some quick I/O bound task. I.e. A > and B are cpu hogs with A having the cpu, if task Z comes along and runs > for 100 micro seconds, A finds itself replace by B, where as if Z does > not come along, A will complete its slice. It seems to me that A should > be first in line after Z and not bumped just because it has used some of > its slice. This would argue for priority being set at slice renewal > time and left alone until the task blocks or completes its slice. It certainly argues for continuing to run the process with the unexpired timeslice, becausle it is most likely to have current data in caches of every type. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 1:20 ` george anzinger 2001-10-07 1:33 ` Bill Davidsen @ 2001-10-07 9:56 ` Alan Cox 1 sibling, 0 replies; 61+ messages in thread From: Alan Cox @ 2001-10-07 9:56 UTC (permalink / raw) To: george anzinger; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel > its slice. This would argue for priority being set at slice renewal > time and left alone until the task blocks or completes its slice. That may well be the case. The banding I played with was effectively doubling the task switch rate for more interactive tasks versus the bottom feeders ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 15:13 ` Alan Cox 2001-10-05 17:49 ` george anzinger @ 2001-10-06 2:24 ` Albert D. Cahalan 2001-10-06 2:57 ` Davide Libenzi 2001-10-07 9:57 ` Mika Liljeberg 2 siblings, 1 reply; 61+ messages in thread From: Albert D. Cahalan @ 2001-10-06 2:24 UTC (permalink / raw) To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, linux-kernel Alan Cox writes: > [somebody] >> I don't quite agree with you that it doesn't matter. A lot of tests >> (volanomark, other silly things) show that the current scheduler jumps >> processes from CPU to CPU on SMP boxes far too easily Be careful that your viewer doesn't disturb the system. Please don't even consider using "top" for this. > #4 On x86 we are horribly cache pessimal. All the task structs are > on the same cache colour. Multiple tasks waiting for the same event > put their variables (like the wait queue) on the same cache line. If cache problems are bad enough, maybe this pays for itself: /* old */ current = stack_ptr & ~0x1fff; /* new */ hash = (stack_ptr>>8)^(stack_ptr>>12)^(stack_ptr>>16)^(stack_ptr>>20); current = (stack_ptr & ~0x1fff) | (hash & 0x1e0); > The classic example is two steady cpu loads and an occasionally waking > client (like an editor) Like "top"! ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-06 2:24 ` Albert D. Cahalan @ 2001-10-06 2:57 ` Davide Libenzi 0 siblings, 0 replies; 61+ messages in thread From: Davide Libenzi @ 2001-10-06 2:57 UTC (permalink / raw) To: Albert D. Cahalan Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel On Fri, 5 Oct 2001, Albert D. Cahalan wrote: > If cache problems are bad enough, maybe this pays for itself: > > /* old */ > current = stack_ptr & ~0x1fff; > > /* new */ > hash = (stack_ptr>>8)^(stack_ptr>>12)^(stack_ptr>>16)^(stack_ptr>>20); > current = (stack_ptr & ~0x1fff) | (hash & 0x1e0); I suggested something like this ( randomly "moved" struct task_struct * ) a couple of years ago. Never tried to measure the impact of the system though. - Davide ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 15:13 ` Alan Cox 2001-10-05 17:49 ` george anzinger 2001-10-06 2:24 ` Albert D. Cahalan @ 2001-10-07 9:57 ` Mika Liljeberg 2001-10-07 13:03 ` Ingo Oeser ` (2 more replies) 2 siblings, 3 replies; 61+ messages in thread From: Mika Liljeberg @ 2001-10-07 9:57 UTC (permalink / raw) To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, linux-kernel Alan Cox wrote: > This isnt idle speculation - I've done some minimal playing with this but > my initial re-implementation didnt handle SMP at all and I am still not 100% > sure how to resolve SMP or how SMP will improve out of the current cunning > plan. Here's some idle speculation on SMP to top it off. :) I tend to think that the load balancing between CPUs should be a completely separate algorithim and should not necessarily be run at every schedule(). The idea is to compeletely decouple the problem of scheduling a single CPU between tasks and the problem of load balancing between the CPUs, making each problem simpler to solve. Consider the following basic rules: A) When a new task comes along, pick the "least loaded" CPU and lock the new task onto that. B) Whenever the load imbalance between least loaded CPU and most loaded CPU becomes too great, move one or more tasks from most loaded CPU to the least loaded CPU. The rules themselves should be self-explanatory: A provides initial load balancing, while B tries to keep the balance (with a sensible hysteresis to avoid thrashing). However, there are a few minor details to solve: 1) How to determine the load of a CPU? If we can quantify this clearly, we can easily set a hysteresis level to trigger load balancing between two CPUs. 2) When and how often to check for load imbalance? 3) How to select the task(s) that should be moved between two CPUs to correct an imbalance? For problems 1 and 2 I propose the following solution: Insert the the load balancing routine itself as a (fake) task on each CPU and run it when the CPU gets around to it. The load balancer should behave almost like a CPU-bound task, scheduled on the lowest priority level with other runnable tasks. The last bit is important: the load balancer should not be allowed to starve but should be invoked approximately once every "full rotation" of the scheduler. With the above it is easy to estimate the load of a CPU. We can simply use the elapsed time between two invokations of the load balancer task. When the load balancer task of a particular CPU gets run, it chalks up the elapsed time on a score board somewhere, and checks whether there is a significant imbalance between itself and some other CPU. If there is, it commences to move some tasks between itself and the other CPU (note rule B, though, it should be enough to mess with just two CPU queues at a time to minimize balancing and locking overhead). Problem 3 is tricky. Basically, there should be a cost/benefit function F(tasks to move) that should be minimized. Ideally F(task_i), the cost/benefit of moving a single task, would be calculated as a byproduct of the CPU scheduler algorithm. F(task_i) might be function of elapsed time since task_i was last scheduled and the average time slice used by task_i, to account for the probable cache hit. This would leave it up to the load balancer to move as many lowest cost tasks to a new CPU as is needed to correct the imbalance (average time slices used by each task would be needed in order to make this decision). Naturally, some additional rules might be necessary to make a task eligible for moving, e.g., never move the only/last CPU bound task to another CPU. In addition, it might actually make sense to move at most one task at each invocation of the load balancer, to further reduce the probability of thrashing. The load would still converge fairly quickly towards a balanced state. It would also scale fairly well with the number of CPUs. How does that sound? MikaL ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 9:57 ` Mika Liljeberg @ 2001-10-07 13:03 ` Ingo Oeser 2001-10-07 13:48 ` Mika Liljeberg 2001-10-07 18:39 ` Davide Libenzi 2001-10-09 20:37 ` Hubertus Franke 2 siblings, 1 reply; 61+ messages in thread From: Ingo Oeser @ 2001-10-07 13:03 UTC (permalink / raw) To: Mika Liljeberg; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel On Sun, Oct 07, 2001 at 12:57:29PM +0300, Mika Liljeberg wrote: > For problems 1 and 2 I propose the following solution: Insert the the > load balancing routine itself as a (fake) task on each CPU and run it > when the CPU gets around to it. The load balancer should behave almost > like a CPU-bound task, scheduled on the lowest priority level with other > runnable tasks. The idle-task might be (ab-)used for this, because it has perfect data for this. T_SystemElapsed - T_IdleTaskRun = T_CPULoaded Balancing could be done in schedule() itself, after checking this value for each CPU. > The last bit is important: the load balancer should not > be allowed to starve but should be invoked approximately once every > "full rotation" of the scheduler. If a artificial CPU-hog is used for this task, the idle task will never be run and power savings in the CPU are impossible. Other parts sound interesting. Regards Ingo Oeser -- .... Our continuing mission: To seek out knowledge of C, to explore strange UNIX commands, and to boldly code where no one has man page 4. --- Mike A. Harris <mharris@meteng.on.ca> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 13:03 ` Ingo Oeser @ 2001-10-07 13:48 ` Mika Liljeberg 2001-10-07 14:24 ` Ingo Oeser 0 siblings, 1 reply; 61+ messages in thread From: Mika Liljeberg @ 2001-10-07 13:48 UTC (permalink / raw) To: Ingo Oeser; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel Ingo Oeser wrote: > > On Sun, Oct 07, 2001 at 12:57:29PM +0300, Mika Liljeberg wrote: > > For problems 1 and 2 I propose the following solution: Insert the the > > load balancing routine itself as a (fake) task on each CPU and run it > > when the CPU gets around to it. The load balancer should behave almost > > like a CPU-bound task, scheduled on the lowest priority level with other > > runnable tasks. > > The idle-task might be (ab-)used for this, because it has perfect > data for this. > > T_SystemElapsed - T_IdleTaskRun = T_CPULoaded The problem here is that the idle task doesn't get run at all if there are runnable tasks. With a little bit of tweaking, the idle task context could probably be used for this, I guess. However, I was thinking more along the lines that the load balancer would not be a real task but just a routine that would be executed in schedule() when the fake task comes up (i.e. when the sceduler has done a "full rotation"). > Balancing could be done in schedule() itself, after checking this > value for each CPU. The point about separating the load balancer from the scheduler is that you don't need to run it at every task switch. You can make the scheduler very simple if you don't have to scan queues looking for likely switchover candidates. This saves time in context switches. One interesting property of the load balancer tasks would be that the less heavily loaded CPUs would tend to execute the load balancer more often, actively releaving the more heavily loaded CPUs, while those would concentrate more on getting the job done. Come to think of it, it could be coded in such a way that only the least loaded CPU would execute the load balancing algorithm, while the others would simply chalk up elapsed times. > > The last bit is important: the load balancer should not > > be allowed to starve but should be invoked approximately once every > > "full rotation" of the scheduler. > > If a artificial CPU-hog is used for this task, the idle task will > never be run and power savings in the CPU are impossible. Right. Obviously, if a CPU has no runnable tasks and the load balancer can't acquire any from other CPUs, it should yield to the idle task rather than hog the CPU. MikaL ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 13:48 ` Mika Liljeberg @ 2001-10-07 14:24 ` Ingo Oeser 2001-10-07 14:33 ` Mika Liljeberg 0 siblings, 1 reply; 61+ messages in thread From: Ingo Oeser @ 2001-10-07 14:24 UTC (permalink / raw) To: Mika Liljeberg; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel On Sun, Oct 07, 2001 at 04:48:30PM +0300, Mika Liljeberg wrote: > Ingo Oeser wrote: > > The idle-task might be (ab-)used for this, because it has perfect > > data for this. > > > > T_SystemElapsed - T_IdleTaskRun = T_CPULoaded > > The problem here is that the idle task doesn't get run at all if there > are runnable tasks. There are idle tasks per CPU. If there are runnable tasks and the idle-task of a CPU is running it, it is not fully loaded at this time. No idle task is running, if all CPUs are fully loaded AFAIR. > One interesting property of the load balancer tasks would be that the > less heavily loaded CPUs would tend to execute the load balancer more > often, actively releaving the more heavily loaded CPUs, while those > would concentrate more on getting the job done. Come to think of it, it > could be coded in such a way that only the least loaded CPU would > execute the load balancing algorithm, while the others would simply > chalk up elapsed times. So you suggest only one load balancer task jumping from CPU to CPU. I misunderstood it. _Only_ chalking up could certainly done by the idle tasks. Regards Ingo Oeser ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 14:24 ` Ingo Oeser @ 2001-10-07 14:33 ` Mika Liljeberg 2001-10-07 18:00 ` george anzinger 2001-10-08 15:19 ` Context switch times bill davidsen 0 siblings, 2 replies; 61+ messages in thread From: Mika Liljeberg @ 2001-10-07 14:33 UTC (permalink / raw) To: Ingo Oeser; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel Ingo Oeser wrote: > There are idle tasks per CPU. If there are runnable tasks and the > idle-task of a CPU is running it, it is not fully loaded at this > time. > > No idle task is running, if all CPUs are fully loaded AFAIR. Yes. However, you still want to balance the queues even if all CPUs are 100% utilized. It's a fairness issue. Otherwise you could have 1 task running on one CPU and 49 tasks on another. MikaL ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 14:33 ` Mika Liljeberg @ 2001-10-07 18:00 ` george anzinger 2001-10-07 22:06 ` Mika Liljeberg 2001-10-08 15:19 ` Context switch times bill davidsen 1 sibling, 1 reply; 61+ messages in thread From: george anzinger @ 2001-10-07 18:00 UTC (permalink / raw) To: Mika Liljeberg Cc: Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel Mika Liljeberg wrote: > > Ingo Oeser wrote: > > There are idle tasks per CPU. If there are runnable tasks and the > > idle-task of a CPU is running it, it is not fully loaded at this > > time. > > > > No idle task is running, if all CPUs are fully loaded AFAIR. > > Yes. However, you still want to balance the queues even if all CPUs are > 100% utilized. It's a fairness issue. Otherwise you could have 1 task > running on one CPU and 49 tasks on another. > On the face of it this only seems unfair. I think what we want to do is to allocate the cpu resources among competing tasks as dictated by their NICE values. If each task gets its allotted share it should not matter if one of them is monopolizing one cpu. This is not to say that things will work out this way, but to say that this is not the measure, nor the thing to look at. I suggest that, using the "recalculate tick" as a measure of time (I know it is not very linear) when a cpu finds itself with nothing to do (because all its tasks have completed their slices or blocked) and other cpus have tasks in their queues it is time to "shop" for a new task to run. This would then do load balancing just before each "recalculate tick". Of course, the above assumes that each cpu has its own run queue. George ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 18:00 ` george anzinger @ 2001-10-07 22:06 ` Mika Liljeberg 2001-10-07 22:31 ` Davide Libenzi 2001-10-07 23:49 ` george anzinger 0 siblings, 2 replies; 61+ messages in thread From: Mika Liljeberg @ 2001-10-07 22:06 UTC (permalink / raw) To: george anzinger Cc: Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel, Davide Libenzi george anzinger wrote: > On the face of it this only seems unfair. I think what we want to do is > to allocate the cpu resources among competing tasks as dictated by their > NICE values. If each task gets its allotted share it should not matter > if one of them is monopolizing one cpu. This is not to say that things > will work out this way, but to say that this is not the measure, nor the > thing to look at. I'm not sure I fully understood what you're driving at, but surely "each task getting its allotted share" implies that the tasks are correctly balanced across CPUs? I take your point that if you're only interested in the total completion time of certain set of tasks, it doesn't matter how they are divided among CPUs as long as each CPU is crunching at maximum capacity. However, if you're interested in the completion time of a particular task (or if the task is mostly CPU bound but with occasional interaction with the user), you need to provide fairness all the way through the scheduling process. This, IMHO, implies balancing the tasks across CPUs. How the balance is determined is another issue, though. I basically proposed summing the time slices consumed by tasks executing on a single CPU and using that as the comparison value. Davide Libenzi, on the other hand, simply proposed using the number of tasks. I would contend that if you can evenly divide a particular load across multiple CPUs, you can then run a pre-emptive scheduler on each CPUs run queue independently, without regard for the other CPUs, and still come out with high CPU utilization and reasonable completion times for all tasks. This has the major advantage of limiting spinlock contention to the load balancer, instead of risking it at every schedule(). > I suggest that, using the "recalculate tick" as a measure of time (I > know it is not very linear) I'll take your word for it as I'm not very familiar with the current scheduler. The main thing is to do the rebalancing periodically and to have a reliable way of estimating the loading (not utilization) of each CPU. > when a cpu finds itself with nothing to do > (because all its tasks have completed their slices or blocked) and other > cpus have tasks in their queues it is time to "shop" for a new task to > run. This again has the problem that you only look for new tasks if you have nothing to do. While this would work if you only look at the total completion time of a set of tasks, it does nothing to ensure fair scheduling for a particular task. > This would then do load balancing just before each "recalculate > tick". > Of course, the above assumes that each cpu has its own run queue. > > George Regards, MikaL ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 22:06 ` Mika Liljeberg @ 2001-10-07 22:31 ` Davide Libenzi 2001-10-07 22:33 ` Mika Liljeberg 2001-10-07 23:49 ` george anzinger 1 sibling, 1 reply; 61+ messages in thread From: Davide Libenzi @ 2001-10-07 22:31 UTC (permalink / raw) To: Mika Liljeberg Cc: george anzinger, Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel, Davide Libenzi On Mon, 8 Oct 2001, Mika Liljeberg wrote: > How the balance is determined is another issue, though. I basically > proposed summing the time slices consumed by tasks executing on a single > CPU and using that as the comparison value. Davide Libenzi, on the other > hand, simply proposed using the number of tasks. CPU based number of __running__ tasks. - Davide ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 22:31 ` Davide Libenzi @ 2001-10-07 22:33 ` Mika Liljeberg 0 siblings, 0 replies; 61+ messages in thread From: Mika Liljeberg @ 2001-10-07 22:33 UTC (permalink / raw) To: Davide Libenzi Cc: george anzinger, Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel Davide Libenzi wrote: > > On Mon, 8 Oct 2001, Mika Liljeberg wrote: > > > How the balance is determined is another issue, though. I basically > > proposed summing the time slices consumed by tasks executing on a single > > CPU and using that as the comparison value. Davide Libenzi, on the other > > hand, simply proposed using the number of tasks. > > CPU based number of __running__ tasks. That's what I meant, sorry for not making it clear. I suppose in practise this would amount to using the number of _CPU bound_ running tasks per CPU, since the non-CPU bound tasks would likely be waiting rather than running? This would then be essentially equivalent to (and simpler than) my proposal, i.e. using the sum of the time slices. > - Davide MikaL ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 22:06 ` Mika Liljeberg 2001-10-07 22:31 ` Davide Libenzi @ 2001-10-07 23:49 ` george anzinger 2001-10-08 21:07 ` Mika Liljeberg 1 sibling, 1 reply; 61+ messages in thread From: george anzinger @ 2001-10-07 23:49 UTC (permalink / raw) To: Mika Liljeberg Cc: Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel, Davide Libenzi Mika Liljeberg wrote: > > george anzinger wrote: > > On the face of it this only seems unfair. I think what we want to do is > > to allocate the cpu resources among competing tasks as dictated by their > > NICE values. If each task gets its allotted share it should not matter > > if one of them is monopolizing one cpu. This is not to say that things > > will work out this way, but to say that this is not the measure, nor the > > thing to look at. > > I'm not sure I fully understood what you're driving at, but surely "each > task getting its allotted share" implies that the tasks are correctly > balanced across CPUs? Actually I am trying to recast your requirement (for fairness) into something that is clean and easy to code :) Keeping track of the number of tasks a cpu executes over a given period of time, while possible, does not seem to me to give a very good indication of its load. > I take your point that if you're only interested > in the total completion time of certain set of tasks, it doesn't matter > how they are divided among CPUs as long as each CPU is crunching at > maximum capacity. However, if you're interested in the completion time > of a particular task (or if the task is mostly CPU bound but with > occasional interaction with the user), you need to provide fairness all > the way through the scheduling process. This, IMHO, implies balancing > the tasks across CPUs. I don't dispute balance, I am just trying to suggest a way to do it that IMHO is relatively easy. > > How the balance is determined is another issue, though. I basically > proposed summing the time slices consumed by tasks executing on a single > CPU and using that as the comparison value. Davide Libenzi, on the other > hand, simply proposed using the number of tasks. In practice, in a given time (say a "recalculate period") a given cpu with a given set of tasks will end up doing a fixed amount of work (provided we don't allow idle tasks). This could be stated as a sum of time slices or CPU jiffies if you like. As long as loads are passed from cpu to cpu to make them all end a "recalculate period" at the same time, I would contend we have balance. Counting tasks is problematic as tasks come and go and present varying loads. The real trick is in figuring which task to move, or if none should be moved. > > I would contend that if you can evenly divide a particular load across > multiple CPUs, you can then run a pre-emptive scheduler on each CPUs run > queue independently, without regard for the other CPUs, and still come > out with high CPU utilization and reasonable completion times for all > tasks. This has the major advantage of limiting spinlock contention to > the load balancer, instead of risking it at every schedule(). Exactly. > > > I suggest that, using the "recalculate tick" as a measure of time (I > > know it is not very linear) > > I'll take your word for it as I'm not very familiar with the current > scheduler. The main thing is to do the rebalancing periodically and to > have a reliable way of estimating the loading (not utilization) of each > CPU. The current schedule periodically calculates new time slices for each task. It does this when no task in the run queue has any slice time left. The slice size each task gets is related to the NICE value. In practice then (other things being equal) each task will get a portion of the available processing time that is related to its NICE value. I think this is fair. But, of course, other things are not equal. When a task is blocked and the recalculation happens, that task gets to keep 1/2 of it remaining slice as well as the new slice. Since, in addition to slice time, the remaining slice value is used as a major component of the task's priority, this means that the blocked task, if it is blocked during a recalculate, will get a boosted priority. I think what Alan is suggesting is that the priority boost is needed for blocked tasks, but not necessarily the increased slice time. > > > when a cpu finds itself with nothing to do > > (because all its tasks have completed their slices or blocked) and other > > cpus have tasks in their queues it is time to "shop" for a new task to > > run. > > This again has the problem that you only look for new tasks if you have > nothing to do. While this would work if you only look at the total > completion time of a set of tasks, it does nothing to ensure fair > scheduling for a particular task. But doesn't the f(NICE) used as a slice do this. The point is that we will not allow a recalculate until all run able tasks with non zero slice times have run their slice times to zero. This insures that all tasks that can use the cpu will get their fair share of time each "recalculate period". This is pretty much what is done today across the whole run queue. The enhancement we are talking about is separating tasks by cpu and pretty much keeping them on one cpu. > > > This would then do load balancing just before each "recalculate > > tick". > > Of course, the above assumes that each cpu has its own run queue. > > > > George > > Regards, > > MikaL ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 23:49 ` george anzinger @ 2001-10-08 21:07 ` Mika Liljeberg 2001-10-08 22:54 ` discontig physical memory Petko Manolov 0 siblings, 1 reply; 61+ messages in thread From: Mika Liljeberg @ 2001-10-08 21:07 UTC (permalink / raw) To: george anzinger Cc: Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel, Davide Libenzi george anzinger wrote: > > Mika Liljeberg wrote: > > I'm not sure I fully understood what you're driving at, but surely "each > > task getting its allotted share" implies that the tasks are correctly > > balanced across CPUs? > > Actually I am trying to recast your requirement (for fairness) into > something that is clean and easy to code :) Fair enough. ;) > The current schedule periodically calculates new time slices for each > task. It does this when no task in the run queue has any slice time > left. The slice size each task gets is related to the NICE value. In > practice then (other things being equal) each task will get a portion of > the available processing time that is related to its NICE value. I > think this is fair. But, of course, other things are not equal. When a > task is blocked and the recalculation happens, that task gets to keep > 1/2 of it remaining slice as well as the new slice. Since, in addition > to slice time, the remaining slice value is used as a major component of > the task's priority, this means that the blocked task, if it is blocked > during a recalculate, will get a boosted priority. Thanks for the concise explanation. I follow you a little better now. > I think what Alan is suggesting is that the priority boost is needed for > blocked tasks, but not necessarily the increased slice time. I have to agree. If the job is I/O bound, it's unlikely to need the extra CPU time. > > > when a cpu finds itself with nothing to do > > > (because all its tasks have completed their slices or blocked) and other > > > cpus have tasks in their queues it is time to "shop" for a new task to > > > run. > > > > This again has the problem that you only look for new tasks if you have > > nothing to do. While this would work if you only look at the total > > completion time of a set of tasks, it does nothing to ensure fair > > scheduling for a particular task. > > But doesn't the f(NICE) used as a slice do this. The point is that we > will not allow a recalculate until all run able tasks with non zero > slice times have run their slice times to zero. This insures that all > tasks that can use the cpu will get their fair share of time each > "recalculate period". Ok, I think I get you. I didn't realize that there was already a strict fairness enforcing mechanism built into the scheduler. So, you're proposing to do a rebalance when the first CPU runs out of things to do, which will happen sometime before the recalculate period expires? Sounds reasonable, but I think you will see a flurry of rebalancing when the recalculate period is ending and the CPUs are starting to run out of things to do. This will cause load the balancer to frantically try to spread the last few tasks with remaining time slices across the CPUs. This is exactly the kind of thrashing that we would like to avoid. > This is pretty much what is done today across the whole run queue. The > enhancement we are talking about is separating tasks by cpu and pretty > much keeping them on one cpu. Exactly. > > > This would then do load balancing just before each "recalculate > > > tick". > > > Of course, the above assumes that each cpu has its own run queue. > > > > > > George Cheers, MikaL ^ permalink raw reply [flat|nested] 61+ messages in thread
* discontig physical memory 2001-10-08 21:07 ` Mika Liljeberg @ 2001-10-08 22:54 ` Petko Manolov 2001-10-08 23:05 ` David S. Miller 0 siblings, 1 reply; 61+ messages in thread From: Petko Manolov @ 2001-10-08 22:54 UTC (permalink / raw) To: linux-kernel Hi guys, I ran into this problem with a SOC weird design. The physical memory map looks like this: 0 - 4MB DMA-able embedded RAM; 4MB - 16MB nothing here; 16MB - 32MB external RAM; Embedded controllers (FB/USB) can see only lowest 4MB and they need almost all of it for buffers. The kernel is living at phy address 16Mb. Any ideas how to make lowest 4MB allocatable throu kmalloc(size, GFP_DMA) without breaking the kernel? Petko PS: I've seen CONFIG_DISCONTIGMEM but it is not yet implemented for MIPS and i am not sure if it is what is required in this case. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: discontig physical memory 2001-10-08 22:54 ` discontig physical memory Petko Manolov @ 2001-10-08 23:05 ` David S. Miller 2001-10-08 23:18 ` Petko Manolov 0 siblings, 1 reply; 61+ messages in thread From: David S. Miller @ 2001-10-08 23:05 UTC (permalink / raw) To: pmanolov; +Cc: linux-kernel From: Petko Manolov <pmanolov@Lnxw.COM> Date: Mon, 08 Oct 2001 15:54:30 -0700 Any ideas how to make lowest 4MB allocatable throu kmalloc(size, GFP_DMA) without breaking the kernel? Setup ZONE_DMA correctly at boot time. See the code around the free_area_init() call in arch/i386/mm/init.c for an example of how to do this. Intel does it for the low 16MB, you would just do it for the low 4MB. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: discontig physical memory 2001-10-08 23:05 ` David S. Miller @ 2001-10-08 23:18 ` Petko Manolov 2001-10-08 23:29 ` David S. Miller 0 siblings, 1 reply; 61+ messages in thread From: Petko Manolov @ 2001-10-08 23:18 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel I already did that. I fixed MAX_DMA_ADDRESS and the kernel now reports: zone(0): 1024 pages zone(1): 7168 pages /* this should be 4096 */ ... Which is not true as we have a gap between 4 - 16MB phys memory. I also played with add_memory_region(), but the kernel still think we have 32MB ram instead of 20MB as is the case. Also it lock solid in __get_free_pages in mem_init() Probably this is related to zone(1) wrong assumption right? best, Petko "David S. Miller" wrote: > > From: Petko Manolov <pmanolov@Lnxw.COM> > Date: Mon, 08 Oct 2001 15:54:30 -0700 > > Any ideas how to make lowest 4MB allocatable throu > kmalloc(size, GFP_DMA) without breaking the kernel? > > Setup ZONE_DMA correctly at boot time. > > See the code around the free_area_init() call in > arch/i386/mm/init.c for an example of how to do this. > Intel does it for the low 16MB, you would just do it > for the low 4MB. > > Franks a lot, > David S. Miller > davem@redhat.com ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: discontig physical memory 2001-10-08 23:18 ` Petko Manolov @ 2001-10-08 23:29 ` David S. Miller 2001-10-09 0:34 ` Petko Manolov 2001-10-09 0:36 ` Petko Manolov 0 siblings, 2 replies; 61+ messages in thread From: David S. Miller @ 2001-10-08 23:29 UTC (permalink / raw) To: pmanolov; +Cc: linux-kernel From: Petko Manolov <pmanolov@Lnxw.COM> Date: Mon, 08 Oct 2001 16:18:25 -0700 I already did that. I fixed MAX_DMA_ADDRESS and the kernel now reports: zone(0): 1024 pages zone(1): 7168 pages /* this should be 4096 */ ... Which is not true as we have a gap between 4 - 16MB phys memory. I also played with add_memory_region(), but the kernel still think we have 32MB ram instead of 20MB as is the case. You need to setup bootmem correctly, earlier on, such that you do not tell the kernel that there are any pages in this 4 - 16MB region. Do something like this instead of whatever your bootmem calls are doing now: bootmap_size = init_bootmem(0, (32 * 1024 * 1024)); free_bootmem((4 * 1024 * 1024), ((16 - 4) * 1024 * 1024)); Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: discontig physical memory 2001-10-08 23:29 ` David S. Miller @ 2001-10-09 0:34 ` Petko Manolov 2001-10-09 0:36 ` Petko Manolov 1 sibling, 0 replies; 61+ messages in thread From: Petko Manolov @ 2001-10-09 0:34 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel "David S. Miller" wrote: > > You need to setup bootmem correctly, earlier on, such that > you do not tell the kernel that there are any pages in this > 4 - 16MB region. > > Do something like this instead of whatever your bootmem > calls are doing now: > > bootmap_size = init_bootmem(0, (32 * 1024 * 1024)); > free_bootmem((4 * 1024 * 1024), > ((16 - 4) * 1024 * 1024)); ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: discontig physical memory 2001-10-08 23:29 ` David S. Miller 2001-10-09 0:34 ` Petko Manolov @ 2001-10-09 0:36 ` Petko Manolov 2001-10-09 1:37 ` David S. Miller 1 sibling, 1 reply; 61+ messages in thread From: Petko Manolov @ 2001-10-09 0:36 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel "David S. Miller" wrote: > > You need to setup bootmem correctly, earlier on, such that > you do not tell the kernel that there are any pages in this > 4 - 16MB region. > > Do something like this instead of whatever your bootmem > calls are doing now: > > bootmap_size = init_bootmem(0, (32 * 1024 * 1024)); > free_bootmem((4 * 1024 * 1024), > ((16 - 4) * 1024 * 1024)); This is suppose to tell the kernel about the gap? Petko ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: discontig physical memory 2001-10-09 0:36 ` Petko Manolov @ 2001-10-09 1:37 ` David S. Miller 2001-10-09 2:43 ` Petko Manolov 0 siblings, 1 reply; 61+ messages in thread From: David S. Miller @ 2001-10-09 1:37 UTC (permalink / raw) To: pmanolov; +Cc: linux-kernel From: Petko Manolov <pmanolov@Lnxw.COM> Date: Mon, 08 Oct 2001 17:36:11 -0700 "David S. Miller" wrote: > Do something like this instead of whatever your bootmem > calls are doing now: > > bootmap_size = init_bootmem(0, (32 * 1024 * 1024)); > free_bootmem((4 * 1024 * 1024), > ((16 - 4) * 1024 * 1024)); This is suppose to tell the kernel about the gap? Precisely. How else did you expect to let the kernel know? Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: discontig physical memory 2001-10-09 1:37 ` David S. Miller @ 2001-10-09 2:43 ` Petko Manolov 0 siblings, 0 replies; 61+ messages in thread From: Petko Manolov @ 2001-10-09 2:43 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel "David S. Miller" wrote: > > > bootmap_size = init_bootmem(0, (32 * 1024 * 1024)); > > free_bootmem((4 * 1024 * 1024), > > ((16 - 4) * 1024 * 1024)); > > This is suppose to tell the kernel about the gap? > > Precisely. How else did you expect to let the kernel know? Unfortunately this didn't work. I also tried to declare 16MB hole starting from offset 0 so the kernel is supposed to boot from phys addr 0x01000000. It crashed when tried to execute kernel_thread() Do you think there is a mechanism to tell the kernel that first 16MB is a non DMA-able area? Petko ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 14:33 ` Mika Liljeberg 2001-10-07 18:00 ` george anzinger @ 2001-10-08 15:19 ` bill davidsen 2001-10-10 6:07 ` Mike Fedyk 1 sibling, 1 reply; 61+ messages in thread From: bill davidsen @ 2001-10-08 15:19 UTC (permalink / raw) To: linux-kernel In article <3BC067BB.73AF1EB5@welho.com> Mika.Liljeberg@welho.com wrote: >Yes. However, you still want to balance the queues even if all CPUs are >100% utilized. It's a fairness issue. Otherwise you could have 1 task >running on one CPU and 49 tasks on another. You say that as if it were a bad thing... I believe that if you have one long running task and many small tasks in the system CPU affinity will make that happen now. Obviously not if all CPUs are 100% loaded, and your 1 vs. 49 is unrealistic, but having a task stay with a CPU while trivia run on other CPU(s) is generally a good thing under certain load conditions, which I guess are no less likely than your example;-) -- bill davidsen <davidsen@tmr.com> "If I were a diplomat, in the best case I'd go hungry. In the worst case, people would die." -- Robert Lipe ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-08 15:19 ` Context switch times bill davidsen @ 2001-10-10 6:07 ` Mike Fedyk 0 siblings, 0 replies; 61+ messages in thread From: Mike Fedyk @ 2001-10-10 6:07 UTC (permalink / raw) To: linux-kernel On Mon, Oct 08, 2001 at 11:19:49AM -0400, bill davidsen wrote: > In article <3BC067BB.73AF1EB5@welho.com> Mika.Liljeberg@welho.com wrote: > > >Yes. However, you still want to balance the queues even if all CPUs are > >100% utilized. It's a fairness issue. Otherwise you could have 1 task > >running on one CPU and 49 tasks on another. > > You say that as if it were a bad thing... I believe that if you have > one long running task and many small tasks in the system CPU affinity > will make that happen now. Obviously not if all CPUs are 100% loaded, > and your 1 vs. 49 is unrealistic, but having a task stay with a CPU > while trivia run on other CPU(s) is generally a good thing under certain > load conditions, which I guess are no less likely than your example;-) > Lets put an actual load example (which I just happen do be doing now): On my 2x366 celeron smp box, I have a kernel compile running (-j15), and a bzip2 compressing several large files... Right now, (2.4.10-ac4, compiling 2.4.10-ac10+preempt) both of my L2 processor caches are moot (128KB) because of the several gcc processes running, and the very loose affinity. In this case, bzip2 should probably get a processor (CPU0) to itself, and my kernel compile another(CPU1). It would be interesting to see how a system that has a kind of hierarchy to the processors. All new processes would always start on cpu0, and long running processes would get slowly moved up the list of processors the longer they run. Thus, CPU0 would have abysmal L2 cache locality, and CPU7 would have the best caching possible with its L2. What do you guys think? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 9:57 ` Mika Liljeberg 2001-10-07 13:03 ` Ingo Oeser @ 2001-10-07 18:39 ` Davide Libenzi 2001-10-09 20:37 ` Hubertus Franke 2 siblings, 0 replies; 61+ messages in thread From: Davide Libenzi @ 2001-10-07 18:39 UTC (permalink / raw) To: Mika Liljeberg; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel On Sun, 7 Oct 2001, Mika Liljeberg wrote: > Alan Cox wrote: > > This isnt idle speculation - I've done some minimal playing with this but > > my initial re-implementation didnt handle SMP at all and I am still not 100% > > sure how to resolve SMP or how SMP will improve out of the current cunning > > plan. > > Here's some idle speculation on SMP to top it off. :) I tend to think > that the load balancing between CPUs should be a completely separate > algorithim and should not necessarily be run at every schedule(). The > idea is to compeletely decouple the problem of scheduling a single CPU > between tasks and the problem of load balancing between the CPUs, making > each problem simpler to solve. > > Consider the following basic rules: > > A) When a new task comes along, pick the "least loaded" CPU and lock the > new task onto that. > B) Whenever the load imbalance between least loaded CPU and most loaded > CPU becomes too great, move one or more tasks from most loaded CPU to > the least loaded CPU. > > The rules themselves should be self-explanatory: A provides initial load > balancing, while B tries to keep the balance (with a sensible hysteresis > to avoid thrashing). However, there are a few minor details to solve: > > 1) How to determine the load of a CPU? If we can quantify this clearly, > we can easily set a hysteresis level to trigger load balancing between > two CPUs. > 2) When and how often to check for load imbalance? > 3) How to select the task(s) that should be moved between two CPUs to > correct an imbalance? > > For problems 1 and 2 I propose the following solution: Insert the the > load balancing routine itself as a (fake) task on each CPU and run it > when the CPU gets around to it. The load balancer should behave almost > like a CPU-bound task, scheduled on the lowest priority level with other > runnable tasks. The last bit is important: the load balancer should not > be allowed to starve but should be invoked approximately once every > "full rotation" of the scheduler. > > With the above it is easy to estimate the load of a CPU. We can simply > use the elapsed time between two invokations of the load balancer task. > When the load balancer task of a particular CPU gets run, it chalks up > the elapsed time on a score board somewhere, and checks whether there is > a significant imbalance between itself and some other CPU. If there is, > it commences to move some tasks between itself and the other CPU (note > rule B, though, it should be enough to mess with just two CPU queues at > a time to minimize balancing and locking overhead). > > Problem 3 is tricky. Basically, there should be a cost/benefit function > F(tasks to move) that should be minimized. Ideally F(task_i), the > cost/benefit of moving a single task, would be calculated as a byproduct > of the CPU scheduler algorithm. > > F(task_i) might be function of elapsed time since task_i was last > scheduled and the average time slice used by task_i, to account for the > probable cache hit. This would leave it up to the load balancer to move > as many lowest cost tasks to a new CPU as is needed to correct the > imbalance (average time slices used by each task would be needed in > order to make this decision). > > Naturally, some additional rules might be necessary to make a task > eligible for moving, e.g., never move the only/last CPU bound task to > another CPU. In addition, it might actually make sense to move at most > one task at each invocation of the load balancer, to further reduce the > probability of thrashing. The load would still converge fairly quickly > towards a balanced state. It would also scale fairly well with the > number of CPUs. > > How does that sound? To measure the load of a cpu "nr_running" ( per cpu ) is probably the best choice. Anyway there's some work already done : http://lse.sourceforge.net/scheduling/LB/poolMQ.html - Davide ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 9:57 ` Mika Liljeberg 2001-10-07 13:03 ` Ingo Oeser 2001-10-07 18:39 ` Davide Libenzi @ 2001-10-09 20:37 ` Hubertus Franke 2001-10-09 23:50 ` george anzinger 2 siblings, 1 reply; 61+ messages in thread From: Hubertus Franke @ 2001-10-09 20:37 UTC (permalink / raw) To: Mika Liljeberg; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel * Mika Liljeberg <Mika.Liljeberg@welho.com> [20011007 05;57]:" > Alan Cox wrote: > > This isnt idle speculation - I've done some minimal playing with this but > > my initial re-implementation didnt handle SMP at all and I am still not 100% > > sure how to resolve SMP or how SMP will improve out of the current cunning > > plan. > > Here's some idle speculation on SMP to top it off. :) I tend to think > that the load balancing between CPUs should be a completely separate > algorithim and should not necessarily be run at every schedule(). The > idea is to compeletely decouple the problem of scheduling a single CPU > between tasks and the problem of load balancing between the CPUs, making > each problem simpler to solve. > This is what we implemented as an extension to our MQ scheduler. I will present on the results for this during ALS technical session: "CPU Pooling and Load Balancing in Linux MultiQueue Scheduling". If interested I can already put an earlier version on lse.sourceforge.net. It basically does (1) and (2). (3) is done unintelligently right now. > Consider the following basic rules: > > A) When a new task comes along, pick the "least loaded" CPU and lock the > new task onto that. > B) Whenever the load imbalance between least loaded CPU and most loaded > CPU becomes too great, move one or more tasks from most loaded CPU to > the least loaded CPU. > > The rules themselves should be self-explanatory: A provides initial load > balancing, while B tries to keep the balance (with a sensible hysteresis > to avoid thrashing). However, there are a few minor details to solve: > > 1) How to determine the load of a CPU? If we can quantify this clearly, > we can easily set a hysteresis level to trigger load balancing between > two CPUs. > 2) When and how often to check for load imbalance? > 3) How to select the task(s) that should be moved between two CPUs to > correct an imbalance? > > For problems 1 and 2 I propose the following solution: Insert the the > load balancing routine itself as a (fake) task on each CPU and run it > when the CPU gets around to it. The load balancer should behave almost > like a CPU-bound task, scheduled on the lowest priority level with other > runnable tasks. The last bit is important: the load balancer should not > be allowed to starve but should be invoked approximately once every > "full rotation" of the scheduler. > > With the above it is easy to estimate the load of a CPU. We can simply > use the elapsed time between two invokations of the load balancer task. > When the load balancer task of a particular CPU gets run, it chalks up > the elapsed time on a score board somewhere, and checks whether there is > a significant imbalance between itself and some other CPU. If there is, > it commences to move some tasks between itself and the other CPU (note > rule B, though, it should be enough to mess with just two CPU queues at > a time to minimize balancing and locking overhead). > > Problem 3 is tricky. Basically, there should be a cost/benefit function > F(tasks to move) that should be minimized. Ideally F(task_i), the > cost/benefit of moving a single task, would be calculated as a byproduct > of the CPU scheduler algorithm. > > F(task_i) might be function of elapsed time since task_i was last > scheduled and the average time slice used by task_i, to account for the > probable cache hit. This would leave it up to the load balancer to move > as many lowest cost tasks to a new CPU as is needed to correct the > imbalance (average time slices used by each task would be needed in > order to make this decision). > > Naturally, some additional rules might be necessary to make a task > eligible for moving, e.g., never move the only/last CPU bound task to > another CPU. In addition, it might actually make sense to move at most > one task at each invocation of the load balancer, to further reduce the > probability of thrashing. The load would still converge fairly quickly > towards a balanced state. It would also scale fairly well with the > number of CPUs. > > How does that sound? > > MikaL > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-09 20:37 ` Hubertus Franke @ 2001-10-09 23:50 ` george anzinger 2001-10-11 10:52 ` Hubertus Franke 0 siblings, 1 reply; 61+ messages in thread From: george anzinger @ 2001-10-09 23:50 UTC (permalink / raw) To: Hubertus Franke Cc: Mika Liljeberg, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel Hubertus Franke wrote: > > * Mika Liljeberg <Mika.Liljeberg@welho.com> [20011007 05;57]:" > > Alan Cox wrote: > > > This isnt idle speculation - I've done some minimal playing with this but > > > my initial re-implementation didnt handle SMP at all and I am still not 100% > > > sure how to resolve SMP or how SMP will improve out of the current cunning > > > plan. > > > > Here's some idle speculation on SMP to top it off. :) I tend to think > > that the load balancing between CPUs should be a completely separate > > algorithim and should not necessarily be run at every schedule(). The > > idea is to compeletely decouple the problem of scheduling a single CPU > > between tasks and the problem of load balancing between the CPUs, making > > each problem simpler to solve. > > > > This is what we implemented as an extension to our MQ scheduler. > I will present on the results for this during ALS technical session: > "CPU Pooling and Load Balancing in Linux MultiQueue Scheduling". > If interested I can already put an earlier version on lse.sourceforge.net. Please do. It should be a good read. George > > It basically does (1) and (2). (3) is done unintelligently right now. > > > Consider the following basic rules: > > > > A) When a new task comes along, pick the "least loaded" CPU and lock the > > new task onto that. > > B) Whenever the load imbalance between least loaded CPU and most loaded > > CPU becomes too great, move one or more tasks from most loaded CPU to > > the least loaded CPU. > > > > The rules themselves should be self-explanatory: A provides initial load > > balancing, while B tries to keep the balance (with a sensible hysteresis > > to avoid thrashing). However, there are a few minor details to solve: > > > > 1) How to determine the load of a CPU? If we can quantify this clearly, > > we can easily set a hysteresis level to trigger load balancing between > > two CPUs. > > 2) When and how often to check for load imbalance? > > 3) How to select the task(s) that should be moved between two CPUs to > > correct an imbalance? > > > > For problems 1 and 2 I propose the following solution: Insert the the > > load balancing routine itself as a (fake) task on each CPU and run it > > when the CPU gets around to it. The load balancer should behave almost > > like a CPU-bound task, scheduled on the lowest priority level with other > > runnable tasks. The last bit is important: the load balancer should not > > be allowed to starve but should be invoked approximately once every > > "full rotation" of the scheduler. > > > > With the above it is easy to estimate the load of a CPU. We can simply > > use the elapsed time between two invokations of the load balancer task. > > When the load balancer task of a particular CPU gets run, it chalks up > > the elapsed time on a score board somewhere, and checks whether there is > > a significant imbalance between itself and some other CPU. If there is, > > it commences to move some tasks between itself and the other CPU (note > > rule B, though, it should be enough to mess with just two CPU queues at > > a time to minimize balancing and locking overhead). > > > > Problem 3 is tricky. Basically, there should be a cost/benefit function > > F(tasks to move) that should be minimized. Ideally F(task_i), the > > cost/benefit of moving a single task, would be calculated as a byproduct > > of the CPU scheduler algorithm. > > > > F(task_i) might be function of elapsed time since task_i was last > > scheduled and the average time slice used by task_i, to account for the > > probable cache hit. This would leave it up to the load balancer to move > > as many lowest cost tasks to a new CPU as is needed to correct the > > imbalance (average time slices used by each task would be needed in > > order to make this decision). > > > > Naturally, some additional rules might be necessary to make a task > > eligible for moving, e.g., never move the only/last CPU bound task to > > another CPU. In addition, it might actually make sense to move at most > > one task at each invocation of the load balancer, to further reduce the > > probability of thrashing. The load would still converge fairly quickly > > towards a balanced state. It would also scale fairly well with the > > number of CPUs. > > > > How does that sound? > > > > MikaL > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-09 23:50 ` george anzinger @ 2001-10-11 10:52 ` Hubertus Franke 0 siblings, 0 replies; 61+ messages in thread From: Hubertus Franke @ 2001-10-11 10:52 UTC (permalink / raw) To: george anzinger; +Cc: Mika Liljeberg, Linus Torvalds, linux-kernel Done. I put the ALS2001 paper on CPU pooling and load balancing on the lse.sourceforge.net/scheduling website. Here is a direct shortcut http://lse.sourceforge.net/scheduling/als2001/pmqs.ps The patches are a bit behind. Will be updated soon. If you have some suggestions on what is a good metric to move tasks please let me know. We can incorporate them. -- Hubertus * george anzinger <george@mvista.com> [20011009 19;50]:" > Hubertus Franke wrote: > > > > * Mika Liljeberg <Mika.Liljeberg@welho.com> [20011007 05;57]:" > > > Alan Cox wrote: > > > > This isnt idle speculation - I've done some minimal playing with this but > > > > my initial re-implementation didnt handle SMP at all and I am still not 100% > > > > sure how to resolve SMP or how SMP will improve out of the current cunning > > > > plan. > > > > > > Here's some idle speculation on SMP to top it off. :) I tend to think > > > that the load balancing between CPUs should be a completely separate > > > algorithim and should not necessarily be run at every schedule(). The > > > idea is to compeletely decouple the problem of scheduling a single CPU > > > between tasks and the problem of load balancing between the CPUs, making > > > each problem simpler to solve. > > > > > > > This is what we implemented as an extension to our MQ scheduler. > > I will present on the results for this during ALS technical session: > > "CPU Pooling and Load Balancing in Linux MultiQueue Scheduling". > > If interested I can already put an earlier version on lse.sourceforge.net. > > Please do. It should be a good read. > > George > > > > > It basically does (1) and (2). (3) is done unintelligently right now. > > > > > Consider the following basic rules: > > > > > > A) When a new task comes along, pick the "least loaded" CPU and lock the > > > new task onto that. > > > B) Whenever the load imbalance between least loaded CPU and most loaded > > > CPU becomes too great, move one or more tasks from most loaded CPU to > > > the least loaded CPU. > > > > > > The rules themselves should be self-explanatory: A provides initial load > > > balancing, while B tries to keep the balance (with a sensible hysteresis > > > to avoid thrashing). However, there are a few minor details to solve: > > > > > > 1) How to determine the load of a CPU? If we can quantify this clearly, > > > we can easily set a hysteresis level to trigger load balancing between > > > two CPUs. > > > 2) When and how often to check for load imbalance? > > > 3) How to select the task(s) that should be moved between two CPUs to > > > correct an imbalance? > > > > > > For problems 1 and 2 I propose the following solution: Insert the the > > > load balancing routine itself as a (fake) task on each CPU and run it > > > when the CPU gets around to it. The load balancer should behave almost > > > like a CPU-bound task, scheduled on the lowest priority level with other > > > runnable tasks. The last bit is important: the load balancer should not > > > be allowed to starve but should be invoked approximately once every > > > "full rotation" of the scheduler. > > > > > > With the above it is easy to estimate the load of a CPU. We can simply > > > use the elapsed time between two invokations of the load balancer task. > > > When the load balancer task of a particular CPU gets run, it chalks up > > > the elapsed time on a score board somewhere, and checks whether there is > > > a significant imbalance between itself and some other CPU. If there is, > > > it commences to move some tasks between itself and the other CPU (note > > > rule B, though, it should be enough to mess with just two CPU queues at > > > a time to minimize balancing and locking overhead). > > > > > > Problem 3 is tricky. Basically, there should be a cost/benefit function > > > F(tasks to move) that should be minimized. Ideally F(task_i), the > > > cost/benefit of moving a single task, would be calculated as a byproduct > > > of the CPU scheduler algorithm. > > > > > > F(task_i) might be function of elapsed time since task_i was last > > > scheduled and the average time slice used by task_i, to account for the > > > probable cache hit. This would leave it up to the load balancer to move > > > as many lowest cost tasks to a new CPU as is needed to correct the > > > imbalance (average time slices used by each task would be needed in > > > order to make this decision). > > > > > > Naturally, some additional rules might be necessary to make a task > > > eligible for moving, e.g., never move the only/last CPU bound task to > > > another CPU. In addition, it might actually make sense to move at most > > > one task at each invocation of the load balancer, to further reduce the > > > probability of thrashing. The load would still converge fairly quickly > > > towards a balanced state. It would also scale fairly well with the > > > number of CPUs. > > > > > > How does that sound? > > > > > > MikaL > > > - > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Please read the FAQ at http://www.tux.org/lkml/ > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 22:42 ` Linus Torvalds 2001-10-04 22:53 ` Benjamin LaHaise @ 2001-10-04 23:41 ` Mike Kravetz 2001-10-04 23:50 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 61+ messages in thread From: Mike Kravetz @ 2001-10-04 23:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote: > Could we try to hit just two? Probably, but it doesn't really matter, > though: to make the lmbench scheduler benchmark go at full speed, you > want to limit it to _one_ CPU, which is not sensible in real-life > situations. Can you clarify? I agree that tuning the system for the best LMbench performance is not a good thing to do! However, in general on an 8 CPU system with only 2 'active' tasks I would think limiting the tasks to 2 CPUs would be desirable for cache effects. I know that running LMbench with 2 active tasks on an 8 CPU system results in those 2 tasks being 'round-robined' among all 8 CPUs. Prior analysis leads me to believe the reason for this is due to IPI latency. reschedule_idle() chooses the 'best/correct' CPU for a task to run on, but before schedule() runs on that CPU another CPU runs schedule() and the result is that the task runs on a ?less desirable? CPU. The nature of the LMbench scheduler benchmark makes this occur frequently. The real question is: how often does this happen in real-life situations? -- Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 23:41 ` Mike Kravetz @ 2001-10-04 23:50 ` Linus Torvalds 2001-10-05 15:15 ` Eric W. Biederman 2001-10-04 23:56 ` Davide Libenzi 2001-10-05 0:45 ` Andrea Arcangeli 2 siblings, 1 reply; 61+ messages in thread From: Linus Torvalds @ 2001-10-04 23:50 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-kernel On Thu, 4 Oct 2001, Mike Kravetz wrote: > On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote: > > Could we try to hit just two? Probably, but it doesn't really matter, > > though: to make the lmbench scheduler benchmark go at full speed, you > > want to limit it to _one_ CPU, which is not sensible in real-life > > situations. > > Can you clarify? I agree that tuning the system for the best LMbench > performance is not a good thing to do! However, in general on an > 8 CPU system with only 2 'active' tasks I would think limiting the > tasks to 2 CPUs would be desirable for cache effects. Yes, limiting to 2 CPU's probably gets better cache behaviour, and it might be worth looking into why it doesn't. The CPU affinity _should_ prioritize it down to two, but I haven't thought through your theory about IPI latency. However, the reason 2.2.x does so well is that in 2.2.x it will stay on _once_ CPU if I remember correctly. We basically tuned the scheduler for lmbench, and not much else. Linus ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 23:50 ` Linus Torvalds @ 2001-10-05 15:15 ` Eric W. Biederman 0 siblings, 0 replies; 61+ messages in thread From: Eric W. Biederman @ 2001-10-05 15:15 UTC (permalink / raw) To: Linus Torvalds; +Cc: Mike Kravetz, linux-kernel Linus Torvalds <torvalds@transmeta.com> writes: > On Thu, 4 Oct 2001, Mike Kravetz wrote: > > > On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote: > > > Could we try to hit just two? Probably, but it doesn't really matter, > > > though: to make the lmbench scheduler benchmark go at full speed, you > > > want to limit it to _one_ CPU, which is not sensible in real-life > > > situations. > > > > Can you clarify? I agree that tuning the system for the best LMbench > > performance is not a good thing to do! However, in general on an > > 8 CPU system with only 2 'active' tasks I would think limiting the > > tasks to 2 CPUs would be desirable for cache effects. > > Yes, limiting to 2 CPU's probably gets better cache behaviour, and it > might be worth looking into why it doesn't. The CPU affinity _should_ > prioritize it down to two, but I haven't thought through your theory about > IPI latency. I don't know what it is but I have seen this excessive cpu switching in the wild. In particular on a dual processor machine I ran 2 cpu intensive jobs, and a handful of daemons. And the cpu intensive jobs would switch cpus every couple of seconds. I was investigating it because on the Athlon I was running on a customer was getting a super linear speed up. With one processes it would take 8 minutes, and with 2 processes one would take 8 minutes and the other would take 6 minutes. Very strange. These processes except at their very beginning did no I/O and were pure cpu hogs until they spit out their results. Very puzzling. I can't see why we would ever want to take the cache miss penalty of switching cpus, in this case. Eric ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 23:41 ` Mike Kravetz 2001-10-04 23:50 ` Linus Torvalds @ 2001-10-04 23:56 ` Davide Libenzi 2001-10-05 0:45 ` Andrea Arcangeli 2 siblings, 0 replies; 61+ messages in thread From: Davide Libenzi @ 2001-10-04 23:56 UTC (permalink / raw) To: Mike Kravetz; +Cc: Linus Torvalds, lkml On Thu, 4 Oct 2001, Mike Kravetz wrote: > On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote: > > Could we try to hit just two? Probably, but it doesn't really matter, > > though: to make the lmbench scheduler benchmark go at full speed, you > > want to limit it to _one_ CPU, which is not sensible in real-life > > situations. > > Can you clarify? I agree that tuning the system for the best LMbench > performance is not a good thing to do! However, in general on an > 8 CPU system with only 2 'active' tasks I would think limiting the > tasks to 2 CPUs would be desirable for cache effects. > > I know that running LMbench with 2 active tasks on an 8 CPU system > results in those 2 tasks being 'round-robined' among all 8 CPUs. > Prior analysis leads me to believe the reason for this is due to > IPI latency. reschedule_idle() chooses the 'best/correct' CPU for > a task to run on, but before schedule() runs on that CPU another > CPU runs schedule() and the result is that the task runs on a > ?less desirable? CPU. The nature of the LMbench scheduler benchmark > makes this occur frequently. The real question is: how often > does this happen in real-life situations? Well, if you remember the first time this issue was discussed on the mailing list was due a real life situation not due a bench run. - Davide ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 23:41 ` Mike Kravetz 2001-10-04 23:50 ` Linus Torvalds 2001-10-04 23:56 ` Davide Libenzi @ 2001-10-05 0:45 ` Andrea Arcangeli 2001-10-05 4:35 ` Mike Kravetz 2 siblings, 1 reply; 61+ messages in thread From: Andrea Arcangeli @ 2001-10-05 0:45 UTC (permalink / raw) To: Mike Kravetz; +Cc: Linus Torvalds, linux-kernel On Thu, Oct 04, 2001 at 04:41:02PM -0700, Mike Kravetz wrote: > I know that running LMbench with 2 active tasks on an 8 CPU system > results in those 2 tasks being 'round-robined' among all 8 CPUs. > Prior analysis leads me to believe the reason for this is due to > IPI latency. reschedule_idle() chooses the 'best/correct' CPU for > a task to run on, but before schedule() runs on that CPU another > CPU runs schedule() and the result is that the task runs on a > ?less desirable? CPU. The nature of the LMbench scheduler benchmark doesn't lmbench wakeup only via pipes? Linux uses the sync-wakeup that avoids reschedule_idle in such case, to serialize the pipe load in the same cpu. Andrea ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 0:45 ` Andrea Arcangeli @ 2001-10-05 4:35 ` Mike Kravetz 2001-10-07 17:59 ` Andrea Arcangeli 0 siblings, 1 reply; 61+ messages in thread From: Mike Kravetz @ 2001-10-05 4:35 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel On Fri, Oct 05, 2001 at 02:45:26AM +0200, Andrea Arcangeli wrote: > doesn't lmbench wakeup only via pipes? Linux uses the sync-wakeup that > avoids reschedule_idle in such case, to serialize the pipe load in the > same cpu. That's what I thought too. However, kernel profile data of a lmbench run on 2.4.10 reveals that the pipe routines only call the non-synchronous form of wake_up. I believe I reached the same conclusion in the 2.4.7 time frame by instrumenting this code. -- Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-05 4:35 ` Mike Kravetz @ 2001-10-07 17:59 ` Andrea Arcangeli 2001-10-07 19:54 ` george anzinger 0 siblings, 1 reply; 61+ messages in thread From: Andrea Arcangeli @ 2001-10-07 17:59 UTC (permalink / raw) To: Mike Kravetz; +Cc: Linus Torvalds, linux-kernel On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote: > [..] the pipe routines only call > the non-synchronous form of wake_up. [..] Ok I see, I overlooked that when we don't need to block we wakeup-async. So first of all it would be interesting to change the token passed thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so that the task goes to sleep before reading from the next pipe in the ring. And if we want to optimize the current benchmark (with the small token that triggers the async-wakeup and it always goes to sleep in read() rather than write()) we'd need to invalidate a basic point of the scheduler that have nothing to do with IPI reschedule, the point to invalidate is that if the current task (the one that is running wake_up()) has a small avg_slice we'd better not reschedule the wakenup task on a new idle cpu but we'd better wait the current task to go away instead. Unless I'm missing something this should fix lmbench in its current implementation. Andrea ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 17:59 ` Andrea Arcangeli @ 2001-10-07 19:54 ` george anzinger 2001-10-07 20:24 ` Andrea Arcangeli 0 siblings, 1 reply; 61+ messages in thread From: george anzinger @ 2001-10-07 19:54 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Mike Kravetz, Linus Torvalds, linux-kernel Andrea Arcangeli wrote: > > On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote: > > [..] the pipe routines only call > > the non-synchronous form of wake_up. [..] > > Ok I see, I overlooked that when we don't need to block we wakeup-async. > > So first of all it would be interesting to change the token passed > thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so > that the task goes to sleep before reading from the next pipe in the > ring. > > And if we want to optimize the current benchmark (with the small token that > triggers the async-wakeup and it always goes to sleep in read() rather > than write()) we'd need to invalidate a basic point of the scheduler > that have nothing to do with IPI reschedule, the point to invalidate is > that if the current task (the one that is running wake_up()) has a small > avg_slice we'd better not reschedule the wakenup task on a new idle cpu > but we'd better wait the current task to go away instead. A couple of questions: 1.) Do you want to qualify that the wake_up is not from an interrupt? 2.) Having done this AND the task doesn't block THIS time, do we wait for the slice to end or is some other "dead man" timer needed? George Unless I'm > missing something this should fix lmbench in its current implementation. > > Andrea > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-07 19:54 ` george anzinger @ 2001-10-07 20:24 ` Andrea Arcangeli 0 siblings, 0 replies; 61+ messages in thread From: Andrea Arcangeli @ 2001-10-07 20:24 UTC (permalink / raw) To: george anzinger; +Cc: Mike Kravetz, Linus Torvalds, linux-kernel On Sun, Oct 07, 2001 at 12:54:08PM -0700, george anzinger wrote: > Andrea Arcangeli wrote: > > > > On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote: > > > [..] the pipe routines only call > > > the non-synchronous form of wake_up. [..] > > > > Ok I see, I overlooked that when we don't need to block we wakeup-async. > > > > So first of all it would be interesting to change the token passed > > thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so > > that the task goes to sleep before reading from the next pipe in the > > ring. > > > > And if we want to optimize the current benchmark (with the small token that > > triggers the async-wakeup and it always goes to sleep in read() rather > > than write()) we'd need to invalidate a basic point of the scheduler > > that have nothing to do with IPI reschedule, the point to invalidate is > > that if the current task (the one that is running wake_up()) has a small > > avg_slice we'd better not reschedule the wakenup task on a new idle cpu > > but we'd better wait the current task to go away instead. > > A couple of questions: > > 1.) Do you want to qualify that the wake_up is not from an interrupt? If the avg_slice is very small we know the task will probabilistically go away soon regardless where the wakeup cames from. I don't think the logic has connection to the kind of wakeup, it only has connection to the kind of "running task". Infact waiting "bash" or "ksoftirqd" to go away is the right thing to do to fix the suprious migrations of cpu intensive tasks too, no matter where the wakeup cames from. The only risk here is that we never know from the scheduler if the running task with the small avg_slice will really go away this time too or not. And if it doesn't go away this time, we may not do the best global decision by not migrating the wakenup task to the idle cpus, but the probability information provided by avg_slice should cover this problem. Of course this is theory, I didn't attempted to test it. btw, I remeber Ingo some year ago said on l-k that rescheduling immediatly every idle cpu isn't the best thing to do. > 2.) Having done this AND the task doesn't block THIS time, do we wait > for the slice to end or is some other "dead man" timer needed? Of course waiting for the slice to end anyways is the natural first step. As said above the point of the probability information of avg_slice is to render very unlikely the "task doesn't block THIS time" case. Andrea ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-04 21:52 ` David S. Miller 2001-10-04 21:55 ` Benjamin LaHaise @ 2001-10-09 4:55 ` Richard Gooch 2001-10-09 5:00 ` David S. Miller 2001-10-09 13:49 ` bill davidsen 1 sibling, 2 replies; 61+ messages in thread From: Richard Gooch @ 2001-10-09 4:55 UTC (permalink / raw) To: David S. Miller; +Cc: arjan, kravetz, linux-kernel David S. Miller writes: > From: Richard Gooch <rgooch@ras.ucalgary.ca> > Date: Thu, 4 Oct 2001 15:39:05 -0600 > > David S. Miller writes: > > lat_ctx doesn't execute any FPU ops. So at worst this happens once > > on GLIBC program startup, but then never again. > > Has something changed? Last I looked, the whole lmbench timing harness > was based on using the FPU. > > Oops, that's entirely possible... > > But things are usually layed out like this: > > capture_start_time(); > context_switch_N_times(); > capture_end_time(); > > So the FPU hit is only before/after the runs, not during each and > every iteration. Hm. Perhaps when I did my tests (where I noticed a penalty), we didn't have lazy FPU saving. Now we disable the FPU, and restore state when we trap, right? I do note this comment in arch/i386/kernel/process.c: * We fsave/fwait so that an exception goes off at the right time * (as a call from the fsave or fwait in effect) rather than to * the wrong process. Lazy FP saving no longer makes any sense * with modern CPU's, and this simplifies a lot of things (SMP * and UP become the same). So what exactly is the difference between our "delayed FPU restore upon trap" (which I think of as lazy FPU saving), and the "lazy FP" saving in the comments? Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-09 4:55 ` Richard Gooch @ 2001-10-09 5:00 ` David S. Miller 2001-10-09 13:49 ` bill davidsen 1 sibling, 0 replies; 61+ messages in thread From: David S. Miller @ 2001-10-09 5:00 UTC (permalink / raw) To: rgooch; +Cc: arjan, kravetz, linux-kernel From: Richard Gooch <rgooch@ras.ucalgary.ca> Date: Mon, 8 Oct 2001 22:55:23 -0600 So what exactly is the difference between our "delayed FPU restore upon trap" (which I think of as lazy FPU saving), and the "lazy FP" saving in the comments? Save always on swithching OUT from a task vs. save only when some different task asks for the FPU. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Context switch times 2001-10-09 4:55 ` Richard Gooch 2001-10-09 5:00 ` David S. Miller @ 2001-10-09 13:49 ` bill davidsen 1 sibling, 0 replies; 61+ messages in thread From: bill davidsen @ 2001-10-09 13:49 UTC (permalink / raw) To: linux-kernel In article <200110090455.f994tNB22322@vindaloo.ras.ucalgary.ca> rgooch@ras.ucalgary.ca asked: | Hm. Perhaps when I did my tests (where I noticed a penalty), we didn't | have lazy FPU saving. Now we disable the FPU, and restore state when | we trap, right? | | I do note this comment in arch/i386/kernel/process.c: | * We fsave/fwait so that an exception goes off at the right time | * (as a call from the fsave or fwait in effect) rather than to | * the wrong process. Lazy FP saving no longer makes any sense | * with modern CPU's, and this simplifies a lot of things (SMP | * and UP become the same). | | So what exactly is the difference between our "delayed FPU restore | upon trap" (which I think of as lazy FPU saving), and the "lazy FP" | saving in the comments? We always save the FPU, but only restore it when/if it is going to be used. And obviously we don't want to save it if it hasn't been used, since it wasn't restored... -- bill davidsen <davidsen@tmr.com> "If I were a diplomat, in the best case I'd go hungry. In the worst case, people would die." -- Robert Lipe ^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2001-10-11 12:53 UTC | newest] Thread overview: 61+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-10-04 21:04 Context switch times Mike Kravetz 2001-10-04 21:14 ` arjan 2001-10-04 21:25 ` David S. Miller 2001-10-04 21:39 ` Richard Gooch 2001-10-04 21:52 ` David S. Miller 2001-10-04 21:55 ` Benjamin LaHaise 2001-10-04 22:35 ` Davide Libenzi 2001-10-04 22:49 ` Mike Kravetz 2001-10-04 22:42 ` Linus Torvalds 2001-10-04 22:53 ` Benjamin LaHaise 2001-10-05 15:13 ` Alan Cox 2001-10-05 17:49 ` george anzinger 2001-10-05 22:29 ` Alan Cox 2001-10-05 22:56 ` Davide Libenzi 2001-10-05 23:04 ` Alan Cox 2001-10-05 23:16 ` Davide Libenzi 2001-10-05 23:17 ` Alan Cox 2001-10-05 23:21 ` Davide Libenzi 2001-10-05 23:43 ` Roger Larsson 2001-10-07 1:20 ` george anzinger 2001-10-07 1:33 ` Bill Davidsen 2001-10-07 9:56 ` Alan Cox 2001-10-06 2:24 ` Albert D. Cahalan 2001-10-06 2:57 ` Davide Libenzi 2001-10-07 9:57 ` Mika Liljeberg 2001-10-07 13:03 ` Ingo Oeser 2001-10-07 13:48 ` Mika Liljeberg 2001-10-07 14:24 ` Ingo Oeser 2001-10-07 14:33 ` Mika Liljeberg 2001-10-07 18:00 ` george anzinger 2001-10-07 22:06 ` Mika Liljeberg 2001-10-07 22:31 ` Davide Libenzi 2001-10-07 22:33 ` Mika Liljeberg 2001-10-07 23:49 ` george anzinger 2001-10-08 21:07 ` Mika Liljeberg 2001-10-08 22:54 ` discontig physical memory Petko Manolov 2001-10-08 23:05 ` David S. Miller 2001-10-08 23:18 ` Petko Manolov 2001-10-08 23:29 ` David S. Miller 2001-10-09 0:34 ` Petko Manolov 2001-10-09 0:36 ` Petko Manolov 2001-10-09 1:37 ` David S. Miller 2001-10-09 2:43 ` Petko Manolov 2001-10-08 15:19 ` Context switch times bill davidsen 2001-10-10 6:07 ` Mike Fedyk 2001-10-07 18:39 ` Davide Libenzi 2001-10-09 20:37 ` Hubertus Franke 2001-10-09 23:50 ` george anzinger 2001-10-11 10:52 ` Hubertus Franke 2001-10-04 23:41 ` Mike Kravetz 2001-10-04 23:50 ` Linus Torvalds 2001-10-05 15:15 ` Eric W. Biederman 2001-10-04 23:56 ` Davide Libenzi 2001-10-05 0:45 ` Andrea Arcangeli 2001-10-05 4:35 ` Mike Kravetz 2001-10-07 17:59 ` Andrea Arcangeli 2001-10-07 19:54 ` george anzinger 2001-10-07 20:24 ` Andrea Arcangeli 2001-10-09 4:55 ` Richard Gooch 2001-10-09 5:00 ` David S. Miller 2001-10-09 13:49 ` bill davidsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox