* scheduler scalability - cgroups, cpusets and load-balancing @ 2008-01-29 9:53 Peter Zijlstra 2008-01-29 10:01 ` Paul Jackson 2008-01-29 10:57 ` Peter Zijlstra 0 siblings, 2 replies; 36+ messages in thread From: Peter Zijlstra @ 2008-01-29 9:53 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, vatsa, Dhaval Giani, Paul Jackson, Nick Piggin, Eric W. Biederman, Andrew Morton, Steve Grubb, Steven Rostedt, Gregory Haskins, Dmitry Adamushko, Li, Tong N, Thomas Gleixner, Paul Menage, David Rientjes Hi All, Some of the fancy new scheduler features such as the cgroup load balancer (load_balance_monitor) and the real-time load balancer are a bit of an scalability issue. They all seem to want a rather strong global bound to keep a global fairness (which is quite understandable). [ my own interest is currently real-time group scheduling on multiple cpus, and that seems to require _very_ strong bonds ] I think the current stuff would scale up to 8 maybe 16 cpus, but after that I'd be real worried. Now we want distributions to enable most of these features. Distros seem to want containers, but distros also need to support 128+ cpu machines, so how are we going to solve this. My thoughts were to make stronger use of disjoint cpu-sets. cgroups and cpusets are related, in that cpusets provide a property to a cgroup. However, load_balance_monitor()'s interaction with sched domains confuses me - it might DTRT, but I can't tell. [ It looks to me it balances a group over the largest SD the current cpu has access to, even though that might be larger than the SD associated with the cpuset of that particular cgroup. ] Also the RT load-balance needs to become aware of such these sets, I think Paul J and Steven once talked about it, but can't quite remember where that ended. From my POV there should be sched-domain based balance information, not global. By cutting the problem into smaller pieces, and adding tunables to weaken to global fairness, I think we can give administrators enough freedom to make use of these features, even on the largest of machines. [ so I'd move the load_balance_monitor() tunables into cpusets as well, I can imagine a smaller cpuset wanting a stronger fairness than a much larger cpuset. ] I understand its a somewhat hand-wavey email, but I wanted to start discussion on the issue, or have someone show me I'm wrong and can stop worrying :-). Peter ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 9:53 scheduler scalability - cgroups, cpusets and load-balancing Peter Zijlstra @ 2008-01-29 10:01 ` Paul Jackson 2008-01-29 10:50 ` Peter Zijlstra 2008-01-29 10:57 ` Peter Zijlstra 1 sibling, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 10:01 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes Peter wrote: > Also the RT load-balance needs to become aware of such these sets, I > think Paul J and Steven once talked about it, but can't quite remember > where that ended See further the thread: http://lkml.org/lkml/2007/10/22/400 (I don't remember where it ended up either; probably nowhere. I'm just passing on the link, before doing any reading or thinking.) -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 10:01 ` Paul Jackson @ 2008-01-29 10:50 ` Peter Zijlstra 2008-01-29 11:13 ` Paul Jackson 0 siblings, 1 reply; 36+ messages in thread From: Peter Zijlstra @ 2008-01-29 10:50 UTC (permalink / raw) To: Paul Jackson Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes On Tue, 2008-01-29 at 04:01 -0600, Paul Jackson wrote: > Peter wrote: > > Also the RT load-balance needs to become aware of such these sets, I > > think Paul J and Steven once talked about it, but can't quite remember > > where that ended > > See further the thread: > > http://lkml.org/lkml/2007/10/22/400 > > (I don't remember where it ended up either; probably nowhere. > I'm just passing on the link, before doing any reading or thinking.) Thanks for the link. Yes I think your last suggestion of creating rt-domains ( http://lkml.org/lkml/2007/10/23/419 ) is a good one. Upon cpuset changes we could then look for the largest disjoint set and hang the rt balance code from that. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 10:50 ` Peter Zijlstra @ 2008-01-29 11:13 ` Paul Jackson 2008-01-29 11:31 ` Peter Zijlstra 0 siblings, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 11:13 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes Peter wrote: > Thanks for the link. Yes I think your last suggestion of creating > rt-domains ( http://lkml.org/lkml/2007/10/23/419 ) is a good one. We now have a per-cpuset Boolean flag file called 'sched_load_balance'. In the default case, this flag is set on, and the kernel does its usual load balancing across all CPUs in that cpuset. This means, under the covers, that there exists some sched domain such that all CPUs in that cpuset are in that same sched domain. That sched domain might contain additional CPUs from outside that cpuset as well. Indeed, in the default vanilla configuration, that sched domain contains all CPUs in the system. If we turn the sched_load_balance flag off for some cpuset, we are telling the kernel it's ok not to load balance on the CPUs in that cpuset (unless those CPUs are in some other cpuset that needed load balancing anyway.) This 'sched_load_balance' flag is, thus far, "the" cpuset hook supporting realtime. One can use it to configure a system so that the kernel does not do normal load balancing on select CPUs, such as those CPUs dedicated to realtime use. It sounds like Peter is reminding us that we really have three choices for a handling a given CPU's load balancing: 1) normal kernel scheduler load balancing, 2) RT load balancing, or 3) no load balancing whatsoever. If that's the case (if we really need choice 3) then a single Boolean flag, such as sched_load_balance, is not sufficient to select from the three choices, and it might make sense to add a second per-cpuset Boolean flag, say "sched_rt_balance", default off, which if turned on, enabled choice 2. If that's not the case (we only need choices 1 and 2) then -logically- we could overload the meaning of the current sched_load_balance, to mean, if turned off, not only to stop doing normal balancing, but to further mean that we should commence RT balancing. However bits aren't -that- precious here, and this sounds unnecessarily confusing. So ... would a new per-cpuset Boolean flag such as sched_rt_balance be appropriate and sufficient to mark those cpusets whose set of CPUs required RT balancing? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:13 ` Paul Jackson @ 2008-01-29 11:31 ` Peter Zijlstra 2008-01-29 11:53 ` Paul Jackson 2008-01-29 12:03 ` Paul Jackson 0 siblings, 2 replies; 36+ messages in thread From: Peter Zijlstra @ 2008-01-29 11:31 UTC (permalink / raw) To: Paul Jackson Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes On Tue, 2008-01-29 at 05:13 -0600, Paul Jackson wrote: > Peter wrote: > > Thanks for the link. Yes I think your last suggestion of creating > > rt-domains ( http://lkml.org/lkml/2007/10/23/419 ) is a good one. > > We now have a per-cpuset Boolean flag file called 'sched_load_balance'. SD_LOAD_BALANCE, right? > In the default case, this flag is set on, and the kernel does its > usual load balancing across all CPUs in that cpuset. This means, under > the covers, that there exists some sched domain such that all CPUs in > that cpuset are in that same sched domain. That sched domain might > contain additional CPUs from outside that cpuset as well. Indeed, > in the default vanilla configuration, that sched domain contains all > CPUs in the system. > > If we turn the sched_load_balance flag off for some cpuset, we are > telling the kernel it's ok not to load balance on the CPUs in that > cpuset (unless those CPUs are in some other cpuset that needed load > balancing anyway.) > > This 'sched_load_balance' flag is, thus far, "the" cpuset hook > supporting realtime. One can use it to configure a system so that > the kernel does not do normal load balancing on select CPUs, such > as those CPUs dedicated to realtime use. Ah, here I disagree, it is possible to do (hard) realtime scheduling over multiple cpus, the only draw back is that it requires a very strong load-balancer, making it unsuitable for large number of cpus. ( of course, having a strong rt load balancer on a large cpuset doesn't harm, as long as there are no rt tasks to balance ) So if we have a system like so: __A__ / | \ B1 B2 B3 /\ / \ C1 C2 A comprises of cpus 0-127, !SD_LOAD_BALANCE B1 comprises of cpus 0-63, !SD_LOAD_BALANCE B2 comprises of cpus 64-119 B3 120-127 C1 0-3 C2 5-63 We end up with 4 disjoint load-balanced sets. I would then attach the rt balance information to: C1, C2, B2, B3. If, for example, B1 would be load-balanced, we'd only have 3 disjoint sets left: B1, B2 and B3, and the rt balance data would be there. > It sounds like Peter is reminding us that we really have three choices > for a handling a given CPU's load balancing: > 1) normal kernel scheduler load balancing, > 2) RT load balancing, or > 3) no load balancing whatsoever. > > If that's the case (if we really need choice 3) then a single Boolean > flag, such as sched_load_balance, is not sufficient to select from > the three choices, and it might make sense to add a second per-cpuset > Boolean flag, say "sched_rt_balance", default off, which if turned on, > enabled choice 2. > > If that's not the case (we only need choices 1 and 2) then -logically- > we could overload the meaning of the current sched_load_balance, > to mean, if turned off, not only to stop doing normal balancing, but > to further mean that we should commence RT balancing. However bits > aren't -that- precious here, and this sounds unnecessarily confusing. > > So ... would a new per-cpuset Boolean flag such as sched_rt_balance be > appropriate and sufficient to mark those cpusets whose set of CPUs > required RT balancing? So, I don't think we need that, I think we can do with the single flag, we just need to find these disjoint sets and stick our rt-domain there. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:31 ` Peter Zijlstra @ 2008-01-29 11:53 ` Paul Jackson 2008-01-29 12:07 ` Peter Zijlstra 2008-01-29 12:03 ` Paul Jackson 1 sibling, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 11:53 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes Peter wrote; > So, I don't think we need that, I think we can do with the single flag, > we just need to find these disjoint sets and stick our rt-domain there. Ah - perhaps you don't need that flag - but my other cpuset users do ;). You see, there are two very different ways that 'sched_load_balance' is used in practice. The other way is by big batch schedulers. They may be placed in charge of managing a few hundred CPUs on a system, and might be running a mix of many small jobs each covering only a few CPUs. They routinely setup one cpuset for each job, to contain that job to the CPUs and memory nodes assigned to it. This is actually the original motivating use for cpusets. As a bit of optimization, batch schedulers desire to tell the normal kernel scheduler -not- to bother load balancing across the big set of CPUs controlled by the batch scheduler, but only to load balance within each of the smaller per-job cpusets. Load balancing across hundreds of CPUs when the batch scheduler knows such efforts would be fruitless is a waste of good CPU cycles in kernel/sched.c. I really doubt we'd want to have such systems triggering the hard RT scheduler on whatever CPUs were in the batch schedulers big cpuset that didn't happened to have an active job currently assigned to them. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:53 ` Paul Jackson @ 2008-01-29 12:07 ` Peter Zijlstra 2008-01-29 12:36 ` Paul Jackson 0 siblings, 1 reply; 36+ messages in thread From: Peter Zijlstra @ 2008-01-29 12:07 UTC (permalink / raw) To: Paul Jackson Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes On Tue, 2008-01-29 at 05:53 -0600, Paul Jackson wrote: > Peter wrote; > > So, I don't think we need that, I think we can do with the single flag, > > we just need to find these disjoint sets and stick our rt-domain there. > > Ah - perhaps you don't need that flag - but my other cpuset users do ;). > > You see, there are two very different ways that 'sched_load_balance' is > used in practice. > > The other way is by big batch schedulers. They may be placed in charge > of managing a few hundred CPUs on a system, and might be running a mix > of many small jobs each covering only a few CPUs. They routinely setup > one cpuset for each job, to contain that job to the CPUs and memory > nodes assigned to it. This is actually the original motivating use for > cpusets. > > As a bit of optimization, batch schedulers desire to tell the normal > kernel scheduler -not- to bother load balancing across the big set of > CPUs controlled by the batch scheduler, but only to load balance within > each of the smaller per-job cpusets. Load balancing across hundreds > of CPUs when the batch scheduler knows such efforts would be fruitless > is a waste of good CPU cycles in kernel/sched.c. > > I really doubt we'd want to have such systems triggering the hard RT > scheduler on whatever CPUs were in the batch schedulers big cpuset > that didn't happened to have an active job currently assigned to them. My turn to be confused.. If SD_LOAD_BALANCE is only set on the smaller, per-job, sets, how will the RT balancer trigger on the large set? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 12:07 ` Peter Zijlstra @ 2008-01-29 12:36 ` Paul Jackson 0 siblings, 0 replies; 36+ messages in thread From: Paul Jackson @ 2008-01-29 12:36 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes Peter, responding to Paul: > > I really doubt we'd want to have such systems triggering the hard RT > > scheduler on whatever CPUs were in the batch schedulers big cpuset > > that didn't happened to have an active job currently assigned to them. > > My turn to be confused.. > > If SD_LOAD_BALANCE is only set on the smaller, per-job, sets, how will > the RT balancer trigger on the large set? What 'sched_load_balance' does now is help you setup a -partial- covering of non-overlappping sched domains. In the batch scheduler example, those CPUs that were: 1) being managed by the batch scheduler, but 2) were not assigned to any active job at the moment would -not- be in any sched domain. It's not a question of the SC_LOAD_BALANCE flag. It's a question of whether a given CPU is even included in any sched domain. If we did as you are suggesting (if I understand) then instead of leaving these CPUs out of any sched domain, rather we'd setup a new kind of sched domain for these CPUs, marked for hard real time load balancing, rather than the somewhat more scalable, but softer normal load balancing. We want no load balancing on those CPUs, not realtime load balancing. Indeed, I suspect, we especially do not want realtime load balancing on those CPUs as that kind of load balancing is (I'm suspecting) more expensive and less scalable than normal load balancing. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:31 ` Peter Zijlstra 2008-01-29 11:53 ` Paul Jackson @ 2008-01-29 12:03 ` Paul Jackson 2008-01-29 12:30 ` Peter Zijlstra 1 sibling, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 12:03 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes Paul, responding to Peter: > > We now have a per-cpuset Boolean flag file called 'sched_load_balance'. > > SD_LOAD_BALANCE, right? No. SD_LOAD_BALANCE is some attribute of sched domains. The 'sched_load_balance' flag is an attribute of cpusets. The mapping of cpusets to sched domains required several pages of 'fun to write' code, which had to go through a couple of years of fixing and one major rewrite before it (knock on wood) worked correctly. It's not a one-to-one relation, in other words. See my earlier messages for further explanation of how this works. I'm not sure what SD_LOAD_BALANCE does ... I guess from a quick read that it just optimizes the recognition of singleton sched domains for which load balancing would be a wasted effort. > > This 'sched_load_balance' flag is, thus far, "the" cpuset hook > > supporting realtime. One can use it to configure a system so that > > the kernel does not do normal load balancing on select CPUs, such > > as those CPUs dedicated to realtime use. > > Ah, here I disagree, it is possible to do (hard) realtime scheduling > over multiple cpus, the only draw back is that it requires a very strong > load-balancer, making it unsuitable for large number of cpus. I don't think we are disagreeing. I was speaking of "normal" load balancing (what the mainline kernel/sched.c code normally does). You're speaking of hard realtime load balancing. I think we agree that these both exist, and require different load balancing code, the latter 'very strong.' -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 12:03 ` Paul Jackson @ 2008-01-29 12:30 ` Peter Zijlstra 2008-01-29 12:52 ` Paul Jackson 0 siblings, 1 reply; 36+ messages in thread From: Peter Zijlstra @ 2008-01-29 12:30 UTC (permalink / raw) To: Paul Jackson Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes On Tue, 2008-01-29 at 06:03 -0600, Paul Jackson wrote: > Paul, responding to Peter: > > > We now have a per-cpuset Boolean flag file called 'sched_load_balance'. > > > > SD_LOAD_BALANCE, right? > > No. SD_LOAD_BALANCE is some attribute of sched domains. > > The 'sched_load_balance' flag is an attribute of cpusets. > > The mapping of cpusets to sched domains required several pages of 'fun > to write' code, which had to go through a couple of years of fixing and > one major rewrite before it (knock on wood) worked correctly. It's not > a one-to-one relation, in other words. See my earlier messages for > further explanation of how this works. Ok, I'll take a stab at understanding that code. Perhaps it seems to me a lot of confusion could be solved by getting a more level playing ground :-) > > > This 'sched_load_balance' flag is, thus far, "the" cpuset hook > > > supporting realtime. One can use it to configure a system so that > > > the kernel does not do normal load balancing on select CPUs, such > > > as those CPUs dedicated to realtime use. > > > > Ah, here I disagree, it is possible to do (hard) realtime scheduling > > over multiple cpus, the only draw back is that it requires a very strong > > load-balancer, making it unsuitable for large number of cpus. > > I don't think we are disagreeing. I was speaking of "normal" > load balancing (what the mainline kernel/sched.c code normally > does). You're speaking of hard realtime load balancing. > > I think we agree that these both exist, and require different > load balancing code, the latter 'very strong.' Great :-) ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 12:30 ` Peter Zijlstra @ 2008-01-29 12:52 ` Paul Jackson 2008-01-29 13:38 ` Peter Zijlstra 0 siblings, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 12:52 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes > Ok, I'll take a stab at understanding that code. See also the section: 1.7 What is sched_load_balance ? in Documentation/cpusets.txt. Good luck ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 12:52 ` Paul Jackson @ 2008-01-29 13:38 ` Peter Zijlstra 0 siblings, 0 replies; 36+ messages in thread From: Peter Zijlstra @ 2008-01-29 13:38 UTC (permalink / raw) To: Paul Jackson Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes On Tue, 2008-01-29 at 06:52 -0600, Paul Jackson wrote: > > Ok, I'll take a stab at understanding that code. > > See also the section: > > 1.7 What is sched_load_balance ? > > in Documentation/cpusets.txt. > > Good luck ;). It seems Gregory tricked us both: 57d885fea0da0e9541d7730a9e1dcf734981a173 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 9:53 scheduler scalability - cgroups, cpusets and load-balancing Peter Zijlstra 2008-01-29 10:01 ` Paul Jackson @ 2008-01-29 10:57 ` Peter Zijlstra 2008-01-29 11:30 ` Paul Jackson 2008-01-29 12:32 ` Srivatsa Vaddagiri 1 sibling, 2 replies; 36+ messages in thread From: Peter Zijlstra @ 2008-01-29 10:57 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, vatsa, Dhaval Giani, Paul Jackson, Nick Piggin, Eric W. Biederman, Andrew Morton, Steve Grubb, Steven Rostedt, Gregory Haskins, Dmitry Adamushko, Li, Tong N, Thomas Gleixner, Paul Menage, David Rientjes Here I go, talking to myself.. On Tue, 2008-01-29 at 10:53 +0100, Peter Zijlstra wrote: > My thoughts were to make stronger use of disjoint cpu-sets. cgroups and > cpusets are related, in that cpusets provide a property to a cgroup. > However, load_balance_monitor()'s interaction with sched domains > confuses me - it might DTRT, but I can't tell. > > [ It looks to me it balances a group over the largest SD the current cpu > has access to, even though that might be larger than the SD associated > with the cpuset of that particular cgroup. ] Hmm, with a bit more thought I think that does indeed DTRT. Because, if the cpu belongs to a disjoint cpuset, the highest sd (with load-balancing enabled) would be that. Right? [ Just a bit of a shame we have all cgroups represented on each cpu. ] Also, might be a nice idea to split the daemon up if there are indeed disjoint sets - currently there is only a single daemon which touches the whole system. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 10:57 ` Peter Zijlstra @ 2008-01-29 11:30 ` Paul Jackson 2008-01-29 11:34 ` Paul Jackson ` (2 more replies) 2008-01-29 12:32 ` Srivatsa Vaddagiri 1 sibling, 3 replies; 36+ messages in thread From: Paul Jackson @ 2008-01-29 11:30 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes Peter wrote, in reply to Peter ;): > > [ It looks to me it balances a group over the largest SD the current cpu > > has access to, even though that might be larger than the SD associated > > with the cpuset of that particular cgroup. ] > > Hmm, with a bit more thought I think that does indeed DTRT. Because, if > the cpu belongs to a disjoint cpuset, the highest sd (with > load-balancing enabled) would be that. Right? The code that defines sched domains, kernel/sched.c partition_sched_domains(), as called from the cpuset code in kernel/cpuset.c rebuild_sched_domains(), does not make use of the full range of sched_domain possibilities. In particular, it only sets up some non-overlapping set of sched domains. Every CPU ends up in at most a single sched domain. The original reason that one can't define overlapping sched domains via this cpuset interface (based off the cpuset 'sched_load_balance' flag) is that I didn't realize it was even possible to overlap sched domains when I wrote the cpuset code defining sched domains. And then when I later realized one could overlap sched domains, I (a) didn't see a need to do so, and (b) couldn't see how to do so via the cpuset interface without causing my brain to explode. Now, back to Peter's question, being a bit pedantic, CPUs don't belong to disjoint cpusets, except in the most minimal situation that there is only one cpuset covering all CPUs. Rather what happens, when you have need for some realtime CPUs, is that: 1) you turn off sched_load_balance on the top cpuset, 2) you setup your realtime cpuset as a child cpuset of the top cpuset such that its CPUs doesn't overlap any of its siblings, and 3) you turn off sched_load_balance in that realtime cpuset. At that point, sched domains are rebuilt, including providing a sched domain that just contains the CPUs in that realtime cpuset, and normal scheduler load balancing ceases on the CPUs in that realtime cpuset. > [ Just a bit of a shame we have all cgroups represented on each cpu. ] Could you restate this -- I suspect it's obvious, but I'm oblivious ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:30 ` Paul Jackson @ 2008-01-29 11:34 ` Paul Jackson 2008-01-29 11:50 ` Peter Zijlstra 2008-01-29 15:36 ` Gregory Haskins 2 siblings, 0 replies; 36+ messages in thread From: Paul Jackson @ 2008-01-29 11:34 UTC (permalink / raw) To: Paul Jackson Cc: a.p.zijlstra, linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes Paul, talking to himself: > At that point, sched domains are rebuilt, including providing a > sched domain that just contains the CPUs in that realtime cpuset, and > normal scheduler load balancing ceases on the CPUs in that realtime > cpuset. Oops - correction - at that point sched domains are rebuilt, and the CPUs in that realtime cpuset are not included in any sched domain at all. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:30 ` Paul Jackson 2008-01-29 11:34 ` Paul Jackson @ 2008-01-29 11:50 ` Peter Zijlstra 2008-01-29 12:12 ` Paul Jackson 2008-01-29 15:50 ` Gregory Haskins 2008-01-29 15:36 ` Gregory Haskins 2 siblings, 2 replies; 36+ messages in thread From: Peter Zijlstra @ 2008-01-29 11:50 UTC (permalink / raw) To: Paul Jackson Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes On Tue, 2008-01-29 at 05:30 -0600, Paul Jackson wrote: > Peter wrote, in reply to Peter ;): > > > [ It looks to me it balances a group over the largest SD the current cpu > > > has access to, even though that might be larger than the SD associated > > > with the cpuset of that particular cgroup. ] > > > > Hmm, with a bit more thought I think that does indeed DTRT. Because, if > > the cpu belongs to a disjoint cpuset, the highest sd (with > > load-balancing enabled) would be that. Right? > > The code that defines sched domains, kernel/sched.c partition_sched_domains(), > as called from the cpuset code in kernel/cpuset.c rebuild_sched_domains(), > does not make use of the full range of sched_domain possibilities. > > In particular, it only sets up some non-overlapping set of sched domains. > Every CPU ends up in at most a single sched domain. Ah, good to know. I thought it would reflect the hierarchy of the sets themselves. > The original reason that one can't define overlapping sched domains via > this cpuset interface (based off the cpuset 'sched_load_balance' flag) > is that I didn't realize it was even possible to overlap sched domains > when I wrote the cpuset code defining sched domains. And then when I > later realized one could overlap sched domains, I (a) didn't see a need > to do so, and (b) couldn't see how to do so via the cpuset interface > without causing my brain to explode. Good reason :-), this code needs all the reasons it can grasp to not grow more complexity. > Now, back to Peter's question, being a bit pedantic, CPUs don't belong > to disjoint cpusets, except in the most minimal situation that there is > only one cpuset covering all CPUs. > > Rather what happens, when you have need for some realtime CPUs, is that: > 1) you turn off sched_load_balance on the top cpuset, > 2) you setup your realtime cpuset as a child cpuset of the top cpuset > such that its CPUs doesn't overlap any of its siblings, and > 3) you turn off sched_load_balance in that realtime cpuset. Ah, I don't think 3 is needed. Quite to the contrary, there is quite a large body of research work covering the scheduling of (hard and soft) realtime tasks on multiple cpus. > At that point, sched domains are rebuilt, including providing a > sched domain that just contains the CPUs in that realtime cpuset, and > normal scheduler load balancing ceases on the CPUs in that realtime > cpuset. Right, which would also disable the realtime load-balancing we do want. Hence my suggestion to stick the rt balance data in this sched domain. > > [ Just a bit of a shame we have all cgroups represented on each cpu. ] > > Could you restate this -- I suspect it's obvious, but I'm oblivious ;). Ah, sure. struct task_group creates cfs_rq/rt_rq entities for each cpu's runqueue. So an iteration like for_each_leaf_{cfs,rt}_rq() will touch all task_groups/cgroups, not only those that are actually schedulable on that cpu. Now, I think that could be easily solved by adding/removing {cfs,rt}_rq->leaf_{cfs,rt}_rq_list to/from rq->leaf_{cfs,rt}_rq_list on enqueue of the first/dequeue of the last entity of its tg on that rq. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:50 ` Peter Zijlstra @ 2008-01-29 12:12 ` Paul Jackson 2008-01-29 15:57 ` Gregory Haskins 2008-01-29 15:50 ` Gregory Haskins 1 sibling, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 12:12 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, vatsa, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes Peter, replying to Paul: > > 3) you turn off sched_load_balance in that realtime cpuset. > > Ah, I don't think 3 is needed. Quite to the contrary, there is quite a > large body of research work covering the scheduling of (hard and soft) > realtime tasks on multiple cpus. Well, the way it's coded now, the user space code needs to do (3), because that's the only way they get the system to have anything other than one big fat sched domain covering the all the CPUs in the system. Actually ... I need a picture of a bunny with a pancake hat here, as I have no idea what you just said ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 12:12 ` Paul Jackson @ 2008-01-29 15:57 ` Gregory Haskins 2008-01-29 16:33 ` Paul Jackson 0 siblings, 1 reply; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 15:57 UTC (permalink / raw) To: Peter Zijlstra, Paul Jackson Cc: mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 7:12 AM, in message <20080129061202.95b66041.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: > Peter, replying to Paul: >> > 3) you turn off sched_load_balance in that realtime cpuset. >> >> Ah, I don't think 3 is needed. Quite to the contrary, there is quite a >> large body of research work covering the scheduling of (hard and soft) >> realtime tasks on multiple cpus. > > Well, the way it's coded now, the user space code needs to do (3), > because that's the only way they get the system to have anything > other than one big fat sched domain covering the all the CPUs in > the system. What about exclusive cpusets? Don't they create a new sched-domain or did I misunderstand there? -Greg > > Actually ... I need a picture of a bunny with a pancake hat here, > as I have no idea what you just said ;). ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 15:57 ` Gregory Haskins @ 2008-01-29 16:33 ` Paul Jackson 0 siblings, 0 replies; 36+ messages in thread From: Paul Jackson @ 2008-01-29 16:33 UTC (permalink / raw) To: Gregory Haskins Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin Gregory wrote: > What about exclusive cpusets? Don't they create a > new sched-domain or did I misunderstand there? cpu_exclusive cpusets no longer determine sched domains. I just said more in this in an earlier reply. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:50 ` Peter Zijlstra 2008-01-29 12:12 ` Paul Jackson @ 2008-01-29 15:50 ` Gregory Haskins 2008-01-29 16:51 ` Paul Jackson 1 sibling, 1 reply; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 15:50 UTC (permalink / raw) To: Peter Zijlstra, Paul Jackson Cc: mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 6:50 AM, in message <1201607401.28547.124.camel@lappy>, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Tue, 2008-01-29 at 05:30 -0600, Paul Jackson wrote: >> Peter wrote, in reply to Peter ;): >> > > [ It looks to me it balances a group over the largest SD the current cpu >> > > has access to, even though that might be larger than the SD associated >> > > with the cpuset of that particular cgroup. ] >> > >> > Hmm, with a bit more thought I think that does indeed DTRT. Because, if >> > the cpu belongs to a disjoint cpuset, the highest sd (with >> > load-balancing enabled) would be that. Right? >> >> The code that defines sched domains, kernel/sched.c > partition_sched_domains(), >> as called from the cpuset code in kernel/cpuset.c rebuild_sched_domains(), >> does not make use of the full range of sched_domain possibilities. >> >> In particular, it only sets up some non-overlapping set of sched domains. >> Every CPU ends up in at most a single sched domain. > > Ah, good to know. I thought it would reflect the hierarchy of the sets > themselves. > >> The original reason that one can't define overlapping sched domains via >> this cpuset interface (based off the cpuset 'sched_load_balance' flag) >> is that I didn't realize it was even possible to overlap sched domains >> when I wrote the cpuset code defining sched domains. And then when I >> later realized one could overlap sched domains, I (a) didn't see a need >> to do so, and (b) couldn't see how to do so via the cpuset interface >> without causing my brain to explode. > > Good reason :-), this code needs all the reasons it can grasp to not > grow more complexity. > >> Now, back to Peter's question, being a bit pedantic, CPUs don't belong >> to disjoint cpusets, except in the most minimal situation that there is >> only one cpuset covering all CPUs. >> >> Rather what happens, when you have need for some realtime CPUs, is that: >> 1) you turn off sched_load_balance on the top cpuset, >> 2) you setup your realtime cpuset as a child cpuset of the top cpuset >> such that its CPUs doesn't overlap any of its siblings, and >> 3) you turn off sched_load_balance in that realtime cpuset. > > Ah, I don't think 3 is needed. Quite to the contrary, there is quite a > large body of research work covering the scheduling of (hard and soft) > realtime tasks on multiple cpus. This is correct. We have the balance policy polymorphically associated with each sched_class, and the CFS load-balancer and RT "load" (really, priority) balancer can coexist together at the same time and across arbitrary #s of cores. From an RT perspective, this works great. Its a little trickier (and I dont think we have this quite right, yet) for the CFS side, since that interface deals strictly in terms of load. As such, it gets a little perturbed by these "rude" RT tasks that arbitrarily preempt its tasks. :) I think Steven may have done some work in that area by playing with the associated weight of RT tasks, etc so that the CFS balancer can more accurate account for the externally managed RT load on the system. But AFAIK, its not in the tree yet. > >> At that point, sched domains are rebuilt, including providing a >> sched domain that just contains the CPUs in that realtime cpuset, and >> normal scheduler load balancing ceases on the CPUs in that realtime >> cpuset. > > Right, which would also disable the realtime load-balancing we do want. > Hence my suggestion to stick the rt balance data in this sched domain. > >> > [ Just a bit of a shame we have all cgroups represented on each cpu. ] >> >> Could you restate this -- I suspect it's obvious, but I'm oblivious ;). > > Ah, sure. struct task_group creates cfs_rq/rt_rq entities for each cpu's > runqueue. So an iteration like for_each_leaf_{cfs,rt}_rq() will touch > all task_groups/cgroups, not only those that are actually schedulable on > that cpu. > > Now, I think that could be easily solved by adding/removing > {cfs,rt}_rq->leaf_{cfs,rt}_rq_list to/from rq->leaf_{cfs,rt}_rq_list on > enqueue of the first/dequeue of the last entity of its tg on that rq. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 15:50 ` Gregory Haskins @ 2008-01-29 16:51 ` Paul Jackson 2008-01-29 17:21 ` Gregory Haskins 0 siblings, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 16:51 UTC (permalink / raw) To: Gregory Haskins Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin Gregory wrote: > This is correct. We have the balance policy polymorphically associated > with each sched_class, and the CFS load-balancer and RT "load" (really, > priority) balancer can coexist together at the same time and across > arbitrary #s of cores So ... we have the option of having all sched_classes coexist polymorphically. That I didn't realize until this thread. Now ... do we -want- to ?) That is, what is the easiest kernel-user API to work with and understand? Is it one where we essentially expose sched_class to user space, and let them pick their sched_class, or pick none of the above (don't balance)? Or is it one where, other than the special case my batch schedulers need to not balance at all, we expose nothing more to user space, and provide all sched_class load balancers to all sched_domains (other than those not balanced at all)? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 16:51 ` Paul Jackson @ 2008-01-29 17:21 ` Gregory Haskins 2008-01-29 19:04 ` Paul Jackson 0 siblings, 1 reply; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 17:21 UTC (permalink / raw) To: Paul Jackson Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 11:51 AM, in message <20080129105104.d70f36ef.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: > Gregory wrote: >> This is correct. We have the balance policy polymorphically associated >> with each sched_class, and the CFS load-balancer and RT "load" (really, >> priority) balancer can coexist together at the same time and across >> arbitrary #s of cores > > So ... we have the option of having all sched_classes coexist > polymorphically. > > That I didn't realize until this thread. Its on a per-task basis when the task elects SCHED_FIFO/RR/BATCH/OTHER, etc. If the task is on a particular RQ, the RQ operates under the policy of that class. There are some cases where the RQ consults the policy of all classes, but they are still influenced by whether there are actual tasks running within the scope of the current cpuset (or root-domain). > > Now ... do we -want- to ?) I think so, yes. But I will give the disclaimer that I don't fully understand your world ;) You could certainly create a group of cpus with homogeneous policy by creating a cpuset with only tasks of a single class as members. But likewise, if you populate a cpuset with tasks from mixed classes, you have mixed balance policy affecting those cpus. > > That is, what is the easiest kernel-user API to work with and understand? > > Is it one where we essentially expose sched_class to user space, and let > them pick their sched_class, or pick none of the above (don't balance)? IMHO it works well the way it is: The user selects the class for a particular task using sched_setscheduler(), and they select the cpuset (or inherit it) that defines its execution scope. If that scope has balancing enabled, the policy for the member classes is in effect. (on this topic, note that I do not know if the RT-balancer will respect the cpuset concept of "balance-enabled" anyway. That might have to be fixed) Again, the disclaimer that I do not have expertise in your area, so perhaps this is naive. > > Or is it one where, other than the special case my batch schedulers need > to not balance at all, we expose nothing more to user space, and provide > all sched_class load balancers to all sched_domains (other than those > not balanced at all)? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 17:21 ` Gregory Haskins @ 2008-01-29 19:04 ` Paul Jackson 2008-01-29 20:36 ` Gregory Haskins 0 siblings, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 19:04 UTC (permalink / raw) To: Gregory Haskins Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin Gregory wrote: > IMHO it works well the way it is: The user selects the class for a > particular task using sched_setscheduler(), and they select the cpuset > (or inherit it) that defines its execution scope. If that scope has > balancing enabled, the policy for the member classes is in effect. Ok. For the various classes of schedulers (sched_class's), it's fine by me if sched domains are polymorphic, supporting all classes, and it is left to each task to self-select the scheduling class of its preference. For the batch scheduler case, this -must- be imposable from outside the task, by the batch scheduler that is overseeing the job, and it must support the batch scheduler being able to disable all the balancers in selected cpusets (selected sched_domains). We have that now. Each of us only knew of part of the solution, but we managed to arrive at the desired answer even so ... amazing. The batch scheduler just has to arrange to get 'sched_load_balance' turned off in a cpuset and all overlapping cpusets, and then the CPUS in that cpuset will not belong to -any- sched_domain, and hence (could you verify I'm right in this detail?) won't be balanced by any sched_class. I should update the documentation for sched_load_balance, changing it from saying that you get realtime by turning off sched_load_balance in the RT cpuset, to saying that you get realtime by (1) turning off sched_load_balance in any overlapping cpusets, including all encompassing parent cpusets, (2) leaving sched_load_balance on in the RT cpuset itself, and (3) having those realtime tasks each self-select (elect) the desired SCHED_* using sched_setscheduler(). Condition (1) above is a tad difficult to understand, but servicable, I guess. The combination of (1) and (2) results in a separate sched_domain just for the CPUs in the RT cpuset. > (on this topic, note that I do not know if the RT-balancer will > respect the cpuset concept of "balance-enabled" anyway. That might > have to be fixed) Er eh ... it has no choice. If the user space code has configured a cpuset with 'sched_load_balance' turned off in that cpuset and all overlapping cpusets, then there will not even be a sched_domain covering those CPUs, and hence no balancer, RT or other class, will even see those CPUs. Unless I really don't understand the kernel/sched.c sched_domain code (a distinct possibility), if some CPU is not in any sched_domain, then it won't get balanced, RT or otherwise. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 19:04 ` Paul Jackson @ 2008-01-29 20:36 ` Gregory Haskins 2008-01-29 21:02 ` Paul Jackson 0 siblings, 1 reply; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 20:36 UTC (permalink / raw) To: Paul Jackson Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 2:04 PM, in message <20080129130403.92d0a1fe.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: > Gregory wrote: >> IMHO it works well the way it is: The user selects the class for a >> particular task using sched_setscheduler(), and they select the cpuset >> (or inherit it) that defines its execution scope. If that scope has >> balancing enabled, the policy for the member classes is in effect. > > Ok. > > For the various classes of schedulers (sched_class's), it's fine by me > if sched domains are polymorphic, supporting all classes, and it is > left to each task to self-select the scheduling class of its preference. > > For the batch scheduler case, this -must- be imposable from outside > the task, by the batch scheduler that is overseeing the job, and it > must support the batch scheduler being able to disable all the > balancers in selected cpusets (selected sched_domains). > > We have that now. Each of us only knew of part of the solution, > but we managed to arrive at the desired answer even so ... amazing. > > The batch scheduler just has to arrange to get 'sched_load_balance' > turned off in a cpuset and all overlapping cpusets, and then the > CPUS in that cpuset will not belong to -any- sched_domain, and hence > (could you verify I'm right in this detail?) won't be balanced by any > sched_class. I am a little fuzzy on how this would work, so I cant say for certain. :) But it seems like that is accurate. > > I should update the documentation for sched_load_balance, changing it > from saying that you get realtime by turning off sched_load_balance in > the RT cpuset, to saying that you get realtime by (1) turning off > sched_load_balance in any overlapping cpusets, including all > encompassing parent cpusets, (2) leaving sched_load_balance on in the > RT cpuset itself, and (3) having those realtime tasks each self-select > (elect) the desired SCHED_* using sched_setscheduler(). > > Condition (1) above is a tad difficult to understand, but servicable, > I guess. The combination of (1) and (2) results in a separate > sched_domain just for the CPUs in the RT cpuset. Technically you only need (2). I run my 4-8 core development systems in the single default global cpuset, normally. Customers typically do use multiple sets, but we only use the vanilla balanced variety. > >> (on this topic, note that I do not know if the RT-balancer will >> respect the cpuset concept of "balance-enabled" anyway. That might >> have to be fixed) > > Er eh ... it has no choice. If the user space code has configured a > cpuset with 'sched_load_balance' turned off in that cpuset and all > overlapping cpusets, then there will not even be a sched_domain > covering those CPUs, and hence no balancer, RT or other class, will > even see those CPUs. > > Unless I really don't understand the kernel/sched.c sched_domain code > (a distinct possibility), if some CPU is not in any sched_domain, then > it won't get balanced, RT or otherwise. Heh...I cant quite wrap my head around that, but it sounds like you are correct. The only thing I was really pointing out is that the RT code doesn't necessarily look at sched-domain flags before making balancing decisions. So as long as that is not a requirement, I think we are all set. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 20:36 ` Gregory Haskins @ 2008-01-29 21:02 ` Paul Jackson 2008-01-29 21:07 ` Gregory Haskins 0 siblings, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 21:02 UTC (permalink / raw) To: Gregory Haskins Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin Gregory wrote: > > ... (1) turning off > > sched_load_balance in any overlapping cpusets, including all > > encompassing parent cpusets, (2) leaving sched_load_balance on in the > > RT cpuset itself, and ... > > Technically you only need (2). I run my 4-8 core development systems > in the single default global cpuset, normally. Well, if you're running in the default cpuset, then you automatically get (1), because sched_load_balance is turned off in all overlapping cpusets (there aren't any overlapping cpusets!) So, yes, you -do- need both (1) and (2). In your normal system, you just happen to get (1) effortlessly. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 21:02 ` Paul Jackson @ 2008-01-29 21:07 ` Gregory Haskins 0 siblings, 0 replies; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 21:07 UTC (permalink / raw) To: Paul Jackson Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 4:02 PM, in message <20080129150234.b57ce988.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: > Gregory wrote: >> > ... (1) turning off >> > sched_load_balance in any overlapping cpusets, including all >> > encompassing parent cpusets, (2) leaving sched_load_balance on in the >> > RT cpuset itself, and ... >> >> Technically you only need (2). I run my 4-8 core development systems >> in the single default global cpuset, normally. > > Well, if you're running in the default cpuset, then you automatically get > (1), > because sched_load_balance is turned off in all overlapping cpusets (there > aren't any overlapping cpusets!) > > So, yes, you -do- need both (1) and (2). In your normal system, you > just happen to get (1) effortlessly. Ah. Well see, I am just showing my ignorance of this area of the cpuset code then. I stand corrected, and sorry for the noise. :) -Greg ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 11:30 ` Paul Jackson 2008-01-29 11:34 ` Paul Jackson 2008-01-29 11:50 ` Peter Zijlstra @ 2008-01-29 15:36 ` Gregory Haskins 2008-01-29 16:28 ` Paul Jackson 2 siblings, 1 reply; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 15:36 UTC (permalink / raw) To: Peter Zijlstra, Paul Jackson Cc: mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 6:30 AM, in message <20080129053005.bc7a11d7.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: > Peter wrote, in reply to Peter ;): >> > [ It looks to me it balances a group over the largest SD the current cpu >> > has access to, even though that might be larger than the SD associated >> > with the cpuset of that particular cgroup. ] >> >> Hmm, with a bit more thought I think that does indeed DTRT. Because, if >> the cpu belongs to a disjoint cpuset, the highest sd (with >> load-balancing enabled) would be that. Right? > > The code that defines sched domains, kernel/sched.c > partition_sched_domains(), > as called from the cpuset code in kernel/cpuset.c rebuild_sched_domains(), > does not make use of the full range of sched_domain possibilities. > > In particular, it only sets up some non-overlapping set of sched domains. > Every CPU ends up in at most a single sched domain. > > The original reason that one can't define overlapping sched domains via > this cpuset interface (based off the cpuset 'sched_load_balance' flag) > is that I didn't realize it was even possible to overlap sched domains > when I wrote the cpuset code defining sched domains. And then when I > later realized one could overlap sched domains, I (a) didn't see a need > to do so, and (b) couldn't see how to do so via the cpuset interface > without causing my brain to explode. > > Now, back to Peter's question, being a bit pedantic, CPUs don't belong > to disjoint cpusets, except in the most minimal situation that there is > only one cpuset covering all CPUs. > > Rather what happens, when you have need for some realtime CPUs, is that: > 1) you turn off sched_load_balance on the top cpuset, > 2) you setup your realtime cpuset as a child cpuset of the top cpuset > such that its CPUs doesn't overlap any of its siblings, and > 3) you turn off sched_load_balance in that realtime cpuset. > > At that point, sched domains are rebuilt, including providing a > sched domain that just contains the CPUs in that realtime cpuset, and > normal scheduler load balancing ceases on the CPUs in that realtime > cpuset. Hi Paul, I am a bit confused as to why you disable load-balancing in the RT cpuset? It shouldn't be strictly necessary in order for the RT scheduler to do its job (unless I am misunderstanding what you are trying to accomplish?). Do you do this because you *have* to in order to make real-time deadlines, or because its just a further optimization? -Greg > >> [ Just a bit of a shame we have all cgroups represented on each cpu. ] > > Could you restate this -- I suspect it's obvious, but I'm oblivious ;). ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 15:36 ` Gregory Haskins @ 2008-01-29 16:28 ` Paul Jackson 2008-01-29 16:42 ` Gregory Haskins 0 siblings, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 16:28 UTC (permalink / raw) To: Gregory Haskins Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin Gregory wrote: > I am a bit confused as to why you disable load-balancing in the > RT cpuset? It shouldn't be strictly necessary in order for the > RT scheduler to do its job (unless I am misunderstanding what you > are trying to accomplish?). Do you do this because you *have* > to in order to make real-time deadlines, or because its just a > further optimization? My primary motivation for cpusets originally, and for the sched_load_balance flag now, was not realtime, but "soft partitioning" of big NUMA systems, especially for batch schedulers. They sometimes have large cpusets which are only being used to hold smaller, per-job, cpusets. It is a waste of time (CPU cycles in the kernel sched code) to load balance those large cpusets. Load balancing doesn't scale easily to high CPU counts, and it's nice to avoid doing that where not needed. See the following lkml message for a fuller explanation: http://lkml.org/lkml/2008/1/29/85 As a secondary motivation, I thought that disabling load balancing on the RT cpuset was the right thing to do for RT needs, but I make no claim to knowing much about RT. I just now realized that you added a 'root_domain' in a patch in late Nov and early Dec. I was on the road then, moving from California to Texas, and not paying much attention to Linux. A couple of questions on that patch, both involving a comment it adds to kernel/sched.c: /* * We add the notion of a root-domain which will be used to define per-domain * variables. Each exclusive cpuset essentially defines an island domain by * fully partitioning the member cpus from any other cpuset. Whenever a new * exclusive cpuset is created, we also create and attach a new root-domain * object. */ 1) What are 'per-domain' variables? 2) The mention of 'exclusive cpuset' is no longer correct. With the patch 'remove sched domain hooks from cpusets' cpusets no longer defines sched domains using the cpu_exclusive flag. With the subsequent sched_load_balance patch (see http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset flag 'sched_load_balance' to define sched domains. The following revised comment might be more accurate: /* * We add the notion of a root-domain which will be used to define per-domain * variables. Each non-overlapping sched domain defines an island domain by * fully partitioning the member cpus from any other cpuset. Whenever a new * such a sched domain is created, we also create and attach a new root-domain * object. These non-overlapping sched domains are determined by the cpuset * configuration, via a call to partition_sched_domains(). */ It sounds like you (Gregory, others) want your RT CPUs to be in a sched domain, unlike the current way things are, where my cpuset code carefully avoids setting up a sched domain for those CPUs. However I still have need, in the batch scheduler case explained above, to have some CPUs not in any sched domain. If you require these RT sched domains to be setup differently somehow, in some way that is visible to partition_sched_domains, then that apparently means we need a per-cpuset flag to mark those RT cpusets. If you just want an ordinary sched domain setup (just so long as it contains only the intended RT CPUs, not others) then I guess we don't technically need any more per-cpuset flags, but I'm worried, because the API we're presenting to users for this has just gone from subtle to bizarre. I suspect I'll want to add a flag anyway, if by doing so, I can make the kernel-user API, via cpusets, easier to understand. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 16:28 ` Paul Jackson @ 2008-01-29 16:42 ` Gregory Haskins 2008-01-29 19:37 ` Paul Jackson 0 siblings, 1 reply; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 16:42 UTC (permalink / raw) To: Paul Jackson Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 11:28 AM, in message <20080129102836.be614579.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: > Gregory wrote: >> I am a bit confused as to why you disable load-balancing in the >> RT cpuset? It shouldn't be strictly necessary in order for the >> RT scheduler to do its job (unless I am misunderstanding what you >> are trying to accomplish?). Do you do this because you *have* >> to in order to make real-time deadlines, or because its just a >> further optimization? > > My primary motivation for cpusets originally, and for the > sched_load_balance flag now, was not realtime, but "soft partitioning" > of big NUMA systems, especially for batch schedulers. They sometimes > have large cpusets which are only being used to hold smaller, per-job, > cpusets. It is a waste of time (CPU cycles in the kernel sched code) > to load balance those large cpusets. Load balancing doesn't scale > easily to high CPU counts, and it's nice to avoid doing that where > not needed. Understood, and that makes tons of sense. > > See the following lkml message for a fuller explanation: > > http://lkml.org/lkml/2008/1/29/85 > > As a secondary motivation, I thought that disabling load balancing on > the RT cpuset was the right thing to do for RT needs, but I make no > claim to knowing much about RT. Well, I make no claim to understand the large batch systems you work on either ;) Everything you said made a ton of sense other than the RT/load-balance thing, but I think we are on the same page now. > > I just now realized that you added a 'root_domain' in a patch in > late Nov and early Dec. I was on the road then, moving from > California to Texas, and not paying much attention to Linux. np (though I was wondering why you had no comment before ;) > > A couple of questions on that patch, both involving a comment it adds > to kernel/sched.c: > > /* > * We add the notion of a root-domain which will be used to define per-domain > * variables. Each exclusive cpuset essentially defines an island domain by > * fully partitioning the member cpus from any other cpuset. Whenever a new > * exclusive cpuset is created, we also create and attach a new root-domain > * object. > */ > > 1) What are 'per-domain' variables? s/per-domain/per-root-domain > > 2) The mention of 'exclusive cpuset' is no longer correct. > > With the patch 'remove sched domain hooks from cpusets' cpusets > no longer defines sched domains using the cpu_exclusive flag. > > With the subsequent sched_load_balance patch (see > http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset > flag 'sched_load_balance' to define sched domains. Doh! Thanks for the heads up. > > The following revised comment might be more accurate: > > /* > * We add the notion of a root-domain which will be used to define per-domain > * variables. Each non-overlapping sched domain defines an island domain by > * fully partitioning the member cpus from any other cpuset. Whenever a new > * such a sched domain is created, we also create and attach a new > root-domain > * object. These non-overlapping sched domains are determined by the cpuset > * configuration, via a call to partition_sched_domains(). > */ > > It sounds like you (Gregory, others) want your RT CPUs to be in a sched > domain, unlike the current way things are, where my cpuset code > carefully avoids setting up a sched domain for those CPUs. However I > still have need, in the batch scheduler case explained above, to have > some CPUs not in any sched domain. > > If you require these RT sched domains to be setup differently somehow, > in some way that is visible to partition_sched_domains, then that > apparently means we need a per-cpuset flag to mark those RT cpusets. I think we only need a plain-vanilla partition, so no flags should be necessary. -Greg > > If you just want an ordinary sched domain setup (just so long as it > contains only the intended RT CPUs, not others) then I guess we don't > technically need any more per-cpuset flags, but I'm worried, because > the API we're presenting to users for this has just gone from subtle to > bizarre. I suspect I'll want to add a flag anyway, if by doing so, I > can make the kernel-user API, via cpusets, easier to understand. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 16:42 ` Gregory Haskins @ 2008-01-29 19:37 ` Paul Jackson 2008-01-29 20:28 ` Gregory Haskins 0 siblings, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 19:37 UTC (permalink / raw) To: Gregory Haskins Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin Gregory wrote: > > 1) What are 'per-domain' variables? > > s/per-domain/per-root-domain Oh dear - now I've got more questions, not fewer. 1) "variables" ... what variables? 2) Is a 'root-domain' just the RT specific portion of a sched_domain, or is it something else? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 19:37 ` Paul Jackson @ 2008-01-29 20:28 ` Gregory Haskins 2008-01-29 20:56 ` Paul Jackson 0 siblings, 1 reply; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 20:28 UTC (permalink / raw) To: Paul Jackson Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 2:37 PM, in message <20080129133700.7f1ab444.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: > Gregory wrote: >> > 1) What are 'per-domain' variables? >> >> s/per-domain/per-root-domain > > Oh dear - now I've got more questions, not fewer. > > 1) "variables" ... what variables? Well, anything that is declared in "struct root_domain" in kernel/sched.c. For instance, today in mainline we have: struct root_domain { atomic_t refcount; cpumask_t span; cpumask_t online; /* * The "RT overload" flag: it gets set if a CPU has more than * one runnable RT task. */ cpumask_t rto_mask; atomic_t rto_count; }; The first three are just related to general root-domain infrastructure code. The last two in this case are related specifically to the rt-overload feature. In earlier versions of rt-balance, the rt-overload bitmap was a global variable. By moving it into the root_domain structure, there is now an instance per (um, for lack of a better, more up to date word) "exclusive" cpuset. That way, disparate cpusets will not bother each other with overload notifications, etc. Note that in -rt, we have more variables in this structure (RQ priority info) but that patch hasnt been pulled into sched-devel/linux-2.6 yet. > > 2) Is a 'root-domain' just the RT specific portion > of a sched_domain, or is it something else? Its meant to be general, but the only current client is the RT sched_class. Reading back through the links you guys have been sending, its very similar in concept to the "rt-domain" stuff that you, Peter, and Steven were discussing a while back. When I was originally putting this stuff together, I wanted to piggy back this data in the sched-domain code. But I soon realized that the sched-domain trees are per-cpu structures. What I needed was an "umbrella" structure that would allow cpus in a common cpuset to share arbitrary state data, but yet were associated with the sched-domains that the cpuset code setup. The first pass had the structures associated with the sched-domain hierarchy, but I soon realized that it was really a per-rq association so I could simplify the design. I.e.. Rather than have the code walk the sched-domain to find the common "root", I just hung the root directly on the rq itself. But anyway, to answer the question: The concept is meant to be generic. For instance, if it made sense for Peters cgroup work to sit here as well, we could just add new fields to the struct root_domain and Peter could access them via rq->rd. I realize that it could possibly have been designed to abstract away the type of objects that the root-domain manages, but I want to keep the initial code as simple as possible. We can always complicate^h^h^h^h^hcleanup the code later ;) Regards, -Greg ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 20:28 ` Gregory Haskins @ 2008-01-29 20:56 ` Paul Jackson 2008-01-29 21:02 ` Gregory Haskins 0 siblings, 1 reply; 36+ messages in thread From: Paul Jackson @ 2008-01-29 20:56 UTC (permalink / raw) To: Gregory Haskins Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin Gregory wrote: > By moving it into the root_domain structure, there is now an instance > per (um, for lack of a better, more up to date word) "exclusive" > cpuset. That way, disparate cpusets will not bother each other with > overload notifications, etc. So the root_domain structure is meant to be the portions of the sched_domains that are shared across all CPUs in that sched_domain ? And the word 'cpuset', occurring in the above quote twice, should be 'sched_domain', right ? Surely these aren't cpuset's ;). And 'exclusive cpuset' really means 'non-overlapping sched_domain' ? Or am I still confused ? I would like to get our concepts clear, and terms consistent. That's important for those others who would try to understand this. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 20:56 ` Paul Jackson @ 2008-01-29 21:02 ` Gregory Haskins 2008-01-29 22:23 ` Steven Rostedt 0 siblings, 1 reply; 36+ messages in thread From: Gregory Haskins @ 2008-01-29 21:02 UTC (permalink / raw) To: Paul Jackson Cc: a.p.zijlstra, mingo, dmitry.adamushko, rostedt, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin >>> On Tue, Jan 29, 2008 at 3:56 PM, in message <20080129145647.579b7d53.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: > Gregory wrote: >> By moving it into the root_domain structure, there is now an instance >> per (um, for lack of a better, more up to date word) "exclusive" >> cpuset. That way, disparate cpusets will not bother each other with >> overload notifications, etc. > > So the root_domain structure is meant to be the portions of the > sched_domains that are shared across all CPUs in that sched_domain ? Thats exactly right. > > And the word 'cpuset', occurring in the above quote twice, should > be 'sched_domain', right ? Surely these aren't cpuset's ;). Yeah, I think I am taking shortcuts in the language ;). I wanted the root_domain to be an object of shared data that sits at the "root sched_domain", or in other terms the terminating parent in the hierarchy. And there is one of these suckers created every time a non-overlapping cpuset is created (which was called "exclusive" at the time I wrote it, I believe, but I keep forgetting what you said they are called now ;). So because the non-overlapping cpuset configuration begat the sched_domain hierarchy, I started using them interchangeably. Sorry for the confusion :) > > And 'exclusive cpuset' really means 'non-overlapping sched_domain' ? > > Or am I still confused ? No, I think you nailed it. > > I would like to get our concepts clear, and terms consistent. That's > important for those others who would try to understand this. Very good idea. Thanks for doing this! -Greg ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 21:02 ` Gregory Haskins @ 2008-01-29 22:23 ` Steven Rostedt 0 siblings, 0 replies; 36+ messages in thread From: Steven Rostedt @ 2008-01-29 22:23 UTC (permalink / raw) To: Gregory Haskins Cc: Paul Jackson, a.p.zijlstra, mingo, dmitry.adamushko, menage, rientjes, tong.n.li, tglx, akpm, dhaval, vatsa, sgrubb, linux-kernel, ebiederm, nickpiggin On Tue, 29 Jan 2008, Gregory Haskins wrote: > > > > I would like to get our concepts clear, and terms consistent. That's > > important for those others who would try to understand this. > > Very good idea. Thanks for doing this! > Sorry for coming in so late, I've been banging my head on different bugs all day. Just to clear up what our goal for the RT balancer was, and how simple it is ;-) Basically any task that has an RT priority needs to run ASAP from the time it woke up if there's a CPU available that it can run on and it has a higher priority than what is currently running on that CPU. If an RT task wakes up, and there's a CPU available somewhere for it to run on, we want the RT task to jump to that CPU and run. RT tasks should not be waiting around for nice load balancing that optimizes the cache usage. But we also have a problem as well. We don't want to kill the cache on large NUMA architectures by looking for places for on RT task to run. With domains, we first look for a place that an RT task can run on a local CPU in the node, if so, then place it there, otherwise look at other nodes to run on. Note, that the RT balancing is aggressive and not passive. That means that the balancing takes place at the time the RT task is awoken (perhaps by the task that is waking it) or at the time a task changes priority. It is not passive, being that it waits for something else to migrate it (i.e. migration thread) Paul, I think you now understand that we don't have some scheduler domain that is specific for RT. The scheduling class is specific to the priority of the process and not to what domain it is in. But if you keep a domain invisible to an RT task, that domain never needs to worry about RT tasks migrating to it. The code that Gregory and I have been adding was to try to migrate RT tasks to CPUs they can run on as quick as possible without algorithms that cause cacheline bouncing. -- Steve ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 10:57 ` Peter Zijlstra 2008-01-29 11:30 ` Paul Jackson @ 2008-01-29 12:32 ` Srivatsa Vaddagiri 2008-01-29 12:21 ` Paul Jackson 1 sibling, 1 reply; 36+ messages in thread From: Srivatsa Vaddagiri @ 2008-01-29 12:32 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Dhaval Giani, Paul Jackson, Nick Piggin, Eric W. Biederman, Andrew Morton, Steve Grubb, Steven Rostedt, Gregory Haskins, Dmitry Adamushko, Li, Tong N, Thomas Gleixner, Paul Menage, David Rientjes On Tue, Jan 29, 2008 at 11:57:22AM +0100, Peter Zijlstra wrote: > On Tue, 2008-01-29 at 10:53 +0100, Peter Zijlstra wrote: > > > My thoughts were to make stronger use of disjoint cpu-sets. cgroups and > > cpusets are related, in that cpusets provide a property to a cgroup. > > However, load_balance_monitor()'s interaction with sched domains > > confuses me - it might DTRT, but I can't tell. > > > > [ It looks to me it balances a group over the largest SD the current cpu > > has access to, even though that might be larger than the SD associated > > with the cpuset of that particular cgroup. ] > > Hmm, with a bit more thought I think that does indeed DTRT. Because, if > the cpu belongs to a disjoint cpuset, the highest sd (with > load-balancing enabled) would be that. Right? Hi Peter, Yes, I was having this in mind when I wrote the load_balance_monitor() function - to only balance across cpus that form a disjoint cpuset in the system. > [ Just a bit of a shame we have all cgroups represented on each cpu. ] After reading your explanation in the other mail abt what you mean here, I agree. Your suggestion to remove/add cfs_rq from/to the leaf_cfs_rq_list upon dequeue_of_last_task/enqueue_of_first_task AND > Also, might be a nice idea to split the daemon up if there are indeed > disjoint sets - currently there is only a single daemon which touches > the whole system. the above suggestions seems like good ideas. I can also look at reducing the frequency at which the thread runs .. -- Regards, vatsa ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: scheduler scalability - cgroups, cpusets and load-balancing 2008-01-29 12:32 ` Srivatsa Vaddagiri @ 2008-01-29 12:21 ` Paul Jackson 0 siblings, 0 replies; 36+ messages in thread From: Paul Jackson @ 2008-01-29 12:21 UTC (permalink / raw) To: vatsa Cc: a.p.zijlstra, linux-kernel, mingo, dhaval, nickpiggin, ebiederm, akpm, sgrubb, rostedt, ghaskins, dmitry.adamushko, tong.n.li, tglx, menage, rientjes vatsa wrote to Peter: > After reading your explanation in the other mail abt what you mean here, > I agree. Ah good - glad someone understood that. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2008-01-29 22:24 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-29 9:53 scheduler scalability - cgroups, cpusets and load-balancing Peter Zijlstra 2008-01-29 10:01 ` Paul Jackson 2008-01-29 10:50 ` Peter Zijlstra 2008-01-29 11:13 ` Paul Jackson 2008-01-29 11:31 ` Peter Zijlstra 2008-01-29 11:53 ` Paul Jackson 2008-01-29 12:07 ` Peter Zijlstra 2008-01-29 12:36 ` Paul Jackson 2008-01-29 12:03 ` Paul Jackson 2008-01-29 12:30 ` Peter Zijlstra 2008-01-29 12:52 ` Paul Jackson 2008-01-29 13:38 ` Peter Zijlstra 2008-01-29 10:57 ` Peter Zijlstra 2008-01-29 11:30 ` Paul Jackson 2008-01-29 11:34 ` Paul Jackson 2008-01-29 11:50 ` Peter Zijlstra 2008-01-29 12:12 ` Paul Jackson 2008-01-29 15:57 ` Gregory Haskins 2008-01-29 16:33 ` Paul Jackson 2008-01-29 15:50 ` Gregory Haskins 2008-01-29 16:51 ` Paul Jackson 2008-01-29 17:21 ` Gregory Haskins 2008-01-29 19:04 ` Paul Jackson 2008-01-29 20:36 ` Gregory Haskins 2008-01-29 21:02 ` Paul Jackson 2008-01-29 21:07 ` Gregory Haskins 2008-01-29 15:36 ` Gregory Haskins 2008-01-29 16:28 ` Paul Jackson 2008-01-29 16:42 ` Gregory Haskins 2008-01-29 19:37 ` Paul Jackson 2008-01-29 20:28 ` Gregory Haskins 2008-01-29 20:56 ` Paul Jackson 2008-01-29 21:02 ` Gregory Haskins 2008-01-29 22:23 ` Steven Rostedt 2008-01-29 12:32 ` Srivatsa Vaddagiri 2008-01-29 12:21 ` Paul Jackson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox