* exclusive cpusets broken with cpu hotplug @ 2006-10-18 2:25 Siddha, Suresh B 2006-10-18 7:14 ` Paul Jackson 2006-10-18 17:54 ` Dinakar Guniguntala 0 siblings, 2 replies; 24+ messages in thread From: Siddha, Suresh B @ 2006-10-18 2:25 UTC (permalink / raw) To: pj, dino, menage Cc: Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin When ever a cpu hotplug happens, current kernel calls build_sched_domains() with cpu_online_map. That will destroy all the domain partitions(done by partition_sched_domains()) setup so far by exclusive cpusets. And its not just cpu hotplug, this happens even if someone changes multi core sched power savings policy. Anyone would like to fix it up? In the presence of cpusets, we basically need to traverse all the exclusive sets and setup the sched domains accordingly. If no one does :( then I will do that when I get some time... thanks, suresh ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 2:25 exclusive cpusets broken with cpu hotplug Siddha, Suresh B @ 2006-10-18 7:14 ` Paul Jackson 2006-10-18 9:56 ` Robin Holt 2006-10-18 17:54 ` Dinakar Guniguntala 1 sibling, 1 reply; 24+ messages in thread From: Paul Jackson @ 2006-10-18 7:14 UTC (permalink / raw) To: Siddha, Suresh B Cc: dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin > Anyone would like to fix it up? Hotplug is not high on my priority list. I do what I can in my spare time to avoid having cpusets or hotplug break each other. Besides, I'm not sure I'd be able. I've gotten to the point where I am confident I can make simple changes at the edges, such as mimicing the sched domain side affects of the cpu_exclusive flag with my new sched_domain flag. But that's near the current limit of my sched domain writing skills. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 7:14 ` Paul Jackson @ 2006-10-18 9:56 ` Robin Holt 2006-10-18 10:10 ` Paul Jackson 0 siblings, 1 reply; 24+ messages in thread From: Robin Holt @ 2006-10-18 9:56 UTC (permalink / raw) To: Paul Jackson Cc: Siddha, Suresh B, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin On Wed, Oct 18, 2006 at 12:14:24AM -0700, Paul Jackson wrote: > > Anyone would like to fix it up? > > Hotplug is not high on my priority list. > > I do what I can in my spare time to avoid having cpusets or hotplug > break each other. > > Besides, I'm not sure I'd be able. I've gotten to the point where I am > confident I can make simple changes at the edges, such as mimicing the > sched domain side affects of the cpu_exclusive flag with my new > sched_domain flag. But that's near the current limit of my sched domain > writing skills. Paul and Suresh, Could this be as simple as a CPU_UP_PREPARE or CPU_DOWN_PREPARE removing all the cpu_exclusive cpusets and a CPU_UP_CANCELLED, CPU_DOWN_CANCELLED, CPU_ONLINE, CPU_DEAD going through and partitioning all the cpu_exclusive cpusets. Thanks, Robin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 9:56 ` Robin Holt @ 2006-10-18 10:10 ` Paul Jackson 2006-10-18 10:53 ` Robin Holt 2006-10-18 12:16 ` Nick Piggin 0 siblings, 2 replies; 24+ messages in thread From: Paul Jackson @ 2006-10-18 10:10 UTC (permalink / raw) To: Robin Holt Cc: suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin Robin wrote: > Could this be as simple as a CPU_UP_PREPARE or CPU_DOWN_PREPARE > removing all the cpu_exclusive cpusets and a CPU_UP_CANCELLED, > CPU_DOWN_CANCELLED, CPU_ONLINE, CPU_DEAD going through and > partitioning all the cpu_exclusive cpusets. Perhaps. The somewhat related problems, in my book, are: 1) I don't know how to tell what sched domains/groups a system has, nor how to tell my customers how to see what sched domains they have, and 2) I suspect that Mr. Cpusets doesn't understand sched domains and that Mr. Sched Domain doesn't understand cpusets, and that we've ended up with some inscrutable and likely unsuitable interactions between the two as a result, which in particular don't result in cpusets driving the sched domain configuration in the desired ways for some of the less trivial configs. Well ... at least the first suspcicion above is a near certainty ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 10:10 ` Paul Jackson @ 2006-10-18 10:53 ` Robin Holt 2006-10-18 21:07 ` Paul Jackson 2006-10-18 12:16 ` Nick Piggin 1 sibling, 1 reply; 24+ messages in thread From: Robin Holt @ 2006-10-18 10:53 UTC (permalink / raw) To: Paul Jackson Cc: Robin Holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin On Wed, Oct 18, 2006 at 03:10:21AM -0700, Paul Jackson wrote: > 2) I suspect that Mr. Cpusets doesn't understand sched domains and that > Mr. Sched Domain doesn't understand cpusets, and that we've ended > up with some inscrutable and likely unsuitable interactions between > the two as a result, which in particular don't result in cpusets > driving the sched domain configuration in the desired ways for some > of the less trivial configs. You do, however, hopefully have enough information to create the calls you would make to partition_sched_domain if each had their cpu_exclusive flags cleared. Essentially, what I am proposing is making all the calls as if the user had cleared each as the remove/add starts, and then behave as if each each was set again. Thanks, Robin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 10:53 ` Robin Holt @ 2006-10-18 21:07 ` Paul Jackson 2006-10-19 5:56 ` Paul Jackson 0 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2006-10-18 21:07 UTC (permalink / raw) To: Robin Holt Cc: suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin > You do, however, hopefully have enough information to create the > calls you would make to partition_sched_domain if each had their > cpu_exclusive flags cleared. Essentially, what I am proposing is > making all the calls as if the user had cleared each as the > remove/add starts, and then behave as if each each was set again. Yes - hopefully we have enough information to rebuild the sched domains each time, consistently. And your proposal is probably an improvement for that reason. However, I'm afraid that only solves half the problem. It makes the sched domains more repeatable and predictable. But I'm worried that the cpuset control over sched domains is still broken .. see the example below. I've half a mind to prepare a patch to just rip out the sched domain defining code from kernel/cpuset.c, completely uncoupling the cpu_exclusive flag, and any other cpuset flags, from sched domains. Example: As best as I can tell (which is not very far ;), if some hapless user does the following: /dev/cpuset cpu_exclusive == 1; cpus == 0-7 /dev/cpuset/a cpu_exclusive == 1; cpus == 0-3 /dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7 and then runs a big job in the top cpuset (/dev/cpuset), then that big job will not load balance correctly, with whatever threads in the big job that got stuck on cpus 0-3 isolated from whatever threads got stuck on cpus 4-7. Is this correct? If so, there no practical way that I can see on a production system for the system admin to realize they have messed up their system this way. If we can't make this work properly automatically, then we either need to provide users the visibility and control to make it work by explicit manual control (meaning my 'sched_domain' flag patch, plus some way of exporting the sched domain topology in /sys), or we need to stop doing this. If the above example is not correct, then I'm afraid my education in sched domains is in need of another lesson. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 21:07 ` Paul Jackson @ 2006-10-19 5:56 ` Paul Jackson 0 siblings, 0 replies; 24+ messages in thread From: Paul Jackson @ 2006-10-19 5:56 UTC (permalink / raw) To: Paul Jackson Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin Earlier today I wrote: > I've half a mind to prepare a patch to just rip out the sched domain > defining code from kernel/cpuset.c, completely uncoupling the > cpu_exclusive flag, and any other cpuset flags, from sched domains. > > Example: > > As best as I can tell (which is not very far ;), if some hapless > user does the following: > > /dev/cpuset cpu_exclusive == 1; cpus == 0-7 > /dev/cpuset/a cpu_exclusive == 1; cpus == 0-3 > /dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7 > > and then runs a big job in the top cpuset (/dev/cpuset), then that > big job will not load balance correctly, with whatever threads > in the big job that got stuck on cpus 0-3 isolated from whatever > threads got stuck on cpus 4-7. > > Is this correct? > > If so, there no practical way that I can see on a production system for > the system admin to realize they have messed up their system this way. > > If we can't make this work properly automatically, then we either need > to provide users the visibility and control to make it work by explicit > manual control (meaning my 'sched_domain' flag patch, plus some way of > exporting the sched domain topology in /sys), or we need to stop doing > this. I am now more certain - the above gives an example of serious breakage with the current mechanism of connecting cpusets to sched domains via the cpuset flag. We should either fix it (perhaps with my patch to add sched_domain flags to cpusets, plus a yet to be written patch to make sched domains visible via /sys or some such place), or we should nuke it, I am now 90% certain we should nuke the entire mechanism connecting cpusets to sched domains via the cpu_exclusive flag. The only useful thing to be done, which is much simpler, is to provide someway to manipulate the cpu_isolated_map at runtime. I have a pair of patches ready to ship out that do this. Coming soon to a mailing list near you ... -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 10:10 ` Paul Jackson 2006-10-18 10:53 ` Robin Holt @ 2006-10-18 12:16 ` Nick Piggin 2006-10-18 14:14 ` Siddha, Suresh B 2006-10-19 6:15 ` Paul Jackson 1 sibling, 2 replies; 24+ messages in thread From: Nick Piggin @ 2006-10-18 12:16 UTC (permalink / raw) To: Paul Jackson Cc: Robin Holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Paul Jackson wrote: > Robin wrote: > >>Could this be as simple as a CPU_UP_PREPARE or CPU_DOWN_PREPARE >>removing all the cpu_exclusive cpusets and a CPU_UP_CANCELLED, >>CPU_DOWN_CANCELLED, CPU_ONLINE, CPU_DEAD going through and >>partitioning all the cpu_exclusive cpusets. > > > Perhaps. > > The somewhat related problems, in my book, are: > > 1) I don't know how to tell what sched domains/groups a system has, nor > how to tell my customers how to see what sched domains they have, and I don't know if you want customers do know what domains they have. I think you should avoid having explicit control over sched-domains in your cpusets completely, and just have the cpusets create partitioned domains whenever it can. > > 2) I suspect that Mr. Cpusets doesn't understand sched domains and that > Mr. Sched Domain doesn't understand cpusets, and that we've ended > up with some inscrutable and likely unsuitable interactions between > the two as a result, which in particular don't result in cpusets > driving the sched domain configuration in the desired ways for some > of the less trivial configs. > > Well ... at least the first suspcicion above is a near certainty ;). cpusets is the only thing that messes with sched-domains (excluding the isolcpus -- that seems to require a small change to partition_sched_domains, but forget that for now). And so you should know what partitioning to build at any point when asked. So we could have a call to cpusets at the end of arch_init_sched_domains, which asks for the domains to be partitioned, no? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 12:16 ` Nick Piggin @ 2006-10-18 14:14 ` Siddha, Suresh B 2006-10-18 14:51 ` Nick Piggin 2006-10-19 6:15 ` Paul Jackson 1 sibling, 1 reply; 24+ messages in thread From: Siddha, Suresh B @ 2006-10-18 14:14 UTC (permalink / raw) To: Nick Piggin Cc: Paul Jackson, Robin Holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar On Wed, Oct 18, 2006 at 10:16:50PM +1000, Nick Piggin wrote: > Paul Jackson wrote: > > 1) I don't know how to tell what sched domains/groups a system has, nor Paul, atleast for debugging one can know that by defining SCHED_DOMAIN_DEBUG > > how to tell my customers how to see what sched domains they have, and > > I don't know if you want customers do know what domains they have. I think At the first glance, I have to agree with Nick. All the customer wants is a mechanism to specify group these cpus together for scheduling... But looking at how cpusets interact with sched-domains and especially for large systems, it will probably be useful if we export the topology through /sys > cpusets is the only thing that messes with sched-domains (excluding the > isolcpus -- that seems to require a small change to partition_sched_domains, > but forget that for now). > > And so you should know what partitioning to build at any point when asked. > So we could have a call to cpusets at the end of arch_init_sched_domains, > which asks for the domains to be partitioned, no? yes. Robin, Right now everyone is calling arch_init_sched_domain() with cpu_online_map. We can remove this argument and in the presence of cpusets, this routine can go through exclusive cpusets and partition the domains accordinly. Otherwise we can simply build one domain partition with cpu_online_map. thanks, suresh ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 14:14 ` Siddha, Suresh B @ 2006-10-18 14:51 ` Nick Piggin 0 siblings, 0 replies; 24+ messages in thread From: Nick Piggin @ 2006-10-18 14:51 UTC (permalink / raw) To: Siddha, Suresh B Cc: Paul Jackson, Robin Holt, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Siddha, Suresh B wrote: > On Wed, Oct 18, 2006 at 10:16:50PM +1000, Nick Piggin wrote: > >>Paul Jackson wrote: >> >>> 1) I don't know how to tell what sched domains/groups a system has, nor > > > Paul, atleast for debugging one can know that by defining SCHED_DOMAIN_DEBUG Yep. This is meant to be useful precisely for things like making cpusets partition the domains properly or ensuring a system's topology is built correctly. >>> how to tell my customers how to see what sched domains they have, and >> >>I don't know if you want customers do know what domains they have. I think > > > At the first glance, I have to agree with Nick. All the customer wants is a > mechanism to specify group these cpus together for scheduling... > > But looking at how cpusets interact with sched-domains and especially for > large systems, it will probably be useful if we export the topology through /sys I'll concede that point. It would probably be useful for a sysadmin to be able to look at how they can better make cpuset placements such that they get the best partitioning. I would still prefer not to say "use an exclusive domain for this cpuset". cpusets should be able to do the optimal thing with the data it has, so this is one less complication to deal with. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 12:16 ` Nick Piggin 2006-10-18 14:14 ` Siddha, Suresh B @ 2006-10-19 6:15 ` Paul Jackson 2006-10-19 6:35 ` Nick Piggin 1 sibling, 1 reply; 24+ messages in thread From: Paul Jackson @ 2006-10-19 6:15 UTC (permalink / raw) To: Nick Piggin Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar > I don't know if you want customers do know what domains they have. I think > you should avoid having explicit control over sched-domains in your cpusets > completely, and just have the cpusets create partitioned domains whenever > it can. We have a choice to make. I am increasingly convinced that the current mechanism linking cpusets with sched domains is busted, allowing people to easily and unspectingly set up broken sched domain configs, without even being able to see what they are doing. Certainly that linkage has been confusing to some of us who are not kernel/sched.c experts. Certainly users on production systems cannot see what sched domains they have ended up with. We should either make this linkage explicit and understandable, giving users direct means to construct sched domains and probe what they have done, or we should remove this linkage. My patch to add sched_domain flags to cpusets was an attempt to make this control explicit. I am now 90% convinced that this is the wrong direction, and that the entire chunk of code linking cpu_exclusive cpusets to sched domains should be nuked. The one thing I found so far today that people actually needed from this was that my real time people needed to be able to something like marking a cpu isolated. So I think we should have runtime support for manipulating the cpu_isolated_map. I will be sending in a pair of patches shortly to: 1) nuke the cpu_exclusive - sched_domain linkage, and 2) support runtime marking of isolated cpus. Does that sound better to you? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 6:15 ` Paul Jackson @ 2006-10-19 6:35 ` Nick Piggin 2006-10-19 6:57 ` Paul Jackson 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2006-10-19 6:35 UTC (permalink / raw) To: Paul Jackson Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Paul Jackson wrote: >>I don't know if you want customers do know what domains they have. I think >>you should avoid having explicit control over sched-domains in your cpusets >>completely, and just have the cpusets create partitioned domains whenever >>it can. > > > We have a choice to make. I am increasingly convinced that the > current mechanism linking cpusets with sched domains is busted, > allowing people to easily and unspectingly set up broken sched domain > configs, without even being able to see what they are doing. > Certainly that linkage has been confusing to some of us who are > not kernel/sched.c experts. Certainly users on production systems > cannot see what sched domains they have ended up with. > > We should either make this linkage explicit and understandable, giving > users direct means to construct sched domains and probe what they have > done, or we should remove this linkage. > > My patch to add sched_domain flags to cpusets was an attempt to > make this control explicit. > > I am now 90% convinced that this is the wrong direction, and that > the entire chunk of code linking cpu_exclusive cpusets to sched > domains should be nuked. > > The one thing I found so far today that people actually needed from > this was that my real time people needed to be able to something like > marking a cpu isolated. So I think we should have runtime support for > manipulating the cpu_isolated_map. > > I will be sending in a pair of patches shortly to: > 1) nuke the cpu_exclusive - sched_domain linkage, and > 2) support runtime marking of isolated cpus. > > Does that sound better to you? > I don't understand why you think the "implicit" (as in, not directly user controlled?) linkage is wrong. If it is allowing people to set up busted domains, then the cpusets code is asking for the wrong partitions. Having them explicitly control it is wrong because it is really an implementation detail that could change in the future. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 6:35 ` Nick Piggin @ 2006-10-19 6:57 ` Paul Jackson 2006-10-19 7:04 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2006-10-19 6:57 UTC (permalink / raw) To: Nick Piggin Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Nick wrote: > I don't understand why you think the "implicit" (as in, not directly user > controlled?) linkage is wrong. Twice now I've given the following specific example. I am not yet confident that I have it right, and welcome feedback. However, Suresh has apparently agreed with my conclusion that one can use the current linkage between cpu_exclusive cpusets and sched domains to get unexpected and perhaps undesirable sched domain setups. What's your take on this example: > Example: > > As best as I can tell (which is not very far ;), if some hapless > user does the following: > > /dev/cpuset cpu_exclusive == 1; cpus == 0-7 > /dev/cpuset/a cpu_exclusive == 1; cpus == 0-3 > /dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7 > > and then runs a big job in the top cpuset (/dev/cpuset), then that > big job will not load balance correctly, with whatever threads > in the big job that got stuck on cpus 0-3 isolated from whatever > threads got stuck on cpus 4-7. > > Is this correct? If I have concluded incorrectly what happens in the above example (good chance) then please educate me on how this stuff works. I should warn you that I have demonstrated a remarkable resistance to being educatible on this subject ;). If this interface has no material affect on users programs, then implicit may well be ok. But if it has material affect on the behaviour, such as CPU placement or scope of load balancing, of user programs, then I am strongly in favor of making that affect explicit, understandable, and visible at runtime, on production systems. That, or getting rid of the affect, and replacing it with something that is simple, understandable, explicit and visible ... my current plan. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 6:57 ` Paul Jackson @ 2006-10-19 7:04 ` Nick Piggin 2006-10-19 7:33 ` Paul Jackson 2006-10-19 7:34 ` Paul Jackson 0 siblings, 2 replies; 24+ messages in thread From: Nick Piggin @ 2006-10-19 7:04 UTC (permalink / raw) To: Paul Jackson Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Paul Jackson wrote: > Nick wrote: > >>I don't understand why you think the "implicit" (as in, not directly user >>controlled?) linkage is wrong. > > > Twice now I've given the following specific example. I am not yet > confident that I have it right, and welcome feedback. Sorry, I skimmed over that. > > However, Suresh has apparently agreed with my conclusion that one > can use the current linkage between cpu_exclusive cpusets and sched > domains to get unexpected and perhaps undesirable sched domain setups. > > What's your take on this example: > > >>Example: >> >> As best as I can tell (which is not very far ;), if some hapless >> user does the following: >> >> /dev/cpuset cpu_exclusive == 1; cpus == 0-7 >> /dev/cpuset/a cpu_exclusive == 1; cpus == 0-3 >> /dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7 >> >> and then runs a big job in the top cpuset (/dev/cpuset), then that >> big job will not load balance correctly, with whatever threads >> in the big job that got stuck on cpus 0-3 isolated from whatever >> threads got stuck on cpus 4-7. >> >>Is this correct? > > > If I have concluded incorrectly what happens in the above example > (good chance) then please educate me on how this stuff works. So that depends on what cpusets asks for. If, when setting up a and b, it asks to partition the domains, then yes that breaks the parent cpuset gets broken. > I should warn you that I have demonstrated a remarkable resistance > to being educatible on this subject ;). Don't worry about the whole sched-domains implementation if you just consider that partitioning the domains creates a hard partition among the system's CPUs (but the upshot is that within the partitions, balancing works pretty nicely). So in your above example, cpusets should only ask for a partition of the 0-7 CPUs. If you wanted to get fancy and detect that there are no jobs in the root cpuset, then you could make the two smaller partitions, and revert back to the one bigger one if something gets assigned to it. But that's all a matter of how you want cpusets to manage it, I really don't think a user should control this (we simply shouldn't allow situations where we put a partition in the middle of a cpuset). Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 7:04 ` Nick Piggin @ 2006-10-19 7:33 ` Paul Jackson 2006-10-19 8:16 ` Nick Piggin 2006-10-19 7:34 ` Paul Jackson 1 sibling, 1 reply; 24+ messages in thread From: Paul Jackson @ 2006-10-19 7:33 UTC (permalink / raw) To: Nick Piggin Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar > So that depends on what cpusets asks for. If, when setting up a and > b, it asks to partition the domains, then yes that breaks the parent > cpuset gets broken. That probably makes good sense from the sched domain side of things. It is insanely counterintuitive from the cpuset side of things. Using heirarchical cpuset properties to drive this is the wrong approach. In the general case, looking at it (as best I can) from the sched domain side of things, it seems that the sched domain could be defined on a system as follows. Partition the CPUs on the system - into one or more subsets (partitions), non-overlapping, and covering. Each of those partitions can either have a sched domain setup on it, to support load balancing across the CPUs in that partition, or can be isolated, with no load balancing occuring within that partition. No load balancing occurs across partitions. Using cpu_exclusive cpusets for this is next to impossible. It could be approximated perhaps by having just the immediate children of the root cpuset, /dev/cpuset/*, define the partition. But if any lower level cpusets have any affect on the partitioning, by setting their cpu_exclusive flag in the current implementation, it is -always- the case, by the basic structure of the cpuset hierarchy, that the lower level cpuset is a subset of its parents cpus, and that that parent also has cpu_exclusive set. The resulting partitioning, even in such simple examples as above, is not obvious. If you look back a couple days, when I first presented essentially this example, I got the resulting sched domain partitioning entirely wrong. The essential detail in my big patch of yesterday, to add new specific sched_domain flags to cpusets, is that it -removed- the requirement to mark a parent as defining a sched domain anytime a child defined one. That requirement is one of the defining properties of the cpu_exclusive flag, and makes that flag -outrageously- unsuited for defining sched domain partitions. My new sched_domain flags at least had the right properties, defaults and rules, that they perhaps could have been used to sanely define sched domain partitions. One could mark a few select cpusets, at any depth in the hierarchy, as defining sched domain partions, without being forced to mark a whole bunch more ancestor cpusets the same way, slicing and dicing the sched domain partions into hamburger. However, fortunately, ... so far as I can tell ... no one needs the general case described above, of multiple sched domain partitions. So far as I know, the only essential special case that user land requires to deal with is to isolate one partition (one subset of CPUs) from any scheduler load balancing. Every CPU is either load balanced however the kernel/sched.c chooses to load balance it, with potentially every other non-isolated CPU, or is in the isolated partition (cpu_isolated_map) and not considered for load balancing. Have I missed any case requiring explicit user intervention? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 7:33 ` Paul Jackson @ 2006-10-19 8:16 ` Nick Piggin 2006-10-19 8:31 ` Paul Jackson 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2006-10-19 8:16 UTC (permalink / raw) To: Paul Jackson Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Paul Jackson wrote: >>So that depends on what cpusets asks for. If, when setting up a and >>b, it asks to partition the domains, then yes that breaks the parent >>cpuset gets broken. > > > That probably makes good sense from the sched domain side of things. > > It is insanely counterintuitive from the cpuset side of things. > > Using heirarchical cpuset properties to drive this is the wrong > approach. > > In the general case, looking at it (as best I can) from the sched > domain side of things, it seems that the sched domain could be > defined on a system as follows. > > Partition the CPUs on the system - into one or more subsets > (partitions), non-overlapping, and covering. > > Each of those partitions can either have a sched domain setup on > it, to support load balancing across the CPUs in that partition, > or can be isolated, with no load balancing occuring within that > partition. > > No load balancing occurs across partitions. Correct. But you don't have to treat isolated CPUs differently - they are just the degenerate case of a partition of 1 CPU. I assume cpusets could create similar "isolated" domains where no balancing takes place. > Using cpu_exclusive cpusets for this is next to impossible. It could > be approximated perhaps by having just the immediate children of the > root cpuset, /dev/cpuset/*, define the partition. Fine. > But if any lower level cpusets have any affect on the partitioning, > by setting their cpu_exclusive flag in the current implementation, > it is -always- the case, by the basic structure of the cpuset > hierarchy, that the lower level cpuset is a subset of its parents > cpus, and that that parent also has cpu_exclusive set. > > The resulting partitioning, even in such simple examples as above, is > not obvious. If you look back a couple days, when I first presented > essentially this example, I got the resulting sched domain partitioning > entirely wrong. > > The essential detail in my big patch of yesterday, to add new specific > sched_domain flags to cpusets, is that it -removed- the requirement to > mark a parent as defining a sched domain anytime a child defined one. > > That requirement is one of the defining properties of the cpu_exclusive > flag, and makes that flag -outrageously- unsuited for defining sched > domain partitions. So make the new rule "cpu_exclusive && direct-child-of-root-cpuset". Your problems go away, and they haven't been pushed to userspace. If a user wants to, for some crazy reason, have a set of cpu_exclusive sets deep in the cpuset hierarchy, such that no load balancing happens between them... just tell them they can't; they should just make those cpusets children of the root. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 8:16 ` Nick Piggin @ 2006-10-19 8:31 ` Paul Jackson 0 siblings, 0 replies; 24+ messages in thread From: Paul Jackson @ 2006-10-19 8:31 UTC (permalink / raw) To: Nick Piggin Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar > So make the new rule "cpu_exclusive && direct-child-of-root-cpuset". > Your problems go away, and they haven't been pushed to userspace. I don't know of anyone that has need for this feature. Do you? If you do - good - lets consider them anew. If such needs arise, I doubt I would recommend meeting them with the cpu_exclusive flag, in any way shape or form. That would probably not be a particularly clear and intuitive interface for whatever it was we needed. > If a user wants to, for some crazy reason, have a set of cpu_exclusive > sets deep in the cpuset hierarchy, such that no load balancing happens > between them... just tell them they can't; they should just make those > cpusets children of the root. I have no problem telling users what the limits are on mechanisms. I have serious problems trying to push mechanisms on them that I couldn't understand until after repeated attempts over many months, that are counter intuitive and dangerous (at least unless such odd rules are imposed) to use, and that provide no useful feedback to the user as to what they are doing. It doesn't increase my sympathy for this code that it has been my biggest source of customer maintenance costs due to a couple of serious bugs, in all of the cpuset code. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 7:04 ` Nick Piggin 2006-10-19 7:33 ` Paul Jackson @ 2006-10-19 7:34 ` Paul Jackson 2006-10-19 8:07 ` Nick Piggin 1 sibling, 1 reply; 24+ messages in thread From: Paul Jackson @ 2006-10-19 7:34 UTC (permalink / raw) To: Nick Piggin Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Nick wrote: > (we simply shouldn't allow > situations where we put a partition in the middle of a cpuset). Could you explain to me what you mean by "put a partition in the middle of a cpuset?" -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 7:34 ` Paul Jackson @ 2006-10-19 8:07 ` Nick Piggin 2006-10-19 8:11 ` Paul Jackson 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2006-10-19 8:07 UTC (permalink / raw) To: Paul Jackson Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Paul Jackson wrote: > Nick wrote: > >>(we simply shouldn't allow >>situations where we put a partition in the middle of a cpuset). > > > Could you explain to me what you mean by "put a partition in the > middle of a cpuset?" > Your example, if a partition is created for each of the sub cpusets. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 8:07 ` Nick Piggin @ 2006-10-19 8:11 ` Paul Jackson 2006-10-19 8:22 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2006-10-19 8:11 UTC (permalink / raw) To: Nick Piggin Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar > Paul Jackson wrote: > > Nick wrote: > > > >>(we simply shouldn't allow > >>situations where we put a partition in the middle of a cpuset). > > > > > > Could you explain to me what you mean by "put a partition in the > > middle of a cpuset?" > > > > Your example, if a partition is created for each of the sub cpusets. The thing "we simply shouldn't allow", then, is the bread and butter of cpusets. I am convinced that we are trying to pound nails with toothpicks. The cpu_exclusive flag was the wrong flag to overload to define sched domains. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 8:11 ` Paul Jackson @ 2006-10-19 8:22 ` Nick Piggin 2006-10-19 8:42 ` Paul Jackson 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2006-10-19 8:22 UTC (permalink / raw) To: Paul Jackson Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Paul Jackson wrote: >>Paul Jackson wrote: >> >>>Nick wrote: >>> >>> >>>>(we simply shouldn't allow >>>>situations where we put a partition in the middle of a cpuset). >>> >>> >>>Could you explain to me what you mean by "put a partition in the >>>middle of a cpuset?" >>> >> >>Your example, if a partition is created for each of the sub cpusets. > > > The thing "we simply shouldn't allow", then, is the bread and > butter of cpusets. No. They can put a cpuset there all they like. But the cpuset code should *not* put a partition there. That is all. > > I am convinced that we are trying to pound nails with toothpicks. > > The cpu_exclusive flag was the wrong flag to overload to define > sched domains. Well it is the correct flag if we only create the domain for the oldest ancestor with the cpu_exclusive flag set. From the documentation: "A cpuset may be marked exclusive, which ensures that no other cpuset (except direct ancestors and descendents) may contain any overlapping CPUs or Memory Nodes." It is this non overlapping property that we can take advantage of, and partition the scheduler. Obviously, the exception (from the POV of the oldest ancestor) is its descendents, which can be overlapping. So just don't create partitions for those guys. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-19 8:22 ` Nick Piggin @ 2006-10-19 8:42 ` Paul Jackson 0 siblings, 0 replies; 24+ messages in thread From: Paul Jackson @ 2006-10-19 8:42 UTC (permalink / raw) To: Nick Piggin Cc: holt, suresh.b.siddha, dino, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar Nick wrote: > It is this non overlapping property that we can take advantage of, and > partition the scheduler. You want non-overlapping versus all other CPUs on the system. You want to partition the systems CPUs, in the mathematical sense of the word 'partition', a non-overlapping cover. Fine. That's an honorable goal. But cpu_exclusive gives you non-overlapping versus sibling cpusets. Wrong tool for the job. Close - sounded right - has that nice long word 'exclusive' in there somewhere. Wrong one however. It made good sense to anyone that came at this from the kernel/sched.c side, as it was obvious to them what was needed. To myself and my cpuset users, it made no bleeping sense whatsoever. What actual needs do we have here? Lets figure that out, then if that leads to adding mechanism of the right shape to fit the needs, fine. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 2:25 exclusive cpusets broken with cpu hotplug Siddha, Suresh B 2006-10-18 7:14 ` Paul Jackson @ 2006-10-18 17:54 ` Dinakar Guniguntala 2006-10-18 18:05 ` Paul Jackson 1 sibling, 1 reply; 24+ messages in thread From: Dinakar Guniguntala @ 2006-10-18 17:54 UTC (permalink / raw) To: Siddha, Suresh B Cc: pj, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin On Tue, Oct 17, 2006 at 07:25:48PM -0700, Siddha, Suresh B wrote: > When ever a cpu hotplug happens, current kernel calls build_sched_domains() > with cpu_online_map. That will destroy all the domain partitions(done by > partition_sched_domains()) setup so far by exclusive cpusets. > > And its not just cpu hotplug, this happens even if someone changes multi core > sched power savings policy. > > Anyone would like to fix it up? In the presence of cpusets, we basically > need to traverse all the exclusive sets and setup the sched domains > accordingly. > > If no one does :( then I will do that when I get some time... Suresh, I have a patch (though a very old one...) for handling hotplug and cpusets. However there were some ugly locking issues and nesting of locks that I ran into and I never got the time to sort them out. Also there didnt seem to be any users for it and so I had no motivation to further complicate the cpusets code/sched domains code. However I can dust up the patches if there is a need -Dinakar ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: exclusive cpusets broken with cpu hotplug 2006-10-18 17:54 ` Dinakar Guniguntala @ 2006-10-18 18:05 ` Paul Jackson 0 siblings, 0 replies; 24+ messages in thread From: Paul Jackson @ 2006-10-18 18:05 UTC (permalink / raw) To: dino Cc: suresh.b.siddha, menage, Simon.Derr, linux-kernel, mbligh, rohitseth, dipankar, nickpiggin Dinakar wrote: > I have a patch (though a very old one...) for handling hotplug and cpusets. > However there were some ugly locking issues and nesting of locks ... The interaction of cpusets and hotplug should be in good shape. Look in kernel/cpuset.c for CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG, and you will see the two routines to call for cpu and memory hotplug events, cpuset_handle_cpuhp() and cpuset_track_online_nodes(). The problem area is the interaction of dynamic sched domains and cpusets with hot plug events. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2006-10-19 8:42 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-10-18 2:25 exclusive cpusets broken with cpu hotplug Siddha, Suresh B 2006-10-18 7:14 ` Paul Jackson 2006-10-18 9:56 ` Robin Holt 2006-10-18 10:10 ` Paul Jackson 2006-10-18 10:53 ` Robin Holt 2006-10-18 21:07 ` Paul Jackson 2006-10-19 5:56 ` Paul Jackson 2006-10-18 12:16 ` Nick Piggin 2006-10-18 14:14 ` Siddha, Suresh B 2006-10-18 14:51 ` Nick Piggin 2006-10-19 6:15 ` Paul Jackson 2006-10-19 6:35 ` Nick Piggin 2006-10-19 6:57 ` Paul Jackson 2006-10-19 7:04 ` Nick Piggin 2006-10-19 7:33 ` Paul Jackson 2006-10-19 8:16 ` Nick Piggin 2006-10-19 8:31 ` Paul Jackson 2006-10-19 7:34 ` Paul Jackson 2006-10-19 8:07 ` Nick Piggin 2006-10-19 8:11 ` Paul Jackson 2006-10-19 8:22 ` Nick Piggin 2006-10-19 8:42 ` Paul Jackson 2006-10-18 17:54 ` Dinakar Guniguntala 2006-10-18 18:05 ` Paul Jackson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox