* RT sched: cpupri_vec lock contention with def_root_domain and no load balance
@ 2008-11-03 21:07 Dimitri Sivanich
2008-11-03 22:33 ` Peter Zijlstra
0 siblings, 1 reply; 29+ messages in thread
From: Dimitri Sivanich @ 2008-11-03 21:07 UTC (permalink / raw)
To: linux-kernel; +Cc: Ingo Molnar
When load balancing gets switched off for a set of cpus via the
sched_load_balance flag in cpusets, those cpus wind up with the
globally defined def_root_domain attached. The def_root_domain is
attached when partition_sched_domains calls detach_destroy_domains().
A new root_domain is never allocated or attached as a sched domain
will never be attached by __build_sched_domains() for the non-load
balanced processors.
The problem with this scenario is that on systems with a large number
of processors with load balancing switched off, we start to see the
cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
This starts to become much more apparent above 8 waking RT threads
(with each RT thread running on it's own cpu, blocking and waking up
continuously).
I'm wondering if this is, in fact, the way things were meant to work,
or should we have a root domain allocated for each cpu that is not to
be part of a sched domain? Note the the def_root_domain spans all of
the non-load-balanced cpus in this case. Having it attached to cpus
that should not be load balancing doesn't quite make sense to me.
Here's where we've often seen this lock contention occur:
0xa0000001006df1e0 _spin_lock_irqsave+0x40
args (0xa000000101f8e1c8)
0xa00000010014b150 cpupri_set+0x290
args (0x16, 0x2c, 0x16, 0xa000000101f8e1c8, 0xa000000101f8b518, 0x1, 0x2c,
0xa000000100092ee0, 0x48c)
0xa000000100092ee0 __enqueue_rt_entity+0x300
args (0xe00000b4730401a0, 0xe0000b300316b510, 0xe0000b300316ba10, 0x500,
0xe0000b300316b518, 0x50, 0xa000000100093bc0, 0x286, 0x4f)
0xa000000100093bc0 enqueue_rt_entity+0xe0
args (0xe00000b4730401a0, 0x0, 0xa000000100093c50, 0x307, 0xe00000b4730401a0)
0xa000000100093c50 enqueue_task_rt+0x30
args (0xe0000b300316b400, 0xe00000b473040000, 0x1, 0xa0000001000848d0, 0x309,
0xa000000101122134)
0xa0000001000848d0 enqueue_task+0xd0
args (0xe0000b300316b400, 0xe00000b473040000, 0x1, 0xa000000100084ba0, 0x309,
0xa0000001013079b0)
0xa000000100084ba0 activate_task+0x60
args (0xe0000b300316b400, 0xe00000b473040000, 0x1, 0xa00000010009a270, 0x58e,
0xa000000100099ec0)
0xa00000010009a270 try_to_wake_up+0x530
args (0xe00000b473040000, 0x1, 0xe0000b300316b400, 0x49c6, 0xe0000b300316bc10,
0xe0000b300316bcac, 0xe00000b473040078, 0xe0000b300316bc38, 0xa00000010009a4d0)
0xa00000010009a4d0 wake_up_process+0x30
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-03 21:07 RT sched: cpupri_vec lock contention with def_root_domain and no load balance Dimitri Sivanich @ 2008-11-03 22:33 ` Peter Zijlstra 2008-11-04 1:29 ` Dimitri Sivanich 2008-11-04 3:53 ` Gregory Haskins 0 siblings, 2 replies; 29+ messages in thread From: Peter Zijlstra @ 2008-11-03 22:33 UTC (permalink / raw) To: Dimitri Sivanich; +Cc: linux-kernel, Ingo Molnar, Gregory Haskins On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: > When load balancing gets switched off for a set of cpus via the > sched_load_balance flag in cpusets, those cpus wind up with the > globally defined def_root_domain attached. The def_root_domain is > attached when partition_sched_domains calls detach_destroy_domains(). > A new root_domain is never allocated or attached as a sched domain > will never be attached by __build_sched_domains() for the non-load > balanced processors. > > The problem with this scenario is that on systems with a large number > of processors with load balancing switched off, we start to see the > cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. > This starts to become much more apparent above 8 waking RT threads > (with each RT thread running on it's own cpu, blocking and waking up > continuously). > > I'm wondering if this is, in fact, the way things were meant to work, > or should we have a root domain allocated for each cpu that is not to > be part of a sched domain? Note the the def_root_domain spans all of > the non-load-balanced cpus in this case. Having it attached to cpus > that should not be load balancing doesn't quite make sense to me. It shouldn't be like that, each load-balance domain (in your case a single cpu) should get its own root domain. Gregory? > Here's where we've often seen this lock contention occur: what's this horrible output from? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-03 22:33 ` Peter Zijlstra @ 2008-11-04 1:29 ` Dimitri Sivanich 2008-11-04 3:53 ` Gregory Haskins 1 sibling, 0 replies; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-04 1:29 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Gregory Haskins On Mon, Nov 03, 2008 at 11:33:23PM +0100, Peter Zijlstra wrote: > On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: > > When load balancing gets switched off for a set of cpus via the > > sched_load_balance flag in cpusets, those cpus wind up with the > > globally defined def_root_domain attached. The def_root_domain is > > attached when partition_sched_domains calls detach_destroy_domains(). > > A new root_domain is never allocated or attached as a sched domain > > will never be attached by __build_sched_domains() for the non-load > > balanced processors. > > > > The problem with this scenario is that on systems with a large number > > of processors with load balancing switched off, we start to see the > > cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. > > This starts to become much more apparent above 8 waking RT threads > > (with each RT thread running on it's own cpu, blocking and waking up > > continuously). > > > > I'm wondering if this is, in fact, the way things were meant to work, > > or should we have a root domain allocated for each cpu that is not to > > be part of a sched domain? Note the the def_root_domain spans all of > > the non-load-balanced cpus in this case. Having it attached to cpus > > that should not be load balancing doesn't quite make sense to me. > > It shouldn't be like that, each load-balance domain (in your case a > single cpu) should get its own root domain. Gregory? > > > Here's where we've often seen this lock contention occur: > > what's this horrible output from? This output is a stack backtrace from KDB. KDB entry is triggered after too much time elapses prior to thread wakeup. The traces pointed to this lock. Too further test that theory, we hacked up a change to create root_domain's for each cpu and the max thread wakeup times improved. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-03 22:33 ` Peter Zijlstra 2008-11-04 1:29 ` Dimitri Sivanich @ 2008-11-04 3:53 ` Gregory Haskins 2008-11-04 14:34 ` Gregory Haskins 1 sibling, 1 reply; 29+ messages in thread From: Gregory Haskins @ 2008-11-04 3:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Dimitri Sivanich, linux-kernel, Ingo Molnar, Gregory Haskins [-- Attachment #1: Type: text/plain, Size: 2063 bytes --] Peter Zijlstra wrote: > On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: > >> When load balancing gets switched off for a set of cpus via the >> sched_load_balance flag in cpusets, those cpus wind up with the >> globally defined def_root_domain attached. The def_root_domain is >> attached when partition_sched_domains calls detach_destroy_domains(). >> A new root_domain is never allocated or attached as a sched domain >> will never be attached by __build_sched_domains() for the non-load >> balanced processors. >> >> The problem with this scenario is that on systems with a large number >> of processors with load balancing switched off, we start to see the >> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. >> This starts to become much more apparent above 8 waking RT threads >> (with each RT thread running on it's own cpu, blocking and waking up >> continuously). >> >> I'm wondering if this is, in fact, the way things were meant to work, >> or should we have a root domain allocated for each cpu that is not to >> be part of a sched domain? Note the the def_root_domain spans all of >> the non-load-balanced cpus in this case. Having it attached to cpus >> that should not be load balancing doesn't quite make sense to me. >> > > It shouldn't be like that, each load-balance domain (in your case a > single cpu) should get its own root domain. Gregory? > Yeah, this sounds broken. I know that the root-domain code was being developed coincident to some upheaval with the cpuset code, so I suspect something may have been broken from the original intent. I will take a look. -Greg > >> Here's where we've often seen this lock contention occur: >> > > what's this horrible output from? > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-04 3:53 ` Gregory Haskins @ 2008-11-04 14:34 ` Gregory Haskins 2008-11-04 14:36 ` Peter Zijlstra 0 siblings, 1 reply; 29+ messages in thread From: Gregory Haskins @ 2008-11-04 14:34 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Dimitri Sivanich, linux-kernel, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 2433 bytes --] Gregory Haskins wrote: > Peter Zijlstra wrote: > >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: >> >> >>> When load balancing gets switched off for a set of cpus via the >>> sched_load_balance flag in cpusets, those cpus wind up with the >>> globally defined def_root_domain attached. The def_root_domain is >>> attached when partition_sched_domains calls detach_destroy_domains(). >>> A new root_domain is never allocated or attached as a sched domain >>> will never be attached by __build_sched_domains() for the non-load >>> balanced processors. >>> >>> The problem with this scenario is that on systems with a large number >>> of processors with load balancing switched off, we start to see the >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. >>> This starts to become much more apparent above 8 waking RT threads >>> (with each RT thread running on it's own cpu, blocking and waking up >>> continuously). >>> >>> I'm wondering if this is, in fact, the way things were meant to work, >>> or should we have a root domain allocated for each cpu that is not to >>> be part of a sched domain? Note the the def_root_domain spans all of >>> the non-load-balanced cpus in this case. Having it attached to cpus >>> that should not be load balancing doesn't quite make sense to me. >>> >>> >> It shouldn't be like that, each load-balance domain (in your case a >> single cpu) should get its own root domain. Gregory? >> >> > > Yeah, this sounds broken. I know that the root-domain code was being > developed coincident to some upheaval with the cpuset code, so I suspect > something may have been broken from the original intent. I will take a > look. > > -Greg > > After thinking about it some more, I am not quite sure what to do here. The root-domain code was really designed to be 1:1 with a disjoint cpuset. In this case, it sounds like all the non-balanced cpus are still in one default cpuset. In that case, the code is correct to place all those cores in the singleton def_root_domain. The question really is: How do we support the sched_load_balance flag better? I suppose we could go through the scheduler code and have it check that flag before consulting the root-domain. Another alternative is to have the sched_load_balance=false flag create a disjoint cpuset. Any thoughts? -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-04 14:34 ` Gregory Haskins @ 2008-11-04 14:36 ` Peter Zijlstra 2008-11-04 14:40 ` Dimitri Sivanich ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Peter Zijlstra @ 2008-11-04 14:36 UTC (permalink / raw) To: Gregory Haskins; +Cc: Dimitri Sivanich, linux-kernel, Ingo Molnar On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote: > Gregory Haskins wrote: > > Peter Zijlstra wrote: > > > >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: > >> > >> > >>> When load balancing gets switched off for a set of cpus via the > >>> sched_load_balance flag in cpusets, those cpus wind up with the > >>> globally defined def_root_domain attached. The def_root_domain is > >>> attached when partition_sched_domains calls detach_destroy_domains(). > >>> A new root_domain is never allocated or attached as a sched domain > >>> will never be attached by __build_sched_domains() for the non-load > >>> balanced processors. > >>> > >>> The problem with this scenario is that on systems with a large number > >>> of processors with load balancing switched off, we start to see the > >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. > >>> This starts to become much more apparent above 8 waking RT threads > >>> (with each RT thread running on it's own cpu, blocking and waking up > >>> continuously). > >>> > >>> I'm wondering if this is, in fact, the way things were meant to work, > >>> or should we have a root domain allocated for each cpu that is not to > >>> be part of a sched domain? Note the the def_root_domain spans all of > >>> the non-load-balanced cpus in this case. Having it attached to cpus > >>> that should not be load balancing doesn't quite make sense to me. > >>> > >>> > >> It shouldn't be like that, each load-balance domain (in your case a > >> single cpu) should get its own root domain. Gregory? > >> > >> > > > > Yeah, this sounds broken. I know that the root-domain code was being > > developed coincident to some upheaval with the cpuset code, so I suspect > > something may have been broken from the original intent. I will take a > > look. > > > > -Greg > > > > > > After thinking about it some more, I am not quite sure what to do here. > The root-domain code was really designed to be 1:1 with a disjoint > cpuset. In this case, it sounds like all the non-balanced cpus are > still in one default cpuset. In that case, the code is correct to place > all those cores in the singleton def_root_domain. The question really > is: How do we support the sched_load_balance flag better? > > I suppose we could go through the scheduler code and have it check that > flag before consulting the root-domain. Another alternative is to have > the sched_load_balance=false flag create a disjoint cpuset. Any thoughts? Hmm, but you cannot disable load-balance on a cpu without placing it in an cpuset first, right? Or are folks disabling load-balance bottom-up, instead of top-down? In that case, I think we should dis-allow that. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-04 14:36 ` Peter Zijlstra @ 2008-11-04 14:40 ` Dimitri Sivanich 2008-11-04 14:59 ` Gregory Haskins 2008-11-04 14:45 ` Dimitri Sivanich 2008-11-06 9:13 ` Nish Aravamudan 2 siblings, 1 reply; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-04 14:40 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Gregory Haskins, linux-kernel, Ingo Molnar On Tue, Nov 04, 2008 at 03:36:33PM +0100, Peter Zijlstra wrote: > On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote: > > Gregory Haskins wrote: > > > Peter Zijlstra wrote: > > > > > >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: > > >> > > >> > > >>> When load balancing gets switched off for a set of cpus via the > > >>> sched_load_balance flag in cpusets, those cpus wind up with the > > >>> globally defined def_root_domain attached. The def_root_domain is > > >>> attached when partition_sched_domains calls detach_destroy_domains(). > > >>> A new root_domain is never allocated or attached as a sched domain > > >>> will never be attached by __build_sched_domains() for the non-load > > >>> balanced processors. > > >>> > > >>> The problem with this scenario is that on systems with a large number > > >>> of processors with load balancing switched off, we start to see the > > >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. > > >>> This starts to become much more apparent above 8 waking RT threads > > >>> (with each RT thread running on it's own cpu, blocking and waking up > > >>> continuously). > > >>> > > >>> I'm wondering if this is, in fact, the way things were meant to work, > > >>> or should we have a root domain allocated for each cpu that is not to > > >>> be part of a sched domain? Note the the def_root_domain spans all of > > >>> the non-load-balanced cpus in this case. Having it attached to cpus > > >>> that should not be load balancing doesn't quite make sense to me. > > >>> > > >>> > > >> It shouldn't be like that, each load-balance domain (in your case a > > >> single cpu) should get its own root domain. Gregory? > > >> > > >> > > > > > > Yeah, this sounds broken. I know that the root-domain code was being > > > developed coincident to some upheaval with the cpuset code, so I suspect > > > something may have been broken from the original intent. I will take a > > > look. > > > > > > -Greg > > > > > > > > > > After thinking about it some more, I am not quite sure what to do here. > > The root-domain code was really designed to be 1:1 with a disjoint > > cpuset. In this case, it sounds like all the non-balanced cpus are > > still in one default cpuset. In that case, the code is correct to place > > all those cores in the singleton def_root_domain. The question really > > is: How do we support the sched_load_balance flag better? > > > > I suppose we could go through the scheduler code and have it check that > > flag before consulting the root-domain. Another alternative is to have > > the sched_load_balance=false flag create a disjoint cpuset. Any thoughts? > > Hmm, but you cannot disable load-balance on a cpu without placing it in > an cpuset first, right? > > Or are folks disabling load-balance bottom-up, instead of top-down? > > In that case, I think we should dis-allow that. When I see this behavior, I am creating cpusets containing these non load balancing cpus. Whether I create a single cpuset for each one, or one cpuset for all of them, the root domain ends up being the def_root_domain with no sched domain attached once I set both the root cpuset and created cpuset's sched_load_balance flags to 0. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-04 14:40 ` Dimitri Sivanich @ 2008-11-04 14:59 ` Gregory Haskins 2008-11-19 19:49 ` Max Krasnyansky 0 siblings, 1 reply; 29+ messages in thread From: Gregory Haskins @ 2008-11-04 14:59 UTC (permalink / raw) To: Dimitri Sivanich; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 3621 bytes --] Dimitri Sivanich wrote: > On Tue, Nov 04, 2008 at 03:36:33PM +0100, Peter Zijlstra wrote: > >> On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote: >> >>> Gregory Haskins wrote: >>> >>>> Peter Zijlstra wrote: >>>> >>>> >>>>> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: >>>>> >>>>> >>>>> >>>>>> When load balancing gets switched off for a set of cpus via the >>>>>> sched_load_balance flag in cpusets, those cpus wind up with the >>>>>> globally defined def_root_domain attached. The def_root_domain is >>>>>> attached when partition_sched_domains calls detach_destroy_domains(). >>>>>> A new root_domain is never allocated or attached as a sched domain >>>>>> will never be attached by __build_sched_domains() for the non-load >>>>>> balanced processors. >>>>>> >>>>>> The problem with this scenario is that on systems with a large number >>>>>> of processors with load balancing switched off, we start to see the >>>>>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. >>>>>> This starts to become much more apparent above 8 waking RT threads >>>>>> (with each RT thread running on it's own cpu, blocking and waking up >>>>>> continuously). >>>>>> >>>>>> I'm wondering if this is, in fact, the way things were meant to work, >>>>>> or should we have a root domain allocated for each cpu that is not to >>>>>> be part of a sched domain? Note the the def_root_domain spans all of >>>>>> the non-load-balanced cpus in this case. Having it attached to cpus >>>>>> that should not be load balancing doesn't quite make sense to me. >>>>>> >>>>>> >>>>>> >>>>> It shouldn't be like that, each load-balance domain (in your case a >>>>> single cpu) should get its own root domain. Gregory? >>>>> >>>>> >>>>> >>>> Yeah, this sounds broken. I know that the root-domain code was being >>>> developed coincident to some upheaval with the cpuset code, so I suspect >>>> something may have been broken from the original intent. I will take a >>>> look. >>>> >>>> -Greg >>>> >>>> >>>> >>> After thinking about it some more, I am not quite sure what to do here. >>> The root-domain code was really designed to be 1:1 with a disjoint >>> cpuset. In this case, it sounds like all the non-balanced cpus are >>> still in one default cpuset. In that case, the code is correct to place >>> all those cores in the singleton def_root_domain. The question really >>> is: How do we support the sched_load_balance flag better? >>> >>> I suppose we could go through the scheduler code and have it check that >>> flag before consulting the root-domain. Another alternative is to have >>> the sched_load_balance=false flag create a disjoint cpuset. Any thoughts? >>> >> Hmm, but you cannot disable load-balance on a cpu without placing it in >> an cpuset first, right? >> >> Or are folks disabling load-balance bottom-up, instead of top-down? >> >> In that case, I think we should dis-allow that. >> > > When I see this behavior, I am creating cpusets containing these non load balancing cpus. Whether I create a single cpuset for each one, or one cpuset for all of them, the root domain ends up being the def_root_domain with no sched domain attached once I set both the root cpuset and created cpuset's sched_load_balance flags to 0. > > If you tried creating different cpusets and it still had them all end up in the def_root_domain, something is very broken indeed. I will take a look. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-04 14:59 ` Gregory Haskins @ 2008-11-19 19:49 ` Max Krasnyansky 2008-11-19 19:55 ` Dimitri Sivanich 2008-11-19 20:25 ` Gregory Haskins 0 siblings, 2 replies; 29+ messages in thread From: Max Krasnyansky @ 2008-11-19 19:49 UTC (permalink / raw) To: Gregory Haskins Cc: Dimitri Sivanich, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar Gregory Haskins wrote: > If you tried creating different cpusets and it still had them all end up > in the def_root_domain, something is very broken indeed. I will take a > look. I beleive that's the intended behaviour. We always put cpus that are not balanced into null sched domains. This was done since day one (ie when cpuisol= option was introduced) and cpusets just followed the same convention. I think the idea is that we want to make balancer a noop on those processors. We could change cpusets code to create a root sched domain for each cpu I guess. But can we maybe scale cpupri some other way ? Max ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 19:49 ` Max Krasnyansky @ 2008-11-19 19:55 ` Dimitri Sivanich 2008-11-19 20:17 ` Max Krasnyansky 2008-11-19 20:25 ` Gregory Haskins 1 sibling, 1 reply; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-19 19:55 UTC (permalink / raw) To: Max Krasnyansky Cc: Gregory Haskins, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar On Wed, Nov 19, 2008 at 11:49:36AM -0800, Max Krasnyansky wrote: > I think the idea is that we want to make balancer a noop on those processors. Ultimately, making the balancer a noop on processors with load balancing turned off would be the best solution. > We could change cpusets code to create a root sched domain for each cpu I > guess. But can we maybe scale cpupri some other way ? It doesn't make sense to me that they'd have a root domain attached that spans more of the the system than that cpu. > > Max ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 19:55 ` Dimitri Sivanich @ 2008-11-19 20:17 ` Max Krasnyansky 2008-11-19 20:21 ` Dimitri Sivanich 0 siblings, 1 reply; 29+ messages in thread From: Max Krasnyansky @ 2008-11-19 20:17 UTC (permalink / raw) To: Dimitri Sivanich Cc: Gregory Haskins, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar Dimitri Sivanich wrote: > On Wed, Nov 19, 2008 at 11:49:36AM -0800, Max Krasnyansky wrote: >> I think the idea is that we want to make balancer a noop on those processors. > > Ultimately, making the balancer a noop on processors with load balancing turned off would be the best solution. Yes. I forgot to point out that if we do change cpusets to generate sched domain per cpu we want to make sure that balancer is still a noop just like it is today with the null sched domain. >> We could change cpusets code to create a root sched domain for each cpu I >> guess. But can we maybe scale cpupri some other way ? > > It doesn't make sense to me that they'd have a root domain attached that spans more of the the system than that cpu. I think 'root' in this case is a bit of a misnomer. What I meant is that each non-balanced cpu would be in a separate sched domain. Max ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 20:17 ` Max Krasnyansky @ 2008-11-19 20:21 ` Dimitri Sivanich 0 siblings, 0 replies; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-19 20:21 UTC (permalink / raw) To: Max Krasnyansky Cc: Gregory Haskins, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar On Wed, Nov 19, 2008 at 12:17:38PM -0800, Max Krasnyansky wrote: > > > Dimitri Sivanich wrote: > > On Wed, Nov 19, 2008 at 11:49:36AM -0800, Max Krasnyansky wrote: > >> I think the idea is that we want to make balancer a noop on those processors. > > > > Ultimately, making the balancer a noop on processors with load balancing turned off would be the best solution. > Yes. I forgot to point out that if we do change cpusets to generate sched > domain per cpu we want to make sure that balancer is still a noop just like it > is today with the null sched domain. Sorry, I meant root_domain per cpu, not sched domain. Having NULL sched domains for these cpus is fine. > > >> We could change cpusets code to create a root sched domain for each cpu I > >> guess. But can we maybe scale cpupri some other way ? > > > > It doesn't make sense to me that they'd have a root domain attached that spans more of the the system than that cpu. > I think 'root' in this case is a bit of a misnomer. What I meant is that each > non-balanced cpu would be in a separate sched domain. I think a NULL sched domain, as it is now, is fine. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 19:49 ` Max Krasnyansky 2008-11-19 19:55 ` Dimitri Sivanich @ 2008-11-19 20:25 ` Gregory Haskins 2008-11-19 20:33 ` Dimitri Sivanich 2008-11-20 2:12 ` Max Krasnyansky 1 sibling, 2 replies; 29+ messages in thread From: Gregory Haskins @ 2008-11-19 20:25 UTC (permalink / raw) To: Max Krasnyansky Cc: Dimitri Sivanich, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 2061 bytes --] Max Krasnyansky wrote: > Gregory Haskins wrote: > >> If you tried creating different cpusets and it still had them all end up >> in the def_root_domain, something is very broken indeed. I will take a >> look. >> > > I beleive that's the intended behaviour. Heh...well, as the guy that wrote root-domans, I can definitively say that is not the behavior that I personally intended ;) > We always put cpus that are not > balanced into null sched domains. This was done since day one (ie when > cpuisol= option was introduced) and cpusets just followed the same convention. > It sounds like the problem with my code is that "null sched domain" translates into "default root-domain" which is understandably unexpected by Dimitri (and myself). Really I intended root-domains to become associated with each exclusive/disjoint cpuset that is created. In a way, non-balanced/isolated cpus could be modeled as an exclusive cpuset with one member, but that is somewhat beyond the scope of the root-domain code as it stands today. My primary concern was that Dimitri reports that even creating a disjoint cpuset per cpu does not yield an isolated root-domain per cpu. Rather they all end up in the default root-domain, and this is not what I intended at all. However, as a secondary goal it would be nice to somehow directly support the "no-load-balance" option without requiring explicit exclusive per-cpu cpusets to do it. The proper mechanism (IMHO) to scope the scheduler to a subset of cpus (including only "self") is root-domains so I would prefer to see the solution based on that. However, today there is a rather tight coupling of root-domains and cpusets, so this coupling would likely have to be relaxed a little bit to get there. There are certainly other ways to solve the problem as well. But seeing as how I intended root-domains to represent the effective partition scope of the scheduler, this seems like a natural fit in my mind until its proven to me otherwise. Regards, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 20:25 ` Gregory Haskins @ 2008-11-19 20:33 ` Dimitri Sivanich 2008-11-19 21:30 ` Gregory Haskins 2008-11-19 22:25 ` Gregory Haskins 2008-11-20 2:12 ` Max Krasnyansky 1 sibling, 2 replies; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-19 20:33 UTC (permalink / raw) To: Gregory Haskins Cc: Max Krasnyansky, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar On Wed, Nov 19, 2008 at 03:25:15PM -0500, Gregory Haskins wrote: > It sounds like the problem with my code is that "null sched domain" > translates into "default root-domain" which is understandably unexpected > by Dimitri (and myself). Really I intended root-domains to become > associated with each exclusive/disjoint cpuset that is created. In a > way, non-balanced/isolated cpus could be modeled as an exclusive cpuset > with one member, but that is somewhat beyond the scope of the Actually, at one time, that is how things were setup. Setting the cpu_exclusive bit on a single cpu cpuset would isolate that cpu from load balancing. > root-domain code as it stands today. My primary concern was that > Dimitri reports that even creating a disjoint cpuset per cpu does not > yield an isolated root-domain per cpu. Rather they all end up in the > default root-domain, and this is not what I intended at all. > > However, as a secondary goal it would be nice to somehow directly > support the "no-load-balance" option without requiring explicit > exclusive per-cpu cpusets to do it. The proper mechanism (IMHO) to > scope the scheduler to a subset of cpus (including only "self") is > root-domains so I would prefer to see the solution based on that. > However, today there is a rather tight coupling of root-domains and > cpusets, so this coupling would likely have to be relaxed a little bit > to get there. > > There are certainly other ways to solve the problem as well. But seeing > as how I intended root-domains to represent the effective partition > scope of the scheduler, this seems like a natural fit in my mind until > its proven to me otherwise. > Agreed. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 20:33 ` Dimitri Sivanich @ 2008-11-19 21:30 ` Gregory Haskins 2008-11-19 21:47 ` Dimitri Sivanich 2008-11-19 22:25 ` Gregory Haskins 1 sibling, 1 reply; 29+ messages in thread From: Gregory Haskins @ 2008-11-19 21:30 UTC (permalink / raw) To: Dimitri Sivanich Cc: Max Krasnyansky, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 1067 bytes --] Dimitri Sivanich wrote: > On Wed, Nov 19, 2008 at 03:25:15PM -0500, Gregory Haskins wrote: > >> It sounds like the problem with my code is that "null sched domain" >> translates into "default root-domain" which is understandably unexpected >> by Dimitri (and myself). Really I intended root-domains to become >> associated with each exclusive/disjoint cpuset that is created. In a >> way, non-balanced/isolated cpus could be modeled as an exclusive cpuset >> with one member, but that is somewhat beyond the scope of the >> > > Actually, at one time, that is how things were setup. Setting the > cpu_exclusive bit on a single cpu cpuset would isolate that cpu from > load balancing. > Do you know if this was pre or post the root-domain code? Here is a reference to the commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=57d885fea0da0e9541d7730a9e1dcf734981a173 A bisection that shows when this last worked for you would be very appreciated if you have the time, Dimitri. Regards, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 21:30 ` Gregory Haskins @ 2008-11-19 21:47 ` Dimitri Sivanich 0 siblings, 0 replies; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-19 21:47 UTC (permalink / raw) To: Gregory Haskins Cc: Max Krasnyansky, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar On Wed, Nov 19, 2008 at 04:30:08PM -0500, Gregory Haskins wrote: > Dimitri Sivanich wrote: > > On Wed, Nov 19, 2008 at 03:25:15PM -0500, Gregory Haskins wrote: > > > >> It sounds like the problem with my code is that "null sched domain" > >> translates into "default root-domain" which is understandably unexpected > >> by Dimitri (and myself). Really I intended root-domains to become > >> associated with each exclusive/disjoint cpuset that is created. In a > >> way, non-balanced/isolated cpus could be modeled as an exclusive cpuset > >> with one member, but that is somewhat beyond the scope of the > >> > > > > Actually, at one time, that is how things were setup. Setting the > > cpu_exclusive bit on a single cpu cpuset would isolate that cpu from > > load balancing. > > > Do you know if this was pre or post the root-domain code? Here is a > reference to the commit: It was pre root-domain. That behavior was replaced by addition of the sched_load_balance flag with the following commit (though it was actually removed even earlier): http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=029190c515f15f512ac85de8fc686d4dbd0ae731 > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=57d885fea0da0e9541d7730a9e1dcf734981a173 > > A bisection that shows when this last worked for you would be very > appreciated if you have the time, Dimitri. > > Regards, > -Greg > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 20:33 ` Dimitri Sivanich 2008-11-19 21:30 ` Gregory Haskins @ 2008-11-19 22:25 ` Gregory Haskins 1 sibling, 0 replies; 29+ messages in thread From: Gregory Haskins @ 2008-11-19 22:25 UTC (permalink / raw) To: Dimitri Sivanich Cc: Max Krasnyansky, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 2610 bytes --] Dimitri Sivanich wrote: > On Wed, Nov 19, 2008 at 03:25:15PM -0500, Gregory Haskins wrote: > >> It sounds like the problem with my code is that "null sched domain" >> translates into "default root-domain" which is understandably unexpected >> by Dimitri (and myself). Really I intended root-domains to become >> associated with each exclusive/disjoint cpuset that is created. In a >> way, non-balanced/isolated cpus could be modeled as an exclusive cpuset >> with one member, but that is somewhat beyond the scope of the >> > > Actually, at one time, that is how things were setup. Setting the > cpu_exclusive bit on a single cpu cpuset would isolate that cpu from > load balancing. > Re-reading my post made me realize what I said above was confusing. The "that" in "but that is somewhat beyond the scope" was meant to be "explicit/direct support for the no-balance flag". However, it perhaps sounded like I was talking about exclusive cpusets with singleton membership. Exclusive cpusets are the original raison-d'etre for root-domains. ;) Therefore I agree that the exclusive cpuset portion should work (but seems to be broken, thus the bug report). My primary goal is to fix this issue. However, I would also like to *add* support for the no-balance flag as a secondary goal. Its just that this is new feature from my perspective, so may it take some additional work to figure out what needs to be done. HTH and sorry for the confusion. -Greg > >> root-domain code as it stands today. My primary concern was that >> Dimitri reports that even creating a disjoint cpuset per cpu does not >> yield an isolated root-domain per cpu. Rather they all end up in the >> default root-domain, and this is not what I intended at all. >> >> However, as a secondary goal it would be nice to somehow directly >> support the "no-load-balance" option without requiring explicit >> exclusive per-cpu cpusets to do it. The proper mechanism (IMHO) to >> scope the scheduler to a subset of cpus (including only "self") is >> root-domains so I would prefer to see the solution based on that. >> However, today there is a rather tight coupling of root-domains and >> cpusets, so this coupling would likely have to be relaxed a little bit >> to get there. >> >> There are certainly other ways to solve the problem as well. But seeing >> as how I intended root-domains to represent the effective partition >> scope of the scheduler, this seems like a natural fit in my mind until >> its proven to me otherwise. >> >> > > Agreed. > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-19 20:25 ` Gregory Haskins 2008-11-19 20:33 ` Dimitri Sivanich @ 2008-11-20 2:12 ` Max Krasnyansky 2008-11-21 1:57 ` Gregory Haskins 1 sibling, 1 reply; 29+ messages in thread From: Max Krasnyansky @ 2008-11-20 2:12 UTC (permalink / raw) To: Gregory Haskins Cc: Dimitri Sivanich, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar Gregory Haskins wrote: > Max Krasnyansky wrote: >> We always put cpus that are not >> balanced into null sched domains. This was done since day one (ie when >> cpuisol= option was introduced) and cpusets just followed the same convention. >> > > It sounds like the problem with my code is that "null sched domain" > translates into "default root-domain" which is understandably unexpected > by Dimitri (and myself). Really I intended root-domains to become > associated with each exclusive/disjoint cpuset that is created. In a > way, non-balanced/isolated cpus could be modeled as an exclusive cpuset > with one member, but that is somewhat beyond the scope of the > root-domain code as it stands today. My primary concern was that > Dimitri reports that even creating a disjoint cpuset per cpu does not > yield an isolated root-domain per cpu. Rather they all end up in the > default root-domain, and this is not what I intended at all. > > However, as a secondary goal it would be nice to somehow directly > support the "no-load-balance" option without requiring explicit > exclusive per-cpu cpusets to do it. The proper mechanism (IMHO) to > scope the scheduler to a subset of cpus (including only "self") is > root-domains so I would prefer to see the solution based on that. > However, today there is a rather tight coupling of root-domains and > cpusets, so this coupling would likely have to be relaxed a little bit > to get there. > > There are certainly other ways to solve the problem as well. But seeing > as how I intended root-domains to represent the effective partition > scope of the scheduler, this seems like a natural fit in my mind until > its proven to me otherwise. Since I was working on cpuisol updates I decided to stick some debug prinks around and test a few scenarios. I'm basically printing cpumasks generated for each cpuset and address of the root domain. My conclusion is that everything is working as expected. I do not think we need to fix anything in this area. btw cpu_exclusive flag has no impact on the sched domains stuff. I'm not sure what it was mentioned in this context. Here comes a long text with a bunch of traces based on different cpuset setups. This is an 8Core dual Xeon (L5410) box. 2.6.27.6 kernel. All scenarios assume mount -t cgroup -ocpusets /cpusets cd /cpusets ---- Trace 1 $ echo 0 > cpuset.sched_load_balance [ 1674.811610] cpusets: rebuild ndoms 0 [ 1674.811627] CPU0 root domain default [ 1674.811629] CPU0 attaching NULL sched-domain. [ 1674.811633] CPU1 root domain default [ 1674.811635] CPU1 attaching NULL sched-domain. [ 1674.811638] CPU2 root domain default [ 1674.811639] CPU2 attaching NULL sched-domain. [ 1674.811642] CPU3 root domain default [ 1674.811643] CPU3 attaching NULL sched-domain. [ 1674.811646] CPU4 root domain default [ 1674.811647] CPU4 attaching NULL sched-domain. [ 1674.811649] CPU5 root domain default [ 1674.811651] CPU5 attaching NULL sched-domain. [ 1674.811653] CPU6 root domain default [ 1674.811655] CPU6 attaching NULL sched-domain. [ 1674.811657] CPU7 root domain default [ 1674.811659] CPU7 attaching NULL sched-domain. Looks fine. ---- Trace 2 $ echo 1 > cpuset.sched_load_balance [ 1748.260637] cpusets: rebuild ndoms 1 [ 1748.260648] cpuset: domain 0 cpumask ff [ 1748.260650] CPU0 root domain ffff88025884a000 [ 1748.260652] CPU0 attaching sched-domain: [ 1748.260654] domain 0: span 0-7 level CPU [ 1748.260656] groups: 0 1 2 3 4 5 6 7 [ 1748.260665] CPU1 root domain ffff88025884a000 [ 1748.260666] CPU1 attaching sched-domain: [ 1748.260668] domain 0: span 0-7 level CPU [ 1748.260670] groups: 1 2 3 4 5 6 7 0 [ 1748.260677] CPU2 root domain ffff88025884a000 [ 1748.260679] CPU2 attaching sched-domain: [ 1748.260681] domain 0: span 0-7 level CPU [ 1748.260683] groups: 2 3 4 5 6 7 0 1 [ 1748.260690] CPU3 root domain ffff88025884a000 [ 1748.260692] CPU3 attaching sched-domain: [ 1748.260693] domain 0: span 0-7 level CPU [ 1748.260696] groups: 3 4 5 6 7 0 1 2 [ 1748.260703] CPU4 root domain ffff88025884a000 [ 1748.260705] CPU4 attaching sched-domain: [ 1748.260706] domain 0: span 0-7 level CPU [ 1748.260708] groups: 4 5 6 7 0 1 2 3 [ 1748.260715] CPU5 root domain ffff88025884a000 [ 1748.260717] CPU5 attaching sched-domain: [ 1748.260718] domain 0: span 0-7 level CPU [ 1748.260720] groups: 5 6 7 0 1 2 3 4 [ 1748.260727] CPU6 root domain ffff88025884a000 [ 1748.260729] CPU6 attaching sched-domain: [ 1748.260731] domain 0: span 0-7 level CPU [ 1748.260733] groups: 6 7 0 1 2 3 4 5 [ 1748.260740] CPU7 root domain ffff88025884a000 [ 1748.260742] CPU7 attaching sched-domain: [ 1748.260743] domain 0: span 0-7 level CPU [ 1748.260745] groups: 7 0 1 2 3 4 5 6 Looks perfect. ---- Trace 3 $ for i in 0 1 2 3 4 5 6 7; do mkdir par$i; echo $i > par$i/cpuset.cpus; done $ echo 0 > cpuset.sched_load_balance [ 1803.485838] cpusets: rebuild ndoms 1 [ 1803.485843] cpuset: domain 0 cpumask ff [ 1803.486953] cpusets: rebuild ndoms 1 [ 1803.486957] cpuset: domain 0 cpumask ff [ 1803.488039] cpusets: rebuild ndoms 1 [ 1803.488044] cpuset: domain 0 cpumask ff [ 1803.489046] cpusets: rebuild ndoms 1 [ 1803.489056] cpuset: domain 0 cpumask ff [ 1803.490306] cpusets: rebuild ndoms 1 [ 1803.490312] cpuset: domain 0 cpumask ff [ 1803.491464] cpusets: rebuild ndoms 1 [ 1803.491474] cpuset: domain 0 cpumask ff [ 1803.492617] cpusets: rebuild ndoms 1 [ 1803.492622] cpuset: domain 0 cpumask ff [ 1803.493758] cpusets: rebuild ndoms 1 [ 1803.493763] cpuset: domain 0 cpumask ff [ 1835.135245] cpusets: rebuild ndoms 8 [ 1835.135249] cpuset: domain 0 cpumask 80 [ 1835.135251] cpuset: domain 1 cpumask 40 [ 1835.135253] cpuset: domain 2 cpumask 20 [ 1835.135254] cpuset: domain 3 cpumask 10 [ 1835.135256] cpuset: domain 4 cpumask 08 [ 1835.135259] cpuset: domain 5 cpumask 04 [ 1835.135261] cpuset: domain 6 cpumask 02 [ 1835.135263] cpuset: domain 7 cpumask 01 [ 1835.135279] CPU0 root domain default [ 1835.135281] CPU0 attaching NULL sched-domain. [ 1835.135286] CPU1 root domain default [ 1835.135288] CPU1 attaching NULL sched-domain. [ 1835.135291] CPU2 root domain default [ 1835.135294] CPU2 attaching NULL sched-domain. [ 1835.135297] CPU3 root domain default [ 1835.135299] CPU3 attaching NULL sched-domain. [ 1835.135303] CPU4 root domain default [ 1835.135305] CPU4 attaching NULL sched-domain. [ 1835.135308] CPU5 root domain default [ 1835.135311] CPU5 attaching NULL sched-domain. [ 1835.135314] CPU6 root domain default [ 1835.135316] CPU6 attaching NULL sched-domain. [ 1835.135319] CPU7 root domain default [ 1835.135322] CPU7 attaching NULL sched-domain. [ 1835.192509] CPU7 root domain ffff88025884a000 [ 1835.192512] CPU7 attaching NULL sched-domain. [ 1835.192518] CPU6 root domain ffff880258849000 [ 1835.192521] CPU6 attaching NULL sched-domain. [ 1835.192526] CPU5 root domain ffff880258848800 [ 1835.192530] CPU5 attaching NULL sched-domain. [ 1835.192536] CPU4 root domain ffff88025884c000 [ 1835.192539] CPU4 attaching NULL sched-domain. [ 1835.192544] CPU3 root domain ffff88025884c800 [ 1835.192547] CPU3 attaching NULL sched-domain. [ 1835.192553] CPU2 root domain ffff88025884f000 [ 1835.192556] CPU2 attaching NULL sched-domain. [ 1835.192561] CPU1 root domain ffff88025884d000 [ 1835.192565] CPU1 attaching NULL sched-domain. [ 1835.192570] CPU0 root domain ffff88025884b000 [ 1835.192573] CPU0 attaching NULL sched-domain. Looks perfectly fine too. Notice how each cpu ended up in a different root_domain. ---- Trace 4 $ rmdir par* $ echo 1 > cpuset.sched_load_balance This trace looks the same as #2. Again all is fine. ---- Trace 5 $ mkdir par0 $ echo 0-3 > par0/cpuset.cpus $ echo 0 > cpuset.sched_load_balance [ 2204.382352] cpusets: rebuild ndoms 1 [ 2204.382358] cpuset: domain 0 cpumask ff [ 2213.142995] cpusets: rebuild ndoms 1 [ 2213.143000] cpuset: domain 0 cpumask 0f [ 2213.143005] CPU0 root domain default [ 2213.143006] CPU0 attaching NULL sched-domain. [ 2213.143011] CPU1 root domain default [ 2213.143013] CPU1 attaching NULL sched-domain. [ 2213.143017] CPU2 root domain default [ 2213.143021] CPU2 attaching NULL sched-domain. [ 2213.143026] CPU3 root domain default [ 2213.143030] CPU3 attaching NULL sched-domain. [ 2213.143035] CPU4 root domain default [ 2213.143039] CPU4 attaching NULL sched-domain. [ 2213.143044] CPU5 root domain default [ 2213.143048] CPU5 attaching NULL sched-domain. [ 2213.143053] CPU6 root domain default [ 2213.143057] CPU6 attaching NULL sched-domain. [ 2213.143062] CPU7 root domain default [ 2213.143066] CPU7 attaching NULL sched-domain. [ 2213.181261] CPU0 root domain ffff8802589eb000 [ 2213.181265] CPU0 attaching sched-domain: [ 2213.181267] domain 0: span 0-3 level CPU [ 2213.181275] groups: 0 1 2 3 [ 2213.181293] CPU1 root domain ffff8802589eb000 [ 2213.181297] CPU1 attaching sched-domain: [ 2213.181302] domain 0: span 0-3 level CPU [ 2213.181309] groups: 1 2 3 0 [ 2213.181327] CPU2 root domain ffff8802589eb000 [ 2213.181332] CPU2 attaching sched-domain: [ 2213.181336] domain 0: span 0-3 level CPU [ 2213.181343] groups: 2 3 0 1 [ 2213.181366] CPU3 root domain ffff8802589eb000 [ 2213.181370] CPU3 attaching sched-domain: [ 2213.181373] domain 0: span 0-3 level CPU [ 2213.181384] groups: 3 0 1 2 Looks perfectly fine too. CPU0-3 are in root domain ffff8802589eb000. The rest are in def_root_domain. ----- Trace 6 $ mkdir par1 $ echo 4-5 > par1/cpuset.cpus [ 2752.979008] cpusets: rebuild ndoms 2 [ 2752.979014] cpuset: domain 0 cpumask 30 [ 2752.979016] cpuset: domain 1 cpumask 0f [ 2752.979024] CPU4 root domain ffff8802589ec800 [ 2752.979028] CPU4 attaching sched-domain: [ 2752.979032] domain 0: span 4-5 level CPU [ 2752.979039] groups: 4 5 [ 2752.979052] CPU5 root domain ffff8802589ec800 [ 2752.979056] CPU5 attaching sched-domain: [ 2752.979060] domain 0: span 4-5 level CPU [ 2752.979071] groups: 5 4 Looks correct too. CPUs 4 and 5 got added to a new root domain ffff8802589ec800 and nothing else changed. ----- So. I think the only action item is for me to update 'syspart' to create a cpuset for each isolated cpu to avoid putting a bunch of cpus into the default root domain. Everything else looks perfectly fine. btw We should probably rename 'root_domain' to something else to avoid confusion. ie Most people assume that there should be only one root_romain. Maybe something like 'base_domain' ? Also we should probably commit those prints that I added and enable then under SCHED_DEBUG. Right now we're just printing sched_domains and it's not clear which root_domain they belong to. Max ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-20 2:12 ` Max Krasnyansky @ 2008-11-21 1:57 ` Gregory Haskins 2008-11-21 20:04 ` Max Krasnyansky 0 siblings, 1 reply; 29+ messages in thread From: Gregory Haskins @ 2008-11-21 1:57 UTC (permalink / raw) To: Max Krasnyansky Cc: Dimitri Sivanich, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 11933 bytes --] Hi Max, Max Krasnyansky wrote: > Here comes a long text with a bunch of traces based on different cpuset > setups. This is an 8Core dual Xeon (L5410) box. 2.6.27.6 kernel. > All scenarios assume > mount -t cgroup -ocpusets /cpusets > cd /cpusets > Thank you for doing this. Comments inline... > ---- > Trace 1 > $ echo 0 > cpuset.sched_load_balance > > [ 1674.811610] cpusets: rebuild ndoms 0 > [ 1674.811627] CPU0 root domain default > [ 1674.811629] CPU0 attaching NULL sched-domain. > [ 1674.811633] CPU1 root domain default > [ 1674.811635] CPU1 attaching NULL sched-domain. > [ 1674.811638] CPU2 root domain default > [ 1674.811639] CPU2 attaching NULL sched-domain. > [ 1674.811642] CPU3 root domain default > [ 1674.811643] CPU3 attaching NULL sched-domain. > [ 1674.811646] CPU4 root domain default > [ 1674.811647] CPU4 attaching NULL sched-domain. > [ 1674.811649] CPU5 root domain default > [ 1674.811651] CPU5 attaching NULL sched-domain. > [ 1674.811653] CPU6 root domain default > [ 1674.811655] CPU6 attaching NULL sched-domain. > [ 1674.811657] CPU7 root domain default > [ 1674.811659] CPU7 attaching NULL sched-domain. > > Looks fine. > I have to agree. The code is working "as designed" here since I do not support the sched_load_balance=0 mode yet. While technically not a bug, a new feature to add support for it would be nice :) > ---- > Trace 2 > $ echo 1 > cpuset.sched_load_balance > > [ 1748.260637] cpusets: rebuild ndoms 1 > [ 1748.260648] cpuset: domain 0 cpumask ff > [ 1748.260650] CPU0 root domain ffff88025884a000 > [ 1748.260652] CPU0 attaching sched-domain: > [ 1748.260654] domain 0: span 0-7 level CPU > [ 1748.260656] groups: 0 1 2 3 4 5 6 7 > [ 1748.260665] CPU1 root domain ffff88025884a000 > [ 1748.260666] CPU1 attaching sched-domain: > [ 1748.260668] domain 0: span 0-7 level CPU > [ 1748.260670] groups: 1 2 3 4 5 6 7 0 > [ 1748.260677] CPU2 root domain ffff88025884a000 > [ 1748.260679] CPU2 attaching sched-domain: > [ 1748.260681] domain 0: span 0-7 level CPU > [ 1748.260683] groups: 2 3 4 5 6 7 0 1 > [ 1748.260690] CPU3 root domain ffff88025884a000 > [ 1748.260692] CPU3 attaching sched-domain: > [ 1748.260693] domain 0: span 0-7 level CPU > [ 1748.260696] groups: 3 4 5 6 7 0 1 2 > [ 1748.260703] CPU4 root domain ffff88025884a000 > [ 1748.260705] CPU4 attaching sched-domain: > [ 1748.260706] domain 0: span 0-7 level CPU > [ 1748.260708] groups: 4 5 6 7 0 1 2 3 > [ 1748.260715] CPU5 root domain ffff88025884a000 > [ 1748.260717] CPU5 attaching sched-domain: > [ 1748.260718] domain 0: span 0-7 level CPU > [ 1748.260720] groups: 5 6 7 0 1 2 3 4 > [ 1748.260727] CPU6 root domain ffff88025884a000 > [ 1748.260729] CPU6 attaching sched-domain: > [ 1748.260731] domain 0: span 0-7 level CPU > [ 1748.260733] groups: 6 7 0 1 2 3 4 5 > [ 1748.260740] CPU7 root domain ffff88025884a000 > [ 1748.260742] CPU7 attaching sched-domain: > [ 1748.260743] domain 0: span 0-7 level CPU > [ 1748.260745] groups: 7 0 1 2 3 4 5 6 > > Looks perfect. > Yep. > ---- > Trace 3 > $ for i in 0 1 2 3 4 5 6 7; do mkdir par$i; echo $i > par$i/cpuset.cpus; done > $ echo 0 > cpuset.sched_load_balance > > [ 1803.485838] cpusets: rebuild ndoms 1 > [ 1803.485843] cpuset: domain 0 cpumask ff > [ 1803.486953] cpusets: rebuild ndoms 1 > [ 1803.486957] cpuset: domain 0 cpumask ff > [ 1803.488039] cpusets: rebuild ndoms 1 > [ 1803.488044] cpuset: domain 0 cpumask ff > [ 1803.489046] cpusets: rebuild ndoms 1 > [ 1803.489056] cpuset: domain 0 cpumask ff > [ 1803.490306] cpusets: rebuild ndoms 1 > [ 1803.490312] cpuset: domain 0 cpumask ff > [ 1803.491464] cpusets: rebuild ndoms 1 > [ 1803.491474] cpuset: domain 0 cpumask ff > [ 1803.492617] cpusets: rebuild ndoms 1 > [ 1803.492622] cpuset: domain 0 cpumask ff > [ 1803.493758] cpusets: rebuild ndoms 1 > [ 1803.493763] cpuset: domain 0 cpumask ff > [ 1835.135245] cpusets: rebuild ndoms 8 > [ 1835.135249] cpuset: domain 0 cpumask 80 > [ 1835.135251] cpuset: domain 1 cpumask 40 > [ 1835.135253] cpuset: domain 2 cpumask 20 > [ 1835.135254] cpuset: domain 3 cpumask 10 > [ 1835.135256] cpuset: domain 4 cpumask 08 > [ 1835.135259] cpuset: domain 5 cpumask 04 > [ 1835.135261] cpuset: domain 6 cpumask 02 > [ 1835.135263] cpuset: domain 7 cpumask 01 > [ 1835.135279] CPU0 root domain default > [ 1835.135281] CPU0 attaching NULL sched-domain. > [ 1835.135286] CPU1 root domain default > [ 1835.135288] CPU1 attaching NULL sched-domain. > [ 1835.135291] CPU2 root domain default > [ 1835.135294] CPU2 attaching NULL sched-domain. > [ 1835.135297] CPU3 root domain default > [ 1835.135299] CPU3 attaching NULL sched-domain. > [ 1835.135303] CPU4 root domain default > [ 1835.135305] CPU4 attaching NULL sched-domain. > [ 1835.135308] CPU5 root domain default > [ 1835.135311] CPU5 attaching NULL sched-domain. > [ 1835.135314] CPU6 root domain default > [ 1835.135316] CPU6 attaching NULL sched-domain. > [ 1835.135319] CPU7 root domain default > [ 1835.135322] CPU7 attaching NULL sched-domain. > [ 1835.192509] CPU7 root domain ffff88025884a000 > [ 1835.192512] CPU7 attaching NULL sched-domain. > [ 1835.192518] CPU6 root domain ffff880258849000 > [ 1835.192521] CPU6 attaching NULL sched-domain. > [ 1835.192526] CPU5 root domain ffff880258848800 > [ 1835.192530] CPU5 attaching NULL sched-domain. > [ 1835.192536] CPU4 root domain ffff88025884c000 > [ 1835.192539] CPU4 attaching NULL sched-domain. > [ 1835.192544] CPU3 root domain ffff88025884c800 > [ 1835.192547] CPU3 attaching NULL sched-domain. > [ 1835.192553] CPU2 root domain ffff88025884f000 > [ 1835.192556] CPU2 attaching NULL sched-domain. > [ 1835.192561] CPU1 root domain ffff88025884d000 > [ 1835.192565] CPU1 attaching NULL sched-domain. > [ 1835.192570] CPU0 root domain ffff88025884b000 > [ 1835.192573] CPU0 attaching NULL sched-domain. > > Looks perfectly fine too. Notice how each cpu ended up in a different root_domain. > Yep, I concur. This is how I intended it to work. However, Dimitri reports that this is not working for him and this is what piqued my interest and drove the creation of a BZ report. Dimitri, can you share your cpuset configuration with us, and also re-run both it and Max's approach (assuming they differ) on your end to confirm the problem still exists? Max, perhaps you can post the patch with your debugging instrumentation so we can equally see what happens on Dimitri's side? > ---- > Trace 4 > $ rmdir par* > $ echo 1 > cpuset.sched_load_balance > > This trace looks the same as #2. Again all is fine. > > ---- > Trace 5 > $ mkdir par0 > $ echo 0-3 > par0/cpuset.cpus > $ echo 0 > cpuset.sched_load_balance > > [ 2204.382352] cpusets: rebuild ndoms 1 > [ 2204.382358] cpuset: domain 0 cpumask ff > [ 2213.142995] cpusets: rebuild ndoms 1 > [ 2213.143000] cpuset: domain 0 cpumask 0f > [ 2213.143005] CPU0 root domain default > [ 2213.143006] CPU0 attaching NULL sched-domain. > [ 2213.143011] CPU1 root domain default > [ 2213.143013] CPU1 attaching NULL sched-domain. > [ 2213.143017] CPU2 root domain default > [ 2213.143021] CPU2 attaching NULL sched-domain. > [ 2213.143026] CPU3 root domain default > [ 2213.143030] CPU3 attaching NULL sched-domain. > [ 2213.143035] CPU4 root domain default > [ 2213.143039] CPU4 attaching NULL sched-domain. > [ 2213.143044] CPU5 root domain default > [ 2213.143048] CPU5 attaching NULL sched-domain. > [ 2213.143053] CPU6 root domain default > [ 2213.143057] CPU6 attaching NULL sched-domain. > [ 2213.143062] CPU7 root domain default > [ 2213.143066] CPU7 attaching NULL sched-domain. > [ 2213.181261] CPU0 root domain ffff8802589eb000 > [ 2213.181265] CPU0 attaching sched-domain: > [ 2213.181267] domain 0: span 0-3 level CPU > [ 2213.181275] groups: 0 1 2 3 > [ 2213.181293] CPU1 root domain ffff8802589eb000 > [ 2213.181297] CPU1 attaching sched-domain: > [ 2213.181302] domain 0: span 0-3 level CPU > [ 2213.181309] groups: 1 2 3 0 > [ 2213.181327] CPU2 root domain ffff8802589eb000 > [ 2213.181332] CPU2 attaching sched-domain: > [ 2213.181336] domain 0: span 0-3 level CPU > [ 2213.181343] groups: 2 3 0 1 > [ 2213.181366] CPU3 root domain ffff8802589eb000 > [ 2213.181370] CPU3 attaching sched-domain: > [ 2213.181373] domain 0: span 0-3 level CPU > [ 2213.181384] groups: 3 0 1 2 > > Looks perfectly fine too. CPU0-3 are in root domain ffff8802589eb000. The rest > are in def_root_domain. > > ----- > Trace 6 > $ mkdir par1 > $ echo 4-5 > par1/cpuset.cpus > > [ 2752.979008] cpusets: rebuild ndoms 2 > [ 2752.979014] cpuset: domain 0 cpumask 30 > [ 2752.979016] cpuset: domain 1 cpumask 0f > [ 2752.979024] CPU4 root domain ffff8802589ec800 > [ 2752.979028] CPU4 attaching sched-domain: > [ 2752.979032] domain 0: span 4-5 level CPU > [ 2752.979039] groups: 4 5 > [ 2752.979052] CPU5 root domain ffff8802589ec800 > [ 2752.979056] CPU5 attaching sched-domain: > [ 2752.979060] domain 0: span 4-5 level CPU > [ 2752.979071] groups: 5 4 > > Looks correct too. CPUs 4 and 5 got added to a new root domain > ffff8802589ec800 and nothing else changed. > > ----- > > So. I think the only action item is for me to update 'syspart' to create a > cpuset for each isolated cpu to avoid putting a bunch of cpus into the default > root domain. Everything else looks perfectly fine. > I agree. We just need to make sure Dimitri can reproduce these findings on his side to make sure it is not something like a different cpuset configuration that causes the problem. If you can, Max, could you also add the rd->span to the instrumentation just so we can verify that it is scoped appropriately? > btw We should probably rename 'root_domain' to something else to avoid > confusion. ie Most people assume that there should be only one root_romain. > Agreed, but that is already true (depending on your perspective ;) I chose "root-domain" as short for root-sched-domain (meaning the top-most sched-domain in the hierarchy). There is only one root-domain per run-queue. There can be multiple root-domains per system. The former is how I intended it to be considered, and I think in this context "root" is appropriate. Just as you could consider that every Linux box has a root filesystem, but there can be multiple root filesystems that exist on, say, a single HDD for example. Its simply a context to govern/scope the rq behavior. Early iterations of my patches actually had the rd pointer hanging off the top sched-domain structure, actually. This perhaps reinforced the concept of "root" and thus allowed the reasoning for the chosen name to be more apparent. However, I quickly realized that there was no advantage to walking up the sd hierarchy to find "root" and thus the rd pointer...you could effectively hang the pointer on the rq directly for the same result and with less overhead. So I moved it in the later patches which were ultimately accepted. I don't feel strongly about the name either way, however. So if people have a name they prefer and the consensus is that it's less confusing, I am fine with that. > Also we should probably commit those prints that I added and enable then under > SCHED_DEBUG. Right now we're just printing sched_domains and it's not clear > which root_domain they belong to. > Yes, please do! (and please add the rd->span as indicated earlier, if you would be so kind ;) If Dimitri can reproduce your findings, we can close out the bug as FAD and create a new-feature request for the sched_load_balance flag. In the meantime, the workaround for the new feature is to use per-cpu exclusive cpusets which it sounds can be supported by your syspart tool. Thanks Max, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-21 1:57 ` Gregory Haskins @ 2008-11-21 20:04 ` Max Krasnyansky 2008-11-21 21:18 ` Dimitri Sivanich 0 siblings, 1 reply; 29+ messages in thread From: Max Krasnyansky @ 2008-11-21 20:04 UTC (permalink / raw) To: Gregory Haskins Cc: Dimitri Sivanich, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 6136 bytes --] Hi Greg, I attached debug instrumentation patch for Dmitri to try. I'll clean it up and add things you requested and will resubmit properly some time next week. More comments inline Gregory Haskins wrote: > Max Krasnyansky wrote: >> ---- >> Trace 1 >> $ echo 0 > cpuset.sched_load_balance >> >> [ 1674.811610] cpusets: rebuild ndoms 0 >> [ 1674.811627] CPU0 root domain default >> [ 1674.811629] CPU0 attaching NULL sched-domain. >> [ 1674.811633] CPU1 root domain default >> [ 1674.811635] CPU1 attaching NULL sched-domain. >> [ 1674.811638] CPU2 root domain default >> [ 1674.811639] CPU2 attaching NULL sched-domain. >> [ 1674.811642] CPU3 root domain default >> [ 1674.811643] CPU3 attaching NULL sched-domain. >> [ 1674.811646] CPU4 root domain default >> [ 1674.811647] CPU4 attaching NULL sched-domain. >> [ 1674.811649] CPU5 root domain default >> [ 1674.811651] CPU5 attaching NULL sched-domain. >> [ 1674.811653] CPU6 root domain default >> [ 1674.811655] CPU6 attaching NULL sched-domain. >> [ 1674.811657] CPU7 root domain default >> [ 1674.811659] CPU7 attaching NULL sched-domain. >> >> Looks fine. >> > > I have to agree. The code is working "as designed" here since I do not > support the sched_load_balance=0 mode yet. While technically not a bug, > a new feature to add support for it would be nice :) Hmm, I'm not sure what would be a better way to support this. I see it a transitional state where CPUs are not assigned to any cpuset and/or domain. So it seems to be perfectly acceptable to put them into the default root domain. You could allocate an rd for each of those CPUs but it seems that in most cases they won't be useful because as a very next action cpusets will create some real domains. In other words as long as people and tools are aware of the possible lock contention in this case we're ok. I already update my 'syspart' tool to create cpuset set for each isolated cpu. >> ---- >> Trace 3 >> $ for i in 0 1 2 3 4 5 6 7; do mkdir par$i; echo $i > par$i/cpuset.cpus; done >> $ echo 0 > cpuset.sched_load_balance >> >> SNIP >> >> Looks perfectly fine too. Notice how each cpu ended up in a different root_domain. >> > > Yep, I concur. This is how I intended it to work. However, Dimitri > reports that this is not working for him and this is what piqued my > interest and drove the creation of a BZ report. > > Dimitri, can you share your cpuset configuration with us, and also > re-run both it and Max's approach (assuming they differ) on your end to > confirm the problem still exists? Max, perhaps you can post the patch > with your debugging instrumentation so we can equally see what happens > on Dimitri's side? Attached. >> So. I think the only action item is for me to update 'syspart' to create a >> cpuset for each isolated cpu to avoid putting a bunch of cpus into the default >> root domain. Everything else looks perfectly fine. >> > > I agree. We just need to make sure Dimitri can reproduce these findings > on his side to make sure it is not something like a different cpuset > configuration that causes the problem. If you can, Max, could you also > add the rd->span to the instrumentation just so we can verify that it is > scoped appropriately? Will do. >> btw We should probably rename 'root_domain' to something else to avoid >> confusion. ie Most people assume that there should be only one root_romain. >> > > Agreed, but that is already true (depending on your perspective ;) I > chose "root-domain" as short for root-sched-domain (meaning the top-most > sched-domain in the hierarchy). There is only one root-domain per > run-queue. There can be multiple root-domains per system. The former > is how I intended it to be considered, and I think in this context > "root" is appropriate. Just as you could consider that every Linux box > has a root filesystem, but there can be multiple root filesystems that > exist on, say, a single HDD for example. Its simply a context to > govern/scope the rq behavior. > > Early iterations of my patches actually had the rd pointer hanging off > the top sched-domain structure, actually. This perhaps reinforced the > concept of "root" and thus allowed the reasoning for the chosen name to > be more apparent. However, I quickly realized that there was no > advantage to walking up the sd hierarchy to find "root" and thus the rd > pointer...you could effectively hang the pointer on the rq directly for > the same result and with less overhead. So I moved it in the later > patches which were ultimately accepted. > > I don't feel strongly about the name either way, however. So if people > have a name they prefer and the consensus is that it's less confusing, I > am fine with that. I do not feel strong about this one either :) >> Also we should probably commit those prints that I added and enable then under >> SCHED_DEBUG. Right now we're just printing sched_domains and it's not clear >> which root_domain they belong to. >> > > Yes, please do! (and please add the rd->span as indicated earlier, if > you would be so kind ;) btw I'm also thinking of adding CONFIG_SCHED_VERBOSE_DOMAIN_MSG and moving those printks under that option. COFIG_SCHED_DEBUG brings a lot of other things that add overhead in fast path, where as this stuff does not. It is actually very useful to see those messages in general. They are useful even on laptops and stuff, during suspend for example because cpusets and domains are rebuild as we bring CPUs up and down. > If Dimitri can reproduce your findings, we can close out the bug as FAD > and create a new-feature request for the sched_load_balance flag. In > the meantime, the workaround for the new feature is to use per-cpu > exclusive cpusets which it sounds can be supported by your syspart tool. Yes I added that to 'syspart' already and I as explained above I think that's where it belongs (ie userspace tools). I guess we could change cpuset.c:generate_sched_domains() to generate a domain for each cpu that is not in any cpuset but since nothing really brakes (not crashes or anything) if it does not then I'd leave it up to userspace. Max [-- Attachment #2: sched_domain_debug.patch --] [-- Type: text/plain, Size: 1910 bytes --] diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 827cd9a..b94a6de 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -760,6 +760,16 @@ static void do_rebuild_sched_domains(struct work_struct *unused) ndoms = generate_sched_domains(&doms, &attr); cgroup_unlock(); + printk(KERN_INFO "cpusets: rebuild ndoms %u\n", ndoms); + if (doms) { + char str[128]; + int i; + for (i=0; i < ndoms; i++) { + cpumask_scnprintf(str, sizeof(str), *(doms + i)); + printk(KERN_INFO "cpuset: domain %u cpumask %s\n", i, str); + } + } + /* Have scheduler rebuild the domains */ partition_sched_domains(ndoms, doms, attr); diff --git a/kernel/sched.c b/kernel/sched.c index ad1962d..7833224 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6647,11 +6647,16 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, return 0; } -static void sched_domain_debug(struct sched_domain *sd, int cpu) +static void sched_domain_debug(struct root_domain *rd, struct sched_domain *sd, int cpu) { cpumask_t *groupmask; int level = 0; + if (rd == &def_root_domain) + printk(KERN_DEBUG "CPU%d root domain default\n", cpu); + else + printk(KERN_DEBUG "CPU%d root domain %p\n", cpu, rd); + if (!sd) { printk(KERN_DEBUG "CPU%d attaching NULL sched-domain.\n", cpu); return; @@ -6676,7 +6681,7 @@ static void sched_domain_debug(struct sched_domain *sd, int cpu) kfree(groupmask); } #else /* !CONFIG_SCHED_DEBUG */ -# define sched_domain_debug(sd, cpu) do { } while (0) +# define sched_domain_debug(rd, sd, cpu) do { } while (0) #endif /* CONFIG_SCHED_DEBUG */ static int sd_degenerate(struct sched_domain *sd) @@ -6819,7 +6824,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) sd->child = NULL; } - sched_domain_debug(sd, cpu); + sched_domain_debug(rd, sd, cpu); rq_attach_root(rq, rd); rcu_assign_pointer(rq->sd, sd); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-21 20:04 ` Max Krasnyansky @ 2008-11-21 21:18 ` Dimitri Sivanich 2008-11-22 7:03 ` Max Krasnyansky 0 siblings, 1 reply; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-21 21:18 UTC (permalink / raw) To: Gregory Haskins, Max Krasnyansky Cc: Derek Fults, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar Hi Greg and Max, On Fri, Nov 21, 2008 at 12:04:25PM -0800, Max Krasnyansky wrote: > Hi Greg, > > I attached debug instrumentation patch for Dmitri to try. I'll clean it up and > add things you requested and will resubmit properly some time next week. > We added Max's debug patch to our kernel and have run Max's Trace 3 scenario, but we do not see a NULL sched-domain remain attached, see my comments below. mount -t cgroup cpuset -ocpuset /cpusets/ for i in 0 1 2 3; do mkdir par$i; echo $i > par$i/cpuset.cpus; done kernel: cpusets: rebuild ndoms 1 kernel: cpuset: domain 0 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: cpusets: rebuild ndoms 1 kernel: cpuset: domain 0 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: cpusets: rebuild ndoms 1 kernel: cpuset: domain 0 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: cpusets: rebuild ndoms 1 kernel: cpuset: domain 0 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 echo 0 > cpuset.sched_load_balance kernel: cpusets: rebuild ndoms 4 kernel: cpuset: domain 0 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: cpuset: domain 1 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: cpuset: domain 2 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: cpuset: domain 3 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: CPU0 root domain default kernel: CPU0 attaching NULL sched-domain. kernel: CPU1 root domain default kernel: CPU1 attaching NULL sched-domain. kernel: CPU2 root domain default kernel: CPU2 attaching NULL sched-domain. kernel: CPU3 root domain default kernel: CPU3 attaching NULL sched-domain. kernel: CPU3 root domain e0000069ecb20000 kernel: CPU3 attaching sched-domain: kernel: domain 0: span 3 level NODE kernel: groups: 3 kernel: CPU2 root domain e000006884a00000 kernel: CPU2 attaching sched-domain: kernel: domain 0: span 2 level NODE kernel: groups: 2 kernel: CPU1 root domain e000006884a20000 kernel: CPU1 attaching sched-domain: kernel: domain 0: span 1 level NODE kernel: groups: 1 kernel: CPU0 root domain e000006884a40000 kernel: CPU0 attaching sched-domain: kernel: domain 0: span 0 level NODE kernel: groups: 0 Which is the way sched_load_balance is supposed to work. You need to set sched_load_balance=0 for all cpusets containing any cpu you want to disable balancing on, otherwise some balancing will happen. So in addition to the top (root) cpuset, we need to set it to '0' in the parX cpusets. That will turn off load balancing to the cpus in question (thereby attaching a NULL sched domain). So when we do that for just par3, we get the following: echo 0 > par3/cpuset.sched_load_balance kernel: cpusets: rebuild ndoms 3 kernel: cpuset: domain 0 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: cpuset: domain 1 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: cpuset: domain 2 cpumask 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 0000000,00000000,00000000,00000000,0 kernel: CPU3 root domain default kernel: CPU3 attaching NULL sched-domain. So the def_root_domain is now attached for CPU 3. And we do have a NULL sched-domain, which we expect for a cpu with load balancing turned off. If we turn sched_load_balance off ('0') on each of the other cpusets (par0-2), each of those cpus would also have a NULL sched-domain attached. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-21 21:18 ` Dimitri Sivanich @ 2008-11-22 7:03 ` Max Krasnyansky 2008-11-22 8:18 ` Li Zefan 0 siblings, 1 reply; 29+ messages in thread From: Max Krasnyansky @ 2008-11-22 7:03 UTC (permalink / raw) To: Dimitri Sivanich Cc: Gregory Haskins, Derek Fults, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar Dimitri Sivanich wrote: > Hi Greg and Max, > > On Fri, Nov 21, 2008 at 12:04:25PM -0800, Max Krasnyansky wrote: >> Hi Greg, >> >> I attached debug instrumentation patch for Dmitri to try. I'll clean it up and >> add things you requested and will resubmit properly some time next week. >> > > We added Max's debug patch to our kernel and have run Max's Trace 3 scenario, but we do not see a NULL sched-domain remain attached, see my comments below. > > > mount -t cgroup cpuset -ocpuset /cpusets/ > > for i in 0 1 2 3; do mkdir par$i; echo $i > par$i/cpuset.cpus; done > > kernel: cpusets: rebuild ndoms 1 > kernel: cpuset: domain 0 cpumask > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 > 0000000,00000000,00000000,00000000,0 Oops. I did not realize your NR_CPUS is so large. Unfortunately all your masks got truncated. I'll update the patch to print cpu list instead of the masks. > echo 0 > cpuset.sched_load_balance > kernel: cpusets: rebuild ndoms 4 > kernel: cpuset: domain 0 cpumask > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 > 0000000,00000000,00000000,00000000,0 > kernel: cpuset: domain 1 cpumask > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 > 0000000,00000000,00000000,00000000,0 > kernel: cpuset: domain 2 cpumask > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 > 0000000,00000000,00000000,00000000,0 > kernel: cpuset: domain 3 cpumask > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 > 0000000,00000000,00000000,00000000,0 > kernel: CPU0 root domain default > kernel: CPU0 attaching NULL sched-domain. > kernel: CPU1 root domain default > kernel: CPU1 attaching NULL sched-domain. > kernel: CPU2 root domain default > kernel: CPU2 attaching NULL sched-domain. > kernel: CPU3 root domain default > kernel: CPU3 attaching NULL sched-domain. > kernel: CPU3 root domain e0000069ecb20000 > kernel: CPU3 attaching sched-domain: > kernel: domain 0: span 3 level NODE > kernel: groups: 3 > kernel: CPU2 root domain e000006884a00000 > kernel: CPU2 attaching sched-domain: > kernel: domain 0: span 2 level NODE > kernel: groups: 2 > kernel: CPU1 root domain e000006884a20000 > kernel: CPU1 attaching sched-domain: > kernel: domain 0: span 1 level NODE > kernel: groups: 1 > kernel: CPU0 root domain e000006884a40000 > kernel: CPU0 attaching sched-domain: > kernel: domain 0: span 0 level NODE > kernel: groups: 0 > > Which is the way sched_load_balance is supposed to work. You need to set > sched_load_balance=0 for all cpusets containing any cpu you want to disable > balancing on, otherwise some balancing will happen. It won't be much of a balancing in this case because this just one cpu per domain. In other words no that's not how it supposed to work. There is code in cpu_attach_domain() that is supposed to remove redundant levels (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1. btw The reason you got a different result that I did is because you have a NUMA box where is mine is UMA. I was able to reproduce the problem though by enabling multi-core scheduler. In which case I also get one redundant domain level CPU, with a single CPU in it. So we definitely need to fix this. I'll try to poke around tomorrow and figure out why redundant level is not dropped. > So in addition to the top (root) cpuset, we need to set it to '0' in the > parX cpusets. That will turn off load balancing to the cpus in question > (thereby attaching a NULL sched domain). As I explained above we should not have to disable load balancing in cpusets with a single CPU. > So when we do that for just par3, we get the following: > echo 0 > par3/cpuset.sched_load_balance > kernel: cpusets: rebuild ndoms 3 > kernel: cpuset: domain 0 cpumask > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 > 0000000,00000000,00000000,00000000,0 > kernel: cpuset: domain 1 cpumask > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 > 0000000,00000000,00000000,00000000,0 > kernel: cpuset: domain 2 cpumask > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 > 0000000,00000000,00000000,00000000,0 > kernel: CPU3 root domain default > kernel: CPU3 attaching NULL sched-domain. > > So the def_root_domain is now attached for CPU 3. And we do have a NULL > sched-domain, which we expect for a cpu with load balancing turned off. If > we turn sched_load_balance off ('0') on each of the other cpusets (par0-2), > each of those cpus would also have a NULL sched-domain attached. Ok. This one is a bug in cpuset.c:generate_sched_domains(). Sched domain generator in cpusets should not drop domains with single cpu in them when sched_load_balance==0. I'll look at that tomorrow too. Max ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-22 7:03 ` Max Krasnyansky @ 2008-11-22 8:18 ` Li Zefan 2008-11-24 15:11 ` Dimitri Sivanich 2008-11-24 21:46 ` Max Krasnyansky 0 siblings, 2 replies; 29+ messages in thread From: Li Zefan @ 2008-11-22 8:18 UTC (permalink / raw) To: Max Krasnyansky Cc: Dimitri Sivanich, Gregory Haskins, Derek Fults, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar Max Krasnyansky wrote: > > Dimitri Sivanich wrote: >> Hi Greg and Max, >> >> On Fri, Nov 21, 2008 at 12:04:25PM -0800, Max Krasnyansky wrote: >>> Hi Greg, >>> >>> I attached debug instrumentation patch for Dmitri to try. I'll clean it up and >>> add things you requested and will resubmit properly some time next week. >>> >> We added Max's debug patch to our kernel and have run Max's Trace 3 scenario, but we do not see a NULL sched-domain remain attached, see my comments below. >> >> >> mount -t cgroup cpuset -ocpuset /cpusets/ >> >> for i in 0 1 2 3; do mkdir par$i; echo $i > par$i/cpuset.cpus; done >> >> kernel: cpusets: rebuild ndoms 1 >> kernel: cpuset: domain 0 cpumask >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >> 0000000,00000000,00000000,00000000,0 > Oops. I did not realize your NR_CPUS is so large. Unfortunately all your masks > got truncated. > I'll update the patch to print cpu list instead of the masks. > >> echo 0 > cpuset.sched_load_balance >> kernel: cpusets: rebuild ndoms 4 >> kernel: cpuset: domain 0 cpumask >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >> 0000000,00000000,00000000,00000000,0 >> kernel: cpuset: domain 1 cpumask >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >> 0000000,00000000,00000000,00000000,0 >> kernel: cpuset: domain 2 cpumask >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >> 0000000,00000000,00000000,00000000,0 >> kernel: cpuset: domain 3 cpumask >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >> 0000000,00000000,00000000,00000000,0 >> kernel: CPU0 root domain default >> kernel: CPU0 attaching NULL sched-domain. >> kernel: CPU1 root domain default >> kernel: CPU1 attaching NULL sched-domain. >> kernel: CPU2 root domain default >> kernel: CPU2 attaching NULL sched-domain. >> kernel: CPU3 root domain default >> kernel: CPU3 attaching NULL sched-domain. > >> kernel: CPU3 root domain e0000069ecb20000 >> kernel: CPU3 attaching sched-domain: >> kernel: domain 0: span 3 level NODE >> kernel: groups: 3 >> kernel: CPU2 root domain e000006884a00000 >> kernel: CPU2 attaching sched-domain: >> kernel: domain 0: span 2 level NODE >> kernel: groups: 2 >> kernel: CPU1 root domain e000006884a20000 >> kernel: CPU1 attaching sched-domain: >> kernel: domain 0: span 1 level NODE >> kernel: groups: 1 >> kernel: CPU0 root domain e000006884a40000 >> kernel: CPU0 attaching sched-domain: >> kernel: domain 0: span 0 level NODE >> kernel: groups: 0 >> >> Which is the way sched_load_balance is supposed to work. You need to set >> sched_load_balance=0 for all cpusets containing any cpu you want to disable >> balancing on, otherwise some balancing will happen. > It won't be much of a balancing in this case because this just one cpu per > domain. > In other words no that's not how it supposed to work. There is code in > cpu_attach_domain() that is supposed to remove redundant levels > (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1. > btw The reason you got a different result that I did is because you have a > NUMA box where is mine is UMA. I was able to reproduce the problem though by > enabling multi-core scheduler. In which case I also get one redundant domain > level CPU, with a single CPU in it. > So we definitely need to fix this. I'll try to poke around tomorrow and figure > out why redundant level is not dropped. > You were not using latest kernel, were you? There was a bug in sd degenerate code, and it has already been fixed: http://lkml.org/lkml/2008/11/8/10 >> So in addition to the top (root) cpuset, we need to set it to '0' in the >> parX cpusets. That will turn off load balancing to the cpus in question >> (thereby attaching a NULL sched domain). > As I explained above we should not have to disable load balancing in cpusets > with a single CPU. > Yes, and please try the laste kernel. ;) >> So when we do that for just par3, we get the following: >> echo 0 > par3/cpuset.sched_load_balance >> kernel: cpusets: rebuild ndoms 3 >> kernel: cpuset: domain 0 cpumask >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >> 0000000,00000000,00000000,00000000,0 >> kernel: cpuset: domain 1 cpumask >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >> 0000000,00000000,00000000,00000000,0 >> kernel: cpuset: domain 2 cpumask >> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >> 0000000,00000000,00000000,00000000,0 >> kernel: CPU3 root domain default >> kernel: CPU3 attaching NULL sched-domain. >> >> So the def_root_domain is now attached for CPU 3. And we do have a NULL >> sched-domain, which we expect for a cpu with load balancing turned off. If >> we turn sched_load_balance off ('0') on each of the other cpusets (par0-2), >> each of those cpus would also have a NULL sched-domain attached. > Ok. This one is a bug in cpuset.c:generate_sched_domains(). Sched domain > generator in cpusets should not drop domains with single cpu in them when > sched_load_balance==0. I'll look at that tomorrow too. > Do you mean the correct behavior should be as following? kernel: cpusets: rebuild ndoms 4 But why do you think this is a bug? In generate_sched_domains(), cpusets with sched_load_balance==0 will be skippped: list_add(&top_cpuset.stack_list, &q); while (!list_empty(&q)) { ... if (is_sched_load_balance(cp)) { csa[csn++] = cp; continue; } ... } Correct me if I misunderstood your point. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-22 8:18 ` Li Zefan @ 2008-11-24 15:11 ` Dimitri Sivanich 2008-11-24 21:47 ` Max Krasnyansky 2008-11-24 21:46 ` Max Krasnyansky 1 sibling, 1 reply; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-24 15:11 UTC (permalink / raw) To: Li Zefan, Gregory Haskins Cc: Max Krasnyansky, Derek Fults, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar On Sat, Nov 22, 2008 at 04:18:29PM +0800, Li Zefan wrote: > Max Krasnyansky wrote: > > > > Dimitri Sivanich wrote: > >> > >> Which is the way sched_load_balance is supposed to work. You need to set > >> sched_load_balance=0 for all cpusets containing any cpu you want to disable > >> balancing on, otherwise some balancing will happen. > > It won't be much of a balancing in this case because this just one cpu per > > domain. > > In other words no that's not how it supposed to work. There is code in > > cpu_attach_domain() that is supposed to remove redundant levels > > (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1. > > btw The reason you got a different result that I did is because you have a > > NUMA box where is mine is UMA. I was able to reproduce the problem though by > > enabling multi-core scheduler. In which case I also get one redundant domain > > level CPU, with a single CPU in it. > > So we definitely need to fix this. I'll try to poke around tomorrow and figure > > out why redundant level is not dropped. > > > > You were not using latest kernel, were you? > > There was a bug in sd degenerate code, and it has already been fixed: > http://lkml.org/lkml/2008/11/8/10 With the above patch added, we now see the results that Max is showing as far as individual root domains being created with a span of just their own cpu when sched_load_balance is turned off. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-24 15:11 ` Dimitri Sivanich @ 2008-11-24 21:47 ` Max Krasnyansky 0 siblings, 0 replies; 29+ messages in thread From: Max Krasnyansky @ 2008-11-24 21:47 UTC (permalink / raw) To: Dimitri Sivanich Cc: Li Zefan, Gregory Haskins, Derek Fults, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar Dimitri Sivanich wrote: > On Sat, Nov 22, 2008 at 04:18:29PM +0800, Li Zefan wrote: >> Max Krasnyansky wrote: >>> Dimitri Sivanich wrote: >>>> Which is the way sched_load_balance is supposed to work. You need to set >>>> sched_load_balance=0 for all cpusets containing any cpu you want to disable >>>> balancing on, otherwise some balancing will happen. >>> It won't be much of a balancing in this case because this just one cpu per >>> domain. >>> In other words no that's not how it supposed to work. There is code in >>> cpu_attach_domain() that is supposed to remove redundant levels >>> (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1. >>> btw The reason you got a different result that I did is because you have a >>> NUMA box where is mine is UMA. I was able to reproduce the problem though by >>> enabling multi-core scheduler. In which case I also get one redundant domain >>> level CPU, with a single CPU in it. >>> So we definitely need to fix this. I'll try to poke around tomorrow and figure >>> out why redundant level is not dropped. >>> >> You were not using latest kernel, were you? >> >> There was a bug in sd degenerate code, and it has already been fixed: >> http://lkml.org/lkml/2008/11/8/10 > > With the above patch added, we now see the results that Max is > showing as far as individual root domains being created with a span > of just their own cpu when sched_load_balance is turned off. Nice. Max ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-22 8:18 ` Li Zefan 2008-11-24 15:11 ` Dimitri Sivanich @ 2008-11-24 21:46 ` Max Krasnyansky 1 sibling, 0 replies; 29+ messages in thread From: Max Krasnyansky @ 2008-11-24 21:46 UTC (permalink / raw) To: Li Zefan Cc: Dimitri Sivanich, Gregory Haskins, Derek Fults, Peter Zijlstra, linux-kernel@vger.kernel.org, Ingo Molnar Li Zefan wrote: > Max Krasnyansky wrote: >> Dimitri Sivanich wrote: >>> kernel: CPU3 root domain e0000069ecb20000 >>> kernel: CPU3 attaching sched-domain: >>> kernel: domain 0: span 3 level NODE >>> kernel: groups: 3 >>> kernel: CPU2 root domain e000006884a00000 >>> kernel: CPU2 attaching sched-domain: >>> kernel: domain 0: span 2 level NODE >>> kernel: groups: 2 >>> kernel: CPU1 root domain e000006884a20000 >>> kernel: CPU1 attaching sched-domain: >>> kernel: domain 0: span 1 level NODE >>> kernel: groups: 1 >>> kernel: CPU0 root domain e000006884a40000 >>> kernel: CPU0 attaching sched-domain: >>> kernel: domain 0: span 0 level NODE >>> kernel: groups: 0 >>> >>> Which is the way sched_load_balance is supposed to work. You need to set >>> sched_load_balance=0 for all cpusets containing any cpu you want to disable >>> balancing on, otherwise some balancing will happen. >> It won't be much of a balancing in this case because this just one cpu per >> domain. >> In other words no that's not how it supposed to work. There is code in >> cpu_attach_domain() that is supposed to remove redundant levels >> (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1. >> btw The reason you got a different result that I did is because you have a >> NUMA box where is mine is UMA. I was able to reproduce the problem though by >> enabling multi-core scheduler. In which case I also get one redundant domain >> level CPU, with a single CPU in it. >> So we definitely need to fix this. I'll try to poke around tomorrow and figure >> out why redundant level is not dropped. >> > > You were not using latest kernel, were you? > > There was a bug in sd degenerate code, and it has already been fixed: > http://lkml.org/lkml/2008/11/8/10 Ah, makes sense. The funny part is that I did see the patch before but completely forgot about it :). >>> So when we do that for just par3, we get the following: >>> echo 0 > par3/cpuset.sched_load_balance >>> kernel: cpusets: rebuild ndoms 3 >>> kernel: cpuset: domain 0 cpumask >>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >>> 0000000,00000000,00000000,00000000,0 >>> kernel: cpuset: domain 1 cpumask >>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >>> 0000000,00000000,00000000,00000000,0 >>> kernel: cpuset: domain 2 cpumask >>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0 >>> 0000000,00000000,00000000,00000000,0 >>> kernel: CPU3 root domain default >>> kernel: CPU3 attaching NULL sched-domain. >>> >>> So the def_root_domain is now attached for CPU 3. And we do have a NULL >>> sched-domain, which we expect for a cpu with load balancing turned off. If >>> we turn sched_load_balance off ('0') on each of the other cpusets (par0-2), >>> each of those cpus would also have a NULL sched-domain attached. >> Ok. This one is a bug in cpuset.c:generate_sched_domains(). Sched domain >> generator in cpusets should not drop domains with single cpu in them when >> sched_load_balance==0. I'll look at that tomorrow too. >> > > Do you mean the correct behavior should be as following? > kernel: cpusets: rebuild ndoms 4 Yes. > But why do you think this is a bug? In generate_sched_domains(), cpusets with > sched_load_balance==0 will be skippped: > > list_add(&top_cpuset.stack_list, &q); > while (!list_empty(&q)) { > ... > if (is_sched_load_balance(cp)) { > csa[csn++] = cp; > continue; > } > ... > } > > Correct me if I misunderstood your point. The problem is that all cpus in cpusets with sched_load_balance==0 end up in the default root_domain which causes lock contention. We can fix it either in sched.c:partition_sched_domains() or in cpusets.c:generate_sched_domains(). I'd rather fix cpusets because sched.c fix will be sub-optimal. See my answer to Greg on the same thread. Basically the scheduler code would have to allocate a root_domain for each CPU even on transitional states. So I'd rather fix cpusets to generate domain for each non-overlapping cpuset regardless of the sched_load_balance flag. Max ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-04 14:36 ` Peter Zijlstra 2008-11-04 14:40 ` Dimitri Sivanich @ 2008-11-04 14:45 ` Dimitri Sivanich 2008-11-06 9:13 ` Nish Aravamudan 2 siblings, 0 replies; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-04 14:45 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Gregory Haskins, linux-kernel, Ingo Molnar On Tue, Nov 04, 2008 at 03:36:33PM +0100, Peter Zijlstra wrote: > Or are folks disabling load-balance bottom-up, instead of top-down? > > In that case, I think we should dis-allow that. If what you mean by "disabling load-balance bottom-up" is disabling load-balance in the root cpuset before disabling it in the leaves, in the end it does not matter which way you do it, the setup winds up being the same. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-04 14:36 ` Peter Zijlstra 2008-11-04 14:40 ` Dimitri Sivanich 2008-11-04 14:45 ` Dimitri Sivanich @ 2008-11-06 9:13 ` Nish Aravamudan 2008-11-06 13:32 ` Dimitri Sivanich 2 siblings, 1 reply; 29+ messages in thread From: Nish Aravamudan @ 2008-11-06 9:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Gregory Haskins, Dimitri Sivanich, linux-kernel, Ingo Molnar On Tue, Nov 4, 2008 at 6:36 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote: >> Gregory Haskins wrote: >> > Peter Zijlstra wrote: >> > >> >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: >> >> >> >> >> >>> When load balancing gets switched off for a set of cpus via the >> >>> sched_load_balance flag in cpusets, those cpus wind up with the >> >>> globally defined def_root_domain attached. The def_root_domain is >> >>> attached when partition_sched_domains calls detach_destroy_domains(). >> >>> A new root_domain is never allocated or attached as a sched domain >> >>> will never be attached by __build_sched_domains() for the non-load >> >>> balanced processors. >> >>> >> >>> The problem with this scenario is that on systems with a large number >> >>> of processors with load balancing switched off, we start to see the >> >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. >> >>> This starts to become much more apparent above 8 waking RT threads >> >>> (with each RT thread running on it's own cpu, blocking and waking up >> >>> continuously). >> >>> >> >>> I'm wondering if this is, in fact, the way things were meant to work, >> >>> or should we have a root domain allocated for each cpu that is not to >> >>> be part of a sched domain? Note the the def_root_domain spans all of >> >>> the non-load-balanced cpus in this case. Having it attached to cpus >> >>> that should not be load balancing doesn't quite make sense to me. >> >>> >> >>> >> >> It shouldn't be like that, each load-balance domain (in your case a >> >> single cpu) should get its own root domain. Gregory? >> >> >> >> >> > >> > Yeah, this sounds broken. I know that the root-domain code was being >> > developed coincident to some upheaval with the cpuset code, so I suspect >> > something may have been broken from the original intent. I will take a >> > look. >> > >> > -Greg >> > >> > >> >> After thinking about it some more, I am not quite sure what to do here. >> The root-domain code was really designed to be 1:1 with a disjoint >> cpuset. In this case, it sounds like all the non-balanced cpus are >> still in one default cpuset. In that case, the code is correct to place >> all those cores in the singleton def_root_domain. The question really >> is: How do we support the sched_load_balance flag better? >> >> I suppose we could go through the scheduler code and have it check that >> flag before consulting the root-domain. Another alternative is to have >> the sched_load_balance=false flag create a disjoint cpuset. Any thoughts? > > Hmm, but you cannot disable load-balance on a cpu without placing it in > an cpuset first, right? > > Or are folks disabling load-balance bottom-up, instead of top-down? > > In that case, I think we should dis-allow that. I don't have a lot of insight into the technical discussion, but will say that (if I understand you right), the "bottom-up" approach was recommended on LKML by Max K. in the (long) thread from earlier this year with Subject "Inquiry: Should we remove "isolcpus= kernel boot option? (may have realtime uses)": "Just to complete the example above. Lets say you want to isolate cpu2 (assuming that cpusets are already mounted). # Bring cpu2 offline echo 0 > /sys/devices/system/cpu/cpu2/online # Disable system wide load balancing echo 0 > /dev/cpuset/cpuset.sched_load_banace # Bring cpu2 online echo 1 > /sys/devices/system/cpu/cpu2/online Now if you want to un-isolate cpu2 you do # Disable system wide load balancing echo 1 > /dev/cpuset/cpuset.sched_load_banace Of course this is not a complete isolation. There are also irqs (see my "default irq affinity" patch), workqueues and the stop machine. I'm working on those too and will release .25 base cpuisol tree when I'm done." Would you recommend instead, then, that a new cpuset be created with only cpu 2 in it (should one set cpuset.cpu_exclusive then?) and then disabling load balancing in that cpuset? Thanks, Nish ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance 2008-11-06 9:13 ` Nish Aravamudan @ 2008-11-06 13:32 ` Dimitri Sivanich 0 siblings, 0 replies; 29+ messages in thread From: Dimitri Sivanich @ 2008-11-06 13:32 UTC (permalink / raw) To: Nish Aravamudan Cc: Peter Zijlstra, Gregory Haskins, linux-kernel, Ingo Molnar On Thu, Nov 06, 2008 at 01:13:48AM -0800, Nish Aravamudan wrote: > On Tue, Nov 4, 2008 at 6:36 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote: > >> Gregory Haskins wrote: > >> > Peter Zijlstra wrote: > >> > > >> >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote: > >> >> > >> >> > >> >>> When load balancing gets switched off for a set of cpus via the > >> >>> sched_load_balance flag in cpusets, those cpus wind up with the > >> >>> globally defined def_root_domain attached. The def_root_domain is > >> >>> attached when partition_sched_domains calls detach_destroy_domains(). > >> >>> A new root_domain is never allocated or attached as a sched domain > >> >>> will never be attached by __build_sched_domains() for the non-load > >> >>> balanced processors. > >> >>> > >> >>> The problem with this scenario is that on systems with a large number > >> >>> of processors with load balancing switched off, we start to see the > >> >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended. > >> >>> This starts to become much more apparent above 8 waking RT threads > >> >>> (with each RT thread running on it's own cpu, blocking and waking up > >> >>> continuously). > >> >>> > >> >>> I'm wondering if this is, in fact, the way things were meant to work, > >> >>> or should we have a root domain allocated for each cpu that is not to > >> >>> be part of a sched domain? Note the the def_root_domain spans all of > >> >>> the non-load-balanced cpus in this case. Having it attached to cpus > >> >>> that should not be load balancing doesn't quite make sense to me. > >> >>> > >> >>> > >> >> It shouldn't be like that, each load-balance domain (in your case a > >> >> single cpu) should get its own root domain. Gregory? > >> >> > >> >> > >> > > >> > Yeah, this sounds broken. I know that the root-domain code was being > >> > developed coincident to some upheaval with the cpuset code, so I suspect > >> > something may have been broken from the original intent. I will take a > >> > look. > >> > > >> > -Greg > >> > > >> > > >> > >> After thinking about it some more, I am not quite sure what to do here. > >> The root-domain code was really designed to be 1:1 with a disjoint > >> cpuset. In this case, it sounds like all the non-balanced cpus are > >> still in one default cpuset. In that case, the code is correct to place > >> all those cores in the singleton def_root_domain. The question really > >> is: How do we support the sched_load_balance flag better? > >> > >> I suppose we could go through the scheduler code and have it check that > >> flag before consulting the root-domain. Another alternative is to have > >> the sched_load_balance=false flag create a disjoint cpuset. Any thoughts? > > > > Hmm, but you cannot disable load-balance on a cpu without placing it in > > an cpuset first, right? > > > > Or are folks disabling load-balance bottom-up, instead of top-down? > > > > In that case, I think we should dis-allow that. > > I don't have a lot of insight into the technical discussion, but will > say that (if I understand you right), the "bottom-up" approach was > recommended on LKML by Max K. in the (long) thread from earlier this > year with Subject "Inquiry: Should we remove "isolcpus= kernel boot > option? (may have realtime uses)": > > "Just to complete the example above. Lets say you want to isolate cpu2 > (assuming that cpusets are already mounted). > > # Bring cpu2 offline > echo 0 > /sys/devices/system/cpu/cpu2/online > > # Disable system wide load balancing > echo 0 > /dev/cpuset/cpuset.sched_load_banace > > # Bring cpu2 online > echo 1 > /sys/devices/system/cpu/cpu2/online > > Now if you want to un-isolate cpu2 you do > > # Disable system wide load balancing > echo 1 > /dev/cpuset/cpuset.sched_load_banace > > Of course this is not a complete isolation. There are also irqs (see my > "default irq affinity" patch), workqueues and the stop machine. I'm working on > those too and will release .25 base cpuisol tree when I'm done." > > Would you recommend instead, then, that a new cpuset be created with > only cpu 2 in it (should one set cpuset.cpu_exclusive then?) and then > disabling load balancing in that cpuset? > This is exactly the primary scenario that I've been trying (as well as having multiple cpus in that cpuset). Regardless of the setup, the same problem occurs - the default root domain is what gets attached, and that spans all other cpus with load balancing switched off. The lock in the def_root_domain's cpupri_vec therefore becomes contended, and that slows down thread wakeup. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2008-11-24 21:47 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-03 21:07 RT sched: cpupri_vec lock contention with def_root_domain and no load balance Dimitri Sivanich 2008-11-03 22:33 ` Peter Zijlstra 2008-11-04 1:29 ` Dimitri Sivanich 2008-11-04 3:53 ` Gregory Haskins 2008-11-04 14:34 ` Gregory Haskins 2008-11-04 14:36 ` Peter Zijlstra 2008-11-04 14:40 ` Dimitri Sivanich 2008-11-04 14:59 ` Gregory Haskins 2008-11-19 19:49 ` Max Krasnyansky 2008-11-19 19:55 ` Dimitri Sivanich 2008-11-19 20:17 ` Max Krasnyansky 2008-11-19 20:21 ` Dimitri Sivanich 2008-11-19 20:25 ` Gregory Haskins 2008-11-19 20:33 ` Dimitri Sivanich 2008-11-19 21:30 ` Gregory Haskins 2008-11-19 21:47 ` Dimitri Sivanich 2008-11-19 22:25 ` Gregory Haskins 2008-11-20 2:12 ` Max Krasnyansky 2008-11-21 1:57 ` Gregory Haskins 2008-11-21 20:04 ` Max Krasnyansky 2008-11-21 21:18 ` Dimitri Sivanich 2008-11-22 7:03 ` Max Krasnyansky 2008-11-22 8:18 ` Li Zefan 2008-11-24 15:11 ` Dimitri Sivanich 2008-11-24 21:47 ` Max Krasnyansky 2008-11-24 21:46 ` Max Krasnyansky 2008-11-04 14:45 ` Dimitri Sivanich 2008-11-06 9:13 ` Nish Aravamudan 2008-11-06 13:32 ` Dimitri Sivanich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox