public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] sched: add locking when update the task_group's cfs_rq[]  array.
@ 2008-11-19  6:48 Ken Chen
  2008-11-19  7:53 ` Ingo Molnar
  2008-11-19 16:54 ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Ken Chen @ 2008-11-19  6:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra; +Cc: Linux Kernel Mailing List

add locking when update the task_group's cfs_rq[] array.  tg_shares_up()
can be potentially executed concurrently on multiple CPUs with overlaping
cpu mask depending on where task_cpu() was when a task got woken up.  Lack
of any locking while redistribute tg->shares over cfs_rq[] array opens up
a large window for conflict updates and utimately cause corruptions to the
integrity of per cpu cfs_rq shares. Add a tg_lock to protect the operations.


Signed-off-by: Ken Chen <kenchen@google.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index 1ff78b6..907a44e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -267,6 +267,8 @@ struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+	/* protect integrity of per-cpu cfs_rq[i]->shares */
+	spinlock_t tg_lock;
 #endif

 #ifdef CONFIG_RT_GROUP_SCHED
@@ -1493,13 +1495,11 @@ update_group_shares_cpu
 	if (abs(shares - tg->se[cpu]->load.weight) >
 			sysctl_sched_shares_thresh) {
 		struct rq *rq = cpu_rq(cpu);
-		unsigned long flags;

-		spin_lock_irqsave(&rq->lock, flags);
+		spin_lock(&rq->lock);
 		tg->cfs_rq[cpu]->shares = shares;
-
 		__set_se_shares(tg->se[cpu], shares);
-		spin_unlock_irqrestore(&rq->lock, flags);
+		spin_unlock(&rq->lock);
 	}
 }

@@ -1513,8 +1513,12 @@ static int tg_shares_up
 	unsigned long weight, rq_weight = 0;
 	unsigned long shares = 0;
 	struct sched_domain *sd = data;
+	unsigned long flags;
 	int i;

+	if (!spin_trylock_irqsave(&tg->tg_lock, flags))
+		return 0;
+
 	for_each_cpu_mask(i, sd->span) {
 		/*
 		 * If there are currently no tasks on the cpu pretend there
@@ -1539,6 +1543,7 @@ static int tg_shares_up
 	for_each_cpu_mask(i, sd->span)
 		update_group_shares_cpu(tg, i, shares, rq_weight);

+	spin_unlock_irqrestore(&tg->tg_lock, flags);
 	return 0;
 }

@@ -8195,6 +8200,10 @@ void __init sched_init(void)
 	list_add(&init_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&init_task_group.children);

+#ifdef CONFIG_FAIR_GROUP_SCHED
+	spin_lock_init(&init_task_group.tg_lock);
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_USER_SCHED
 	INIT_LIST_HEAD(&root_task_group.children);
 	init_task_group.parent = &root_task_group;
@@ -8491,6 +8500,10 @@ int alloc_fair_sched_group

 	tg->shares = NICE_0_LOAD;

+#ifdef CONFIG_FAIR_GROUP_SCHED
+	spin_lock_init(&tg->tg_lock);
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 	for_each_possible_cpu(i) {
 		rq = cpu_rq(i);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.
  2008-11-19  6:48 [patch] sched: add locking when update the task_group's cfs_rq[] array Ken Chen
@ 2008-11-19  7:53 ` Ingo Molnar
  2008-11-19  8:22   ` Ken Chen
  2008-11-19 16:54 ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-11-19  7:53 UTC (permalink / raw)
  To: Ken Chen; +Cc: Peter Zijlstra, Linux Kernel Mailing List


* Ken Chen <kenchen@google.com> wrote:

> @@ -1513,8 +1513,12 @@ static int tg_shares_up
>  	unsigned long weight, rq_weight = 0;
>  	unsigned long shares = 0;
>  	struct sched_domain *sd = data;
> +	unsigned long flags;
>  	int i;
> 
> +	if (!spin_trylock_irqsave(&tg->tg_lock, flags))
> +		return 0;

hm, why trylock?

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] sched: add locking when update the task_group's cfs_rq[]  array.
  2008-11-19  7:53 ` Ingo Molnar
@ 2008-11-19  8:22   ` Ken Chen
  0 siblings, 0 replies; 8+ messages in thread
From: Ken Chen @ 2008-11-19  8:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Linux Kernel Mailing List

On Tue, Nov 18, 2008 at 11:53 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Ken Chen <kenchen@google.com> wrote:
>
>> @@ -1513,8 +1513,12 @@ static int tg_shares_up
>>       unsigned long weight, rq_weight = 0;
>>       unsigned long shares = 0;
>>       struct sched_domain *sd = data;
>> +     unsigned long flags;
>>       int i;
>>
>> +     if (!spin_trylock_irqsave(&tg->tg_lock, flags))
>> +             return 0;
>
> hm, why trylock?

I'm paranoid about potential lock contention.  Considering calls to
tg_shares_up() are more or less sample based, I opt to skip updating
if there is a lock contention.  Though kernel only walks tg tree every
sysctl_sched_shares_ratelimit.  Maybe chances of running into lock
contention isn't that high anyway, in which case trylock will mostly
able to get the lock.

- Ken

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] sched: add locking when update the task_group's cfs_rq[]  array.
  2008-11-19  6:48 [patch] sched: add locking when update the task_group's cfs_rq[] array Ken Chen
  2008-11-19  7:53 ` Ingo Molnar
@ 2008-11-19 16:54 ` Peter Zijlstra
  2008-11-19 17:21   ` Ken Chen
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2008-11-19 16:54 UTC (permalink / raw)
  To: Ken Chen; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Tue, 2008-11-18 at 22:48 -0800, Ken Chen wrote:
> add locking when update the task_group's cfs_rq[] array.  tg_shares_up()
> can be potentially executed concurrently on multiple CPUs with overlaping
> cpu mask depending on where task_cpu() was when a task got woken up.  Lack
> of any locking while redistribute tg->shares over cfs_rq[] array opens up
> a large window for conflict updates and utimately cause corruptions to the
> integrity of per cpu cfs_rq shares. Add a tg_lock to protect the operations.

I see why you want to do this, but introducing a global lock makes me
sad :/

Let me ponder this a while...

> Signed-off-by: Ken Chen <kenchen@google.com>
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1ff78b6..907a44e 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -267,6 +267,8 @@ struct task_group {
>  	/* runqueue "owned" by this group on each cpu */
>  	struct cfs_rq **cfs_rq;
>  	unsigned long shares;
> +	/* protect integrity of per-cpu cfs_rq[i]->shares */
> +	spinlock_t tg_lock;
>  #endif
> 
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -1493,13 +1495,11 @@ update_group_shares_cpu
>  	if (abs(shares - tg->se[cpu]->load.weight) >
>  			sysctl_sched_shares_thresh) {
>  		struct rq *rq = cpu_rq(cpu);
> -		unsigned long flags;
> 
> -		spin_lock_irqsave(&rq->lock, flags);
> +		spin_lock(&rq->lock);
>  		tg->cfs_rq[cpu]->shares = shares;
> -
>  		__set_se_shares(tg->se[cpu], shares);
> -		spin_unlock_irqrestore(&rq->lock, flags);
> +		spin_unlock(&rq->lock);
>  	}
>  }
> 
> @@ -1513,8 +1513,12 @@ static int tg_shares_up
>  	unsigned long weight, rq_weight = 0;
>  	unsigned long shares = 0;
>  	struct sched_domain *sd = data;
> +	unsigned long flags;
>  	int i;
> 
> +	if (!spin_trylock_irqsave(&tg->tg_lock, flags))
> +		return 0;
> +
>  	for_each_cpu_mask(i, sd->span) {
>  		/*
>  		 * If there are currently no tasks on the cpu pretend there
> @@ -1539,6 +1543,7 @@ static int tg_shares_up
>  	for_each_cpu_mask(i, sd->span)
>  		update_group_shares_cpu(tg, i, shares, rq_weight);
> 
> +	spin_unlock_irqrestore(&tg->tg_lock, flags);
>  	return 0;
>  }
> 
> @@ -8195,6 +8200,10 @@ void __init sched_init(void)
>  	list_add(&init_task_group.list, &task_groups);
>  	INIT_LIST_HEAD(&init_task_group.children);
> 
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +	spin_lock_init(&init_task_group.tg_lock);
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
>  #ifdef CONFIG_USER_SCHED
>  	INIT_LIST_HEAD(&root_task_group.children);
>  	init_task_group.parent = &root_task_group;
> @@ -8491,6 +8500,10 @@ int alloc_fair_sched_group
> 
>  	tg->shares = NICE_0_LOAD;
> 
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +	spin_lock_init(&tg->tg_lock);
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
>  	for_each_possible_cpu(i) {
>  		rq = cpu_rq(i);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] sched: add locking when update the task_group's cfs_rq[]  array.
  2008-11-19 16:54 ` Peter Zijlstra
@ 2008-11-19 17:21   ` Ken Chen
  2008-11-19 20:58     ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Ken Chen @ 2008-11-19 17:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Wed, Nov 19, 2008 at 8:54 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2008-11-18 at 22:48 -0800, Ken Chen wrote:
>> add locking when update the task_group's cfs_rq[] array.  tg_shares_up()
>> can be potentially executed concurrently on multiple CPUs with overlaping
>> cpu mask depending on where task_cpu() was when a task got woken up.  Lack
>> of any locking while redistribute tg->shares over cfs_rq[] array opens up
>> a large window for conflict updates and utimately cause corruptions to the
>> integrity of per cpu cfs_rq shares. Add a tg_lock to protect the operations.
>
> I see why you want to do this, but introducing a global lock makes me
> sad :/

I wholly agree on the scalability.  The bigger the system, the more it
needs to protect the integrity of cfs_rq[]->shares that the sum still
adds up to tg->shares.  Otherwise, the share distributed on each CPU's
cfs_rq might go wildly and indirectly leads to fluctuation of
effective total tg->shares.  However, I have the same doubt that this
will scale on large CPU system.  Does CFS really have to iterate the
whole task_group tree?

- Ken

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] sched: add locking when update the task_group's cfs_rq[]  array.
  2008-11-19 17:21   ` Ken Chen
@ 2008-11-19 20:58     ` Peter Zijlstra
  2008-11-19 21:50       ` Chris Friesen
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2008-11-19 20:58 UTC (permalink / raw)
  To: Ken Chen; +Cc: Ingo Molnar, Linux Kernel Mailing List, Dhaval Giani

On Wed, 2008-11-19 at 09:21 -0800, Ken Chen wrote:
> On Wed, Nov 19, 2008 at 8:54 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2008-11-18 at 22:48 -0800, Ken Chen wrote:
> >> add locking when update the task_group's cfs_rq[] array.  tg_shares_up()
> >> can be potentially executed concurrently on multiple CPUs with overlaping
> >> cpu mask depending on where task_cpu() was when a task got woken up.  Lack
> >> of any locking while redistribute tg->shares over cfs_rq[] array opens up
> >> a large window for conflict updates and utimately cause corruptions to the
> >> integrity of per cpu cfs_rq shares. Add a tg_lock to protect the operations.
> >
> > I see why you want to do this, but introducing a global lock makes me
> > sad :/
> 
> I wholly agree on the scalability.  The bigger the system, the more it
> needs to protect the integrity of cfs_rq[]->shares that the sum still
> adds up to tg->shares.  Otherwise, the share distributed on each CPU's
> cfs_rq might go wildly and indirectly leads to fluctuation of
> effective total tg->shares.  However, I have the same doubt that this
> will scale on large CPU system.  Does CFS really have to iterate the
> whole task_group tree?

Yes, sadly. The weight of a per-cpu super-task representation depends on
the group's task distribution over all tasks :/

(Dhaval, could you send Ken a copy of the paper we did on this?)

The idea was that we balance the stuff usng the sched-domain tree and
update it incrementally, and on the top level sched domain fix it all
up.

Will it scale, half-way, I'd say. It races a little, but should
converge. The biggest issue is that we're running with 10 bit fixed
point math, and on large cpu machines you get into granularity problems.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.
  2008-11-19 20:58     ` Peter Zijlstra
@ 2008-11-19 21:50       ` Chris Friesen
  2008-11-22 10:09         ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Friesen @ 2008-11-19 21:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ken Chen, Ingo Molnar, Linux Kernel Mailing List, Dhaval Giani

Peter Zijlstra wrote:

> Will it scale, half-way, I'd say. It races a little, but should
> converge. The biggest issue is that we're running with 10 bit fixed
> point math, and on large cpu machines you get into granularity problems.

Is there any way to get higher resolution math on 64-bit machines?  (I'm 
assuming that most larger SMP boxes will be 64-bit capable.)

Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.
  2008-11-19 21:50       ` Chris Friesen
@ 2008-11-22 10:09         ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2008-11-22 10:09 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ken Chen, Ingo Molnar, Linux Kernel Mailing List, Dhaval Giani

On Wed, 2008-11-19 at 15:50 -0600, Chris Friesen wrote:
> Peter Zijlstra wrote:
> 
> > Will it scale, half-way, I'd say. It races a little, but should
> > converge. The biggest issue is that we're running with 10 bit fixed
> > point math, and on large cpu machines you get into granularity problems.
> 
> Is there any way to get higher resolution math on 64-bit machines?  (I'm 
> assuming that most larger SMP boxes will be 64-bit capable.)

Yes, if we do the same thing we do for 64bit math on 32bit boxen to
obtain 128bit math on 64bit boxen.

Trouble seems to be gcc, which imho f*'ed up by not providing a long
long long (or whatever), there does seem to be something but all the gcc
people tell me its fragile and not to be used.

Anyway, I've been meaning to send a RFC to linux-arch about adding
something like this, which we'd then have to do our-selves with C
functions instead of nice integer types :/


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-11-22 10:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-19  6:48 [patch] sched: add locking when update the task_group's cfs_rq[] array Ken Chen
2008-11-19  7:53 ` Ingo Molnar
2008-11-19  8:22   ` Ken Chen
2008-11-19 16:54 ` Peter Zijlstra
2008-11-19 17:21   ` Ken Chen
2008-11-19 20:58     ` Peter Zijlstra
2008-11-19 21:50       ` Chris Friesen
2008-11-22 10:09         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox