Q: select_fallback_rq() && cpuset

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Q: select_fallback_rq() && cpuset_lock()
@ 2010-03-09 18:06 Oleg Nesterov
  2010-03-10 16:40 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2010-03-09 18:06 UTC (permalink / raw)
  To: Ingo Molnar, Lai Jiangshan, Peter Zijlstra, Tejun Heo; +Cc: linux-kernel

Hello.

I tried to remove the deadlockable cpuset_lock() many times, but my
attempts were ignored by cpuset maintainers ;)

In particular, see http://marc.info/?l=linux-kernel&m=125261083613103

But now I have another question. Since 5da9a0fb673a0ea0a093862f95f6b89b3390c31e
cpuset_cpus_allowed_locked() is called without callback_mutex held by
try_to_wake_up().

And, without callback_mutex held, isn't it possible to race with, say,
update_cpumask() which changes cpuset->cpus_allowed? Yes, update_tasks_cpumask()
should fixup task->cpus_allowed later. But isn't it possible (at least
in theory) that try_to_wake_up() gets, say, all-zeroes in task->cpus_allowed
after select_fallback_rq()->cpuset_cpus_allowed_locked() if we race with
update_cpumask()->cpumask_copy() ?

Oleg.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-09 18:06 Q: select_fallback_rq() && cpuset_lock() Oleg Nesterov
@ 2010-03-10 16:40 ` Peter Zijlstra
  2010-03-10 17:30   ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2010-03-10 16:40 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On Tue, 2010-03-09 at 19:06 +0100, Oleg Nesterov wrote:
> Hello.
> 
> I tried to remove the deadlockable cpuset_lock() many times, but my
> attempts were ignored by cpuset maintainers ;)

Yeah, this appears to be an issue, there's no real maintainer atm, parts
are done by the sched folks, parts by the cgroup folks, and I guess
neither really knows everything..

> In particular, see http://marc.info/?l=linux-kernel&m=125261083613103

/me puts it on the to-review stack.

> But now I have another question. Since 5da9a0fb673a0ea0a093862f95f6b89b3390c31e
> cpuset_cpus_allowed_locked() is called without callback_mutex held by
> try_to_wake_up().
> 
> And, without callback_mutex held, isn't it possible to race with, say,
> update_cpumask() which changes cpuset->cpus_allowed? Yes, update_tasks_cpumask()
> should fixup task->cpus_allowed later. But isn't it possible (at least
> in theory) that try_to_wake_up() gets, say, all-zeroes in task->cpus_allowed
> after select_fallback_rq()->cpuset_cpus_allowed_locked() if we race with
> update_cpumask()->cpumask_copy() ?

Hurmm,.. good point,.. yes I think that might be possible.
p->cpus_allowed is synchronized properly, but cs->cpus_allowed is not,
bugger.

I guess the quick fix is to really bail and always use cpu_online_mask
in select_fallback_rq().


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-10 16:40 ` Peter Zijlstra
@ 2010-03-10 17:30   ` Oleg Nesterov
  2010-03-10 18:01     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2010-03-10 17:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

On 03/10, Peter Zijlstra wrote:
>
> On Tue, 2010-03-09 at 19:06 +0100, Oleg Nesterov wrote:
> > In particular, see http://marc.info/?l=linux-kernel&m=125261083613103
>
> /me puts it on the to-review stack.

Great, thanks. In fact, you already acked it before ;)

> > But now I have another question. Since 5da9a0fb673a0ea0a093862f95f6b89b3390c31e
> > cpuset_cpus_allowed_locked() is called without callback_mutex held by
> > try_to_wake_up().
> >
> > And, without callback_mutex held, isn't it possible to race with, say,
> > update_cpumask() which changes cpuset->cpus_allowed? Yes, update_tasks_cpumask()
> > should fixup task->cpus_allowed later. But isn't it possible (at least
> > in theory) that try_to_wake_up() gets, say, all-zeroes in task->cpus_allowed
> > after select_fallback_rq()->cpuset_cpus_allowed_locked() if we race with
> > update_cpumask()->cpumask_copy() ?
>
> Hurmm,.. good point,.. yes I think that might be possible.
> p->cpus_allowed is synchronized properly, but cs->cpus_allowed is not,
> bugger.
>
> I guess the quick fix is to really bail and always use cpu_online_mask
> in select_fallback_rq().

Yes, but this breaks cpusets.

Peter, please see the attached mbox with the last discussion and patches.
Of course, the changes in sched.c need the trivial fixups. I'll resend
if you think these changes make sense.

Oleg.

[-- Attachment #2: cpuset_lock.m --]
[-- Type: text/plain, Size: 46633 bytes --]

>From oleg@redhat.com Thu Sep 10 21:21:53 2009
Date: Thu, 10 Sep 2009 21:21:53 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
	Jiri Slaby <jirislaby@gmail.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Li Zefan <lizf@cn.fujitsu.com>, Miao Xie <miaox@cn.fujitsu.com>,
	Paul Menage <menage@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
Subject: [PATCH 0/3] resend, cpuset/hotplug fixes
Message-ID: <20090910192153.GA584@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.18 (2008-05-17)
Status: RO
Content-Length: 418
Lines: 18

(my apologies to those who see this twice).


The 1st patch is preparation. 2-3 fix different problems, the 3rd one
depends on 2nd.

These patches change the code which I don't really understand, please
review.


As for the 3rd patch, it replaces

	cpu_hotplug-dont-affect-current-tasks-affinity.patch

in -mm tree. Imho the new patch is more simple and clean, but of course
this is subjective and I am biased.

Oleg.

>From rjw@sisk.pl Thu Sep 10 22:18:12 2009
Return-Path: rjw@sisk.pl
Received: from zmta03.collab.prod.int.phx2.redhat.com (LHLO
 zmta03.collab.prod.int.phx2.redhat.com) (10.5.5.33) by
 mail03.corp.redhat.com with LMTP; Thu, 10 Sep 2009 16:18:12 -0400 (EDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by zmta03.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 7F5804EFEA
	for <onestero@redhat.com>; Thu, 10 Sep 2009 16:18:12 -0400 (EDT)
Received: from zmta03.collab.prod.int.phx2.redhat.com ([127.0.0.1])
	by localhost (zmta03.collab.prod.int.phx2.redhat.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id qaWdRotOC5ni for <onestero@redhat.com>;
	Thu, 10 Sep 2009 16:18:12 -0400 (EDT)
Received: from int-mx05.intmail.prod.int.phx2.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.18])
	by zmta03.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 5E2C54ECE0
	for <onestero@mail.corp.redhat.com>; Thu, 10 Sep 2009 16:18:12 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx06.extmail.prod.ext.phx2.redhat.com [10.5.110.10])
	by int-mx05.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8AKICgD004491
	for <oleg@redhat.com>; Thu, 10 Sep 2009 16:18:12 -0400
Received: from ogre.sisk.pl (ogre.sisk.pl [217.79.144.158])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8AKI0Os006158
	for <oleg@redhat.com>; Thu, 10 Sep 2009 16:18:00 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
	by ogre.sisk.pl (Postfix) with ESMTP id 9952A153A4B;
	Thu, 10 Sep 2009 19:13:58 +0200 (CEST)
Received: from ogre.sisk.pl ([127.0.0.1])
 by localhost (ogre.sisk.pl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP
 id 11412-02; Thu, 10 Sep 2009 19:13:40 +0200 (CEST)
Received: from tosh.localnet (220-bem-13.acn.waw.pl [82.210.184.220])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by ogre.sisk.pl (Postfix) with ESMTP id 978731565F3;
	Thu, 10 Sep 2009 19:13:40 +0200 (CEST)
From: "Rafael J. Wysocki" <rjw@sisk.pl>
To: Oleg Nesterov <oleg@redhat.com>
Subject: Re: [PATCH 0/3] resend, cpuset/hotplug fixes
Date: Thu, 10 Sep 2009 22:18:40 +0200
User-Agent: KMail/1.12.1 (Linux/2.6.31-rc9-rjw; KDE/4.3.1; x86_64; ; )
Cc: Andrew Morton <akpm@linux-foundation.org>, Gautham Shenoy <ego@in.ibm.com>,
        Ingo Molnar <mingo@elte.hu>, Jiri Slaby <jirislaby@gmail.com>,
        Lai Jiangshan <laijs@cn.fujitsu.com>, Li Zefan <lizf@cn.fujitsu.com>,
        Miao Xie <miaox@cn.fujitsu.com>, Paul Menage <menage@google.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
References: <20090910192153.GA584@redhat.com>
In-Reply-To: <20090910192153.GA584@redhat.com>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <200909102218.40143.rjw@sisk.pl>
X-Virus-Scanned: amavisd-new at ogre.sisk.pl using MkS_Vir for Linux
X-RedHat-Spam-Score: -1.76  (AWL)
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.18
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.10
Status: RO
Content-Length: 587
Lines: 25

Hi Oleg,

On Thursday 10 September 2009, Oleg Nesterov wrote:
> (my apologies to those who see this twice).
> 
> 
> The 1st patch is preparation. 2-3 fix different problems, the 3rd one
> depends on 2nd.
> 
> These patches change the code which I don't really understand, please
> review.
> 
> 
> As for the 3rd patch, it replaces
> 
> 	cpu_hotplug-dont-affect-current-tasks-affinity.patch
> 
> in -mm tree. Imho the new patch is more simple and clean, but of course
> this is subjective and I am biased.

The patches look good to me FWIW.

Thanks a lot for taking care of this!

Rafael

>From peterz@infradead.org Thu Sep 10 22:53:49 2009
Return-Path: peterz@infradead.org
Received: from zmta01.collab.prod.int.phx2.redhat.com (LHLO
 zmta01.collab.prod.int.phx2.redhat.com) (10.5.5.31) by
 mail03.corp.redhat.com with LMTP; Thu, 10 Sep 2009 16:53:49 -0400 (EDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by zmta01.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 1F30B93204
	for <onestero@redhat.com>; Thu, 10 Sep 2009 16:53:49 -0400 (EDT)
Received: from zmta01.collab.prod.int.phx2.redhat.com ([127.0.0.1])
	by localhost (zmta01.collab.prod.int.phx2.redhat.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id gFIPqn3u1aAu for <onestero@redhat.com>;
	Thu, 10 Sep 2009 16:53:49 -0400 (EDT)
Received: from int-mx05.intmail.prod.int.phx2.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.18])
	by zmta01.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id E9B23931F0
	for <onestero@mail.corp.redhat.com>; Thu, 10 Sep 2009 16:53:48 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx09.extmail.prod.ext.phx2.redhat.com [10.5.110.13])
	by int-mx05.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8AKrmCb015358
	for <oleg@redhat.com>; Thu, 10 Sep 2009 16:53:48 -0400
Received: from casper.infradead.org (casper.infradead.org [85.118.1.10])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8AKrapm024526
	for <oleg@redhat.com>; Thu, 10 Sep 2009 16:53:37 -0400
Received: from e53227.upc-e.chello.nl ([213.93.53.227] helo=dyad.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.69 #1 (Red Hat Linux))
	id 1Mlqdp-0002Ae-NV; Thu, 10 Sep 2009 20:53:17 +0000
Received: from [127.0.0.1] (dyad [192.168.0.60])
	by dyad.programming.kicks-ass.net (Postfix) with ESMTP id AEE2D10BC1;
	Thu, 10 Sep 2009 22:52:58 +0200 (CEST)
Subject: Re: [PATCH 0/3] resend, cpuset/hotplug fixes
From: Peter Zijlstra <peterz@infradead.org>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Gautham Shenoy <ego@in.ibm.com>,
        Ingo Molnar <mingo@elte.hu>, Jiri Slaby <jirislaby@gmail.com>,
        Lai Jiangshan <laijs@cn.fujitsu.com>, Li Zefan <lizf@cn.fujitsu.com>,
        Miao Xie <miaox@cn.fujitsu.com>, Paul Menage <menage@google.com>,
        "Rafael J. Wysocki" <rjw@sisk.pl>,
        Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
In-Reply-To: <20090910192153.GA584@redhat.com>
References: <20090910192153.GA584@redhat.com>
Content-Type: text/plain
Date: Thu, 10 Sep 2009 22:53:16 +0200
Message-Id: <1252615996.7205.99.camel@laptop>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-RedHat-Spam-Score: -3.861  (AWL,RCVD_IN_DNSWL_MED)
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.18
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.13
Status: RO
Content-Length: 605
Lines: 24

On Thu, 2009-09-10 at 21:21 +0200, Oleg Nesterov wrote:
> (my apologies to those who see this twice).
> 
> 
> The 1st patch is preparation. 2-3 fix different problems, the 3rd one
> depends on 2nd.
> 
> These patches change the code which I don't really understand, please
> review.
> 
> 
> As for the 3rd patch, it replaces
> 
> 	cpu_hotplug-dont-affect-current-tasks-affinity.patch
> 
> in -mm tree. Imho the new patch is more simple and clean, but of course
> this is subjective and I am biased.

Look good to me.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Ingo, will you stick them in -tip?


>From laijs@cn.fujitsu.com Fri Sep 11 09:16:19 2009
Return-Path: laijs@cn.fujitsu.com
Received: from zmta02.collab.prod.int.phx2.redhat.com (LHLO
 zmta02.collab.prod.int.phx2.redhat.com) (10.5.5.32) by
 mail03.corp.redhat.com with LMTP; Fri, 11 Sep 2009 03:16:19 -0400 (EDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by zmta02.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 30EAB9F0AB
	for <onestero@redhat.com>; Fri, 11 Sep 2009 03:16:19 -0400 (EDT)
Received: from zmta02.collab.prod.int.phx2.redhat.com ([127.0.0.1])
	by localhost (zmta02.collab.prod.int.phx2.redhat.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id ZX5DzK0QQztm for <onestero@redhat.com>;
	Fri, 11 Sep 2009 03:16:19 -0400 (EDT)
Received: from int-mx08.intmail.prod.int.phx2.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.21])
	by zmta02.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 0C8CB9F0A8
	for <onestero@mail.corp.redhat.com>; Fri, 11 Sep 2009 03:16:19 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx09.extmail.prod.ext.phx2.redhat.com [10.5.110.13])
	by int-mx08.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7GGZF022108
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:16:16 -0400
Received: from song.cn.fujitsu.com (cn.fujitsu.com [222.73.24.84] (may be forged))
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7G5iI000653
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:16:06 -0400
Received: from tang.cn.fujitsu.com (tang.cn.fujitsu.com [10.167.250.3])
	by song.cn.fujitsu.com (Postfix) with ESMTP id 6B82A170117;
	Fri, 11 Sep 2009 15:16:05 +0800 (CST)
Received: from fnst.cn.fujitsu.com (tang.cn.fujitsu.com [127.0.0.1])
	by tang.cn.fujitsu.com (8.14.3/8.13.1) with ESMTP id n8B7Ftuc014949;
	Fri, 11 Sep 2009 15:15:56 +0800
Received: from [127.0.0.1] (unknown [10.167.141.204])
	by fnst.cn.fujitsu.com (Postfix) with ESMTPA id A8E49D4211;
	Fri, 11 Sep 2009 15:16:49 +0800 (CST)
Message-ID: <4AA9F902.4030306@cn.fujitsu.com>
Date: Fri, 11 Sep 2009 15:15:14 +0800
From: Lai Jiangshan <laijs@cn.fujitsu.com>
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>, Oleg Nesterov <oleg@redhat.com>
CC: Andrew Morton <akpm@linux-foundation.org>, Gautham Shenoy <ego@in.ibm.com>,
        Ingo Molnar <mingo@elte.hu>, Jiri Slaby <jirislaby@gmail.com>,
        Li Zefan <lizf@cn.fujitsu.com>, Miao Xie <miaox@cn.fujitsu.com>,
        Paul Menage <menage@google.com>, "Rafael J. Wysocki" <rjw@sisk.pl>,
        Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/3] resend, cpuset/hotplug fixes
References: <20090910192153.GA584@redhat.com> <1252615996.7205.99.camel@laptop>
In-Reply-To: <1252615996.7205.99.camel@laptop>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-RedHat-Spam-Score: -1.95  (AWL,RDNS_NONE)
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.21
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.13
Status: RO
X-Status: A
Content-Length: 1960
Lines: 67

Peter Zijlstra wrote:
> On Thu, 2009-09-10 at 21:21 +0200, Oleg Nesterov wrote:
>> (my apologies to those who see this twice).
>>
>>
>> The 1st patch is preparation. 2-3 fix different problems, the 3rd one
>> depends on 2nd.
>>
>> These patches change the code which I don't really understand, please
>> review.
>>
>>
>> As for the 3rd patch, it replaces
>>
>> 	cpu_hotplug-dont-affect-current-tasks-affinity.patch
>>
>> in -mm tree. Imho the new patch is more simple and clean, but of course
>> this is subjective and I am biased.
> 
> Look good to me.
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Ingo, will you stick them in -tip?
> 
>

Sorry. I was taken ill for weeks and forgot to follow these discussions.
Especially I should say sorry to Oleg.

I have different concept. cpuset_cpus_allowed() is not called at atomic
context nor non-preemptable context nor other critical context.
So it should be allowed to use mutexs. That's what I think.

There is a bug when migration_call() requires a mutex
before migration has been finished when cpu offline as Oleg described.

Bug this bug is only happened when cpu offline. cpu offline is rare and
is slowpath. I think we should fix cpu offline and ensure it requires
the mutex safely.

Oleg's patch moves all dirty things into CPUSET subsystem and makes
cpuset_cpus_allowed() does not require any mutex and increases CPUSET's
coupling. I don't feel it's good.

Anyway, Oleg's patch works good.

> > cpuset_cpus_allowed() is not only  used for CPU offline.
> > >
> > > sched_setaffinity() also uses it.
> 
> Sure. And it must take get_online_cpus() to avoid the races with hotplug.

Oleg hasn't answered that
"is it safe when pdflush() calls cpuset_cpus_allowed()?".
A patch may be needed to ensure pdflush() calls cpuset_cpus_allowed() safely.

One other minor thing:
Oleg's patch may introduce a trouble in PREEEMPT_RT tree, because
spinlock in RT is also mutex. Likely I'm wrong.

- Lai






>From peterz@infradead.org Fri Sep 11 09:34:21 2009
Return-Path: peterz@infradead.org
Received: from zmta03.collab.prod.int.phx2.redhat.com (LHLO
 zmta03.collab.prod.int.phx2.redhat.com) (10.5.5.33) by
 mail03.corp.redhat.com with LMTP; Fri, 11 Sep 2009 03:34:21 -0400 (EDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by zmta03.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 326F44F21D
	for <onestero@redhat.com>; Fri, 11 Sep 2009 03:34:21 -0400 (EDT)
Received: from zmta03.collab.prod.int.phx2.redhat.com ([127.0.0.1])
	by localhost (zmta03.collab.prod.int.phx2.redhat.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id i9ZqrkFsc3sA for <onestero@redhat.com>;
	Fri, 11 Sep 2009 03:34:21 -0400 (EDT)
Received: from int-mx01.intmail.prod.int.phx2.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	by zmta03.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 1C3D04F208
	for <onestero@mail.corp.redhat.com>; Fri, 11 Sep 2009 03:34:21 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx10.extmail.prod.ext.phx2.redhat.com [10.5.110.14])
	by int-mx01.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7YKn1020621
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:34:21 -0400
Received: from casper.infradead.org (casper.infradead.org [85.118.1.10])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7Y9aM006841
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:34:09 -0400
Received: from e53227.upc-e.chello.nl ([213.93.53.227] helo=dyad.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.69 #1 (Red Hat Linux))
	id 1Mm0dQ-0003Lq-II; Fri, 11 Sep 2009 07:33:32 +0000
Received: from [127.0.0.1] (dyad [192.168.0.60])
	by dyad.programming.kicks-ass.net (Postfix) with ESMTP id 2C3DC10BC1;
	Fri, 11 Sep 2009 09:33:12 +0200 (CEST)
Subject: Re: [PATCH 0/3] resend, cpuset/hotplug fixes
From: Peter Zijlstra <peterz@infradead.org>
To: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>, Andrew Morton <akpm@linux-foundation.org>,
        Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
        Jiri Slaby <jirislaby@gmail.com>, Li Zefan <lizf@cn.fujitsu.com>,
        Miao Xie <miaox@cn.fujitsu.com>, Paul Menage <menage@google.com>,
        "Rafael J. Wysocki" <rjw@sisk.pl>,
        Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
In-Reply-To: <4AA9F902.4030306@cn.fujitsu.com>
References: <20090910192153.GA584@redhat.com>
	 <1252615996.7205.99.camel@laptop>  <4AA9F902.4030306@cn.fujitsu.com>
Content-Type: text/plain
Date: Fri, 11 Sep 2009 09:33:30 +0200
Message-Id: <1252654410.7126.1.camel@laptop>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-RedHat-Spam-Score: -4  (RCVD_IN_DNSWL_MED)
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.11
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.14
Status: RO
Content-Length: 443
Lines: 10

On Fri, 2009-09-11 at 15:15 +0800, Lai Jiangshan wrote:
> One other minor thing:
> Oleg's patch may introduce a trouble in PREEEMPT_RT tree, because
> spinlock in RT is also mutex. Likely I'm wrong.

But they have priority-inheritance, hence you cannot create a deadlock
through preemption. If the kstopmachine thread blocks on a mutex, the
owner gets boosted to kstopmachine's prio and runs to completion, after
that kstopmachine continues.


>From laijs@cn.fujitsu.com Fri Sep 11 09:54:21 2009
Return-Path: laijs@cn.fujitsu.com
Received: from zmta02.collab.prod.int.phx2.redhat.com (LHLO
 zmta02.collab.prod.int.phx2.redhat.com) (10.5.5.32) by
 mail03.corp.redhat.com with LMTP; Fri, 11 Sep 2009 03:54:21 -0400 (EDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by zmta02.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 5B0039F10D
	for <onestero@redhat.com>; Fri, 11 Sep 2009 03:54:21 -0400 (EDT)
Received: from zmta02.collab.prod.int.phx2.redhat.com ([127.0.0.1])
	by localhost (zmta02.collab.prod.int.phx2.redhat.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id brrBUOKlTyIh for <onestero@redhat.com>;
	Fri, 11 Sep 2009 03:54:21 -0400 (EDT)
Received: from int-mx03.intmail.prod.int.phx2.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.16])
	by zmta02.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 3EAAB9F10C
	for <onestero@mail.corp.redhat.com>; Fri, 11 Sep 2009 03:54:21 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx03.extmail.prod.ext.phx2.redhat.com [10.5.110.7])
	by int-mx03.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7sKFG018408
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:54:21 -0400
Received: from song.cn.fujitsu.com (cn.fujitsu.com [222.73.24.84] (may be forged))
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7sAeM012961
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:54:11 -0400
Received: from tang.cn.fujitsu.com (tang.cn.fujitsu.com [10.167.250.3])
	by song.cn.fujitsu.com (Postfix) with ESMTP id 45BC217008E;
	Fri, 11 Sep 2009 15:54:10 +0800 (CST)
Received: from fnst.cn.fujitsu.com (tang.cn.fujitsu.com [127.0.0.1])
	by tang.cn.fujitsu.com (8.14.3/8.13.1) with ESMTP id n8B7s0Jd017049;
	Fri, 11 Sep 2009 15:54:01 +0800
Received: from [127.0.0.1] (unknown [10.167.141.204])
	by fnst.cn.fujitsu.com (Postfix) with ESMTPA id 0E5D329283A;
	Fri, 11 Sep 2009 15:54:55 +0800 (CST)
Message-ID: <4AAA01F4.3050502@cn.fujitsu.com>
Date: Fri, 11 Sep 2009 15:53:24 +0800
From: Lai Jiangshan <laijs@cn.fujitsu.com>
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>
CC: Oleg Nesterov <oleg@redhat.com>, Andrew Morton <akpm@linux-foundation.org>,
        Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
        Jiri Slaby <jirislaby@gmail.com>, Li Zefan <lizf@cn.fujitsu.com>,
        Miao Xie <miaox@cn.fujitsu.com>, Paul Menage <menage@google.com>,
        "Rafael J. Wysocki" <rjw@sisk.pl>,
        Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/3] resend, cpuset/hotplug fixes
References: <20090910192153.GA584@redhat.com>	 <1252615996.7205.99.camel@laptop>  <4AA9F902.4030306@cn.fujitsu.com> <1252654410.7126.1.camel@laptop>
In-Reply-To: <1252654410.7126.1.camel@laptop>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-RedHat-Spam-Score: -1.95  (AWL,RDNS_NONE)
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.16
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.7
Status: RO
Content-Length: 656
Lines: 17

Peter Zijlstra wrote:
> On Fri, 2009-09-11 at 15:15 +0800, Lai Jiangshan wrote:
>> One other minor thing:
>> Oleg's patch may introduce a trouble in PREEEMPT_RT tree, because
>> spinlock in RT is also mutex. Likely I'm wrong.
> 
> But they have priority-inheritance, hence you cannot create a deadlock
> through preemption. If the kstopmachine thread blocks on a mutex, the
> owner gets boosted to kstopmachine's prio and runs to completion, after
> that kstopmachine continues.
> 

The deadlock is because the owner is at the dead cpu, It's not because
the owner's prio is low. priority-inheritance can't help it.

I think we need to use a raw spinlock.


>From peterz@infradead.org Fri Sep 11 09:58:14 2009
Return-Path: peterz@infradead.org
Received: from zmta03.collab.prod.int.phx2.redhat.com (LHLO
 zmta03.collab.prod.int.phx2.redhat.com) (10.5.5.33) by
 mail03.corp.redhat.com with LMTP; Fri, 11 Sep 2009 03:58:14 -0400 (EDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by zmta03.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 415DE4F273
	for <onestero@redhat.com>; Fri, 11 Sep 2009 03:58:14 -0400 (EDT)
Received: from zmta03.collab.prod.int.phx2.redhat.com ([127.0.0.1])
	by localhost (zmta03.collab.prod.int.phx2.redhat.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id N1ItvJhvUybo for <onestero@redhat.com>;
	Fri, 11 Sep 2009 03:58:14 -0400 (EDT)
Received: from int-mx04.intmail.prod.int.phx2.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.17])
	by zmta03.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 253AD4F26D
	for <onestero@mail.corp.redhat.com>; Fri, 11 Sep 2009 03:58:14 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx09.extmail.prod.ext.phx2.redhat.com [10.5.110.13])
	by int-mx04.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7wDGu013620
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:58:13 -0400
Received: from casper.infradead.org (casper.infradead.org [85.118.1.10])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7w2j7006307
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:58:02 -0400
Received: from e53227.upc-e.chello.nl ([213.93.53.227] helo=dyad.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.69 #1 (Red Hat Linux))
	id 1Mm10z-0003qX-Bd; Fri, 11 Sep 2009 07:57:53 +0000
Received: from [127.0.0.1] (dyad [192.168.0.60])
	by dyad.programming.kicks-ass.net (Postfix) with ESMTP id C90F910BC1;
	Fri, 11 Sep 2009 09:57:33 +0200 (CEST)
Subject: Re: [PATCH 0/3] resend, cpuset/hotplug fixes
From: Peter Zijlstra <peterz@infradead.org>
To: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>, Andrew Morton <akpm@linux-foundation.org>,
        Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
        Jiri Slaby <jirislaby@gmail.com>, Li Zefan <lizf@cn.fujitsu.com>,
        Miao Xie <miaox@cn.fujitsu.com>, Paul Menage <menage@google.com>,
        "Rafael J. Wysocki" <rjw@sisk.pl>,
        Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
In-Reply-To: <4AAA01F4.3050502@cn.fujitsu.com>
References: <20090910192153.GA584@redhat.com>
	 <1252615996.7205.99.camel@laptop>  <4AA9F902.4030306@cn.fujitsu.com>
	 <1252654410.7126.1.camel@laptop>  <4AAA01F4.3050502@cn.fujitsu.com>
Content-Type: text/plain
Date: Fri, 11 Sep 2009 09:57:52 +0200
Message-Id: <1252655872.7126.5.camel@laptop>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-RedHat-Spam-Score: -3.873  (AWL,RCVD_IN_DNSWL_MED)
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.17
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.13
Status: RO
Content-Length: 772
Lines: 20

On Fri, 2009-09-11 at 15:53 +0800, Lai Jiangshan wrote:
> Peter Zijlstra wrote:
> > On Fri, 2009-09-11 at 15:15 +0800, Lai Jiangshan wrote:
> >> One other minor thing:
> >> Oleg's patch may introduce a trouble in PREEEMPT_RT tree, because
> >> spinlock in RT is also mutex. Likely I'm wrong.
> > 
> > But they have priority-inheritance, hence you cannot create a deadlock
> > through preemption. If the kstopmachine thread blocks on a mutex, the
> > owner gets boosted to kstopmachine's prio and runs to completion, after
> > that kstopmachine continues.
> > 
> 
> The deadlock is because the owner is at the dead cpu, It's not because
> the owner's prio is low. priority-inheritance can't help it.
> 
> I think we need to use a raw spinlock.

Not in mainline you don't.


>From peterz@infradead.org Fri Sep 11 09:38:03 2009
Return-Path: peterz@infradead.org
Received: from zmta02.collab.prod.int.phx2.redhat.com (LHLO
 zmta02.collab.prod.int.phx2.redhat.com) (10.5.5.32) by
 mail03.corp.redhat.com with LMTP; Fri, 11 Sep 2009 03:38:03 -0400 (EDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by zmta02.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 2DDEC9F0D5
	for <onestero@redhat.com>; Fri, 11 Sep 2009 03:38:03 -0400 (EDT)
Received: from zmta02.collab.prod.int.phx2.redhat.com ([127.0.0.1])
	by localhost (zmta02.collab.prod.int.phx2.redhat.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id rKhL4CfUL++X for <onestero@redhat.com>;
	Fri, 11 Sep 2009 03:38:03 -0400 (EDT)
Received: from int-mx05.intmail.prod.int.phx2.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.18])
	by zmta02.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 101949F0CE
	for <onestero@mail.corp.redhat.com>; Fri, 11 Sep 2009 03:38:03 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx10.extmail.prod.ext.phx2.redhat.com [10.5.110.14])
	by int-mx05.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7c20g029106
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:38:02 -0400
Received: from casper.infradead.org (casper.infradead.org [85.118.1.10])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8B7boPv007422
	for <oleg@redhat.com>; Fri, 11 Sep 2009 03:37:51 -0400
Received: from e53227.upc-e.chello.nl ([213.93.53.227] helo=dyad.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.69 #1 (Red Hat Linux))
	id 1Mm0hQ-0003RX-Bo; Fri, 11 Sep 2009 07:37:40 +0000
Received: from [127.0.0.1] (dyad [192.168.0.60])
	by dyad.programming.kicks-ass.net (Postfix) with ESMTP id B3D4210BC1;
	Fri, 11 Sep 2009 09:37:20 +0200 (CEST)
Subject: Re: [PATCH 0/3] resend, cpuset/hotplug fixes
From: Peter Zijlstra <peterz@infradead.org>
To: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>, Andrew Morton <akpm@linux-foundation.org>,
        Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
        Jiri Slaby <jirislaby@gmail.com>, Li Zefan <lizf@cn.fujitsu.com>,
        Miao Xie <miaox@cn.fujitsu.com>, Paul Menage <menage@google.com>,
        "Rafael J. Wysocki" <rjw@sisk.pl>,
        Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
In-Reply-To: <4AA9F902.4030306@cn.fujitsu.com>
References: <20090910192153.GA584@redhat.com>
	 <1252615996.7205.99.camel@laptop>  <4AA9F902.4030306@cn.fujitsu.com>
Content-Type: text/plain
Date: Fri, 11 Sep 2009 09:37:39 +0200
Message-Id: <1252654659.7126.2.camel@laptop>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-RedHat-Spam-Score: -4  (RCVD_IN_DNSWL_MED)
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.18
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.14
Status: RO
Content-Length: 323
Lines: 8

On Fri, 2009-09-11 at 15:15 +0800, Lai Jiangshan wrote:
> 
> I have different concept. cpuset_cpus_allowed() is not called at atomic
> context nor non-preemptable context nor other critical context.
> So it should be allowed to use mutexs. That's what I think.

It hardly does any work at all, so why bother with a mutex?


>From oleg@redhat.com Fri Sep 11 20:03:04 2009
Date: Fri, 11 Sep 2009 20:03:04 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
	Jiri Slaby <jirislaby@gmail.com>, Li Zefan <lizf@cn.fujitsu.com>,
	Miao Xie <miaox@cn.fujitsu.com>, Paul Menage <menage@google.com>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/3] resend, cpuset/hotplug fixes
Message-ID: <20090911180304.GA3477@redhat.com>
References: <20090910192153.GA584@redhat.com> <1252615996.7205.99.camel@laptop> <4AA9F902.4030306@cn.fujitsu.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4AA9F902.4030306@cn.fujitsu.com>
User-Agent: Mutt/1.5.18 (2008-05-17)
Status: RO
Content-Length: 2454
Lines: 65

On 09/11, Lai Jiangshan wrote:
>
> I have different concept. cpuset_cpus_allowed() is not called at atomic
> context nor non-preemptable context nor other critical context.
> So it should be allowed to use mutexs. That's what I think.

Well, it is called from non-preemptable context: move_task_off_dead_cpu().
That is why before this patch we had cpuset_cpus_allowed_lock(). And this
imho adds unneeded complications.

And I can't understand why sched_setaffinity() path should take the
global mutex instead of per-cpuset spinlock.

> There is a bug when migration_call() requires a mutex
> before migration has been finished when cpu offline as Oleg described.
>
> Bug this bug is only happened when cpu offline. cpu offline is rare and
> is slowpath. I think we should fix cpu offline and ensure it requires
> the mutex safely.

This is subjective, but I can't agree. I think we should fix cpusets
instead. We should try to avoid the dependencies between different
subsystems as much as possible.

> Oleg's patch moves all dirty things into CPUSET subsystem and makes
> cpuset_cpus_allowed() does not require any mutex and increases CPUSET's
> coupling. I don't feel it's good.

Again, subjective... But I can't understand "increases CPUSET's coupling".

>From my pov, this patch cleanups and simplifies the code. This was the
main motivation, the bugfix is just the "side effect". I don't understand
how this subtle cpuset_lock() can make things better. I can't understand
why we need the global lock to calc cpus_allowed.

> > > cpuset_cpus_allowed() is not only  used for CPU offline.
> > > >
> > > > sched_setaffinity() also uses it.
> >
> > Sure. And it must take get_online_cpus() to avoid the races with hotplug.
>
> Oleg hasn't answered that
> "is it safe when pdflush() calls cpuset_cpus_allowed()?".

Because I didn't see such a question ;) perhaps I missed it previously...

> A patch may be needed to ensure pdflush() calls cpuset_cpus_allowed() safely.

What is wrong with pdflush()->cpuset_cpus_allowed() ? Why this is not safe?

This

	cpuset_cpus_allowed(current, cpus_allowed);
	set_cpus_allowed_ptr(current, cpus_allowed);

looks equally racy, with or without the patch. But this is a bit off-topic,
mm/pdflush.c has gone away.

> One other minor thing:
> Oleg's patch may introduce a trouble in PREEEMPT_RT tree, because
> spinlock in RT is also mutex. Likely I'm wrong.

Yes, probably -rt need raw_lock (as you pointed out).

Oleg.

>From oleg@redhat.com Thu Sep 10 21:22:16 2009
Date: Thu, 10 Sep 2009 21:22:16 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
	Jiri Slaby <jirislaby@gmail.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Li Zefan <lizf@cn.fujitsu.com>, Miao Xie <miaox@cn.fujitsu.com>,
	Paul Menage <menage@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
Subject: [PATCH 1/3] cpusets: introduce cpuset->cpumask_lock
Message-ID: <20090910192216.GA603@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.18 (2008-05-17)
Status: RO
Content-Length: 2231
Lines: 64

Preparation for the next patch.

Introduce cpuset->cpumask_lock. From now ->cpus_allowed of the "active"
cpuset is always changed under this spinlock_t.

A separate patch to simplify the review/fixing, in case I missed some
places where ->cpus_allowed is updated.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/cpuset.c |    9 +++++++++
 1 file changed, 9 insertions(+)

--- CPUHP/kernel/cpuset.c~1_ADD_CPUMASK_LOCK	2009-09-10 19:35:16.000000000 +0200
+++ CPUHP/kernel/cpuset.c	2009-09-10 20:06:39.000000000 +0200
@@ -92,6 +92,7 @@ struct cpuset {
 	struct cgroup_subsys_state css;
 
 	unsigned long flags;		/* "unsigned long" so bitops work */
+	spinlock_t cpumask_lock;	/* protects ->cpus_allowed */
 	cpumask_var_t cpus_allowed;	/* CPUs allowed to tasks in cpuset */
 	nodemask_t mems_allowed;	/* Memory Nodes allowed to tasks */
 
@@ -891,7 +892,9 @@ static int update_cpumask(struct cpuset 
 	is_load_balanced = is_sched_load_balance(trialcs);
 
 	mutex_lock(&callback_mutex);
+	spin_lock(&cs->cpumask_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
+	spin_unlock(&cs->cpumask_lock);
 	mutex_unlock(&callback_mutex);
 
 	/*
@@ -1781,6 +1784,8 @@ static struct cgroup_subsys_state *cpuse
 	cs = kmalloc(sizeof(*cs), GFP_KERNEL);
 	if (!cs)
 		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&cs->cpumask_lock);
 	if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL)) {
 		kfree(cs);
 		return ERR_PTR(-ENOMEM);
@@ -1981,8 +1986,10 @@ static void scan_for_empty_cpusets(struc
 
 		/* Remove offline cpus and mems from this cpuset. */
 		mutex_lock(&callback_mutex);
+		spin_lock(&cp->cpumask_lock);
 		cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
 			    cpu_online_mask);
+		spin_unlock(&cp->cpumask_lock);
 		nodes_and(cp->mems_allowed, cp->mems_allowed,
 						node_states[N_HIGH_MEMORY]);
 		mutex_unlock(&callback_mutex);
@@ -2030,7 +2037,9 @@ static int cpuset_track_online_cpus(stru
 
 	cgroup_lock();
 	mutex_lock(&callback_mutex);
+	spin_lock(&top_cpuset.cpumask_lock);
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_online_mask);
+	spin_unlock(&top_cpuset.cpumask_lock);
 	mutex_unlock(&callback_mutex);
 	scan_for_empty_cpusets(&top_cpuset);
 	ndoms = generate_sched_domains(&doms, &attr);

>From oleg@redhat.com Thu Sep 10 21:22:21 2009
Date: Thu, 10 Sep 2009 21:22:21 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
	Jiri Slaby <jirislaby@gmail.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Li Zefan <lizf@cn.fujitsu.com>, Miao Xie <miaox@cn.fujitsu.com>,
	Paul Menage <menage@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
Subject: [PATCH 2/3] cpusets: rework guarantee_online_cpus() to fix
	deadlock with cpu_down()
Message-ID: <20090910192221.GA610@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.18 (2008-05-17)
Status: RO
Content-Length: 6814
Lines: 209

Suppose that task T bound to CPU takes callback_mutex. If cpu_down(CPU)
happens before T drops callback_mutex we have a deadlock.

stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
cpuset_lock() and hangs forever because CPU is already dead and thus
T can't be scheduled.

To simplify the testing, I added schedule_timeout_interruptible(3*HZ)
to cpuset_sprintf_cpulist() under mutex_lock(callback_mutex). Then,

	mkdir -p /dev/cpuset
	mount -t cgroup -ocpuset cpuset /dev/cpuset
	mkdir -p /dev/cpuset/XXX
	taskset -p 2 $$
	cat /dev/cpuset/XXX/cpuset.cpus

and, on another console,

	echo 0 >> /sys/devices/system/cpu/cpu1/online

Before this patch cpuset/hotplug subsytems deadlock, after this patch
everething is OK.

Since the previous patch we can access cs->cpus_allowed safely either
under the global callback_mutex or cs->cpumask_lock.

This patch:

	- kills cpuset_lock() and cpuset_cpus_allowed_locked()

	- converts move_task_off_dead_cpu() to use cpuset_cpus_allowed()

Then we change guarantee_online_cpus(),

	- take cs->cpumask_lock instead of callback_mutex

	- do NOT scan cs->parent cpusets. If there are no online CPUs in
	  cs->cpus_allowed, we use cpu_online_mask. This is only possible
	  when we are called by cpu_down() hooks, in that case
	  cpuset_track_online_cpus(CPU_DEAD) will fix things later.

In particular, this all means sched_setaffinity() takes the per cpuset
spinlock instead of global callback_mutex, and hotplug can forget about
complications added by cpusets.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/cpuset.h |   13 -----------
 kernel/sched.c         |    4 ---
 kernel/cpuset.c        |   56 ++++++++++---------------------------------------
 3 files changed, 13 insertions(+), 60 deletions(-)

--- CPUHP/include/linux/cpuset.h~2_KILL_CPUSET_LOCK	2009-09-10 18:48:25.000000000 +0200
+++ CPUHP/include/linux/cpuset.h	2009-09-10 20:27:29.000000000 +0200
@@ -21,8 +21,6 @@ extern int number_of_cpusets;	/* How man
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
-extern void cpuset_cpus_allowed_locked(struct task_struct *p,
-				       struct cpumask *mask);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -69,9 +67,6 @@ struct seq_file;
 extern void cpuset_task_status_allowed(struct seq_file *m,
 					struct task_struct *task);
 
-extern void cpuset_lock(void);
-extern void cpuset_unlock(void);
-
 extern int cpuset_mem_spread_node(void);
 
 static inline int cpuset_do_page_mem_spread(void)
@@ -105,11 +100,6 @@ static inline void cpuset_cpus_allowed(s
 {
 	cpumask_copy(mask, cpu_possible_mask);
 }
-static inline void cpuset_cpus_allowed_locked(struct task_struct *p,
-					      struct cpumask *mask)
-{
-	cpumask_copy(mask, cpu_possible_mask);
-}
 
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 {
@@ -157,9 +147,6 @@ static inline void cpuset_task_status_al
 {
 }
 
-static inline void cpuset_lock(void) {}
-static inline void cpuset_unlock(void) {}
-
 static inline int cpuset_mem_spread_node(void)
 {
 	return 0;
--- CPUHP/kernel/sched.c~2_KILL_CPUSET_LOCK	2009-09-10 18:48:25.000000000 +0200
+++ CPUHP/kernel/sched.c	2009-09-10 20:27:29.000000000 +0200
@@ -7136,7 +7136,7 @@ again:
 
 	/* No more Mr. Nice Guy. */
 	if (dest_cpu >= nr_cpu_ids) {
-		cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
+		cpuset_cpus_allowed(p, &p->cpus_allowed);
 		dest_cpu = cpumask_any_and(cpu_online_mask, &p->cpus_allowed);
 
 		/*
@@ -7550,7 +7550,6 @@ migration_call(struct notifier_block *nf
 
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-		cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */
 		migrate_live_tasks(cpu);
 		rq = cpu_rq(cpu);
 		kthread_stop(rq->migration_thread);
@@ -7565,7 +7564,6 @@ migration_call(struct notifier_block *nf
 		rq->idle->sched_class = &idle_sched_class;
 		migrate_dead_tasks(cpu);
 		spin_unlock_irq(&rq->lock);
-		cpuset_unlock();
 		migrate_nr_uninterruptible(rq);
 		BUG_ON(rq->nr_running != 0);
 		calc_global_load_remove(rq);
--- CPUHP/kernel/cpuset.c~2_KILL_CPUSET_LOCK	2009-09-10 20:06:39.000000000 +0200
+++ CPUHP/kernel/cpuset.c	2009-09-10 20:28:11.000000000 +0200
@@ -268,16 +268,23 @@ static struct file_system_type cpuset_fs
  * Call with callback_mutex held.
  */
 
-static void guarantee_online_cpus(const struct cpuset *cs,
-				  struct cpumask *pmask)
+static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
 {
-	while (cs && !cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
-		cs = cs->parent;
 	if (cs)
+		spin_lock(&cs->cpumask_lock);
+	/*
+	 * cs->cpus_allowed must include online CPUs, or we are called
+	 * from cpu_down() hooks. In that case use cpu_online_mask
+	 * temporary until scan_for_empty_cpusets() moves us to ->parent
+	 * cpuset.
+	 */
+	if (cs && cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
 		cpumask_and(pmask, cs->cpus_allowed, cpu_online_mask);
 	else
 		cpumask_copy(pmask, cpu_online_mask);
-	BUG_ON(!cpumask_intersects(pmask, cpu_online_mask));
+
+	if (cs)
+		spin_unlock(&cs->cpumask_lock);
 }
 
 /*
@@ -2106,20 +2113,8 @@ void __init cpuset_init_smp(void)
  * subset of cpu_online_map, even if this means going outside the
  * tasks cpuset.
  **/
-
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
-	mutex_lock(&callback_mutex);
-	cpuset_cpus_allowed_locked(tsk, pmask);
-	mutex_unlock(&callback_mutex);
-}
-
-/**
- * cpuset_cpus_allowed_locked - return cpus_allowed mask from a tasks cpuset.
- * Must be called with callback_mutex held.
- **/
-void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask)
-{
 	task_lock(tsk);
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	task_unlock(tsk);
@@ -2311,33 +2306,6 @@ int __cpuset_node_allowed_hardwall(int n
 }
 
 /**
- * cpuset_lock - lock out any changes to cpuset structures
- *
- * The out of memory (oom) code needs to mutex_lock cpusets
- * from being changed while it scans the tasklist looking for a
- * task in an overlapping cpuset.  Expose callback_mutex via this
- * cpuset_lock() routine, so the oom code can lock it, before
- * locking the task list.  The tasklist_lock is a spinlock, so
- * must be taken inside callback_mutex.
- */
-
-void cpuset_lock(void)
-{
-	mutex_lock(&callback_mutex);
-}
-
-/**
- * cpuset_unlock - release lock on cpuset changes
- *
- * Undo the lock taken in a previous cpuset_lock() call.
- */
-
-void cpuset_unlock(void)
-{
-	mutex_unlock(&callback_mutex);
-}
-
-/**
  * cpuset_mem_spread_node() - On which node to begin search for a page
  *
  * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for

>From oleg@redhat.com Thu Sep 10 21:22:25 2009
Date: Thu, 10 Sep 2009 21:22:25 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Gautham Shenoy <ego@in.ibm.com>, Ingo Molnar <mingo@elte.hu>,
	Jiri Slaby <jirislaby@gmail.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Li Zefan <lizf@cn.fujitsu.com>, Miao Xie <miaox@cn.fujitsu.com>,
	Paul Menage <menage@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	Rusty Russell <rusty@rustcorp.com.au>, linux-kernel@vger.kernel.org
Subject: [PATCH 3/3] cpu hotplug: don't play with current->cpus_allowed
Message-ID: <20090910192225.GA618@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.18 (2008-05-17)
Status: RO
Content-Length: 4324
Lines: 128

(replaces cpu_hotplug-dont-affect-current-tasks-affinity.patch)

_cpu_down() changes the current task's affinity and then recovers it at
the end. The problems are well known: we can't restore old_allowed if it
was bound to the now-dead-cpu, and we can race with the userspace which
can change cpu-affinity during unplug.

_cpu_down() should not play with current->cpus_allowed at all. Instead,
take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
removes the dying cpu from cpu_online_mask.

Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 +
 kernel/sched.c        |    2 +-
 kernel/cpu.c          |   19 ++++++-------------
 3 files changed, 8 insertions(+), 14 deletions(-)

--- CPUHP/include/linux/sched.h~3_CPU_DOWN_AFFINITY	2009-08-01 04:28:56.000000000 +0200
+++ CPUHP/include/linux/sched.h	2009-09-10 20:54:00.000000000 +0200
@@ -1794,6 +1794,7 @@ extern void sched_clock_idle_sleep_event
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
+extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
--- CPUHP/kernel/sched.c~3_CPU_DOWN_AFFINITY	2009-09-10 20:27:29.000000000 +0200
+++ CPUHP/kernel/sched.c	2009-09-10 20:54:00.000000000 +0200
@@ -7118,7 +7118,7 @@ static int __migrate_task_irq(struct tas
 /*
  * Figure out where task on dead CPU should go, use force if necessary.
  */
-static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
+void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
 {
 	int dest_cpu;
 	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(dead_cpu));
--- CPUHP/kernel/cpu.c~3_CPU_DOWN_AFFINITY	2009-08-01 04:28:56.000000000 +0200
+++ CPUHP/kernel/cpu.c	2009-09-10 20:54:00.000000000 +0200
@@ -163,6 +163,7 @@ static inline void check_for_tasks(int c
 }
 
 struct take_cpu_down_param {
+	struct task_struct *caller;
 	unsigned long mod;
 	void *hcpu;
 };
@@ -171,6 +172,7 @@ struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
+	unsigned int cpu = (unsigned long)param->hcpu;
 	int err;
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
@@ -181,6 +183,8 @@ static int __ref take_cpu_down(void *_pa
 	raw_notifier_call_chain(&cpu_chain, CPU_DYING | param->mod,
 				param->hcpu);
 
+	if (task_cpu(param->caller) == cpu)
+		move_task_off_dead_cpu(cpu, param->caller);
 	/* Force idle task to run as soon as we yield: it should
 	   immediately notice cpu is offline and die quickly. */
 	sched_idle_next();
@@ -191,10 +195,10 @@ static int __ref take_cpu_down(void *_pa
 static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 {
 	int err, nr_calls = 0;
-	cpumask_var_t old_allowed;
 	void *hcpu = (void *)(long)cpu;
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct take_cpu_down_param tcd_param = {
+		.caller = current,
 		.mod = mod,
 		.hcpu = hcpu,
 	};
@@ -205,9 +209,6 @@ static int __ref _cpu_down(unsigned int 
 	if (!cpu_online(cpu))
 		return -EINVAL;
 
-	if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))
-		return -ENOMEM;
-
 	cpu_hotplug_begin();
 	err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
 					hcpu, -1, &nr_calls);
@@ -221,11 +222,6 @@ static int __ref _cpu_down(unsigned int 
 		goto out_release;
 	}
 
-	/* Ensure that we are not runnable on dying cpu */
-	cpumask_copy(old_allowed, &current->cpus_allowed);
-	set_cpus_allowed_ptr(current,
-			     cpumask_of(cpumask_any_but(cpu_online_mask, cpu)));
-
 	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
 	if (err) {
 		/* CPU didn't die: tell everyone.  Can't complain. */
@@ -233,7 +229,7 @@ static int __ref _cpu_down(unsigned int 
 					    hcpu) == NOTIFY_BAD)
 			BUG();
 
-		goto out_allowed;
+		goto out_release;
 	}
 	BUG_ON(cpu_online(cpu));
 
@@ -251,8 +247,6 @@ static int __ref _cpu_down(unsigned int 
 
 	check_for_tasks(cpu);
 
-out_allowed:
-	set_cpus_allowed_ptr(current, old_allowed);
 out_release:
 	cpu_hotplug_done();
 	if (!err) {
@@ -260,7 +254,6 @@ out_release:
 					    hcpu) == NOTIFY_BAD)
 			BUG();
 	}
-	free_cpumask_var(old_allowed);
 	return err;
 }
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-10 17:30   ` Oleg Nesterov
@ 2010-03-10 18:01     ` Peter Zijlstra
  2010-03-10 18:33       ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2010-03-10 18:01 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On Wed, 2010-03-10 at 18:30 +0100, Oleg Nesterov wrote:
> On 03/10, Peter Zijlstra wrote:
> >
> > On Tue, 2010-03-09 at 19:06 +0100, Oleg Nesterov wrote:
> > > In particular, see http://marc.info/?l=linux-kernel&m=125261083613103
> >
> > /me puts it on the to-review stack.
> 
> Great, thanks. In fact, you already acked it before ;)
> 
> > > But now I have another question. Since 5da9a0fb673a0ea0a093862f95f6b89b3390c31e
> > > cpuset_cpus_allowed_locked() is called without callback_mutex held by
> > > try_to_wake_up().
> > >
> > > And, without callback_mutex held, isn't it possible to race with, say,
> > > update_cpumask() which changes cpuset->cpus_allowed? Yes, update_tasks_cpumask()
> > > should fixup task->cpus_allowed later. But isn't it possible (at least
> > > in theory) that try_to_wake_up() gets, say, all-zeroes in task->cpus_allowed
> > > after select_fallback_rq()->cpuset_cpus_allowed_locked() if we race with
> > > update_cpumask()->cpumask_copy() ?
> >
> > Hurmm,.. good point,.. yes I think that might be possible.
> > p->cpus_allowed is synchronized properly, but cs->cpus_allowed is not,
> > bugger.
> >
> > I guess the quick fix is to really bail and always use cpu_online_mask
> > in select_fallback_rq().
> 
> Yes, but this breaks cpusets.

Arguably, so, but you can also argue that binding a task to a cpu and
then unplugging that cpu without first fixing up the affinity is a 'you
get to keep both pieces' situation.

> Peter, please see the attached mbox with the last discussion and patches.
> Of course, the changes in sched.c need the trivial fixups. I'll resend
> if you think these changes make sense.

Ah, right.. Damn I must be getting old, there wasn't even a spark of
recollection.

Right, so if you refresh these patches, I'll feed them to mingo and they
should eventually end up in -linus, how's that? :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-10 18:01     ` Peter Zijlstra
@ 2010-03-10 18:33       ` Oleg Nesterov
  2010-03-11 14:52         ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2010-03-10 18:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On 03/10, Peter Zijlstra wrote:
>
> Right, so if you refresh these patches, I'll feed them to mingo and they
> should eventually end up in -linus, how's that? :-)

Great. Will redo/resend tomorrow ;)

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-10 18:33       ` Oleg Nesterov
@ 2010-03-11 14:52         ` Oleg Nesterov
  2010-03-11 15:22           ` Oleg Nesterov
  2010-03-11 15:35           ` Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Oleg Nesterov @ 2010-03-11 14:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On 03/10, Oleg Nesterov wrote:
>
> On 03/10, Peter Zijlstra wrote:
> >
> > Right, so if you refresh these patches, I'll feed them to mingo and they
> > should eventually end up in -linus, how's that? :-)
>
> Great. Will redo/resend tomorrow ;)

That was a great plan, but it doesn't work.

With the recent changes we have even more problems with
cpuset_cpus_allowed_locked(). Not only it misses cpuset_lock() (which
doesn't work anyway and must die imho), it is very wrong to even call
this function from try_to_wakeup() - this can deadlock.

Because task_cs() is protected by task_lock() which is not irq-safe,
and cpuset_cpus_allowed_locked() takes this lock.

We need more changes in cpuset.c. Btw, select_fallback_rq() takes
rcu_read_lock around cpuset_cpus_allowed_locked(). Why? I must have
missed something, but afaics this buys nothing.

>From the previous email:

	On 03/10, Peter Zijlstra wrote:
	>
	> On Wed, 2010-03-10 at 18:30 +0100, Oleg Nesterov wrote:
	> > On 03/10, Peter Zijlstra wrote:
	> > >
	> > > I guess the quick fix is to really bail and always use cpu_online_mask
	> > > in select_fallback_rq().
	> >
	> > Yes, but this breaks cpusets.
	>
	> Arguably, so, but you can also argue that binding a task to a cpu and
	> then unplugging that cpu without first fixing up the affinity is a 'you
	> get to keep both pieces' situation.

Well yes, but still it was supposed the kernel should handle this case
correctly, the task shouldn't escape its cpuset.

However, currently I don't see another option. I think we should fix the
possible deadlocks and kill cpuset_lock/cpuset_cpus_allowed_locked first,
then try to fix cpusets.

See the trivial (uncompiled) patch below. It just states a fact
cpuset_cpus_allowed() logic is broken.

How can we fix this later? Perhaps we can change
cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.

At first glance this should work in try_to_wake_up(p) case, we can't
race with cpuset_change_cpumask()/etc because of TASK_WAKING logic.

But I am not sure how can we fix move_task_off_dead_cpu(). I think
__migrate_task_irq() itself is fine, but if select_fallback_rq() is
called from move_task_off_dead_cpu() nothing protects ->cpus_allowed.
We can race with cpusets, or even with the plain set_cpus_allowed().
Probably nothing really bad can happen, if the resulting cpumask
doesn't have online cpus due to the racing memcpys, we should retry
after __migrate_task_irq() fails. Or we can take cpu_rq(cpu)-lock
around cpumask_copy(p->cpus_allowed, cpu_possible_mask).

sched_exec() seems fine, the task is current and running,
"No more Mr. Nice Guy." case is not possible.

What do you think?

Btw, I think there is a small bug in set_cpus_allowed_ptr(),
wake_up_process(rq->migration_thread) can hit NULL, we should do
wake_up_process(mt).

Oleg.

 include/linux/cpuset.h |   10 ----------
 kernel/cpuset.c        |   29 +----------------------------
 kernel/sched.c         |    9 +++------
 3 files changed, 4 insertions(+), 44 deletions(-)

--- x/include/linux/cpuset.h
+++ x/include/linux/cpuset.h
@@ -21,8 +21,6 @@ extern int number_of_cpusets;	/* How man
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
-extern void cpuset_cpus_allowed_locked(struct task_struct *p,
-				       struct cpumask *mask);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -69,9 +67,6 @@ struct seq_file;
 extern void cpuset_task_status_allowed(struct seq_file *m,
 					struct task_struct *task);
 
-extern void cpuset_lock(void);
-extern void cpuset_unlock(void);
-
 extern int cpuset_mem_spread_node(void);
 
 static inline int cpuset_do_page_mem_spread(void)
@@ -105,11 +100,6 @@ static inline void cpuset_cpus_allowed(s
 {
 	cpumask_copy(mask, cpu_possible_mask);
 }
-static inline void cpuset_cpus_allowed_locked(struct task_struct *p,
-					      struct cpumask *mask)
-{
-	cpumask_copy(mask, cpu_possible_mask);
-}
 
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 {
--- x/kernel/cpuset.c
+++ x/kernel/cpuset.c
@@ -2148,7 +2148,7 @@ void cpuset_cpus_allowed(struct task_str
  * cpuset_cpus_allowed_locked - return cpus_allowed mask from a tasks cpuset.
  * Must be called with callback_mutex held.
  **/
-void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask)
+static void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask)
 {
 	task_lock(tsk);
 	guarantee_online_cpus(task_cs(tsk), pmask);
@@ -2341,33 +2341,6 @@ int __cpuset_node_allowed_hardwall(int n
 }
 
 /**
- * cpuset_lock - lock out any changes to cpuset structures
- *
- * The out of memory (oom) code needs to mutex_lock cpusets
- * from being changed while it scans the tasklist looking for a
- * task in an overlapping cpuset.  Expose callback_mutex via this
- * cpuset_lock() routine, so the oom code can lock it, before
- * locking the task list.  The tasklist_lock is a spinlock, so
- * must be taken inside callback_mutex.
- */
-
-void cpuset_lock(void)
-{
-	mutex_lock(&callback_mutex);
-}
-
-/**
- * cpuset_unlock - release lock on cpuset changes
- *
- * Undo the lock taken in a previous cpuset_lock() call.
- */
-
-void cpuset_unlock(void)
-{
-	mutex_unlock(&callback_mutex);
-}
-
-/**
  * cpuset_mem_spread_node() - On which node to begin search for a page
  *
  * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
--- x/kernel/sched.c
+++ x/kernel/sched.c
@@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s
 
 	/* No more Mr. Nice Guy. */
 	if (dest_cpu >= nr_cpu_ids) {
-		rcu_read_lock();
-		cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
-		rcu_read_unlock();
-		dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
+		// XXX: take cpu_rq(cpu)->lock ???
+		cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
+		dest_cpu = cpumask_any(cpu_active_mask);
 
 		/*
 		 * Don't tell them about moving exiting tasks or
@@ -5929,7 +5928,6 @@ migration_call(struct notifier_block *nf
 
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-		cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */
 		migrate_live_tasks(cpu);
 		rq = cpu_rq(cpu);
 		kthread_stop(rq->migration_thread);
@@ -5943,7 +5941,6 @@ migration_call(struct notifier_block *nf
 		rq->idle->sched_class = &idle_sched_class;
 		migrate_dead_tasks(cpu);
 		raw_spin_unlock_irq(&rq->lock);
-		cpuset_unlock();
 		migrate_nr_uninterruptible(rq);
 		BUG_ON(rq->nr_running != 0);
 		calc_global_load_remove(rq);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-11 14:52         ` Oleg Nesterov
@ 2010-03-11 15:22           ` Oleg Nesterov
  2010-03-11 15:41             ` Peter Zijlstra
  2010-03-11 15:35           ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2010-03-11 15:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On 03/11, Oleg Nesterov wrote:
>
> How can we fix this later? Perhaps we can change
> cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.

Wait. We need to fix the CPU_DEAD case anyway?

Hmm. 6ad4c18884e864cf4c77f9074d3d1816063f99cd
"sched: Fix balance vs hotplug race" did s/CPU_DEAD/CPU_DOWN_PREPARE/
in cpuset_track_online_cpus(). This doesn't look exactly right to me,
we shouldn't do remove_tasks_in_empty_cpuset() at CPU_DOWN_PREPARE
stage, it can fail.

Otoh. This means that move_task_of_dead_cpu() can never see the
task without active cpus in ->cpus_allowed, it is called later by
CPU_DEAD. So, cpuset_lock() is not needed at all.

Oleg.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-11 14:52         ` Oleg Nesterov
  2010-03-11 15:22           ` Oleg Nesterov
@ 2010-03-11 15:35           ` Peter Zijlstra
  2010-03-11 16:19             ` Oleg Nesterov
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2010-03-11 15:35 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On Thu, 2010-03-11 at 15:52 +0100, Oleg Nesterov wrote:
> On 03/10, Oleg Nesterov wrote:
> >
> > On 03/10, Peter Zijlstra wrote:
> > >
> > > Right, so if you refresh these patches, I'll feed them to mingo and they
> > > should eventually end up in -linus, how's that? :-)
> >
> > Great. Will redo/resend tomorrow ;)
> 
> That was a great plan, but it doesn't work.
> 
> With the recent changes we have even more problems with
> cpuset_cpus_allowed_locked(). Not only it misses cpuset_lock() (which
> doesn't work anyway and must die imho), it is very wrong to even call
> this function from try_to_wakeup() - this can deadlock.
> 
> Because task_cs() is protected by task_lock() which is not irq-safe,
> and cpuset_cpus_allowed_locked() takes this lock.

You're right, and lockdep doesn't normally warn about that because
nobody really hits this path :/

> We need more changes in cpuset.c. Btw, select_fallback_rq() takes
> rcu_read_lock around cpuset_cpus_allowed_locked(). Why? I must have
> missed something, but afaics this buys nothing.

for task_cs() iirc.

> From the previous email:
> 
> 	On 03/10, Peter Zijlstra wrote:
> 	>
> 	> On Wed, 2010-03-10 at 18:30 +0100, Oleg Nesterov wrote:
> 	> > On 03/10, Peter Zijlstra wrote:
> 	> > >
> 	> > > I guess the quick fix is to really bail and always use cpu_online_mask
> 	> > > in select_fallback_rq().
> 	> >
> 	> > Yes, but this breaks cpusets.
> 	>
> 	> Arguably, so, but you can also argue that binding a task to a cpu and
> 	> then unplugging that cpu without first fixing up the affinity is a 'you
> 	> get to keep both pieces' situation.
> 
> Well yes, but still it was supposed the kernel should handle this case
> correctly, the task shouldn't escape its cpuset.
> 
> However, currently I don't see another option. I think we should fix the
> possible deadlocks and kill cpuset_lock/cpuset_cpus_allowed_locked first,
> then try to fix cpusets.
> 
> See the trivial (uncompiled) patch below. It just states a fact
> cpuset_cpus_allowed() logic is broken.
> 
> How can we fix this later? Perhaps we can change
> cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.

Problem is, we can't really fix up tasks, wakeup must be able to find a
suitable cpu.

> At first glance this should work in try_to_wake_up(p) case, we can't
> race with cpuset_change_cpumask()/etc because of TASK_WAKING logic.

Well, cs->cpus_possible can still go funny on us.

> But I am not sure how can we fix move_task_off_dead_cpu(). I think
> __migrate_task_irq() itself is fine, but if select_fallback_rq() is
> called from move_task_off_dead_cpu() nothing protects ->cpus_allowed.

It has that retry loop in case the migration fails, right?

> We can race with cpusets, or even with the plain set_cpus_allowed().
> Probably nothing really bad can happen, if the resulting cpumask
> doesn't have online cpus due to the racing memcpys, we should retry
> after __migrate_task_irq() fails. Or we can take cpu_rq(cpu)-lock
> around cpumask_copy(p->cpus_allowed, cpu_possible_mask).

It does the retry thing.

> sched_exec() seems fine, the task is current and running,
> "No more Mr. Nice Guy." case is not possible.
> 
> What do you think?
> 
> Btw, I think there is a small bug in set_cpus_allowed_ptr(),
> wake_up_process(rq->migration_thread) can hit NULL, we should do
> wake_up_process(mt).

Agreed.

> @@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s
>  
>  	/* No more Mr. Nice Guy. */
>  	if (dest_cpu >= nr_cpu_ids) {
> -		rcu_read_lock();
> -		cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> -		rcu_read_unlock();
> -		dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> +		// XXX: take cpu_rq(cpu)->lock ???
> +		cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> +		dest_cpu = cpumask_any(cpu_active_mask);


Right, this seems safe.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-11 15:22           ` Oleg Nesterov
@ 2010-03-11 15:41             ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2010-03-11 15:41 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On Thu, 2010-03-11 at 16:22 +0100, Oleg Nesterov wrote:
> On 03/11, Oleg Nesterov wrote:
> >
> > How can we fix this later? Perhaps we can change
> > cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> > fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.
> 
> Wait. We need to fix the CPU_DEAD case anyway?
> 
> Hmm. 6ad4c18884e864cf4c77f9074d3d1816063f99cd
> "sched: Fix balance vs hotplug race" did s/CPU_DEAD/CPU_DOWN_PREPARE/
> in cpuset_track_online_cpus(). This doesn't look exactly right to me,
> we shouldn't do remove_tasks_in_empty_cpuset() at CPU_DOWN_PREPARE
> stage, it can fail.

Sure, tough luck for those few tasks.

> Otoh. This means that move_task_of_dead_cpu() can never see the
> task without active cpus in ->cpus_allowed, it is called later by
> CPU_DEAD. So, cpuset_lock() is not needed at all.

Right,.. so the whole problem is cpumask ops are terribly expensive
since we got this CONFIG_CPUMASK_OFFSTACK muck, so we try to reduce
these ops in the regular scheduling paths, in the patch you referenced
above the tradeof was between fixing the sched_domains up too often vs
adding a cpumask_and in a hot-path, guess who won ;-)




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-11 15:35           ` Peter Zijlstra
@ 2010-03-11 16:19             ` Oleg Nesterov
  2010-03-11 16:29               ` Peter Zijlstra
  2010-03-13 19:28               ` Oleg Nesterov
  0 siblings, 2 replies; 13+ messages in thread
From: Oleg Nesterov @ 2010-03-11 16:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On 03/11, Peter Zijlstra wrote:
>
> On Thu, 2010-03-11 at 15:52 +0100, Oleg Nesterov wrote:
>
> > Btw, select_fallback_rq() takes
> > rcu_read_lock around cpuset_cpus_allowed_locked(). Why? I must have
> > missed something, but afaics this buys nothing.
>
> for task_cs() iirc.

it should be stable under task_lock()... Never mind.

> > How can we fix this later? Perhaps we can change
> > cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> > fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.
>
> Problem is, we can't really fix up tasks, wakeup must be able to find a
> suitable cpu.

Yes sure. I meant, wakeup()->select_fallback_rq() sets cpus_allowed =
cpu_possible_map as we discussed. Then cpuset_track_online_cpus(CPU_DEAD)
fixes the affected tasks.

> > At first glance this should work in try_to_wake_up(p) case, we can't
> > race with cpuset_change_cpumask()/etc because of TASK_WAKING logic.
>
> Well, cs->cpus_possible can still go funny on us.

What do you mean? Afaics, cpusets always uses set_cpus_allowed() to
change task->cpus_allowed.

> > But I am not sure how can we fix move_task_off_dead_cpu(). I think
> > __migrate_task_irq() itself is fine, but if select_fallback_rq() is
> > called from move_task_off_dead_cpu() nothing protects ->cpus_allowed.
>
> It has that retry loop in case the migration fails, right?
>
> > We can race with cpusets, or even with the plain set_cpus_allowed().
> > Probably nothing really bad can happen, if the resulting cpumask
> > doesn't have online cpus due to the racing memcpys, we should retry
> > after __migrate_task_irq() fails. Or we can take cpu_rq(cpu)-lock
> > around cpumask_copy(p->cpus_allowed, cpu_possible_mask).
>
> It does the retry thing.

Yes, I mentioned retry logic too. But it can't always help, even without
cpusets.

Suppose a task T is bound to the dead CPU, and move_task_off_dead_cpu()
races with set_cpus_allowed(new_mask). I think it is fine if T gets
either new_mask or cpu_possible_map in ->cpus_allowed. But, it can get
a "random" mix if 2 memcpy() run in parallel. And it is possible that
__migrate_task_irq() will not fail if dest_cpu falls into resulting mask.

> > @@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s
> >
> >  	/* No more Mr. Nice Guy. */
> >  	if (dest_cpu >= nr_cpu_ids) {
> > -		rcu_read_lock();
> > -		cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> > -		rcu_read_unlock();
> > -		dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> > +		// XXX: take cpu_rq(cpu)->lock ???
> > +		cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > +		dest_cpu = cpumask_any(cpu_active_mask);
>
>
> Right, this seems safe.

OK, I'll try to read this code a bit more and then send this patch.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-11 16:19             ` Oleg Nesterov
@ 2010-03-11 16:29               ` Peter Zijlstra
  2010-03-13 19:28               ` Oleg Nesterov
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2010-03-11 16:29 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On Thu, 2010-03-11 at 17:19 +0100, Oleg Nesterov wrote:

> > > How can we fix this later? Perhaps we can change
> > > cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> > > fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.
> >
> > Problem is, we can't really fix up tasks, wakeup must be able to find a
> > suitable cpu.
> 
> Yes sure. I meant, wakeup()->select_fallback_rq() sets cpus_allowed =
> cpu_possible_map as we discussed. Then cpuset_track_online_cpus(CPU_DEAD)
> fixes the affected tasks.

Ah, have that re-validate the p->cpus_allowed for all cpuset tasks, ok
that might work.

> > > At first glance this should work in try_to_wake_up(p) case, we can't
> > > race with cpuset_change_cpumask()/etc because of TASK_WAKING logic.
> >
> > Well, cs->cpus_possible can still go funny on us.
> 
> What do you mean? Afaics, cpusets always uses set_cpus_allowed() to
> change task->cpus_allowed.

Confusion^2 ;-), I failed to grasp your fixup idea and got confused,
which confused you.

> > > But I am not sure how can we fix move_task_off_dead_cpu(). I think
> > > __migrate_task_irq() itself is fine, but if select_fallback_rq() is
> > > called from move_task_off_dead_cpu() nothing protects ->cpus_allowed.
> >
> > It has that retry loop in case the migration fails, right?
> >
> > > We can race with cpusets, or even with the plain set_cpus_allowed().
> > > Probably nothing really bad can happen, if the resulting cpumask
> > > doesn't have online cpus due to the racing memcpys, we should retry
> > > after __migrate_task_irq() fails. Or we can take cpu_rq(cpu)-lock
> > > around cpumask_copy(p->cpus_allowed, cpu_possible_mask).
> >
> > It does the retry thing.
> 
> Yes, I mentioned retry logic too. But it can't always help, even without
> cpusets.
> 
> Suppose a task T is bound to the dead CPU, and move_task_off_dead_cpu()
> races with set_cpus_allowed(new_mask). I think it is fine if T gets
> either new_mask or cpu_possible_map in ->cpus_allowed. But, it can get
> a "random" mix if 2 memcpy() run in parallel. And it is possible that
> __migrate_task_irq() will not fail if dest_cpu falls into resulting mask.

Ah indeed. One would almost construct a cpumask_assign that uses RCU
atomic pointer assignment for all this stupid cpumask juggling :/

> > > @@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s
> > >
> > >  	/* No more Mr. Nice Guy. */
> > >  	if (dest_cpu >= nr_cpu_ids) {
> > > -		rcu_read_lock();
> > > -		cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> > > -		rcu_read_unlock();
> > > -		dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> > > +		// XXX: take cpu_rq(cpu)->lock ???
> > > +		cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > > +		dest_cpu = cpumask_any(cpu_active_mask);
> >
> >
> > Right, this seems safe.
> 
> OK, I'll try to read this code a bit more and then send this patch.

Thanks!


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-11 16:19             ` Oleg Nesterov
  2010-03-11 16:29               ` Peter Zijlstra
@ 2010-03-13 19:28               ` Oleg Nesterov
  2010-03-14  2:11                 ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2010-03-13 19:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1053 bytes --]

On 03/11, Oleg Nesterov wrote:
>
> > > @@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s
> > >
> > >  	/* No more Mr. Nice Guy. */
> > >  	if (dest_cpu >= nr_cpu_ids) {
> > > -		rcu_read_lock();
> > > -		cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> > > -		rcu_read_unlock();
> > > -		dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> > > +		// XXX: take cpu_rq(cpu)->lock ???
> > > +		cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > > +		dest_cpu = cpumask_any(cpu_active_mask);
> >
> >
> > Right, this seems safe.
>
> OK, I'll try to read this code a bit more and then send this patch.

No, it is not safe :/

Peter, I seem to see the simple fix for the discussed cpuset problems,
but it turns out sched.c has more problems evem without cpusets.

I'll try to send the whole series on Monday, but perhaps you can look
at the attached compile-tested patches (especially 2 and 3) to quickly
correct me if the 3rd one is wrong.

The subsequent fix in cpuset.c depends on the locking rules enforced
by 2-3.

Oleg.


[-- Attachment #2: 1_KILL_CPUSET_LOCK.patch --]
[-- Type: text/plain, Size: 5625 bytes --]

[PATCH 1/5] kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code

This patch just states the fact the cpusets/cpuhotplug interaction is
broken and removes the deadlockable code which only pretends to work.

- cpuset_lock() doesn't really work. It is needed for
  cpuset_cpus_allowed_locked() but we can't take this lock in
  try_to_wake_up()->select_fallback_rq() path.

- cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
  callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
  stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
  cpuset_lock() and hangs forever because CPU is already dead and thus
  T can't be scheduled.

- cpuset_cpus_allowed_locked() is deadlockable as well. It takes task_lock()
  which is not irq-safe, but try_to_wake_up() can be called from irq.

Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
we currently do without CONFIG_CPUSETS.

Also, with or without this patch, with or without CONFIG_CPUSETS,
move_task_off_dead_cpu() and sched_exec() are wrong, the play with
->cpus_allowed lockless and can race with set_cpus_allowed() pathes.

The subsequent patches try to to fix these problems.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/cpuset.h |   13 -------------
 kernel/cpuset.c        |   27 +--------------------------
 kernel/sched.c         |   10 +++-------
 3 files changed, 4 insertions(+), 46 deletions(-)

--- 34-rc1/include/linux/cpuset.h~1_KILL_CPUSET_LOCK	2009-06-17 14:11:26.000000000 +0200
+++ 34-rc1/include/linux/cpuset.h	2010-03-11 18:17:58.000000000 +0100
@@ -21,8 +21,6 @@ extern int number_of_cpusets;	/* How man
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
-extern void cpuset_cpus_allowed_locked(struct task_struct *p,
-				       struct cpumask *mask);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -69,9 +67,6 @@ struct seq_file;
 extern void cpuset_task_status_allowed(struct seq_file *m,
 					struct task_struct *task);
 
-extern void cpuset_lock(void);
-extern void cpuset_unlock(void);
-
 extern int cpuset_mem_spread_node(void);
 
 static inline int cpuset_do_page_mem_spread(void)
@@ -105,11 +100,6 @@ static inline void cpuset_cpus_allowed(s
 {
 	cpumask_copy(mask, cpu_possible_mask);
 }
-static inline void cpuset_cpus_allowed_locked(struct task_struct *p,
-					      struct cpumask *mask)
-{
-	cpumask_copy(mask, cpu_possible_mask);
-}
 
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 {
@@ -157,9 +147,6 @@ static inline void cpuset_task_status_al
 {
 }
 
-static inline void cpuset_lock(void) {}
-static inline void cpuset_unlock(void) {}
-
 static inline int cpuset_mem_spread_node(void)
 {
 	return 0;
--- 34-rc1/kernel/cpuset.c~1_KILL_CPUSET_LOCK	2009-12-18 19:05:38.000000000 +0100
+++ 34-rc1/kernel/cpuset.c	2010-03-11 18:20:47.000000000 +0100
@@ -2140,19 +2140,10 @@ void __init cpuset_init_smp(void)
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
 	mutex_lock(&callback_mutex);
-	cpuset_cpus_allowed_locked(tsk, pmask);
-	mutex_unlock(&callback_mutex);
-}
-
-/**
- * cpuset_cpus_allowed_locked - return cpus_allowed mask from a tasks cpuset.
- * Must be called with callback_mutex held.
- **/
-void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask)
-{
 	task_lock(tsk);
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	task_unlock(tsk);
+	mutex_unlock(&callback_mutex);
 }
 
 void cpuset_init_current_mems_allowed(void)
@@ -2341,22 +2332,6 @@ int __cpuset_node_allowed_hardwall(int n
 }
 
 /**
- * cpuset_lock - lock out any changes to cpuset structures
- *
- * The out of memory (oom) code needs to mutex_lock cpusets
- * from being changed while it scans the tasklist looking for a
- * task in an overlapping cpuset.  Expose callback_mutex via this
- * cpuset_lock() routine, so the oom code can lock it, before
- * locking the task list.  The tasklist_lock is a spinlock, so
- * must be taken inside callback_mutex.
- */
-
-void cpuset_lock(void)
-{
-	mutex_lock(&callback_mutex);
-}
-
-/**
  * cpuset_unlock - release lock on cpuset changes
  *
  * Undo the lock taken in a previous cpuset_lock() call.
--- 34-rc1/kernel/sched.c~1_KILL_CPUSET_LOCK	2010-03-11 13:11:50.000000000 +0100
+++ 34-rc1/kernel/sched.c	2010-03-13 19:48:15.000000000 +0100
@@ -2288,11 +2288,9 @@ static int select_fallback_rq(int cpu, s
 		return dest_cpu;
 
 	/* No more Mr. Nice Guy. */
-	if (dest_cpu >= nr_cpu_ids) {
-		rcu_read_lock();
-		cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
-		rcu_read_unlock();
-		dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
+	if (unlikely(dest_cpu >= nr_cpu_ids)) {
+		cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
+		dest_cpu = cpumask_any(cpu_active_mask);
 
 		/*
 		 * Don't tell them about moving exiting tasks or
@@ -5929,7 +5927,6 @@ migration_call(struct notifier_block *nf
 
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-		cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */
 		migrate_live_tasks(cpu);
 		rq = cpu_rq(cpu);
 		kthread_stop(rq->migration_thread);
@@ -5943,7 +5940,6 @@ migration_call(struct notifier_block *nf
 		rq->idle->sched_class = &idle_sched_class;
 		migrate_dead_tasks(cpu);
 		raw_spin_unlock_irq(&rq->lock);
-		cpuset_unlock();
 		migrate_nr_uninterruptible(rq);
 		BUG_ON(rq->nr_running != 0);
 		calc_global_load_remove(rq);

[-- Attachment #3: 2_MTODC_TAKE_RQ_LOCK.patch --]
[-- Type: text/plain, Size: 2121 bytes --]

[PATCH 2/5] change move_task_off_dead_cpu() to take rq->lock around select_fallback_rq()

move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
lockless. We can race with set_cpus_allowed() running in parallel.

Change it to take rq->lock around select_fallback_rq(). Note that it is not
trivial to move this spin_lock() into select_fallback_rq(), we must recheck
the task was not migrated after we take the lock and other callers do not
need this lock.

We can't race with other callers of select_fallback_rq() which rely on
TASK_WAKING, try_to_wake_up() and wake_up_new_task(), they must protect
themselves against cpu hotplug anyway.

Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/sched.c |   29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

--- 34-rc1/kernel/sched.c~2_MTODC_TAKE_RQ_LOCK	2010-03-13 19:48:15.000000000 +0100
+++ 34-rc1/kernel/sched.c	2010-03-13 19:50:57.000000000 +0100
@@ -5509,29 +5509,28 @@ static int migration_thread(void *data)
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-
-static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
-{
-	int ret;
-
-	local_irq_disable();
-	ret = __migrate_task(p, src_cpu, dest_cpu);
-	local_irq_enable();
-	return ret;
-}
-
 /*
  * Figure out where task on dead CPU should go, use force if necessary.
  */
 static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
 {
-	int dest_cpu;
-
+	struct rq *rq = cpu_rq(dead_cpu);
+	int migrated, dest_cpu;
+	unsigned long flags;
 again:
-	dest_cpu = select_fallback_rq(dead_cpu, p);
+	local_irq_save(flags);
+	raw_spin_lock(&rq->lock);
+	migrated = (task_cpu(p) != dead_cpu);
+	if (!migrated)
+		dest_cpu = select_fallback_rq(dead_cpu, p);
+	raw_spin_unlock(&rq->lock);
+
+	if (!migrated)
+		migrated = __migrate_task(p, dead_cpu, dest_cpu);
+	local_irq_restore(flags);
 
 	/* It can have affinity changed while we were choosing. */
-	if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
+	if (unlikely(!migrated))
 		goto again;
 }
 

[-- Attachment #4: 3_SCHED_EXEC_DONT_FALLBACK.patch --]
[-- Type: text/plain, Size: 1918 bytes --]

[PATCH 3/5] sched_exec() should use select_fallback_rq() logic

sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless.
This can race with other CPUs updating our ->cpus_allowed, and this
looks meaningless to me.

The task is current and running, it must have online cpus in ->cpus_allowed,
the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu,
this likely means we raced with set_cpus_allowed() which was called
for reason, why should sched_exec() retry and call ->select_task_rq()
again?

Change the code to call sched_class->select_task_rq() directly and do
nothing if the returned cpu is wrong after re-checking under rq->lock.

>From now select_fallback_rq() is always called with either rq-lock or
TASK_WAKING held.

TODO: update the comments.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---


--- 34-rc1/kernel/sched.c~3_SCHED_EXEC_DONT_FALLBACK	2010-03-13 19:50:57.000000000 +0100
+++ 34-rc1/kernel/sched.c	2010-03-13 19:51:11.000000000 +0100
@@ -3123,9 +3123,8 @@ void sched_exec(void)
 	unsigned long flags;
 	struct rq *rq;
 
-again:
 	this_cpu = get_cpu();
-	dest_cpu = select_task_rq(p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == this_cpu) {
 		put_cpu();
 		return;
@@ -3133,18 +3132,12 @@ again:
 
 	rq = task_rq_lock(p, &flags);
 	put_cpu();
-
 	/*
 	 * select_task_rq() can race against ->cpus_allowed
 	 */
-	if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)
-	    || unlikely(!cpu_active(dest_cpu))) {
-		task_rq_unlock(rq, &flags);
-		goto again;
-	}
-
-	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
+	    likely(cpu_active(dest_cpu)) &&
+	    migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
 		struct task_struct *mt = rq->migration_thread;
 

[-- Attachment #5: 4_CPU_DOWN_AFFINITY.patch --]
[-- Type: text/plain, Size: 4178 bytes --]

[PATCH 4/5] _cpu_down: don't play with current->cpus_allowed

_cpu_down() changes the current task's affinity and then recovers it at
the end. The problems are well known: we can't restore old_allowed if it
was bound to the now-dead-cpu, and we can race with the userspace which
can change cpu-affinity during unplug.

_cpu_down() should not play with current->cpus_allowed at all. Instead,
take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
removes the dying cpu from cpu_online_mask.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 +
 kernel/sched.c        |    2 +-
 kernel/cpu.c          |   18 ++++++------------
 3 files changed, 8 insertions(+), 13 deletions(-)

--- 34-rc1/include/linux/sched.h~4_CPU_DOWN_AFFINITY	2010-03-13 17:05:55.000000000 +0100
+++ 34-rc1/include/linux/sched.h	2010-03-13 19:53:22.000000000 +0100
@@ -1843,6 +1843,7 @@ extern void sched_clock_idle_sleep_event
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
+extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
--- 34-rc1/kernel/sched.c~4_CPU_DOWN_AFFINITY	2010-03-13 19:51:11.000000000 +0100
+++ 34-rc1/kernel/sched.c	2010-03-13 19:53:22.000000000 +0100
@@ -5505,7 +5505,7 @@ static int migration_thread(void *data)
 /*
  * Figure out where task on dead CPU should go, use force if necessary.
  */
-static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
+void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
 {
 	struct rq *rq = cpu_rq(dead_cpu);
 	int migrated, dest_cpu;
--- 34-rc1/kernel/cpu.c~4_CPU_DOWN_AFFINITY	2010-03-13 17:05:55.000000000 +0100
+++ 34-rc1/kernel/cpu.c	2010-03-13 19:53:22.000000000 +0100
@@ -163,6 +163,7 @@ static inline void check_for_tasks(int c
 }
 
 struct take_cpu_down_param {
+	struct task_struct *caller;
 	unsigned long mod;
 	void *hcpu;
 };
@@ -171,6 +172,7 @@ struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
+	unsigned int cpu = (unsigned long)param->hcpu;
 	int err;
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
@@ -181,6 +183,8 @@ static int __ref take_cpu_down(void *_pa
 	raw_notifier_call_chain(&cpu_chain, CPU_DYING | param->mod,
 				param->hcpu);
 
+	if (task_cpu(param->caller) == cpu)
+		move_task_off_dead_cpu(cpu, param->caller);
 	/* Force idle task to run as soon as we yield: it should
 	   immediately notice cpu is offline and die quickly. */
 	sched_idle_next();
@@ -191,10 +195,10 @@ static int __ref take_cpu_down(void *_pa
 static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 {
 	int err, nr_calls = 0;
-	cpumask_var_t old_allowed;
 	void *hcpu = (void *)(long)cpu;
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct take_cpu_down_param tcd_param = {
+		.caller = current,
 		.mod = mod,
 		.hcpu = hcpu,
 	};
@@ -205,9 +209,6 @@ static int __ref _cpu_down(unsigned int 
 	if (!cpu_online(cpu))
 		return -EINVAL;
 
-	if (!alloc_cpumask_var(&old_allowed, GFP_KERNEL))
-		return -ENOMEM;
-
 	cpu_hotplug_begin();
 	set_cpu_active(cpu, false);
 	err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
@@ -224,10 +225,6 @@ static int __ref _cpu_down(unsigned int 
 		goto out_release;
 	}
 
-	/* Ensure that we are not runnable on dying cpu */
-	cpumask_copy(old_allowed, &current->cpus_allowed);
-	set_cpus_allowed_ptr(current, cpu_active_mask);
-
 	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
 	if (err) {
 		set_cpu_active(cpu, true);
@@ -236,7 +233,7 @@ static int __ref _cpu_down(unsigned int 
 					    hcpu) == NOTIFY_BAD)
 			BUG();
 
-		goto out_allowed;
+		goto out_release;
 	}
 	BUG_ON(cpu_online(cpu));
 
@@ -254,8 +251,6 @@ static int __ref _cpu_down(unsigned int 
 
 	check_for_tasks(cpu);
 
-out_allowed:
-	set_cpus_allowed_ptr(current, old_allowed);
 out_release:
 	cpu_hotplug_done();
 	if (!err) {
@@ -263,7 +258,6 @@ out_release:
 					    hcpu) == NOTIFY_BAD)
 			BUG();
 	}
-	free_cpumask_var(old_allowed);
 	return err;
 }
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: select_fallback_rq() && cpuset_lock()
  2010-03-13 19:28               ` Oleg Nesterov
@ 2010-03-14  2:11                 ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2010-03-14  2:11 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Ingo Molnar, Lai Jiangshan, Tejun Heo, linux-kernel

On Sat, 2010-03-13 at 20:28 +0100, Oleg Nesterov wrote:
> 
> Peter, I seem to see the simple fix for the discussed cpuset problems,
> but it turns out sched.c has more problems evem without cpusets.
> 
> I'll try to send the whole series on Monday, but perhaps you can look
> at the attached compile-tested patches (especially 2 and 3) to quickly
> correct me if the 3rd one is wrong.
> 
> The subsequent fix in cpuset.c depends on the locking rules enforced
> by 2-3. 

In as far as I can trust my jet-lagged brain, they look good. I'll try
and have another look later.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-03-14  2:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-09 18:06 Q: select_fallback_rq() && cpuset_lock() Oleg Nesterov
2010-03-10 16:40 ` Peter Zijlstra
2010-03-10 17:30   ` Oleg Nesterov
2010-03-10 18:01     ` Peter Zijlstra
2010-03-10 18:33       ` Oleg Nesterov
2010-03-11 14:52         ` Oleg Nesterov
2010-03-11 15:22           ` Oleg Nesterov
2010-03-11 15:41             ` Peter Zijlstra
2010-03-11 15:35           ` Peter Zijlstra
2010-03-11 16:19             ` Oleg Nesterov
2010-03-11 16:29               ` Peter Zijlstra
2010-03-13 19:28               ` Oleg Nesterov
2010-03-14  2:11                 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).