CPU Hotplug rework

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* CPU Hotplug rework
@ 2012-03-19 14:44 Srivatsa S. Bhat
  2012-03-19 14:48 ` Srivatsa S. Bhat
  2012-03-19 23:42 ` Rusty Russell
  0 siblings, 2 replies; 38+ messages in thread
From: Srivatsa S. Bhat @ 2012-03-19 14:44 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

Hi,

There had been some discussion on CPU Hotplug redesign/rework
some time ago, but it was buried under a thread with a different
subject.
(http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404)

So I am opening a new thread with an appropriate subject to discuss
what needs to be done and how to go about it, as part of the rework.

Peter Zijlstra and Paul McKenney had come up with TODO lists for the
rework, and here are their extracts from the previous discussion:

On Tue, Jan 31, 2012 at 02:01:56PM +0100, Peter Zijlstra wrote:
> I paged out most details again, but it goes something like:
>
>  - read and understand the current generic code
>
>  - and all architecture code, at which point you'll probably boggle
>    at all the similarities that are all subtly different (there's
>    about 3 actually different ways in the arch code).
>
>  - pick one, preferably one that keeps additional state and doesn't
>    fully rely on the online bits and pull it into generic code and
>    provide a small vector of arch specific functions.
>
>  - convert all archs over.
>
>
> Also related:
>
>  - figure out why cpu_down needs kstopmachine, I'm not sure it does..
>    we should be able to tear down a cpu using synchronize_sched() and a
>    single stop_one_cpu(). (someday when there's time I might actually
>    try to implement this).



On 02/02/2012 06:03 AM, Paul E. McKenney wrote:
> Currently, a number of the CPU_DYING notifiers assume that they are
> running in stop-machine context, including those of RCU.
>
> However, this is not an inherent property of RCU -- DYNIX/ptx's
> CPU-offline process did not stop the whole machine, after all, and RCU
> (we called it rclock, but whatever) was happy with this arrangement.
> In fact, if the outgoing CPU could be made to stop in that context
> instead of returning to the scheduler and the idle loop, it would make
> my life a bit easier.
>
> My question is why aren't the notifiers executed in the opposite
> order going down and coming up, with the coming-up order matching the
> boot order?  Also, why can't the CPU's exit from this world be driven
> out of the idle loop?  That way, the CPU wouldn't mark itself offline
> (thus in theory to be ignored by CPU), and then immediately dive into
> the scheduler and who knows what all else, using RCU all the time. ;-)
>
> (RCU handles this by keeping a separate set of books for online CPUs.
> It considers a CPU online at CPU_UP_PREPARE time, and doesn't consider
> it offline until CPU_DEAD time.  To handle the grace periods between,
> force_quiescent_state() allows the grace period to run a few jiffies
> before checking cpu_online_map, which allows a given CPU to safely use
> RCU for at least one jiffy before marking itself online and for at least
> one jiffy after marking itself offline.)



On Fri, Feb 03, 2012 at 09:32:35AM -0800, Paul E. McKenney wrote:

> Starting from the top, what does CPU hotplug need to do?
>
> 1.	preempt_disable() or something similarly lightweight and
>	unconditional must block removal of any CPU that was
>	in cpu_online_map at the start of the "critical section".
>	(I will identify these as hotplug read-side critical sections.)
>
>	I don't believe that there is any prohibition against a CPU
>	appearing suddenly, but some auditing would be required to
>	confirm this.  But see below.
>
> 2.	A subsystem not involved in the CPU-hotplug process must be able
>	to test if a CPU is online and be guaranteed that this test
>	remains valid (the CPU remains fully functional) for the duration
>	of the hotplug read-side critical section.
>
> 3.	If a subsystem needs to operate on all currently online CPUs,
>	then it must participate in the CPU-hotplug process.  My
>	belief is that if some code needs to test whether a CPU is
>	present, and needs an "offline" indication to persist, then
>	that code's subsystem must participate in CPU-hotplug operations.
>
> 4.	There must be a way to register/unregister for CPU-hotplug events.
>	This is currently cpu_notifier(), register_cpu_notifier(),
>	and unregister_cpu_notifier().
>
> n-1.	CPU-hotplug operations should be reasonably fast.  Tens of
>	milliseconds is OK, multiple seconds not so much.
>
> n.	(Your additional constraints here.)
>
> How to do this?  Here is one possible approach, probably full of holes:
>
> a.	Maintain the cpu_online_map, as currently, but the meaning
>	of a set bit is that the CPU is fully functional.  If there
>	is any service that the CPU no longer offers, its bit is
>	cleared.
>
> b.	Continue to use preempt_enable()/preempt_disable() to mark
>	hotplug read-side critical sections.
>
> c.	Instead of using __stop_machine(), use a per-CPU variable that
>	is checked in the idle loop.  Possibly another TIF_ bit.
>
> d.	The CPU notifiers are like today, except that CPU_DYING() is
>	invoked by the CPU after it sees that its per-CPU variable
>	telling it to go offline.  As today, the CPU_DYING notifiers
>	are invoked with interrupts disabled, but other CPUs are still
>	running.  Of course, the CPU_DYING notifiers need to be audited
>	and repaired.  There are fewer than 20 of them, so not so bad.
>	RCU's is an easy fix:  Just re-introduce locking and the global
>	RCU callback orphanage.  My guesses for the others at the end.
>
> e.	Getting rid of __stop_machine() means that the final step of the
>	CPU going offline will no longer be seen as atomic by other CPUs.
>	This will require more careful tracking of dependencies among
>	different subsystems.  The required tracking can be reduced
>	by invoking notifiers in registration order for CPU-online
>	operations and invoking them in the reverse of registration
>	order for CPU-offline operations.
>
>	For example, the scheduler uses RCU.  If notifiers are invoked in
>	the same order for all CPU-hotplug operations, then on CPU-offline
>	operations, during the time between when RCU's notifier is invoked
>	and when the scheduler's notifier is invoked, the scheduler must
>	deal with a CPU on which RCU isn't working.  (RCU currently
>	works around this by allowing a one-jiffy time period after
>	notification when it still pays attention to the CPU.)
>
>	In contrast, if notifiers are invoked in reverse-registration
>	order for CPU-offline operations, then any time the scheduler
>	sees a CPU as online, RCU also is treating it as online.
>
> f.	There will be some circular dependencies.  For example, the
>	scheduler uses RCU, but in some configurations, RCU also uses
>	kthreads.  These dependencies must be handled on a case-by-case
>	basis.	For example, the scheduler could invoke an RCU API
>	to tell RCU when to shut down its per-CPU kthreads and when
>	to start them up.  Or RCU could deal with its kthreads in the
>	CPU_DOWN_PREPARE and CPU_ONLINE notifiers.  Either way, RCU
>	needs to correctly handle the interval when it cannot use
>	kthreads on a given CPU that it is still handling, for example,
>	by switching to running the RCU core code in softirq context.
>
> g.	Most subsystems participating in CPU-hotplug operations will need
>	to keep their own copy of CPU online/offline state.  For example,
>	RCU uses the ->qsmaskinit fields in the rcu_node structure for
>	this purpose.
>
> h.	So CPU-offline handling looks something like the following:
>
>	i.	Acquire the hotplug mutex.
>	
>	ii.	Invoke the CPU_DOWN_PREPARE notifiers.  If there
>		are objections, invoke the CPU_DOWN_FAILED notifiers
>		and return an error.
>
>	iii.	Clear the CPU's bit in cpu_online_map.
>	
>	iv.	Invoke synchronize_sched() to ensure that all future hotplug
>		read-side critical sections ignore the outgoing CPU.
>
>	v.	Set a per-CPU variable telling the CPU to take itself
>		offline.  There would need to be something here to
>		help the CPU get to idle quickly, possibly requiring
>		another round of notifiers.  CPU_DOWN?
>
>	vi.	When the dying CPU gets to the idle loop, it invokes the
>		CPU_DYING notifiers and updates its per-CPU variable to
>		indicate that it is ready to die.  It then spins in a
>		tight loop (or does some other architecture-specified
>		operation to wait to be turned off).
>
>		Note that there is no need for RCU to guess how long the
>		CPU might be executing RCU read-side critical sections.
>
>	vii.	When the task doing the offline operation sees the
>		updated per-CPU variable, it calls __cpu_die().
>
>	viii.	The CPU_DEAD notifiers are invoked.
>
>	ix.	Theeck_for_tasks() function is invoked.
>
>	x.	Release the hotplug mutex.
>
>	xi.	Invoke the CPU_POST_DEAD notifiers.
>
> i.	I do not believe that the CPU-offline handling needs to change
>	much.
>
>
> CPU_DYING notifiers as of 3.2:
>
> o	vfp_hotplug():  I believe that this works as-is.
> o	s390_nohz_notify():  I believe that this works as-is.
> o	x86_pmu_notifier():  I believe that this works as-is.
> o	perf_ibs_cpu_notifier():  I don't know enough about
>	APIC to say.
> o	tboot_cpu_callback():  I believe that this works as-is,
>	but this one returns NOTIFY_BAD to a CPU_DYING notifier,
>	which is badness.  But it looks like that case is a
>	"cannot happen" case.  Still needs to be fixed.
> o	clockevents_notify():  This one acquires a global lock,
>	so it should be safe as-is.
> o	console_cpu_notify():  This one takes the same action
>	for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and
>	CPU_UP_CANCELLED that it does for CPU_DYING, so it
>	should be OK.
> o	rcu_cpu_notify():  This one needs adjustment as noted
>	above, but nothing major.
> o	migration_call():  I defer to Peter on this one.
>	It looks to me like it is written to handle other
>	CPUs, but...
> o	workqueue_cpu_callback(): Might need help, does a
>	non-atomic OR.
> o	kvm_cpu_hotplug(): Uses a global spinlock, so should
>	be OK as-is.


Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-19 14:44 CPU Hotplug rework Srivatsa S. Bhat
@ 2012-03-19 14:48 ` Srivatsa S. Bhat
  2012-03-20 11:28   ` Peter Zijlstra
  2012-04-05 17:39   ` Paul E. McKenney
  2012-03-19 23:42 ` Rusty Russell
  1 sibling, 2 replies; 38+ messages in thread
From: Srivatsa S. Bhat @ 2012-03-19 14:48 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote:

> Hi,
> 
> There had been some discussion on CPU Hotplug redesign/rework
> some time ago, but it was buried under a thread with a different
> subject.
> (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404)
> 
> So I am opening a new thread with an appropriate subject to discuss
> what needs to be done and how to go about it, as part of the rework.
> 
> Peter Zijlstra and Paul McKenney had come up with TODO lists for the
> rework, and here are their extracts from the previous discussion:
> 
> On Tue, Jan 31, 2012 at 02:01:56PM +0100, Peter Zijlstra wrote:
>> I paged out most details again, but it goes something like:
>>
>>  - read and understand the current generic code
>>
>>  - and all architecture code, at which point you'll probably boggle
>>    at all the similarities that are all subtly different (there's
>>    about 3 actually different ways in the arch code).
>>
>>  - pick one, preferably one that keeps additional state and doesn't
>>    fully rely on the online bits and pull it into generic code and
>>    provide a small vector of arch specific functions.
>>
>>  - convert all archs over.
>>
>>
>> Also related:
>>
>>  - figure out why cpu_down needs kstopmachine, I'm not sure it does..
>>    we should be able to tear down a cpu using synchronize_sched() and a
>>    single stop_one_cpu(). (someday when there's time I might actually
>>    try to implement this).
> 
> 
> 
> On 02/02/2012 06:03 AM, Paul E. McKenney wrote:
>> Currently, a number of the CPU_DYING notifiers assume that they are
>> running in stop-machine context, including those of RCU.
>>
>> However, this is not an inherent property of RCU -- DYNIX/ptx's
>> CPU-offline process did not stop the whole machine, after all, and RCU
>> (we called it rclock, but whatever) was happy with this arrangement.
>> In fact, if the outgoing CPU could be made to stop in that context
>> instead of returning to the scheduler and the idle loop, it would make
>> my life a bit easier.
>>
>> My question is why aren't the notifiers executed in the opposite
>> order going down and coming up, with the coming-up order matching the
>> boot order?  Also, why can't the CPU's exit from this world be driven
>> out of the idle loop?  That way, the CPU wouldn't mark itself offline
>> (thus in theory to be ignored by CPU), and then immediately dive into
>> the scheduler and who knows what all else, using RCU all the time. ;-)
>>
>> (RCU handles this by keeping a separate set of books for online CPUs.
>> It considers a CPU online at CPU_UP_PREPARE time, and doesn't consider
>> it offline until CPU_DEAD time.  To handle the grace periods between,
>> force_quiescent_state() allows the grace period to run a few jiffies
>> before checking cpu_online_map, which allows a given CPU to safely use
>> RCU for at least one jiffy before marking itself online and for at least
>> one jiffy after marking itself offline.)
> 
> 
> 
> On Fri, Feb 03, 2012 at 09:32:35AM -0800, Paul E. McKenney wrote:
> 
>> Starting from the top, what does CPU hotplug need to do?
>>
>> 1.	preempt_disable() or something similarly lightweight and
>> 	unconditional must block removal of any CPU that was
>> 	in cpu_online_map at the start of the "critical section".
>> 	(I will identify these as hotplug read-side critical sections.)
>>
>> 	I don't believe that there is any prohibition against a CPU
>> 	appearing suddenly, but some auditing would be required to
>> 	confirm this.  But see below.
>>
>> 2.	A subsystem not involved in the CPU-hotplug process must be able
>> 	to test if a CPU is online and be guaranteed that this test
>> 	remains valid (the CPU remains fully functional) for the duration
>> 	of the hotplug read-side critical section.
>>
>> 3.	If a subsystem needs to operate on all currently online CPUs,
>> 	then it must participate in the CPU-hotplug process.  My
>> 	belief is that if some code needs to test whether a CPU is
>> 	present, and needs an "offline" indication to persist, then
>> 	that code's subsystem must participate in CPU-hotplug operations.
>>
>> 4.	There must be a way to register/unregister for CPU-hotplug events.
>> 	This is currently cpu_notifier(), register_cpu_notifier(),
>> 	and unregister_cpu_notifier().
>>
>> n-1.	CPU-hotplug operations should be reasonably fast.  Tens of
>> 	milliseconds is OK, multiple seconds not so much.
>>
>> n.	(Your additional constraints here.)
>>
>> How to do this?  Here is one possible approach, probably full of holes:
>>
>> a.	Maintain the cpu_online_map, as currently, but the meaning
>> 	of a set bit is that the CPU is fully functional.  If there
>> 	is any service that the CPU no longer offers, its bit is
>> 	cleared.
>>
>> b.	Continue to use preempt_enable()/preempt_disable() to mark
>> 	hotplug read-side critical sections.
>>
>> c.	Instead of using __stop_machine(), use a per-CPU variable that
>> 	is checked in the idle loop.  Possibly another TIF_ bit.
>>
>> d.	The CPU notifiers are like today, except that CPU_DYING() is
>> 	invoked by the CPU after it sees that its per-CPU variable
>> 	telling it to go offline.  As today, the CPU_DYING notifiers
>> 	are invoked with interrupts disabled, but other CPUs are still
>> 	running.  Of course, the CPU_DYING notifiers need to be audited
>> 	and repaired.  There are fewer than 20 of them, so not so bad.
>> 	RCU's is an easy fix:  Just re-introduce locking and the global
>> 	RCU callback orphanage.  My guesses for the others at the end.
>>
>> e.	Getting rid of __stop_machine() means that the final step of the
>> 	CPU going offline will no longer be seen as atomic by other CPUs.
>> 	This will require more careful tracking of dependencies among
>> 	different subsystems.  The required tracking can be reduced
>> 	by invoking notifiers in registration order for CPU-online
>> 	operations and invoking them in the reverse of registration
>> 	order for CPU-offline operations.
>>
>> 	For example, the scheduler uses RCU.  If notifiers are invoked in
>> 	the same order for all CPU-hotplug operations, then on CPU-offline
>> 	operations, during the time between when RCU's notifier is invoked
>> 	and when the scheduler's notifier is invoked, the scheduler must
>> 	deal with a CPU on which RCU isn't working.  (RCU currently
>> 	works around this by allowing a one-jiffy time period after
>> 	notification when it still pays attention to the CPU.)
>>
>> 	In contrast, if notifiers are invoked in reverse-registration
>> 	order for CPU-offline operations, then any time the scheduler
>> 	sees a CPU as online, RCU also is treating it as online.
>>
>> f.	There will be some circular dependencies.  For example, the
>> 	scheduler uses RCU, but in some configurations, RCU also uses
>> 	kthreads.  These dependencies must be handled on a case-by-case
>> 	basis.	For example, the scheduler could invoke an RCU API
>> 	to tell RCU when to shut down its per-CPU kthreads and when
>> 	to start them up.  Or RCU could deal with its kthreads in the
>> 	CPU_DOWN_PREPARE and CPU_ONLINE notifiers.  Either way, RCU
>> 	needs to correctly handle the interval when it cannot use
>> 	kthreads on a given CPU that it is still handling, for example,
>> 	by switching to running the RCU core code in softirq context.
>>
>> g.	Most subsystems participating in CPU-hotplug operations will need
>> 	to keep their own copy of CPU online/offline state.  For example,
>> 	RCU uses the ->qsmaskinit fields in the rcu_node structure for
>> 	this purpose.
>>
>> h.	So CPU-offline handling looks something like the following:
>>
>> 	i.	Acquire the hotplug mutex.
>> 	
>> 	ii.	Invoke the CPU_DOWN_PREPARE notifiers.  If there
>> 		are objections, invoke the CPU_DOWN_FAILED notifiers
>> 		and return an error.
>>
>> 	iii.	Clear the CPU's bit in cpu_online_map.
>> 	
>> 	iv.	Invoke synchronize_sched() to ensure that all future hotplug
>> 		read-side critical sections ignore the outgoing CPU.
>>
>> 	v.	Set a per-CPU variable telling the CPU to take itself
>> 		offline.  There would need to be something here to
>> 		help the CPU get to idle quickly, possibly requiring
>> 		another round of notifiers.  CPU_DOWN?
>>
>> 	vi.	When the dying CPU gets to the idle loop, it invokes the
>> 		CPU_DYING notifiers and updates its per-CPU variable to
>> 		indicate that it is ready to die.  It then spins in a
>> 		tight loop (or does some other architecture-specified
>> 		operation to wait to be turned off).
>>
>> 		Note that there is no need for RCU to guess how long the
>> 		CPU might be executing RCU read-side critical sections.
>>
>> 	vii.	When the task doing the offline operation sees the
>> 		updated per-CPU variable, it calls __cpu_die().
>>
>> 	viii.	The CPU_DEAD notifiers are invoked.
>>
>> 	ix.	Theeck_for_tasks() function is invoked.
>>
>> 	x.	Release the hotplug mutex.
>>
>> 	xi.	Invoke the CPU_POST_DEAD notifiers.
>>
>> i.	I do not believe that the CPU-offline handling needs to change
>> 	much.
>>
>>
>> CPU_DYING notifiers as of 3.2:
>>
>> o	vfp_hotplug():  I believe that this works as-is.
>> o	s390_nohz_notify():  I believe that this works as-is.
>> o	x86_pmu_notifier():  I believe that this works as-is.
>> o	perf_ibs_cpu_notifier():  I don't know enough about
>> 	APIC to say.
>> o	tboot_cpu_callback():  I believe that this works as-is,
>> 	but this one returns NOTIFY_BAD to a CPU_DYING notifier,
>> 	which is badness.  But it looks like that case is a
>> 	"cannot happen" case.  Still needs to be fixed.
>> o	clockevents_notify():  This one acquires a global lock,
>> 	so it should be safe as-is.
>> o	console_cpu_notify():  This one takes the same action
>> 	for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and
>> 	CPU_UP_CANCELLED that it does for CPU_DYING, so it
>> 	should be OK.
>> o	rcu_cpu_notify():  This one needs adjustment as noted
>> 	above, but nothing major.
>> o	migration_call():  I defer to Peter on this one.
>> 	It looks to me like it is written to handle other
>> 	CPUs, but...
>> o	workqueue_cpu_callback(): Might need help, does a
>> 	non-atomic OR.
>> o	kvm_cpu_hotplug(): Uses a global spinlock, so should
>> 	be OK as-is.
> 
> 


Additional things that I would like to add to the list:

1. Fix issues with CPU Hotplug callback registration. Currently there
   is no totally-race-free way to register callbacks and do setup
   for already online cpus.

   I had posted an incomplete patchset some time ago regarding this,
   which gives an idea of the direction I had in mind.
   http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826

2. There is a mismatch between the code and the documentation around
   the difference between [un/register]_hotcpu_notifier and
   [un/register]_cpu_notifier. And I remember seeing several places in
   the code that uses them inconsistently. Not terribly important, but
   good to fix it up while we are at it.

3. There was another thread where stuff related to CPU hotplug had been
   discussed. It had exposed some new challenges to CPU hotplug, if we
   were to support asynchronous smp booting.

   http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535
   http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542
   http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241
   http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267

4. Because the current CPU offline code depends on stop_machine(), every
   online CPU must cooperate with the offline event. This means, whenever
   we do a preempt_disable(), it ensures not only that that particular
   CPU won't go offline, but also that *any* CPU cannot go offline. This
   is more like a side-effect of using stop_machine().

   So when trying to move over to stop_one_cpu(), we have to carefully audit
   places where preempt_disable() has been used in that manner (ie.,
   preempt_disable used as a light-weight and non-blocking form of
   get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline,
   a preempt disabled section will prevent only that particular CPU from
   going offline.

   I haven't audited preempt_disable() calls yet, but one such use was there
   in brlocks (include/linux/lglock.h) until quite recently.

5. Given the point above (#4), we might need a new way to disable CPU hotplug
   (atleast CPU offline) of any CPU in a non-blocking manner, as a replacement
   for preempt disabled sections. Of course, if all the existing code just depends
   on the current CPU being online, then we are good, as it is. Else, we'll have
   to come up with something here..
   (I was thinking on the lines of an rwlock being taken inside stop_one_cpu()
   before calling the CPU_DYING notifiers... Then the non-blocking code that needs
   to disable CPU offlining of any CPU, can grab this lock and prevent the offline
   event from proceeding).

If there is anything I missed out, please feel free to add them here.
And suggestions are of course, always welcome :-)

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-19 14:48 ` Srivatsa S. Bhat
@ 2012-03-20 11:28   ` Peter Zijlstra
  2012-04-05 17:39   ` Paul E. McKenney
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2012-03-20 11:28 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Paul E. McKenney, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list, linux-alpha, Mike Frysinger

On Mon, 2012-03-19 at 20:18 +0530, Srivatsa S. Bhat wrote:
> 
> If there is anything I missed out, please feel free to add them here.
> And suggestions are of course, always welcome :-)
> 
OK, so I haven't fully read your list, but looking through the code
today I found that alpha and blackfin call CPU_STARTING after
set_cpu_online(, true), whereas all the other archs call CPU_STARTING
before.

Aside from that probably wanting to get fixed (not sure it actually
breaks anything atm), this is another reason most of that wants to be
generic code.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-19 14:48 ` Srivatsa S. Bhat
  2012-03-20 11:28   ` Peter Zijlstra
@ 2012-04-05 17:39   ` Paul E. McKenney
  2012-04-05 17:55     ` Paul E. McKenney
  2012-04-06 19:52     ` Srivatsa S. Bhat
  1 sibling, 2 replies; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-05 17:39 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Mon, Mar 19, 2012 at 08:18:42PM +0530, Srivatsa S. Bhat wrote:
> On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote:
> 
> > Hi,
> > 
> > There had been some discussion on CPU Hotplug redesign/rework
> > some time ago, but it was buried under a thread with a different
> > subject.
> > (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404)
> > 
> > So I am opening a new thread with an appropriate subject to discuss
> > what needs to be done and how to go about it, as part of the rework.
> > 
> > Peter Zijlstra and Paul McKenney had come up with TODO lists for the
> > rework, and here are their extracts from the previous discussion:

Finally getting around to looking at this in more detail...

> Additional things that I would like to add to the list:
> 
> 1. Fix issues with CPU Hotplug callback registration. Currently there
>    is no totally-race-free way to register callbacks and do setup
>    for already online cpus.
> 
>    I had posted an incomplete patchset some time ago regarding this,
>    which gives an idea of the direction I had in mind.
>    http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826

Another approach is to have the registration function return the
CPU mask corresponding to the instant at which registration occurred,
perhaps via an additional function argument that points to a
cpumask_var_t that can be NULL if you don't care.  Then you can
do setup for the CPUs indicated in the mask.

Or am I missing the race you had in mind?  Or is the point to make
sure that the notifiers execute in order?

> 2. There is a mismatch between the code and the documentation around
>    the difference between [un/register]_hotcpu_notifier and
>    [un/register]_cpu_notifier. And I remember seeing several places in
>    the code that uses them inconsistently. Not terribly important, but
>    good to fix it up while we are at it.

The following lead me to believe that they were the same:

#define register_hotcpu_notifier(nb)    register_cpu_notifier(nb)
#define unregister_hotcpu_notifier(nb)  unregister_cpu_notifier(nb)

What am I missing here?

> 3. There was another thread where stuff related to CPU hotplug had been
>    discussed. It had exposed some new challenges to CPU hotplug, if we
>    were to support asynchronous smp booting.
> 
>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535
>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542
>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241
>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267

Good points!  ;-)

> 4. Because the current CPU offline code depends on stop_machine(), every
>    online CPU must cooperate with the offline event. This means, whenever
>    we do a preempt_disable(), it ensures not only that that particular
>    CPU won't go offline, but also that *any* CPU cannot go offline. This
>    is more like a side-effect of using stop_machine().
> 
>    So when trying to move over to stop_one_cpu(), we have to carefully audit
>    places where preempt_disable() has been used in that manner (ie.,
>    preempt_disable used as a light-weight and non-blocking form of
>    get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline,
>    a preempt disabled section will prevent only that particular CPU from
>    going offline.
> 
>    I haven't audited preempt_disable() calls yet, but one such use was there
>    in brlocks (include/linux/lglock.h) until quite recently.

I was thinking in terms of the offline code path doing a synchronize_sched()
to allow preempt_disable() to retain a reasonable approximation of its
current semantics.  This would require a pair of CPU masks, one for code
using CPU-based primitives (e.g., sending IPIs) and another for code
implementing those primitives.

Or is there some situation that makes this approach fail?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-05 17:39   ` Paul E. McKenney
@ 2012-04-05 17:55     ` Paul E. McKenney
  2012-04-05 23:06       ` Paul E. McKenney
  2012-04-06 19:52     ` Srivatsa S. Bhat
  1 sibling, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-05 17:55 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Thu, Apr 05, 2012 at 10:39:18AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 19, 2012 at 08:18:42PM +0530, Srivatsa S. Bhat wrote:
> > On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote:
> > 
> > > Hi,
> > > 
> > > There had been some discussion on CPU Hotplug redesign/rework
> > > some time ago, but it was buried under a thread with a different
> > > subject.
> > > (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404)
> > > 
> > > So I am opening a new thread with an appropriate subject to discuss
> > > what needs to be done and how to go about it, as part of the rework.
> > > 
> > > Peter Zijlstra and Paul McKenney had come up with TODO lists for the
> > > rework, and here are their extracts from the previous discussion:
> 
> Finally getting around to looking at this in more detail...
> 
> > Additional things that I would like to add to the list:
> > 
> > 1. Fix issues with CPU Hotplug callback registration. Currently there
> >    is no totally-race-free way to register callbacks and do setup
> >    for already online cpus.
> > 
> >    I had posted an incomplete patchset some time ago regarding this,
> >    which gives an idea of the direction I had in mind.
> >    http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826
> 
> Another approach is to have the registration function return the
> CPU mask corresponding to the instant at which registration occurred,
> perhaps via an additional function argument that points to a
> cpumask_var_t that can be NULL if you don't care.  Then you can
> do setup for the CPUs indicated in the mask.
> 
> Or am I missing the race you had in mind?  Or is the point to make
> sure that the notifiers execute in order?
> 
> > 2. There is a mismatch between the code and the documentation around
> >    the difference between [un/register]_hotcpu_notifier and
> >    [un/register]_cpu_notifier. And I remember seeing several places in
> >    the code that uses them inconsistently. Not terribly important, but
> >    good to fix it up while we are at it.
> 
> The following lead me to believe that they were the same:
> 
> #define register_hotcpu_notifier(nb)    register_cpu_notifier(nb)
> #define unregister_hotcpu_notifier(nb)  unregister_cpu_notifier(nb)
> 
> What am I missing here?
> 
> > 3. There was another thread where stuff related to CPU hotplug had been
> >    discussed. It had exposed some new challenges to CPU hotplug, if we
> >    were to support asynchronous smp booting.
> > 
> >    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535
> >    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542
> >    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241
> >    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267
> 
> Good points!  ;-)
> 
> > 4. Because the current CPU offline code depends on stop_machine(), every
> >    online CPU must cooperate with the offline event. This means, whenever
> >    we do a preempt_disable(), it ensures not only that that particular
> >    CPU won't go offline, but also that *any* CPU cannot go offline. This
> >    is more like a side-effect of using stop_machine().
> > 
> >    So when trying to move over to stop_one_cpu(), we have to carefully audit
> >    places where preempt_disable() has been used in that manner (ie.,
> >    preempt_disable used as a light-weight and non-blocking form of
> >    get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline,
> >    a preempt disabled section will prevent only that particular CPU from
> >    going offline.
> > 
> >    I haven't audited preempt_disable() calls yet, but one such use was there
> >    in brlocks (include/linux/lglock.h) until quite recently.
> 
> I was thinking in terms of the offline code path doing a synchronize_sched()
> to allow preempt_disable() to retain a reasonable approximation of its
> current semantics.  This would require a pair of CPU masks, one for code
> using CPU-based primitives (e.g., sending IPIs) and another for code
> implementing those primitives.
> 
> Or is there some situation that makes this approach fail?

Hmmm...  I suppose that -rt's use of migrate_disable() needs some other
approach in any case, unless -rt's offlining waits for all pre-existing
migrate_disable() sections to finish.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-05 17:55     ` Paul E. McKenney
@ 2012-04-05 23:06       ` Paul E. McKenney
  2012-04-06 20:15         ` Srivatsa S. Bhat
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-05 23:06 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

Hello,

Here is my attempt at a summary of the discussion.

Srivatsa, I left out the preempt_disable() pieces, but would be happy
to add them in when you let me know what you are thinking to do for
de-stop_machine()ing CPU hotplug.

							Thanx, Paul

------------------------------------------------------------------------

CPU-hotplug work breakout:

1.	Read and understand the current generic code.
	Srivatsa Bhat has done this, as have Paul E. McKenney and
	Peter Zijlstra to a lesser extent.

2.	Read and understand the architecture-specific code, looking
	for opportunities to consolidate additional function into
	core code.

	a.	Carry out any indicated consolidation.

	b.	Convert all architectures to make use of the
		consolidated implementation.

	Not started.  Low priority from a big.LITTLE perspective.

3.	Address the current kthread creation/teardown/migration
	performance issues.  (More details below.)

	Highest priority from a big.LITTLE perspective.

4.	Wean CPU-hotplug offlining from stop_machine().
	(More details below.)

	Moderate priority from a big.LITTLE perspective.


ADDRESSING KTHREAD CREATION/TEARDOWN/MIGRATION PERFORMANCE ISSUES

1.	Evaluate approaches.  Approaches currently under
	consideration include:

	a.	Park the kthreads rather than tearing them down or
		migrating them.  RCU currently takes this sort of
		approach.  Note that RCU currently relies on both
		preempt_disable() and local_bh_disable() blocking the
		current CPU from going offline.

	b.	Allow in-kernel kthreads to avoid the delay
		required to work around a bug in old versions of
		bash.  (This bug is a failure to expect receiving
		a SIGCHILD signal corresponding to a child
		created by a fork() system call that has not yet
		returned.)

		This might be implemented using an additional
		CLONE_ flag.  This should allow kthreads to
		be created and torn down much more quickly.

	c.	Have some other TBD way to "freeze" a kthread.
		(As in "your clever idea here".)

2.	Implement the chosen approach or approaches.  (Different
	kernel subsystems might have different constraints, possibly
	requiring different kthread handling.)


WEAN CPU-HOTPLUG OFFLINING FROM stop_machine()


1.	CPU_DYING notifier fixes needed as of 3.2:

	o	vfp_hotplug():  I believe that this works as-is.
	o	s390_nohz_notify():  I believe that this works as-is.
	o	x86_pmu_notifier():  I believe that this works as-is.
	o	perf_ibs_cpu_notifier():  I don't know enough about
		APIC to say.
	o	tboot_cpu_callback():  I believe that this works as-is,
		but this one returns NOTIFY_BAD to a CPU_DYING notifier,
		which is badness.  But it looks like that case is a
		"cannot happen" case.  Still needs to be fixed.
	o	clockevents_notify():  This one acquires a global lock,
		so it should be safe as-is.
	o	console_cpu_notify():  This one takes the same action
		for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and
		CPU_UP_CANCELLED that it does for CPU_DYING, so it
		should be OK.
	*	rcu_cpu_notify():  This one needs adjustment as noted
		above, but nothing major.  Patch has been posted,
		probably needs a bit of debugging.
	o	migration_call():  I defer to Peter on this one.
		It looks to me like it is written to handle other
		CPUs, but...
	*	workqueue_cpu_callback(): Might need help, does a
		non-atomic OR.
	o	kvm_cpu_hotplug(): Uses a global spinlock, so should
		be OK as-is.

2.	Evaluate designs for stop_machine()-free CPU hotplug.
	Implement the chosen design.  An outline for a particular
	design is shown below, but the actual design might be
	quite different.

3.	Fix issues with CPU Hotplug callback registration. Currently
	there is no totally-race-free way to register callbacks and do
	setup for already online cpus.

	Srivatsa had posted an incomplete patchset some time ago
	regarding this, which gives an idea of the direction he had
	in mind.
	http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826

4.	There is a mismatch between the code and the documentation around
	the difference between [un/register]_hotcpu_notifier and
	[un/register]_cpu_notifier. And I remember seeing several places
	in the code that uses them inconsistently. Not terribly important,
	but good to fix it up while we are at it.

5.	There was another thread where stuff related to CPU hotplug had
	been discussed. It had exposed some new challenges to CPU hotplug,
	if we were to support asynchronous smp booting.

	http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535
	http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542
	http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241
	http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267

6.	If preempt_disable() no longer blocks CPU offlining, then
	uses of preempt_disable() in the kernel need to be inspected
	to see which are relying on blocking offlining, and any
	identified will need adjustment.


DRAFT REQUIREMENTS FOR stop_machine()-FREE CPU HOTPLUG

1.	preempt_disable() or something similarly lightweight and
	unconditional must block removal of any CPU that was
	in cpu_online_map at the start of the "critical section".
	(I will identify these as hotplug read-side critical sections.)

	I don't believe that there is any prohibition against a CPU
	appearing suddenly, but some auditing would be required to
	confirm this.  But see below.

2.	A subsystem not involved in the CPU-hotplug process must be able
	to test if a CPU is online and be guaranteed that this test
	remains valid (the CPU remains fully functional) for the duration
	of the hotplug read-side critical section.

3.	If a subsystem needs to operate on all currently online CPUs,
	then it must participate in the CPU-hotplug process.  My
	belief is that if some code needs to test whether a CPU is
	present, and needs an "offline" indication to persist, then
	that code's subsystem must participate in CPU-hotplug operations.

4.	There must be a way to register/unregister for CPU-hotplug events.
	This is currently cpu_notifier(), register_cpu_notifier(),
	and unregister_cpu_notifier().

n-1.	CPU-hotplug operations should be reasonably fast.  A few
	milliseconds is OK, multiple seconds not so much.

n.	(Your additional constraints here.)


STRAWMAN DESIGN FOR stop_machine()-FREE CPU HOTPLUG

a.	Maintain the cpu_online_map, as currently, but the meaning
	of a set bit is that the CPU is fully functional.  If there
	is any service that the CPU no longer offers, its bit is
	cleared.

b.	Continue to use preempt_enable()/preempt_disable() to mark
	hotplug read-side critical sections.

c.	Instead of using __stop_machine(), use a per-CPU variable that
	is checked in the idle loop.  Possibly another TIF_ bit.

d.	The CPU notifiers are like today, except that CPU_DYING() is
	invoked by the CPU after it sees that its per-CPU variable
	telling it to go offline.  As today, the CPU_DYING notifiers
	are invoked with interrupts disabled, but other CPUs are still
	running.  Of course, the CPU_DYING notifiers need to be audited
	and repaired.  There are fewer than 20 of them, so not so bad.
	RCU's is an easy fix:  Just re-introduce locking and the global
	RCU callback orphanage.  My guesses for the others at the end.

e.	Getting rid of __stop_machine() means that the final step of the
	CPU going offline will no longer be seen as atomic by other CPUs.
	This will require more careful tracking of dependencies among
	different subsystems.  The required tracking can be reduced
	by invoking notifiers in registration order for CPU-online
	operations and invoking them in the reverse of registration
	order for CPU-offline operations.

	For example, the scheduler uses RCU.  If notifiers are invoked in
	the same order for all CPU-hotplug operations, then on CPU-offline
	operations, during the time between when RCU's notifier is invoked
	and when the scheduler's notifier is invoked, the scheduler must
	deal with a CPU on which RCU isn't working.  (RCU currently
	works around this by allowing a one-jiffy time period after
	notification when it still pays attention to the CPU.)

	In contrast, if notifiers are invoked in reverse-registration
	order for CPU-offline operations, then any time the scheduler
	sees a CPU as online, RCU also is treating it as online.

f.	There will be some circular dependencies.  For example, the
	scheduler uses RCU, but in some configurations, RCU also uses
	kthreads.  These dependencies must be handled on a case-by-case
	basis.	For example, the scheduler could invoke an RCU API
	to tell RCU when to shut down its per-CPU kthreads and when
	to start them up.  Or RCU could deal with its kthreads in the
	CPU_DOWN_PREPARE and CPU_ONLINE notifiers.  Either way, RCU
	needs to correctly handle the interval when it cannot use
	kthreads on a given CPU that it is still handling, for example,
	by switching to running the RCU core code in softirq context.

g.	Most subsystems participating in CPU-hotplug operations will need
	to keep their own copy of CPU online/offline state.  For example,
	RCU uses the ->qsmaskinit fields in the rcu_node structure for
	this purpose.

h.	So CPU-offline handling looks something like the following:

	i.	Acquire the hotplug mutex.
	
	ii.	Invoke the CPU_DOWN_PREPARE notifiers.  If there
		are objections, invoke the CPU_DOWN_FAILED notifiers
		and return an error.

	iii.	Clear the CPU's bit in cpu_online_map.
	
	iv.	Invoke synchronize_sched() to ensure that all future hotplug
		read-side critical sections ignore the outgoing CPU.

	v.	Set a per-CPU variable telling the CPU to take itself
		offline.  There would need to be something here to
		help the CPU get to idle quickly, possibly requiring
		another round of notifiers.  CPU_DOWN?

	vi.	When the dying CPU gets to the idle loop, it invokes the
		CPU_DYING notifiers and updates its per-CPU variable to
		indicate that it is ready to die.  It then spins in a
		tight loop (or does some other architecture-specified
		operation to wait to be turned off).

		Note that there is no need for RCU to guess how long the
		CPU might be executing RCU read-side critical sections.

	vii.	When the task doing the offline operation sees the
		updated per-CPU variable, it calls __cpu_die().

	viii.	The CPU_DEAD notifiers are invoked.

	ix.	Theeck_for_tasks() function is invoked.

	x.	Release the hotplug mutex.

	xi.	Invoke the CPU_POST_DEAD notifiers.

i.	I do not believe that the CPU-offline handling needs to change
	much.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-05 23:06       ` Paul E. McKenney
@ 2012-04-06 20:15         ` Srivatsa S. Bhat
  2012-04-09 16:46           ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Srivatsa S. Bhat @ 2012-04-06 20:15 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list, nikunj

On 04/06/2012 04:36 AM, Paul E. McKenney wrote:

> Hello,
> 
> Here is my attempt at a summary of the discussion.
> 


Thanks for the summary, it is really helpful :-)

> Srivatsa, I left out the preempt_disable() pieces, but would be happy
> to add them in when you let me know what you are thinking to do for
> de-stop_machine()ing CPU hotplug.
>


Ok..

 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> CPU-hotplug work breakout:
> 
> 1.	Read and understand the current generic code.
> 	Srivatsa Bhat has done this, as have Paul E. McKenney and
> 	Peter Zijlstra to a lesser extent.


	"lesser extent"?? Hell no! :-) ;-)

> 
> 2.	Read and understand the architecture-specific code, looking
> 	for opportunities to consolidate additional function into
> 	core code.
> 
> 	a.	Carry out any indicated consolidation.
> 
> 	b.	Convert all architectures to make use of the
> 		consolidated implementation.
> 
> 	Not started.  Low priority from a big.LITTLE perspective.


Recently this unexpectedly assumed high priority due to some scheduler
changes and things got fixed up temporarily. And in that context,
Peter Zijlstra gave some more technical pointers on what is wrong and needs
to be done right. Link: https://lkml.org/lkml/2012/3/22/149

Nikunj (in CC) has offered to work with me on this consolidation.

> 
> 3.	Address the current kthread creation/teardown/migration
> 	performance issues.  (More details below.)
> 
> 	Highest priority from a big.LITTLE perspective.
> 
> 4.	Wean CPU-hotplug offlining from stop_machine().
> 	(More details below.)
> 
> 	Moderate priority from a big.LITTLE perspective.
> 
> 
> ADDRESSING KTHREAD CREATION/TEARDOWN/MIGRATION PERFORMANCE ISSUES
> 
> 1.	Evaluate approaches.  Approaches currently under
> 	consideration include:
> 
> 	a.	Park the kthreads rather than tearing them down or
> 		migrating them.  RCU currently takes this sort of
> 		approach.  Note that RCU currently relies on both
> 		preempt_disable() and local_bh_disable() blocking the
> 		current CPU from going offline.
> 
> 	b.	Allow in-kernel kthreads to avoid the delay
> 		required to work around a bug in old versions of
> 		bash.  (This bug is a failure to expect receiving
> 		a SIGCHILD signal corresponding to a child
> 		created by a fork() system call that has not yet
> 		returned.)
> 
> 		This might be implemented using an additional
> 		CLONE_ flag.  This should allow kthreads to
> 		be created and torn down much more quickly.
> 
> 	c.	Have some other TBD way to "freeze" a kthread.
> 		(As in "your clever idea here".)
> 
> 2.	Implement the chosen approach or approaches.  (Different
> 	kernel subsystems might have different constraints, possibly
> 	requiring different kthread handling.)
> 
> 
> WEAN CPU-HOTPLUG OFFLINING FROM stop_machine()
> 
> 
> 1.	CPU_DYING notifier fixes needed as of 3.2:
> 
> 	o	vfp_hotplug():  I believe that this works as-is.
> 	o	s390_nohz_notify():  I believe that this works as-is.
> 	o	x86_pmu_notifier():  I believe that this works as-is.
> 	o	perf_ibs_cpu_notifier():  I don't know enough about
> 		APIC to say.
> 	o	tboot_cpu_callback():  I believe that this works as-is,
> 		but this one returns NOTIFY_BAD to a CPU_DYING notifier,
> 		which is badness.  But it looks like that case is a
> 		"cannot happen" case.  Still needs to be fixed.
> 	o	clockevents_notify():  This one acquires a global lock,
> 		so it should be safe as-is.
> 	o	console_cpu_notify():  This one takes the same action
> 		for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and
> 		CPU_UP_CANCELLED that it does for CPU_DYING, so it
> 		should be OK.
> 	*	rcu_cpu_notify():  This one needs adjustment as noted
> 		above, but nothing major.  Patch has been posted,
> 		probably needs a bit of debugging.
> 	o	migration_call():  I defer to Peter on this one.
> 		It looks to me like it is written to handle other
> 		CPUs, but...
> 	*	workqueue_cpu_callback(): Might need help, does a
> 		non-atomic OR.
> 	o	kvm_cpu_hotplug(): Uses a global spinlock, so should
> 		be OK as-is.
> 
> 2.	Evaluate designs for stop_machine()-free CPU hotplug.
> 	Implement the chosen design.  An outline for a particular
> 	design is shown below, but the actual design might be
> 	quite different.
> 
> 3.	Fix issues with CPU Hotplug callback registration. Currently
> 	there is no totally-race-free way to register callbacks and do
> 	setup for already online cpus.
> 
> 	Srivatsa had posted an incomplete patchset some time ago
> 	regarding this, which gives an idea of the direction he had
> 	in mind.
> 	http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826
> 


Gah, this has been "incomplete" for quite some time now.. I'll try to speed up
things a bit :-)

> 4.	There is a mismatch between the code and the documentation around
> 	the difference between [un/register]_hotcpu_notifier and
> 	[un/register]_cpu_notifier. And I remember seeing several places
> 	in the code that uses them inconsistently. Not terribly important,
> 	but good to fix it up while we are at it.
> 
> 5.	There was another thread where stuff related to CPU hotplug had
> 	been discussed. It had exposed some new challenges to CPU hotplug,
> 	if we were to support asynchronous smp booting.
> 
> 	http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535
> 	http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542
> 	http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241
> 	http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267
> 
> 6.	If preempt_disable() no longer blocks CPU offlining, then
> 	uses of preempt_disable() in the kernel need to be inspected
> 	to see which are relying on blocking offlining, and any
> 	identified will need adjustment.
> 
> 
> DRAFT REQUIREMENTS FOR stop_machine()-FREE CPU HOTPLUG
> 
> 1.	preempt_disable() or something similarly lightweight and
> 	unconditional must block removal of any CPU that was
> 	in cpu_online_map at the start of the "critical section".
> 	(I will identify these as hotplug read-side critical sections.)
> 
> 	I don't believe that there is any prohibition against a CPU
> 	appearing suddenly, but some auditing would be required to
> 	confirm this.  But see below.
> 
> 2.	A subsystem not involved in the CPU-hotplug process must be able
> 	to test if a CPU is online and be guaranteed that this test
> 	remains valid (the CPU remains fully functional) for the duration
> 	of the hotplug read-side critical section.
> 
> 3.	If a subsystem needs to operate on all currently online CPUs,
> 	then it must participate in the CPU-hotplug process.  My
> 	belief is that if some code needs to test whether a CPU is
> 	present, and needs an "offline" indication to persist, then
> 	that code's subsystem must participate in CPU-hotplug operations.
> 
> 4.	There must be a way to register/unregister for CPU-hotplug events.
> 	This is currently cpu_notifier(), register_cpu_notifier(),
> 	and unregister_cpu_notifier().
> 
> n-1.	CPU-hotplug operations should be reasonably fast.  A few
> 	milliseconds is OK, multiple seconds not so much.
> 
> n.	(Your additional constraints here.)
> 
> 
> STRAWMAN DESIGN FOR stop_machine()-FREE CPU HOTPLUG
> 
> a.	Maintain the cpu_online_map, as currently, but the meaning
> 	of a set bit is that the CPU is fully functional.  If there
> 	is any service that the CPU no longer offers, its bit is
> 	cleared.
> 
> b.	Continue to use preempt_enable()/preempt_disable() to mark
> 	hotplug read-side critical sections.
> 
> c.	Instead of using __stop_machine(), use a per-CPU variable that
> 	is checked in the idle loop.  Possibly another TIF_ bit.
> 
> d.	The CPU notifiers are like today, except that CPU_DYING() is
> 	invoked by the CPU after it sees that its per-CPU variable
> 	telling it to go offline.  As today, the CPU_DYING notifiers
> 	are invoked with interrupts disabled, but other CPUs are still
> 	running.  Of course, the CPU_DYING notifiers need to be audited
> 	and repaired.  There are fewer than 20 of them, so not so bad.
> 	RCU's is an easy fix:  Just re-introduce locking and the global
> 	RCU callback orphanage.  My guesses for the others at the end.
> 
> e.	Getting rid of __stop_machine() means that the final step of the
> 	CPU going offline will no longer be seen as atomic by other CPUs.
> 	This will require more careful tracking of dependencies among
> 	different subsystems.  The required tracking can be reduced
> 	by invoking notifiers in registration order for CPU-online
> 	operations and invoking them in the reverse of registration
> 	order for CPU-offline operations.
> 
> 	For example, the scheduler uses RCU.  If notifiers are invoked in
> 	the same order for all CPU-hotplug operations, then on CPU-offline
> 	operations, during the time between when RCU's notifier is invoked
> 	and when the scheduler's notifier is invoked, the scheduler must
> 	deal with a CPU on which RCU isn't working.  (RCU currently
> 	works around this by allowing a one-jiffy time period after
> 	notification when it still pays attention to the CPU.)
> 
> 	In contrast, if notifiers are invoked in reverse-registration
> 	order for CPU-offline operations, then any time the scheduler
> 	sees a CPU as online, RCU also is treating it as online.
> 
> f.	There will be some circular dependencies.  For example, the
> 	scheduler uses RCU, but in some configurations, RCU also uses
> 	kthreads.  These dependencies must be handled on a case-by-case
> 	basis.	For example, the scheduler could invoke an RCU API
> 	to tell RCU when to shut down its per-CPU kthreads and when
> 	to start them up.  Or RCU could deal with its kthreads in the
> 	CPU_DOWN_PREPARE and CPU_ONLINE notifiers.  Either way, RCU
> 	needs to correctly handle the interval when it cannot use
> 	kthreads on a given CPU that it is still handling, for example,
> 	by switching to running the RCU core code in softirq context.
> 
> g.	Most subsystems participating in CPU-hotplug operations will need
> 	to keep their own copy of CPU online/offline state.  For example,
> 	RCU uses the ->qsmaskinit fields in the rcu_node structure for
> 	this purpose.
> 
> h.	So CPU-offline handling looks something like the following:
> 
> 	i.	Acquire the hotplug mutex.
> 	
> 	ii.	Invoke the CPU_DOWN_PREPARE notifiers.  If there
> 		are objections, invoke the CPU_DOWN_FAILED notifiers
> 		and return an error.
> 
> 	iii.	Clear the CPU's bit in cpu_online_map.
> 	
> 	iv.	Invoke synchronize_sched() to ensure that all future hotplug
> 		read-side critical sections ignore the outgoing CPU.
> 
> 	v.	Set a per-CPU variable telling the CPU to take itself
> 		offline.  There would need to be something here to
> 		help the CPU get to idle quickly, possibly requiring
> 		another round of notifiers.  CPU_DOWN?
> 
> 	vi.	When the dying CPU gets to the idle loop, it invokes the
> 		CPU_DYING notifiers and updates its per-CPU variable to
> 		indicate that it is ready to die.  It then spins in a
> 		tight loop (or does some other architecture-specified
> 		operation to wait to be turned off).
> 
> 		Note that there is no need for RCU to guess how long the
> 		CPU might be executing RCU read-side critical sections.
> 
> 	vii.	When the task doing the offline operation sees the
> 		updated per-CPU variable, it calls __cpu_die().
> 
> 	viii.	The CPU_DEAD notifiers are invoked.
> 
> 	ix.	Theeck_for_tasks() function is invoked.
> 
> 	x.	Release the hotplug mutex.
> 
> 	xi.	Invoke the CPU_POST_DEAD notifiers.
> 
> i.	I do not believe that the CPU-offline handling needs to change
> 	much.


Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-06 20:15         ` Srivatsa S. Bhat
@ 2012-04-09 16:46           ` Paul E. McKenney
  2012-04-10  7:56             ` Nikunj A Dadhania
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-09 16:46 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list, nikunj

On Sat, Apr 07, 2012 at 01:45:44AM +0530, Srivatsa S. Bhat wrote:
> On 04/06/2012 04:36 AM, Paul E. McKenney wrote:
> 
> > Hello,
> > 
> > Here is my attempt at a summary of the discussion.
> > 
> 
> 
> Thanks for the summary, it is really helpful :-)
> 
> > Srivatsa, I left out the preempt_disable() pieces, but would be happy
> > to add them in when you let me know what you are thinking to do for
> > de-stop_machine()ing CPU hotplug.
> >
> 
> 
> Ok..
> 
> 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > CPU-hotplug work breakout:
> > 
> > 1.	Read and understand the current generic code.
> > 	Srivatsa Bhat has done this, as have Paul E. McKenney and
> > 	Peter Zijlstra to a lesser extent.
> 
> 
> 	"lesser extent"?? Hell no! :-) ;-)

Certainly to a lesser extent on my part, but yes, I should not speak
for Peter.

> > 2.	Read and understand the architecture-specific code, looking
> > 	for opportunities to consolidate additional function into
> > 	core code.
> > 
> > 	a.	Carry out any indicated consolidation.
> > 
> > 	b.	Convert all architectures to make use of the
> > 		consolidated implementation.
> > 
> > 	Not started.  Low priority from a big.LITTLE perspective.
> 
> 
> Recently this unexpectedly assumed high priority due to some scheduler
> changes and things got fixed up temporarily. And in that context,
> Peter Zijlstra gave some more technical pointers on what is wrong and needs
> to be done right. Link: https://lkml.org/lkml/2012/3/22/149
> 
> Nikunj (in CC) has offered to work with me on this consolidation.

Very cool!  I have added the following:

------------------------------------------------------------------------

CONSOLIDATE ARCHITECTURE-SPECIFIC CPU-HOTPLUG CODE

1.	Ensure that all CPU_STARTING notifiers complete before the
	incoming CPU is marked online (the blackfin architecture
	fails to do this).

2.	Ensure that interrupts are disabled throughout the CPU_STARTING
	notifiers.  Currently, blackfin, cris, m32r, mips, sh, sparc64,
	um, and x86 fail to do this properly.

3.	Ensure that all architectures that use CONFIG_USE_GENERIC_SMP_HELPERS
	hold ipi_call_lock() over the entire CPU-online process.  Currently,
	alpha, arm, m32r, mips, sh, and sparc32 seem to fail to do this
	properly.

4.	Additional memory barriers are likely to be needed, for example,
	an smp_wmb() after setting cpu_active and an smp_rmb() in
	select_fallback_rq() before reading cpu_active.

Srivatsa Bhat (srivatsa.bhat@linux.vnet.ibm.com) and Nikunj A Dadhania
(nikunj@linux.vnet.ibm.com) are taking on this work.

------------------------------------------------------------------------

Please let me know if adjustments are needed.

> > 3.	Address the current kthread creation/teardown/migration
> > 	performance issues.  (More details below.)
> > 
> > 	Highest priority from a big.LITTLE perspective.
> > 
> > 4.	Wean CPU-hotplug offlining from stop_machine().
> > 	(More details below.)
> > 
> > 	Moderate priority from a big.LITTLE perspective.
> > 
> > 
> > ADDRESSING KTHREAD CREATION/TEARDOWN/MIGRATION PERFORMANCE ISSUES
> > 
> > 1.	Evaluate approaches.  Approaches currently under
> > 	consideration include:
> > 
> > 	a.	Park the kthreads rather than tearing them down or
> > 		migrating them.  RCU currently takes this sort of
> > 		approach.  Note that RCU currently relies on both
> > 		preempt_disable() and local_bh_disable() blocking the
> > 		current CPU from going offline.
> > 
> > 	b.	Allow in-kernel kthreads to avoid the delay
> > 		required to work around a bug in old versions of
> > 		bash.  (This bug is a failure to expect receiving
> > 		a SIGCHILD signal corresponding to a child
> > 		created by a fork() system call that has not yet
> > 		returned.)
> > 
> > 		This might be implemented using an additional
> > 		CLONE_ flag.  This should allow kthreads to
> > 		be created and torn down much more quickly.
> > 
> > 	c.	Have some other TBD way to "freeze" a kthread.
> > 		(As in "your clever idea here".)
> > 
> > 2.	Implement the chosen approach or approaches.  (Different
> > 	kernel subsystems might have different constraints, possibly
> > 	requiring different kthread handling.)
> > 
> > 
> > WEAN CPU-HOTPLUG OFFLINING FROM stop_machine()
> > 
> > 
> > 1.	CPU_DYING notifier fixes needed as of 3.2:
> > 
> > 	o	vfp_hotplug():  I believe that this works as-is.
> > 	o	s390_nohz_notify():  I believe that this works as-is.
> > 	o	x86_pmu_notifier():  I believe that this works as-is.
> > 	o	perf_ibs_cpu_notifier():  I don't know enough about
> > 		APIC to say.
> > 	o	tboot_cpu_callback():  I believe that this works as-is,
> > 		but this one returns NOTIFY_BAD to a CPU_DYING notifier,
> > 		which is badness.  But it looks like that case is a
> > 		"cannot happen" case.  Still needs to be fixed.
> > 	o	clockevents_notify():  This one acquires a global lock,
> > 		so it should be safe as-is.
> > 	o	console_cpu_notify():  This one takes the same action
> > 		for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and
> > 		CPU_UP_CANCELLED that it does for CPU_DYING, so it
> > 		should be OK.
> > 	*	rcu_cpu_notify():  This one needs adjustment as noted
> > 		above, but nothing major.  Patch has been posted,
> > 		probably needs a bit of debugging.
> > 	o	migration_call():  I defer to Peter on this one.
> > 		It looks to me like it is written to handle other
> > 		CPUs, but...
> > 	*	workqueue_cpu_callback(): Might need help, does a
> > 		non-atomic OR.
> > 	o	kvm_cpu_hotplug(): Uses a global spinlock, so should
> > 		be OK as-is.
> > 
> > 2.	Evaluate designs for stop_machine()-free CPU hotplug.
> > 	Implement the chosen design.  An outline for a particular
> > 	design is shown below, but the actual design might be
> > 	quite different.
> > 
> > 3.	Fix issues with CPU Hotplug callback registration. Currently
> > 	there is no totally-race-free way to register callbacks and do
> > 	setup for already online cpus.
> > 
> > 	Srivatsa had posted an incomplete patchset some time ago
> > 	regarding this, which gives an idea of the direction he had
> > 	in mind.
> > 	http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826
> 
> Gah, this has been "incomplete" for quite some time now.. I'll try to speed up
> things a bit :-)

Sounds good to me!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-09 16:46           ` Paul E. McKenney
@ 2012-04-10  7:56             ` Nikunj A Dadhania
  0 siblings, 0 replies; 38+ messages in thread
From: Nikunj A Dadhania @ 2012-04-10  7:56 UTC (permalink / raw)
  To: paulmck, Srivatsa S. Bhat
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Mon, 9 Apr 2012 09:46:28 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > 2.	Read and understand the architecture-specific code, looking
> > > 	for opportunities to consolidate additional function into
> > > 	core code.
> > > 
> > > 	a.	Carry out any indicated consolidation.
> > > 
> > > 	b.	Convert all architectures to make use of the
> > > 		consolidated implementation.
> > > 
> > > 	Not started.  Low priority from a big.LITTLE perspective.
> > 
> > 
> > Recently this unexpectedly assumed high priority due to some scheduler
> > changes and things got fixed up temporarily. And in that context,
> > Peter Zijlstra gave some more technical pointers on what is wrong and needs
> > to be done right. Link: https://lkml.org/lkml/2012/3/22/149
> > 
> > Nikunj (in CC) has offered to work with me on this consolidation.
> 
> Very cool!  I have added the following:
> 
> ------------------------------------------------------------------------
> 
> CONSOLIDATE ARCHITECTURE-SPECIFIC CPU-HOTPLUG CODE
> 
> 1.	Ensure that all CPU_STARTING notifiers complete before the
> 	incoming CPU is marked online (the blackfin architecture
> 	fails to do this).
> 
> 2.	Ensure that interrupts are disabled throughout the CPU_STARTING
> 	notifiers.  Currently, blackfin, cris, m32r, mips, sh, sparc64,
> 	um, and x86 fail to do this properly.
> 
> 3.	Ensure that all architectures that use CONFIG_USE_GENERIC_SMP_HELPERS
> 	hold ipi_call_lock() over the entire CPU-online process.  Currently,
> 	alpha, arm, m32r, mips, sh, and sparc32 seem to fail to do this
> 	properly.
> 
> 4.	Additional memory barriers are likely to be needed, for example,
> 	an smp_wmb() after setting cpu_active and an smp_rmb() in
> 	select_fallback_rq() before reading cpu_active.
> 
> Srivatsa Bhat (srivatsa.bhat@linux.vnet.ibm.com) and Nikunj A Dadhania
> (nikunj@linux.vnet.ibm.com) are taking on this work.
> 
Sounds good :-)

Regards
Nikunj


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-05 17:39   ` Paul E. McKenney
  2012-04-05 17:55     ` Paul E. McKenney
@ 2012-04-06 19:52     ` Srivatsa S. Bhat
  2012-04-09 17:13       ` Paul E. McKenney
  2012-04-11  0:09       ` Steven Rostedt
  1 sibling, 2 replies; 38+ messages in thread
From: Srivatsa S. Bhat @ 2012-04-06 19:52 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On 04/05/2012 11:09 PM, Paul E. McKenney wrote:

> On Mon, Mar 19, 2012 at 08:18:42PM +0530, Srivatsa S. Bhat wrote:
>> On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote:
>>
>>> Hi,
>>>
>>> There had been some discussion on CPU Hotplug redesign/rework
>>> some time ago, but it was buried under a thread with a different
>>> subject.
>>> (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404)
>>>
>>> So I am opening a new thread with an appropriate subject to discuss
>>> what needs to be done and how to go about it, as part of the rework.
>>>
>>> Peter Zijlstra and Paul McKenney had come up with TODO lists for the
>>> rework, and here are their extracts from the previous discussion:
> 
> Finally getting around to looking at this in more detail...
> 
>> Additional things that I would like to add to the list:
>>
>> 1. Fix issues with CPU Hotplug callback registration. Currently there
>>    is no totally-race-free way to register callbacks and do setup
>>    for already online cpus.
>>
>>    I had posted an incomplete patchset some time ago regarding this,
>>    which gives an idea of the direction I had in mind.
>>    http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826
> 
> Another approach is to have the registration function return the
> CPU mask corresponding to the instant at which registration occurred,
> perhaps via an additional function argument that points to a
> cpumask_var_t that can be NULL if you don't care.  Then you can
> do setup for the CPUs indicated in the mask.
>

That would look something like this right?

	register_cpu_notifier(nb, mask);
	do_setup(mask);

If yes, then here is our problem:
1. A CPU hotplug operation can happen, in-between these 2 function calls.
2. No matter how we try to wrap these up with get/put_online_cpus(), it
   will still lead to either a race, or a deadlock, depending on how we
   do it.

Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our
purpose, since the race with CPU Hotplug would still exist, just like
before. So, let's consider what happens when we wrap both the functions
within get/put_online_cpus():

	get_online_cpus();
	register_cpu_notifier(nb, mask);
	do_setup(mask);
	put_online_cpus();

Unfortunately this leads to an ABBA deadlock (see below).

> Or am I missing the race you had in mind?  Or is the point to make
> sure that the notifiers execute in order?
> 

No, I wasn't referring to the order of execution of the notifiers here.

Well, the race I was concerned about was the one between a CPU hotplug
operation and a callback registration operation, which might lead to an
ABBA deadlock, with the 2 locks in question being cpu hotplug lock and
cpu_add_remove_lock.

The heart of the problem is that, something as simple as the following
is prone to an ABBA deadlock (note that I am not even talking about
do_setup() here):

	get_online_cpus()
	register_cpu_notifier() or any other variant
	put_online_cpus()

In the above, the lock ordering is:
grab cpu_hotplug lock -> grab cpu_add_remove_lock.

But in a usual CPU hotplug operation, the lock ordering is the exact reverse:
grab cpu_add_remove_lock -> grab cpu_hotplug lock.

Hence, we have an ABBA deadlock possibility.

I had posted a pictorial representation of the deadlock condition here:
https://lkml.org/lkml/2012/3/19/289

So, to get rid of this deadlock, the approach I proposed came out of the
following observation:

A code such as:

	register_cpu_notifier();

	//Go on doing stuff that are irrelevant to CPU hotplug.

is totally safe from any kind of race or deadlock, the reason being that
CPU Hotplug operations begin by taking the cpu_add_remove_lock, and
callback registration functions like the above also take only this
particular lock. IOW, the mutex that protects a race between CPU hotplug
operation vs callback registration is the cpu_add_remove_lock. The
cpu_hotplug lock is actually irrelevant here!

	CPU0				CPU 1
register_cpu_notifier()			cpu_down()
----------------------------------------------------------------------

acquire cpu_add_remove_lock
					Blocked on cpu_add_remove_lock

do callback registration

release cpu_add_remove_lock

					Only now we can proceed and acquire
					the cpu_add_remove_lock.

					acquire cpu_hotplug lock
					do cpu_down related work
					release cpu_hotplug lock

					release cpu_add_remove_lock

So my approach is: consider whatever we want to do while registering our
callback, including doing setup for already online cpus; - and do *all of it*
within the section marked as "do callback registration". IOW, we piggyback
on something _known_ to be perfectly fine and hence avoid all races and
deadlocks.

>> 2. There is a mismatch between the code and the documentation around
>>    the difference between [un/register]_hotcpu_notifier and
>>    [un/register]_cpu_notifier. And I remember seeing several places in
>>    the code that uses them inconsistently. Not terribly important, but
>>    good to fix it up while we are at it.
> 
> The following lead me to believe that they were the same:
> 
> #define register_hotcpu_notifier(nb)    register_cpu_notifier(nb)
> #define unregister_hotcpu_notifier(nb)  unregister_cpu_notifier(nb)
> 

> What am I missing here?

At first, even I thought that they were exactly same. But then looking at the
#ifdef magic in include/linux/cpu.h, I realized that there is a subtle
difference between the two:

The register_cpu_notifier() family of functions are defined whenever:
a. CONFIG_HOTPLUG_CPU is set
b. or, CONFIG_HOTPLUG_CPU is not set, but we are in core code
   ie., !defined(MODULE)

The hotcpu family of functions are different in the sense that they are
defined only when CONFIG_HOTPLUG_CPU is set. However, if CONFIG_HOTPLUG_CPU
is not set, they are defined as pretty much empty functions, irrespective of
whether we are calling them from core code or from module code.

And the reasoning behind this difference is: strictly speaking,
CONFIG_HOTPLUG_CPU needs to be set only if we want to *offline* CPUs at
some point in time, like suspend/resume for example. Otherwise, we don't need
to set CONFIG_HOTPLUG_CPU.

But there are subsystems/arch specific stuff which need to take action even
during CPU online operations (like booting), and hence their callback
registrations must work even when CONFIG_HOTPLUG_CPU is not set. So they must
use the native register_cpu_notifier() family of functions.

If some code doesn't really care much when offlining of CPUs is not required,
then they can use the hotcpu_* family and optimize out for the !HOTPLUG_CPU
case.

None of this seems to be documented in Documentation/cpu-hotplug.txt. In fact
what that document explains (something about early init/late init) is actually
misleading because it is out-dated beyond recognition! :-(

Now coming to the inconsistent uses of these functions, if some core code which
needs to register its callback irrespective of CONFIG_HOTPLUG_CPU, uses
hotcpu* variants instead, it'll break stuff in the !HOTPLUG_CPU case.
The other case (using register_cpu* variants in every place) is less worrisome,
in fact harmless, but could have been optimized out in the !HOTPLUG_CPU case,
if the hotcpu* variants had been used.

> 
>> 3. There was another thread where stuff related to CPU hotplug had been
>>    discussed. It had exposed some new challenges to CPU hotplug, if we
>>    were to support asynchronous smp booting.
>>
>>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535
>>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542
>>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241
>>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267
> 
> Good points!  ;-)

Thank you :-)

> 
>> 4. Because the current CPU offline code depends on stop_machine(), every
>>    online CPU must cooperate with the offline event. This means, whenever
>>    we do a preempt_disable(), it ensures not only that that particular
>>    CPU won't go offline, but also that *any* CPU cannot go offline. This
>>    is more like a side-effect of using stop_machine().
>>
>>    So when trying to move over to stop_one_cpu(), we have to carefully audit
>>    places where preempt_disable() has been used in that manner (ie.,
>>    preempt_disable used as a light-weight and non-blocking form of
>>    get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline,
>>    a preempt disabled section will prevent only that particular CPU from
>>    going offline.
>>
>>    I haven't audited preempt_disable() calls yet, but one such use was there
>>    in brlocks (include/linux/lglock.h) until quite recently.
> 
> I was thinking in terms of the offline code path doing a synchronize_sched()
> to allow preempt_disable() to retain a reasonable approximation of its
> current semantics.  This would require a pair of CPU masks, one for code
> using CPU-based primitives (e.g., sending IPIs) and another for code
> implementing those primitives.
> 
> Or is there some situation that makes this approach fail?
>

Oh, right, using synchronize_sched() in the offline path would pretty much
solve our problem.. I guess I overlooked that. Thanks for pointing it out!

Regards,
Srivatsa S. Bhat

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-06 19:52     ` Srivatsa S. Bhat
@ 2012-04-09 17:13       ` Paul E. McKenney
  2012-04-10 13:41         ` Srivatsa S. Bhat
  2012-04-11  0:09       ` Steven Rostedt
  1 sibling, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-09 17:13 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Sat, Apr 07, 2012 at 01:22:23AM +0530, Srivatsa S. Bhat wrote:
> On 04/05/2012 11:09 PM, Paul E. McKenney wrote:
> 
> > On Mon, Mar 19, 2012 at 08:18:42PM +0530, Srivatsa S. Bhat wrote:
> >> On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote:
> >>
> >>> Hi,
> >>>
> >>> There had been some discussion on CPU Hotplug redesign/rework
> >>> some time ago, but it was buried under a thread with a different
> >>> subject.
> >>> (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404)
> >>>
> >>> So I am opening a new thread with an appropriate subject to discuss
> >>> what needs to be done and how to go about it, as part of the rework.
> >>>
> >>> Peter Zijlstra and Paul McKenney had come up with TODO lists for the
> >>> rework, and here are their extracts from the previous discussion:
> > 
> > Finally getting around to looking at this in more detail...
> > 
> >> Additional things that I would like to add to the list:
> >>
> >> 1. Fix issues with CPU Hotplug callback registration. Currently there
> >>    is no totally-race-free way to register callbacks and do setup
> >>    for already online cpus.
> >>
> >>    I had posted an incomplete patchset some time ago regarding this,
> >>    which gives an idea of the direction I had in mind.
> >>    http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826
> > 
> > Another approach is to have the registration function return the
> > CPU mask corresponding to the instant at which registration occurred,
> > perhaps via an additional function argument that points to a
> > cpumask_var_t that can be NULL if you don't care.  Then you can
> > do setup for the CPUs indicated in the mask.
> 
> That would look something like this right?
> 
> 	register_cpu_notifier(nb, mask);
> 	do_setup(mask);
> 
> If yes, then here is our problem:
> 1. A CPU hotplug operation can happen, in-between these 2 function calls.
> 2. No matter how we try to wrap these up with get/put_online_cpus(), it
>    will still lead to either a race, or a deadlock, depending on how we
>    do it.

Hmmm...  The call to register_cpu_notifier() excludes CPU-hotplug
operations, so any CPU-hotplug operation that happens after the
register_cpu_notifier() will invoke the notifier.  This does mean that
we can then see concurrent invocation of the notifier from do_setup()
and from a CPU-hotplug operation, but it should not be hard to create
a reasonably locking scheme to prevent this.

BTW, I am assuming that it is OK for the notifiers to run out of order
in this case.  Perhaps you are trying to preserve ordering?

The reason that I believe that it should be OK to allow misordering of
notifiers is that the subsystem in question is not fully initialized yet.
It is therefore probably not yet actually servicing requests, for example,
something like RCU would might be refusing to start grace periods.
If it is not yet servicing requests, it should not matter what order
it sees the CPUs come online.  Once do_setup() returns, it will see
subsequent CPU-hotplug operations in order, so at that point it might
be able to start servicing requests.

Or am I missing something else?

> Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our
> purpose, since the race with CPU Hotplug would still exist, just like
> before. So, let's consider what happens when we wrap both the functions
> within get/put_online_cpus():
> 
> 	get_online_cpus();
> 	register_cpu_notifier(nb, mask);
> 	do_setup(mask);
> 	put_online_cpus();
> 
> Unfortunately this leads to an ABBA deadlock (see below).
> 
> 
> > Or am I missing the race you had in mind?  Or is the point to make
> > sure that the notifiers execute in order?
> 
> No, I wasn't referring to the order of execution of the notifiers here.

OK, then I am still missing something...

> Well, the race I was concerned about was the one between a CPU hotplug
> operation and a callback registration operation, which might lead to an
> ABBA deadlock, with the 2 locks in question being cpu hotplug lock and
> cpu_add_remove_lock.
> 
> The heart of the problem is that, something as simple as the following
> is prone to an ABBA deadlock (note that I am not even talking about
> do_setup() here):
> 
> 	get_online_cpus()
> 	register_cpu_notifier() or any other variant
> 	put_online_cpus()
> 
> In the above, the lock ordering is:
> grab cpu_hotplug lock -> grab cpu_add_remove_lock.

But why do we need the get_online_cpus() in this case?  What would it be
preventing from happening and why do we care?

> But in a usual CPU hotplug operation, the lock ordering is the exact reverse:
> grab cpu_add_remove_lock -> grab cpu_hotplug lock.
> 
> Hence, we have an ABBA deadlock possibility.
> 
> I had posted a pictorial representation of the deadlock condition here:
> https://lkml.org/lkml/2012/3/19/289

And Peter replied to this saying that he can remove the get_online_cpus()
call, which gets rid of the ABBA deadlock condition, correct?

> So, to get rid of this deadlock, the approach I proposed came out of the
> following observation:
> 
> A code such as:
> 
> 	register_cpu_notifier();
> 
> 	//Go on doing stuff that are irrelevant to CPU hotplug.
> 
> is totally safe from any kind of race or deadlock, the reason being that
> CPU Hotplug operations begin by taking the cpu_add_remove_lock, and
> callback registration functions like the above also take only this
> particular lock. IOW, the mutex that protects a race between CPU hotplug
> operation vs callback registration is the cpu_add_remove_lock. The
> cpu_hotplug lock is actually irrelevant here!

Yep, agreed, despite my confusion some weeks ago.

> 
> 	CPU0				CPU 1
> register_cpu_notifier()			cpu_down()
> ----------------------------------------------------------------------
> 
> acquire cpu_add_remove_lock
> 					Blocked on cpu_add_remove_lock
> 
> do callback registration
> 
> release cpu_add_remove_lock
> 
> 					Only now we can proceed and acquire
> 					the cpu_add_remove_lock.
> 
> 					acquire cpu_hotplug lock
> 					do cpu_down related work
> 					release cpu_hotplug lock
> 
> 					release cpu_add_remove_lock
> 
> 
> So my approach is: consider whatever we want to do while registering our
> callback, including doing setup for already online cpus; - and do *all of it*
> within the section marked as "do callback registration". IOW, we piggyback
> on something _known_ to be perfectly fine and hence avoid all races and
> deadlocks.

OK, interesting approach.  But does this actually save anything in
any real situation, given that CPU hotplug notifiers are run during
initialization, before the subsystem being initialized is really doing
anything?

> >> 2. There is a mismatch between the code and the documentation around
> >>    the difference between [un/register]_hotcpu_notifier and
> >>    [un/register]_cpu_notifier. And I remember seeing several places in
> >>    the code that uses them inconsistently. Not terribly important, but
> >>    good to fix it up while we are at it.
> > 
> > The following lead me to believe that they were the same:
> > 
> > #define register_hotcpu_notifier(nb)    register_cpu_notifier(nb)
> > #define unregister_hotcpu_notifier(nb)  unregister_cpu_notifier(nb)
> > 
> 
> > What am I missing here?
> 
> At first, even I thought that they were exactly same. But then looking at the
> #ifdef magic in include/linux/cpu.h, I realized that there is a subtle
> difference between the two:
> 
> The register_cpu_notifier() family of functions are defined whenever:
> a. CONFIG_HOTPLUG_CPU is set
> b. or, CONFIG_HOTPLUG_CPU is not set, but we are in core code
>    ie., !defined(MODULE)
> 
> The hotcpu family of functions are different in the sense that they are
> defined only when CONFIG_HOTPLUG_CPU is set. However, if CONFIG_HOTPLUG_CPU
> is not set, they are defined as pretty much empty functions, irrespective of
> whether we are calling them from core code or from module code.
> 
> And the reasoning behind this difference is: strictly speaking,
> CONFIG_HOTPLUG_CPU needs to be set only if we want to *offline* CPUs at
> some point in time, like suspend/resume for example. Otherwise, we don't need
> to set CONFIG_HOTPLUG_CPU.
> 
> But there are subsystems/arch specific stuff which need to take action even
> during CPU online operations (like booting), and hence their callback
> registrations must work even when CONFIG_HOTPLUG_CPU is not set. So they must
> use the native register_cpu_notifier() family of functions.
> 
> If some code doesn't really care much when offlining of CPUs is not required,
> then they can use the hotcpu_* family and optimize out for the !HOTPLUG_CPU
> case.
> 
> None of this seems to be documented in Documentation/cpu-hotplug.txt. In fact
> what that document explains (something about early init/late init) is actually
> misleading because it is out-dated beyond recognition! :-(

Indeed, it appears that the last substantive change to this file was more
than two years ago, and most of the work on the file happened more than five
years ago.  It certainly needs attention!  I have added documentation
updates to the list.

> Now coming to the inconsistent uses of these functions, if some core code which
> needs to register its callback irrespective of CONFIG_HOTPLUG_CPU, uses
> hotcpu* variants instead, it'll break stuff in the !HOTPLUG_CPU case.
> The other case (using register_cpu* variants in every place) is less worrisome,
> in fact harmless, but could have been optimized out in the !HOTPLUG_CPU case,
> if the hotcpu* variants had been used.

OK, got it, thank you for the tutorial!

> >> 3. There was another thread where stuff related to CPU hotplug had been
> >>    discussed. It had exposed some new challenges to CPU hotplug, if we
> >>    were to support asynchronous smp booting.
> >>
> >>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535
> >>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542
> >>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241
> >>    http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267
> > 
> > Good points!  ;-)
> 
> Thank you :-)
> 
> >> 4. Because the current CPU offline code depends on stop_machine(), every
> >>    online CPU must cooperate with the offline event. This means, whenever
> >>    we do a preempt_disable(), it ensures not only that that particular
> >>    CPU won't go offline, but also that *any* CPU cannot go offline. This
> >>    is more like a side-effect of using stop_machine().
> >>
> >>    So when trying to move over to stop_one_cpu(), we have to carefully audit
> >>    places where preempt_disable() has been used in that manner (ie.,
> >>    preempt_disable used as a light-weight and non-blocking form of
> >>    get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline,
> >>    a preempt disabled section will prevent only that particular CPU from
> >>    going offline.
> >>
> >>    I haven't audited preempt_disable() calls yet, but one such use was there
> >>    in brlocks (include/linux/lglock.h) until quite recently.
> > 
> > I was thinking in terms of the offline code path doing a synchronize_sched()
> > to allow preempt_disable() to retain a reasonable approximation of its
> > current semantics.  This would require a pair of CPU masks, one for code
> > using CPU-based primitives (e.g., sending IPIs) and another for code
> > implementing those primitives.
> > 
> > Or is there some situation that makes this approach fail?
> 
> Oh, right, using synchronize_sched() in the offline path would pretty much
> solve our problem.. I guess I overlooked that. Thanks for pointing it out!

Here is hoping.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-09 17:13       ` Paul E. McKenney
@ 2012-04-10 13:41         ` Srivatsa S. Bhat
  2012-04-10 15:46           ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Srivatsa S. Bhat @ 2012-04-10 13:41 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list, nikunj

On 04/09/2012 10:43 PM, Paul E. McKenney wrote:

> On Sat, Apr 07, 2012 at 01:22:23AM +0530, Srivatsa S. Bhat wrote:
>>>
>>>> Additional things that I would like to add to the list:
>>>>
>>>> 1. Fix issues with CPU Hotplug callback registration. Currently there
>>>>    is no totally-race-free way to register callbacks and do setup
>>>>    for already online cpus.
>>>>
>>>>    I had posted an incomplete patchset some time ago regarding this,
>>>>    which gives an idea of the direction I had in mind.
>>>>    http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826
>>>
>>> Another approach is to have the registration function return the
>>> CPU mask corresponding to the instant at which registration occurred,
>>> perhaps via an additional function argument that points to a
>>> cpumask_var_t that can be NULL if you don't care.  Then you can
>>> do setup for the CPUs indicated in the mask.
>>
>> That would look something like this right?
>>
>> 	register_cpu_notifier(nb, mask);
>> 	do_setup(mask);
>>
>> If yes, then here is our problem:
>> 1. A CPU hotplug operation can happen, in-between these 2 function calls.
>> 2. No matter how we try to wrap these up with get/put_online_cpus(), it
>>    will still lead to either a race, or a deadlock, depending on how we
>>    do it.
> 
> Hmmm...  The call to register_cpu_notifier() excludes CPU-hotplug
> operations, so any CPU-hotplug operation that happens after the
> register_cpu_notifier() will invoke the notifier. 

Right.

> This does mean that
> we can then see concurrent invocation of the notifier from do_setup()
> and from a CPU-hotplug operation,

Exactly!

> but it should not be hard to create
> a reasonably locking scheme to prevent this.
>

Ah, this is where our approach differs ;-)
My approach is to *avoid* the concurrent invocation. Yours seems to be to
*handle* the concurrent invocation. 

> BTW, I am assuming that it is OK for the notifiers to run out of order
> in this case.  Perhaps you are trying to preserve ordering?
> 

No, I am not worried about the ordering. Ordering is an issue only when we
allow concurrent invocations and try to handle them. If we prevent concurrent
invocation of notifiers, (mis)ordering of notifiers for subsequent CPU
hotplug operations won't even come into the picture, right?

> The reason that I believe that it should be OK to allow misordering of
> notifiers is that the subsystem in question is not fully initialized yet.
> It is therefore probably not yet actually servicing requests, for example,
> something like RCU would might be refusing to start grace periods.
> If it is not yet servicing requests, it should not matter what order
> it sees the CPUs come online.  Once do_setup() returns, it will see
> subsequent CPU-hotplug operations in order, so at that point it might
> be able to start servicing requests.
> 
> Or am I missing something else?
> 

Looks like the two of us are trying to solve the same problem in two different
ways, hence the difficulty in understanding each other ;-)

The problem:
During registration of notifier, CPU hotplug cannot happen - guaranteed.
Beyond that, CPU hotplug can happen, which includes the do_setup() phase.
Hence, concurrent invocations of notifiers might take place and mess up things.

Your solution/approach seems to be:
1. Register cpu hotplug notifiers.
2. Run setup for already online cpus, *knowing* that the notifier invocations
   can happen due to 2 sources (CPU hotplug and do_setup). Add appropriate
   locking to handle this. This raises a new issue: notifiers can get executed
   out-of-order, because there are 2 sources that are running the notifiers.
   Ignore this issue because it doesn't appear to be troublesome in any way.

My thoughts on this:
1. The extra locking needs to be added to every initialization code that does
   this "register + do_setup()" thing.
2. Atleast I am not able to think of any simple locking scheme. Consider some
   of the problems it has to address:

   * do_setup is running for CPU 3 which is already online.
   * Now somebody happens to take CPU 3 offline. What happens to the notifiers?
     CPU_UP_PREPARE/CPU_ONLINE notifiers from do_setup() and CPU_DOWN_PREPARE/
     CPU_DEAD notifiers from the CPU Hotplug operation get interleaved?
     That would be bad, isn't it?

(CPUs getting offlined while do_setup() is running might look like a totally
theoretical or rather impractical scenario. But its not. Consider the fact
that the initialization of this kind happens in module_init() code of some
modules. And module load/unload racing with CPU hotplug is not that much of
a theoretical case).

My solution/approach is:
1. Register cpu hotplug notifiers. (we know CPU hotplug is disabled throughout
   this)
2. Run setup for already online cpus, but ensure that the "CPU hotplug
   disabled" phase which started in the previous step extends throughout this
   step too, without any gaps in between. This would *prevent* concurrent
   invocation of notifiers. This would also *prevent* misordering of notifiers,
   because when we re-enable CPU hotplug after our do_setup(), all the
   registered notifiers run in the right order for future CPU hotplug
   operations.

My thoughts on this:
1. No extra locking necessary. Also, no locking added to each call site/
   initialization code. Since all changes are made to the generic
   register_cpu_notifier() function, it simply boils down to all the
   initialization code adapting to the new form of callback registration by
   providing appropriate parameters.
2. This is immune to any sort of CPU Hotplug (offline/online) at any point
   in time, without assumptions such as 'CPU offline won't happen during early
   boot' or 'nobody will do CPU Hotplug when loading/unloading modules'.
3. Looks simple (atleast in my opinion) since no extra locking logic is
   necessary :-) We just piggyback on what is already in place.
   And also, no need to handle some arcanely complex scenarios like what
   I mentioned above (interleaving of online/offline notifiers), since that
   scenario simply won't occur here.

>> Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our
>> purpose, since the race with CPU Hotplug would still exist, just like
>> before. So, let's consider what happens when we wrap both the functions
>> within get/put_online_cpus():
>>
>> 	get_online_cpus();
>> 	register_cpu_notifier(nb, mask);
>> 	do_setup(mask);
>> 	put_online_cpus();
>>
>> Unfortunately this leads to an ABBA deadlock (see below).
>>
>>
>>> Or am I missing the race you had in mind?  Or is the point to make
>>> sure that the notifiers execute in order?
>>
>> No, I wasn't referring to the order of execution of the notifiers here.
> 
> OK, then I am still missing something...
> 
>> Well, the race I was concerned about was the one between a CPU hotplug
>> operation and a callback registration operation, which might lead to an
>> ABBA deadlock, with the 2 locks in question being cpu hotplug lock and
>> cpu_add_remove_lock.
>>
>> The heart of the problem is that, something as simple as the following
>> is prone to an ABBA deadlock (note that I am not even talking about
>> do_setup() here):
>>
>> 	get_online_cpus()
>> 	register_cpu_notifier() or any other variant
>> 	put_online_cpus()
>>
>> In the above, the lock ordering is:
>> grab cpu_hotplug lock -> grab cpu_add_remove_lock.
> 
> But why do we need the get_online_cpus() in this case?  What would it be
> preventing from happening and why do we care?

It was a case of too much short-circuiting, I admit. I intended to show the
concept of "CPU hotplug disabled" phase as being persistent throughout callback
registration + do_setup(), using some known primitives for disabling CPU
hotplug, like get/put_online_cpus().

That is: get_online_cpus()
	register_cpu_notifier()
	do_setup()
	put_online_cpus()

Then I wanted to show that this can lead to an ABBA deadlock; and also point
out that do_setup() has got nothing to do with the deadlock - the interplay
between get/put_online_cpus and register_cpu_notifier() is the actual culprit.
So I short-circuited all this into the code above ;-)

> 
>> But in a usual CPU hotplug operation, the lock ordering is the exact reverse:
>> grab cpu_add_remove_lock -> grab cpu_hotplug lock.
>>
>> Hence, we have an ABBA deadlock possibility.
>>
>> I had posted a pictorial representation of the deadlock condition here:
>> https://lkml.org/lkml/2012/3/19/289
> 
> And Peter replied to this saying that he can remove the get_online_cpus()
> call, which gets rid of the ABBA deadlock condition, correct?
> 

Yes, but that is a special case - his code is an early-initcall (pre-smp).
So there is no possibility of CPU hotplug at that point; only one CPU is up
and running. (Even the async booting patch[1] doesn't alter this behaviour and
hence we are safe.) This is not the case with initialization code that can
occur much later in the boot sequence or which can be triggered after booting
is complete, like module initialization code.

>> So, to get rid of this deadlock, the approach I proposed came out of the
>> following observation:
>>
>> A code such as:
>>
>> 	register_cpu_notifier();
>>
>> 	//Go on doing stuff that are irrelevant to CPU hotplug.
>>
>> is totally safe from any kind of race or deadlock, the reason being that
>> CPU Hotplug operations begin by taking the cpu_add_remove_lock, and
>> callback registration functions like the above also take only this
>> particular lock. IOW, the mutex that protects a race between CPU hotplug
>> operation vs callback registration is the cpu_add_remove_lock. The
>> cpu_hotplug lock is actually irrelevant here!
> 
> Yep, agreed, despite my confusion some weeks ago.
> 
>>
>> 	CPU0				CPU 1
>> register_cpu_notifier()			cpu_down()
>> ----------------------------------------------------------------------
>>
>> acquire cpu_add_remove_lock
>> 					Blocked on cpu_add_remove_lock
>>
>> do callback registration
>>
>> release cpu_add_remove_lock
>>
>> 					Only now we can proceed and acquire
>> 					the cpu_add_remove_lock.
>>
>> 					acquire cpu_hotplug lock
>> 					do cpu_down related work
>> 					release cpu_hotplug lock
>>
>> 					release cpu_add_remove_lock
>>
>>
>> So my approach is: consider whatever we want to do while registering our
>> callback, including doing setup for already online cpus; - and do *all of it*
>> within the section marked as "do callback registration". IOW, we piggyback
>> on something _known_ to be perfectly fine and hence avoid all races and
>> deadlocks.
> 
> OK, interesting approach.  But does this actually save anything in
> any real situation, given that CPU hotplug notifiers are run during
> initialization, before the subsystem being initialized is really doing
> anything?
> 

Yes - we know that if we run multiple instances of a non-reentrant code
simultaneously without locking, it will crash. Here the non-reentrant code
is the individual CPU hotplug notifiers/callbacks.

Why are they non-reentrant?
Because it is a given that notifiers are always run one after the other in
the CPU hotplug path. And hence none of the notifiers take care (locking)
to see that they don't race with themselves; which was perfectly OK previously.

What was the assumption earlier?
During bootup, the onlining of CPUs won't run in parallel with the rest of
the kernel initialization. IOW, when do_setup() is running (as part of
initcalls), no CPU hotplug (online/offline) operation can occur. [Even the
CPU onlining as part of booting wont happen in parallel with do_setup()]

What changed that?
The async boot patch[1]. It made CPU online during boot to run in parallel
with rest of the kernel initialization. (This didn't go upstream though,
because the kernel was not ready for such a change, as shown by these
problems).

What was the effect? 
Except early initcalls which are pre-smp, all other initcalls can race with
CPU Hotplug. That is, do_setup() can race with notifiers from ongoing CPU
hotplug (CPU online operations, as part of booting) and hence the notifiers
suddenly need extra locking or something to be reentrant.

Why does my approach help?
It ensures that do_setup() will never occur in parallel with CPU hotplug,
at any time. Hence the individual notifiers need not watch their back -
they can continue to be non-reentrant and still everything will work fine
because we fix it at the callback registration level itself.

Honestly, I wrote this patchset to fix issues opened up by the async booting
patch[1]. That patch caused boot failures in powerpc [2] because of CPU
Hotplug notifier races. And I believe the solution I proposed will fix it.

Without the async booting patch, this was more or less a theoretical race.
That patch made it not only real but also severe enough to cause boot
failures.

So, if the async booting design is not being pushed any further, then I
guess we can simply ignore this theoretical race altogether and focus on
more important issues (I am totally OK with that) ... and possibly revisit
this race whenever it bites us again ;-)

What do you think?

[1]. http://thread.gmane.org/gmane.linux.kernel/1246209
[2]. http://thread.gmane.org/gmane.linux.kernel.next/20726/focus=20757

Regards,
Srivatsa S. Bhat

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-10 13:41         ` Srivatsa S. Bhat
@ 2012-04-10 15:46           ` Paul E. McKenney
  2012-04-10 17:26             ` Srivatsa S. Bhat
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-10 15:46 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list, nikunj

On Tue, Apr 10, 2012 at 07:11:50PM +0530, Srivatsa S. Bhat wrote:
> On 04/09/2012 10:43 PM, Paul E. McKenney wrote:
> 
> > On Sat, Apr 07, 2012 at 01:22:23AM +0530, Srivatsa S. Bhat wrote:
> >>>
> >>>> Additional things that I would like to add to the list:
> >>>>
> >>>> 1. Fix issues with CPU Hotplug callback registration. Currently there
> >>>>    is no totally-race-free way to register callbacks and do setup
> >>>>    for already online cpus.
> >>>>
> >>>>    I had posted an incomplete patchset some time ago regarding this,
> >>>>    which gives an idea of the direction I had in mind.
> >>>>    http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826
> >>>
> >>> Another approach is to have the registration function return the
> >>> CPU mask corresponding to the instant at which registration occurred,
> >>> perhaps via an additional function argument that points to a
> >>> cpumask_var_t that can be NULL if you don't care.  Then you can
> >>> do setup for the CPUs indicated in the mask.
> >>
> >> That would look something like this right?
> >>
> >> 	register_cpu_notifier(nb, mask);
> >> 	do_setup(mask);
> >>
> >> If yes, then here is our problem:
> >> 1. A CPU hotplug operation can happen, in-between these 2 function calls.
> >> 2. No matter how we try to wrap these up with get/put_online_cpus(), it
> >>    will still lead to either a race, or a deadlock, depending on how we
> >>    do it.
> > 
> > Hmmm...  The call to register_cpu_notifier() excludes CPU-hotplug
> > operations, so any CPU-hotplug operation that happens after the
> > register_cpu_notifier() will invoke the notifier. 
> 
> Right.
> 
> > This does mean that
> > we can then see concurrent invocation of the notifier from do_setup()
> > and from a CPU-hotplug operation,
> 
> Exactly!
> 
> > but it should not be hard to create
> > a reasonably locking scheme to prevent this.
> 
> Ah, this is where our approach differs ;-)
> My approach is to *avoid* the concurrent invocation. Yours seems to be to
> *handle* the concurrent invocation. 

I am lazy, I know.  But I didn't see any reasonable way to prevent the
concurrency.

> > BTW, I am assuming that it is OK for the notifiers to run out of order
> > in this case.  Perhaps you are trying to preserve ordering?
> 
> No, I am not worried about the ordering. Ordering is an issue only when we
> allow concurrent invocations and try to handle them. If we prevent concurrent
> invocation of notifiers, (mis)ordering of notifiers for subsequent CPU
> hotplug operations won't even come into the picture, right?

Right -- after all, we are probably initializing the pre-existing CPUs
in some order other than their onlining order, right?  ;-)

> > The reason that I believe that it should be OK to allow misordering of
> > notifiers is that the subsystem in question is not fully initialized yet.
> > It is therefore probably not yet actually servicing requests, for example,
> > something like RCU would might be refusing to start grace periods.
> > If it is not yet servicing requests, it should not matter what order
> > it sees the CPUs come online.  Once do_setup() returns, it will see
> > subsequent CPU-hotplug operations in order, so at that point it might
> > be able to start servicing requests.
> > 
> > Or am I missing something else?
> 
> Looks like the two of us are trying to solve the same problem in two different
> ways, hence the difficulty in understanding each other ;-)

That usually does make things a bit more difficult.  ;-)

> The problem:
> During registration of notifier, CPU hotplug cannot happen - guaranteed.
> Beyond that, CPU hotplug can happen, which includes the do_setup() phase.
> Hence, concurrent invocations of notifiers might take place and mess up things.
> 
> Your solution/approach seems to be:
> 1. Register cpu hotplug notifiers.
> 2. Run setup for already online cpus, *knowing* that the notifier invocations
>    can happen due to 2 sources (CPU hotplug and do_setup). Add appropriate
>    locking to handle this. This raises a new issue: notifiers can get executed
>    out-of-order, because there are 2 sources that are running the notifiers.
>    Ignore this issue because it doesn't appear to be troublesome in any way.

Yep!

> My thoughts on this:
> 1. The extra locking needs to be added to every initialization code that does
>    this "register + do_setup()" thing.
> 2. Atleast I am not able to think of any simple locking scheme. Consider some
>    of the problems it has to address:
> 
>    * do_setup is running for CPU 3 which is already online.
>    * Now somebody happens to take CPU 3 offline. What happens to the notifiers?
>      CPU_UP_PREPARE/CPU_ONLINE notifiers from do_setup() and CPU_DOWN_PREPARE/
>      CPU_DEAD notifiers from the CPU Hotplug operation get interleaved?
>      That would be bad, isn't it?
> 
> (CPUs getting offlined while do_setup() is running might look like a totally
> theoretical or rather impractical scenario. But its not. Consider the fact
> that the initialization of this kind happens in module_init() code of some
> modules. And module load/unload racing with CPU hotplug is not that much of
> a theoretical case).
> 
> My solution/approach is:
> 1. Register cpu hotplug notifiers. (we know CPU hotplug is disabled throughout
>    this)
> 2. Run setup for already online cpus, but ensure that the "CPU hotplug
>    disabled" phase which started in the previous step extends throughout this
>    step too, without any gaps in between. This would *prevent* concurrent
>    invocation of notifiers. This would also *prevent* misordering of notifiers,
>    because when we re-enable CPU hotplug after our do_setup(), all the
>    registered notifiers run in the right order for future CPU hotplug
>    operations.
> 
> My thoughts on this:
> 1. No extra locking necessary. Also, no locking added to each call site/
>    initialization code. Since all changes are made to the generic
>    register_cpu_notifier() function, it simply boils down to all the
>    initialization code adapting to the new form of callback registration by
>    providing appropriate parameters.
> 2. This is immune to any sort of CPU Hotplug (offline/online) at any point
>    in time, without assumptions such as 'CPU offline won't happen during early
>    boot' or 'nobody will do CPU Hotplug when loading/unloading modules'.
> 3. Looks simple (atleast in my opinion) since no extra locking logic is
>    necessary :-) We just piggyback on what is already in place.
>    And also, no need to handle some arcanely complex scenarios like what
>    I mentioned above (interleaving of online/offline notifiers), since that
>    scenario simply won't occur here.

I was thinking in terms of the CPU_DOWN_PREPARE or CPU_DEAD notifiers
checking the corresponding bit of the mask returned by the notifier
registration function.  If the bit is set, just clear it, otherwise
execute the normal notifier function.  With appropriate locking, of
course.  (Waves hands wildly.)

Note that this does require that the notifier registration function
store the cpumask while under the protection of cpu_maps_update_begin().

It also requires two wrapper functions for the underlying notifier
function, one to be called from the actual notifier and the other
from the initialization/setup code.

> >> Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our
> >> purpose, since the race with CPU Hotplug would still exist, just like
> >> before. So, let's consider what happens when we wrap both the functions
> >> within get/put_online_cpus():
> >>
> >> 	get_online_cpus();
> >> 	register_cpu_notifier(nb, mask);
> >> 	do_setup(mask);
> >> 	put_online_cpus();
> >>
> >> Unfortunately this leads to an ABBA deadlock (see below).
> >>
> >>
> >>> Or am I missing the race you had in mind?  Or is the point to make
> >>> sure that the notifiers execute in order?
> >>
> >> No, I wasn't referring to the order of execution of the notifiers here.
> > 
> > OK, then I am still missing something...
> > 
> >> Well, the race I was concerned about was the one between a CPU hotplug
> >> operation and a callback registration operation, which might lead to an
> >> ABBA deadlock, with the 2 locks in question being cpu hotplug lock and
> >> cpu_add_remove_lock.
> >>
> >> The heart of the problem is that, something as simple as the following
> >> is prone to an ABBA deadlock (note that I am not even talking about
> >> do_setup() here):
> >>
> >> 	get_online_cpus()
> >> 	register_cpu_notifier() or any other variant
> >> 	put_online_cpus()
> >>
> >> In the above, the lock ordering is:
> >> grab cpu_hotplug lock -> grab cpu_add_remove_lock.
> > 
> > But why do we need the get_online_cpus() in this case?  What would it be
> > preventing from happening and why do we care?
> 
> It was a case of too much short-circuiting, I admit. I intended to show the
> concept of "CPU hotplug disabled" phase as being persistent throughout callback
> registration + do_setup(), using some known primitives for disabling CPU
> hotplug, like get/put_online_cpus().
> 
> That is: get_online_cpus()
> 	register_cpu_notifier()
> 	do_setup()
> 	put_online_cpus()
> 
> Then I wanted to show that this can lead to an ABBA deadlock; and also point
> out that do_setup() has got nothing to do with the deadlock - the interplay
> between get/put_online_cpus and register_cpu_notifier() is the actual culprit.
> So I short-circuited all this into the code above ;-)

OK.  ;-)

> >> But in a usual CPU hotplug operation, the lock ordering is the exact reverse:
> >> grab cpu_add_remove_lock -> grab cpu_hotplug lock.
> >>
> >> Hence, we have an ABBA deadlock possibility.
> >>
> >> I had posted a pictorial representation of the deadlock condition here:
> >> https://lkml.org/lkml/2012/3/19/289
> > 
> > And Peter replied to this saying that he can remove the get_online_cpus()
> > call, which gets rid of the ABBA deadlock condition, correct?
> 
> Yes, but that is a special case - his code is an early-initcall (pre-smp).
> So there is no possibility of CPU hotplug at that point; only one CPU is up
> and running. (Even the async booting patch[1] doesn't alter this behaviour and
> hence we are safe.) This is not the case with initialization code that can
> occur much later in the boot sequence or which can be triggered after booting
> is complete, like module initialization code.

But as long as the cpumask is written under the protection of
cpu_maps_update_begin(), use of a common lock for the cpumask should
suffice.  If more scalability is required, multiple locks can be used,
each protecting a different section of the cpumask.

>From the notifiers, execution would be as follows:

	Acquire the subsystem's cpumask lock.
	Check the bit -- if going offline and the bit is set:
		Clear the bit.
		Release the subsystem's cpumask lock.
		Return.
	Invoke the underlying notifier function.
	Release the subsystem's cpumask lock.

Notifiers cannot run concurrently with initialization of the cpumask
because the notifier registration function initializes the cpumask
under protection of cpu_maps_update_begin(), which excludes CPU-hotplug
operations.

>From setup/initialization, execution would be as follows:

	Acquire the subsystem's cpumask lock.
	Find the first set bit.  If there is no bit set:
		Release the subsystem's cpumask lock.
		Return "done" indication.
	Clear the bit.
	Invoke the underlying notifier function.
	Release the subsystem's cpumask lock.

Seem reasonable, or am I missing something?

> >> So, to get rid of this deadlock, the approach I proposed came out of the
> >> following observation:
> >>
> >> A code such as:
> >>
> >> 	register_cpu_notifier();
> >>
> >> 	//Go on doing stuff that are irrelevant to CPU hotplug.
> >>
> >> is totally safe from any kind of race or deadlock, the reason being that
> >> CPU Hotplug operations begin by taking the cpu_add_remove_lock, and
> >> callback registration functions like the above also take only this
> >> particular lock. IOW, the mutex that protects a race between CPU hotplug
> >> operation vs callback registration is the cpu_add_remove_lock. The
> >> cpu_hotplug lock is actually irrelevant here!
> > 
> > Yep, agreed, despite my confusion some weeks ago.
> > 
> >>
> >> 	CPU0				CPU 1
> >> register_cpu_notifier()			cpu_down()
> >> ----------------------------------------------------------------------
> >>
> >> acquire cpu_add_remove_lock
> >> 					Blocked on cpu_add_remove_lock
> >>
> >> do callback registration
> >>
> >> release cpu_add_remove_lock
> >>
> >> 					Only now we can proceed and acquire
> >> 					the cpu_add_remove_lock.
> >>
> >> 					acquire cpu_hotplug lock
> >> 					do cpu_down related work
> >> 					release cpu_hotplug lock
> >>
> >> 					release cpu_add_remove_lock
> >>
> >>
> >> So my approach is: consider whatever we want to do while registering our
> >> callback, including doing setup for already online cpus; - and do *all of it*
> >> within the section marked as "do callback registration". IOW, we piggyback
> >> on something _known_ to be perfectly fine and hence avoid all races and
> >> deadlocks.
> > 
> > OK, interesting approach.  But does this actually save anything in
> > any real situation, given that CPU hotplug notifiers are run during
> > initialization, before the subsystem being initialized is really doing
> > anything?
> 
> Yes - we know that if we run multiple instances of a non-reentrant code
> simultaneously without locking, it will crash. Here the non-reentrant code
> is the individual CPU hotplug notifiers/callbacks.

Right -- and registering CPU hotplug notifiers during early boot remains
a very nice labor-saving approach, because CPU-hotplug operations cannot
run that early.  However, CPU hotplug notifiers registered by loadable
modules do not have this choice, and must therefore correctly handle
CPU-hotplug operations running concurrently with module initialization.

> Why are they non-reentrant?
> Because it is a given that notifiers are always run one after the other in
> the CPU hotplug path. And hence none of the notifiers take care (locking)
> to see that they don't race with themselves; which was perfectly OK previously.

And that is still the case -unless- your code must register CPU-hotplug
notifiers after early boot.  Right?

> What was the assumption earlier?
> During bootup, the onlining of CPUs won't run in parallel with the rest of
> the kernel initialization. IOW, when do_setup() is running (as part of
> initcalls), no CPU hotplug (online/offline) operation can occur. [Even the
> CPU onlining as part of booting wont happen in parallel with do_setup()]
> 
> What changed that?
> The async boot patch[1]. It made CPU online during boot to run in parallel
> with rest of the kernel initialization. (This didn't go upstream though,
> because the kernel was not ready for such a change, as shown by these
> problems).

But even so, CPU online happens after the scheduler is initialized.
So things like RCU can still work as they used to, relying on the fact
that they are initializing before CPU-hotplug operations can run.

Besides, this vulnerability existed beforehand -- a driver really could
bring a CPU down during late boot, which would confuse drivers initializing
concurrently, right?  I doubt that anyone does this, but there is nothing
stopping them from doing it.  ;-)

> What was the effect? 
> Except early initcalls which are pre-smp, all other initcalls can race with
> CPU Hotplug. That is, do_setup() can race with notifiers from ongoing CPU
> hotplug (CPU online operations, as part of booting) and hence the notifiers
> suddenly need extra locking or something to be reentrant.
> 
> Why does my approach help?

At this point, I must confess that I have lost track of exactly what
your approach is...

> It ensures that do_setup() will never occur in parallel with CPU hotplug,
> at any time. Hence the individual notifiers need not watch their back -
> they can continue to be non-reentrant and still everything will work fine
> because we fix it at the callback registration level itself.
> 
> Honestly, I wrote this patchset to fix issues opened up by the async booting
> patch[1]. That patch caused boot failures in powerpc [2] because of CPU
> Hotplug notifier races. And I believe the solution I proposed will fix it.
> 
> Without the async booting patch, this was more or less a theoretical race.
> That patch made it not only real but also severe enough to cause boot
> failures.
> 
> So, if the async booting design is not being pushed any further, then I
> guess we can simply ignore this theoretical race altogether and focus on
> more important issues (I am totally OK with that) ... and possibly revisit
> this race whenever it bites us again ;-)
> 
> What do you think?
> 
> [1]. http://thread.gmane.org/gmane.linux.kernel/1246209
> [2]. http://thread.gmane.org/gmane.linux.kernel.next/20726/focus=20757

Neither of the above two URLs points to a patch, so I am still a bit lost.

That said, it is only a matter of time before async boot reappears --
systems are still getting more CPUs, and boot-time constraints are
becoming more severe.  So something will eventually need to be done,
and we might as well deal with it ahead of time.

						Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-10 15:46           ` Paul E. McKenney
@ 2012-04-10 17:26             ` Srivatsa S. Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Srivatsa S. Bhat @ 2012-04-10 17:26 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list, nikunj

On 04/10/2012 09:16 PM, Paul E. McKenney wrote:

> On Tue, Apr 10, 2012 at 07:11:50PM +0530, Srivatsa S. Bhat wrote:


[This is a quick reply to give the links you requested. I'll reply to the
 other things after I read what you wrote more thoroughly.]

>> Why does my approach help?
> 
> At this point, I must confess that I have lost track of exactly what
> your approach is...


The same old "incomplete" patchset ;-)
(Note that the patch 1/3 is complete. The "incomplete" tag is just because
it is followed by changes only to powerpc (2/3) and sparc (3/3), while
actually, many other places need to be changed. But the first patch in the
series is definitely in full form.

https://lkml.org/lkml/2012/3/1/39
https://lkml.org/lkml/2012/3/1/40
https://lkml.org/lkml/2012/3/1/41

> 
>> It ensures that do_setup() will never occur in parallel with CPU hotplug,
>> at any time. Hence the individual notifiers need not watch their back -
>> they can continue to be non-reentrant and still everything will work fine
>> because we fix it at the callback registration level itself.
>>
>> Honestly, I wrote this patchset to fix issues opened up by the async booting
>> patch[1]. That patch caused boot failures in powerpc [2] because of CPU
>> Hotplug notifier races. And I believe the solution I proposed will fix it.
>>
>> Without the async booting patch, this was more or less a theoretical race.
>> That patch made it not only real but also severe enough to cause boot
>> failures.
>>
>> So, if the async booting design is not being pushed any further, then I
>> guess we can simply ignore this theoretical race altogether and focus on
>> more important issues (I am totally OK with that) ... and possibly revisit
>> this race whenever it bites us again ;-)
>>
>> What do you think?
>>
>> [1]. http://thread.gmane.org/gmane.linux.kernel/1246209
>> [2]. http://thread.gmane.org/gmane.linux.kernel.next/20726/focus=20757
> 
> Neither of the above two URLs points to a patch,


??
Well, the first one points to the async booting patch and the second one points
to a verbal root-cause analysis of the boot failure on powerpc, caused by that
patch.

Let me give equivalent links from lkml.org:
[1]. https://lkml.org/lkml/2012/1/31/286
[2]. https://lkml.org/lkml/2012/2/13/383

 
Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-06 19:52     ` Srivatsa S. Bhat
  2012-04-09 17:13       ` Paul E. McKenney
@ 2012-04-11  0:09       ` Steven Rostedt
  2012-04-11  0:28         ` Paul E. McKenney
  1 sibling, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2012-04-11  0:09 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: paulmck, Peter Zijlstra, Arjan van de Ven, rusty@rustcorp.com.au,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Sat, 2012-04-07 at 01:22 +0530, Srivatsa S. Bhat wrote:

> Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our
> purpose, since the race with CPU Hotplug would still exist, just like
> before. So, let's consider what happens when we wrap both the functions
> within get/put_online_cpus():
> 
> 	get_online_cpus();
> 	register_cpu_notifier(nb, mask);
> 	do_setup(mask);
> 	put_online_cpus();
> 
> Unfortunately this leads to an ABBA deadlock (see below).
>  

Just to throw out the stupid silly approach.

What about creating a "__register_cpu_notifier()" that just does:

int __ref __register_cpu_notifier(struct notifier_block *nb)
{
	return raw_notifier_chain_register(&cpu_chain, nb);
}

Also making cpu_maps_update_begin/done() global (and probably rename
them).

and then in the above code do:

	cpu_maps_update_begin();
	__register_cpu_notifier(nb);
	do_setup();
	cpu_maps_update_done();


Just saying,

-- Steve



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-11  0:09       ` Steven Rostedt
@ 2012-04-11  0:28         ` Paul E. McKenney
  2012-04-11  0:37           ` Steven Rostedt
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-11  0:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Srivatsa S. Bhat, Peter Zijlstra, Arjan van de Ven,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Tue, Apr 10, 2012 at 08:09:59PM -0400, Steven Rostedt wrote:
> On Sat, 2012-04-07 at 01:22 +0530, Srivatsa S. Bhat wrote:
> 
> > Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our
> > purpose, since the race with CPU Hotplug would still exist, just like
> > before. So, let's consider what happens when we wrap both the functions
> > within get/put_online_cpus():
> > 
> > 	get_online_cpus();
> > 	register_cpu_notifier(nb, mask);
> > 	do_setup(mask);
> > 	put_online_cpus();
> > 
> > Unfortunately this leads to an ABBA deadlock (see below).
> >  
> 
> Just to throw out the stupid silly approach.
> 
> What about creating a "__register_cpu_notifier()" that just does:
> 
> int __ref __register_cpu_notifier(struct notifier_block *nb)
> {
> 	return raw_notifier_chain_register(&cpu_chain, nb);
> }
> 
> Also making cpu_maps_update_begin/done() global (and probably rename
> them).
> 
> and then in the above code do:
> 
> 	cpu_maps_update_begin();
> 	__register_cpu_notifier(nb);
> 	do_setup();
> 	cpu_maps_update_done();
> 
> 
> Just saying,

That does have some attractive properties, now that you mention it.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-11  0:28         ` Paul E. McKenney
@ 2012-04-11  0:37           ` Steven Rostedt
  2012-04-11  1:00             ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2012-04-11  0:37 UTC (permalink / raw)
  To: paulmck
  Cc: Srivatsa S. Bhat, Peter Zijlstra, Arjan van de Ven,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Tue, 2012-04-10 at 17:28 -0700, Paul E. McKenney wrote:

> > Just to throw out the stupid silly approach.
> > 
> > What about creating a "__register_cpu_notifier()" that just does:
> > 
> > int __ref __register_cpu_notifier(struct notifier_block *nb)
> > {
> > 	return raw_notifier_chain_register(&cpu_chain, nb);
> > }
> > 
> > Also making cpu_maps_update_begin/done() global (and probably rename
> > them).

I just noticed that the cpu_maps_update_begin/done() are already global.

> > 
> > and then in the above code do:
> > 
> > 	cpu_maps_update_begin();
> > 	__register_cpu_notifier(nb);
> > 	do_setup();
> > 	cpu_maps_update_done();
> > 
> > 
> > Just saying,
> 
> That does have some attractive properties, now that you mention it.  ;-)

Which property? Stupid or Silly ;-)


-- Steve



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-11  0:37           ` Steven Rostedt
@ 2012-04-11  1:00             ` Paul E. McKenney
  2012-04-11  6:02               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-11  1:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Srivatsa S. Bhat, Peter Zijlstra, Arjan van de Ven,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Tue, Apr 10, 2012 at 08:37:18PM -0400, Steven Rostedt wrote:
> On Tue, 2012-04-10 at 17:28 -0700, Paul E. McKenney wrote:
> 
> > > Just to throw out the stupid silly approach.
> > > 
> > > What about creating a "__register_cpu_notifier()" that just does:
> > > 
> > > int __ref __register_cpu_notifier(struct notifier_block *nb)
> > > {
> > > 	return raw_notifier_chain_register(&cpu_chain, nb);
> > > }
> > > 
> > > Also making cpu_maps_update_begin/done() global (and probably rename
> > > them).
> 
> I just noticed that the cpu_maps_update_begin/done() are already global.
> 
> > > 
> > > and then in the above code do:
> > > 
> > > 	cpu_maps_update_begin();
> > > 	__register_cpu_notifier(nb);
> > > 	do_setup();
> > > 	cpu_maps_update_done();
> > > 
> > > 
> > > Just saying,
> > 
> > That does have some attractive properties, now that you mention it.  ;-)
> 
> Which property? Stupid or Silly ;-)

As with any piece of software, no matter how small, both.  ;-)

Of course, __register_cpu_notifier() would need lockdep checking to make
sure that it wasn't called without the benefit of cpu_maps_update_begin().
I might be missing something, but as long as that was in place, seems
like it is a lot simpler and easier to use than the alternatives that
Srivatsa and I were kicking around.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-11  1:00             ` Paul E. McKenney
@ 2012-04-11  6:02               ` Srivatsa S. Bhat
  2012-04-11 12:28                 ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Srivatsa S. Bhat @ 2012-04-11  6:02 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, Peter Zijlstra, Arjan van de Ven,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list, nikunj

On 04/11/2012 06:30 AM, Paul E. McKenney wrote:

> On Tue, Apr 10, 2012 at 08:37:18PM -0400, Steven Rostedt wrote:
>> On Tue, 2012-04-10 at 17:28 -0700, Paul E. McKenney wrote:
>>
>>>> Just to throw out the stupid silly approach.
>>>>
>>>> What about creating a "__register_cpu_notifier()" that just does:
>>>>
>>>> int __ref __register_cpu_notifier(struct notifier_block *nb)
>>>> {
>>>> 	return raw_notifier_chain_register(&cpu_chain, nb);
>>>> }
>>>>
>>>> Also making cpu_maps_update_begin/done() global (and probably rename
>>>> them).
>>
>> I just noticed that the cpu_maps_update_begin/done() are already global.
>>
>>>>
>>>> and then in the above code do:
>>>>
>>>> 	cpu_maps_update_begin();
>>>> 	__register_cpu_notifier(nb);
>>>> 	do_setup();
>>>> 	cpu_maps_update_done();
>>>>
>>>>


Wow! Believe it or not, this is precisely the crux of the approach I was
suggesting all along!! :-) Just that when put to code, it looked slightly
different than this.. Sorry for not being clear.

So here is what I proposed, in a simplified form:

Modify the existing register_cpu_notifier() to this (by possibly giving
it a different name):

int __ref register_cpu_notifier(struct notifier_block *nb,
					int (*do_setup)(void))
{
	int ret;

	cpu_maps_update_begin();
	ret = raw_notifier_chain_register(&cpu_chain, nb);
	do_setup();
	cpu_maps_update_done();

	return ret;
}

and then, in the caller, do:

	register_cpu_notifier(nb, do_setup);

If the caller doesn't need any such extra setup, just do:

	register_cpu_notifier(nb, NULL);


Of course, register_cpu_notifier() should handle NULL properly.
(My patch [1] handles it, along with some other special cases.)
 
That's it!

Also, it is to be noted that cpu_maps_update_begin/done() are global, but
not exported symbols - so modules can't use them. With the above approach,
we need not make them exported symbols, since the caller need not care about
these locks at all.

>>>> Just saying,
>>>
>>> That does have some attractive properties, now that you mention it.  ;-)
>>
>> Which property? Stupid or Silly ;-)
> 
> As with any piece of software, no matter how small, both.  ;-)
> 
> Of course, __register_cpu_notifier() would need lockdep checking to make
> sure that it wasn't called without the benefit of cpu_maps_update_begin().


Not with my approach ;-) Its all automatically handled :-)

> I might be missing something, but as long as that was in place, seems
> like it is a lot simpler and easier to use than the alternatives that
> Srivatsa and I were kicking around.
> 


Hehe :-) Thanks for simplifying things, Steve!


[1]. https://lkml.org/lkml/2012/3/1/39

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-04-11  6:02               ` Srivatsa S. Bhat
@ 2012-04-11 12:28                 ` Paul E. McKenney
  0 siblings, 0 replies; 38+ messages in thread
From: Paul E. McKenney @ 2012-04-11 12:28 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Steven Rostedt, Peter Zijlstra, Arjan van de Ven,
	rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list, nikunj

On Wed, Apr 11, 2012 at 11:32:57AM +0530, Srivatsa S. Bhat wrote:
> On 04/11/2012 06:30 AM, Paul E. McKenney wrote:
> 
> > On Tue, Apr 10, 2012 at 08:37:18PM -0400, Steven Rostedt wrote:
> >> On Tue, 2012-04-10 at 17:28 -0700, Paul E. McKenney wrote:
> >>
> >>>> Just to throw out the stupid silly approach.
> >>>>
> >>>> What about creating a "__register_cpu_notifier()" that just does:
> >>>>
> >>>> int __ref __register_cpu_notifier(struct notifier_block *nb)
> >>>> {
> >>>> 	return raw_notifier_chain_register(&cpu_chain, nb);
> >>>> }
> >>>>
> >>>> Also making cpu_maps_update_begin/done() global (and probably rename
> >>>> them).
> >>
> >> I just noticed that the cpu_maps_update_begin/done() are already global.
> >>
> >>>>
> >>>> and then in the above code do:
> >>>>
> >>>> 	cpu_maps_update_begin();
> >>>> 	__register_cpu_notifier(nb);
> >>>> 	do_setup();
> >>>> 	cpu_maps_update_done();
> >>>>
> >>>>
> 
> 
> Wow! Believe it or not, this is precisely the crux of the approach I was
> suggesting all along!! :-) Just that when put to code, it looked slightly
> different than this.. Sorry for not being clear.
> 
> So here is what I proposed, in a simplified form:
> 
> Modify the existing register_cpu_notifier() to this (by possibly giving
> it a different name):
> 
> int __ref register_cpu_notifier(struct notifier_block *nb,
> 					int (*do_setup)(void))
> {
> 	int ret;
> 
> 	cpu_maps_update_begin();
> 	ret = raw_notifier_chain_register(&cpu_chain, nb);
> 	do_setup();
> 	cpu_maps_update_done();
> 
> 	return ret;
> }
> 
> and then, in the caller, do:
> 
> 	register_cpu_notifier(nb, do_setup);
> 
> If the caller doesn't need any such extra setup, just do:
> 
> 	register_cpu_notifier(nb, NULL);
> 
> 
> Of course, register_cpu_notifier() should handle NULL properly.
> (My patch [1] handles it, along with some other special cases.)
> 
> That's it!
> 
> Also, it is to be noted that cpu_maps_update_begin/done() are global, but
> not exported symbols - so modules can't use them. With the above approach,
> we need not make them exported symbols, since the caller need not care about
> these locks at all.
> 
> >>>> Just saying,
> >>>
> >>> That does have some attractive properties, now that you mention it.  ;-)
> >>
> >> Which property? Stupid or Silly ;-)
> > 
> > As with any piece of software, no matter how small, both.  ;-)
> > 
> > Of course, __register_cpu_notifier() would need lockdep checking to make
> > sure that it wasn't called without the benefit of cpu_maps_update_begin().
> 
> 
> Not with my approach ;-) Its all automatically handled :-)

Good point, looks good!

							Thanx, Paul

> > I might be missing something, but as long as that was in place, seems
> > like it is a lot simpler and easier to use than the alternatives that
> > Srivatsa and I were kicking around.
> > 
> 
> 
> Hehe :-) Thanks for simplifying things, Steve!
> 
> 
> [1]. https://lkml.org/lkml/2012/3/1/39
> 
> Regards,
> Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-19 14:44 CPU Hotplug rework Srivatsa S. Bhat
  2012-03-19 14:48 ` Srivatsa S. Bhat
@ 2012-03-19 23:42 ` Rusty Russell
  2012-03-20 10:42   ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Rusty Russell @ 2012-03-19 23:42 UTC (permalink / raw)
  To: Srivatsa S. Bhat, Peter Zijlstra, Paul E. McKenney
  Cc: Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki,
	Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker,
	Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro,
	linux-kernel, Linux PM mailing list

On Mon, 19 Mar 2012 20:14:25 +0530, "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Hi,
> 
> There had been some discussion on CPU Hotplug redesign/rework
> some time ago, but it was buried under a thread with a different
> subject.
> (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404)
> 
> So I am opening a new thread with an appropriate subject to discuss
> what needs to be done and how to go about it, as part of the rework.
> 
> Peter Zijlstra and Paul McKenney had come up with TODO lists for the
> rework, and here are their extracts from the previous discussion:

This is possible, but quite a lot of tricky auditing work.  There's an
underlying assumption that stop_machine is the slow part, since it feels
so heavy.

Unfortunately, this doesn't seem to be the case in my testing.  The time
for hotplug seems to be moving all the threads around.  So how about:

(1) Let's not shutdown per-cpu kthreads, just leave them there to run
    if the CPU comes back.

(2) Do something more efficient with userspace threads than migrating
    them one at a time.

Otherwise, we risk doing a great deal of work and gaining nothing
(cleanups aside, of course).

Thanks,
Rusty.
-- 
  How could I marry someone with more hair than me?  http://baldalex.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-19 23:42 ` Rusty Russell
@ 2012-03-20 10:42   ` Peter Zijlstra
  2012-03-20 23:00     ` Rusty Russell
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2012-03-20 10:42 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Srivatsa S. Bhat, Paul E. McKenney, Arjan van de Ven,
	Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Tue, 2012-03-20 at 10:12 +1030, Rusty Russell wrote:
> On Mon, 19 Mar 2012 20:14:25 +0530, "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> > Hi,
> > 
> > There had been some discussion on CPU Hotplug redesign/rework
> > some time ago, but it was buried under a thread with a different
> > subject.
> > (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404)
> > 
> > So I am opening a new thread with an appropriate subject to discuss
> > what needs to be done and how to go about it, as part of the rework.
> > 
> > Peter Zijlstra and Paul McKenney had come up with TODO lists for the
> > rework, and here are their extracts from the previous discussion:
> 
> This is possible, but quite a lot of tricky auditing work.  There's an
> underlying assumption that stop_machine is the slow part, since it feels
> so heavy.

Depends on the machine and the needs. For the regular desktop with a
regular kernel, the stop_machine in hotplug isn't really a problem. For
_BIG_ machines stop_machine is a problem, for -RT stop_machine is a
problem.

So if we're going to re-architect hotplug anyway, it would be very good
to get rid of it, because I really don't see any hard reasons why we
would need it.

> Unfortunately, this doesn't seem to be the case in my testing.  The time
> for hotplug seems to be moving all the threads around.  So how about:

Agreed, the thread creation on online is the most expensive operation.

> (1) Let's not shutdown per-cpu kthreads, just leave them there to run
>     if the CPU comes back.

Wasn't as easy as it sounds, but should be doable.

> (2) Do something more efficient with userspace threads than migrating
>     them one at a time.

Sadly that can't really be done. We need to pick up every task
(userspace, but also running kernel threads) and update their state.

> Otherwise, we risk doing a great deal of work and gaining nothing
> (cleanups aside, of course).

I don't really think its possible to spend too much time cleaning up
hotplug at this point :-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-20 10:42   ` Peter Zijlstra
@ 2012-03-20 23:00     ` Rusty Russell
  2012-03-21  9:01       ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Rusty Russell @ 2012-03-20 23:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Paul E. McKenney, Arjan van de Ven,
	Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Tue, 20 Mar 2012 11:42:31 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2012-03-20 at 10:12 +1030, Rusty Russell wrote:
> Depends on the machine and the needs. For the regular desktop with a
> regular kernel, the stop_machine in hotplug isn't really a problem. For
> _BIG_ machines stop_machine is a problem

I tested on a PPC 64-way a few years ago, but let's get a really big
machine and re-run the tests.  Simplest is to benchmark module removal,
which uses a very trivial stop_machine call.  Old test code below, but
it'll need to be updated (module insertion no longer uses stop_machine).

> for -RT stop_machine is a problem.

If this is really about -RT, let's say so.  There's nothing *wrong* with
that, but it feels more honest.

> So if we're going to re-architect hotplug anyway, it would be very good
> to get rid of it, because I really don't see any hard reasons why we
> would need it.

Absolutely.  It was an easy way to introduce it, but it's not
fundamental.

> > Unfortunately, this doesn't seem to be the case in my testing.  The time
> > for hotplug seems to be moving all the threads around.  So how about:
> 
> Agreed, the thread creation on online is the most expensive operation.
> 
> > (1) Let's not shutdown per-cpu kthreads, just leave them there to run
> >     if the CPU comes back.
> 
> Wasn't as easy as it sounds, but should be doable.
> 
> > (2) Do something more efficient with userspace threads than migrating
> >     them one at a time.
> 
> Sadly that can't really be done. We need to pick up every task
> (userspace, but also running kernel threads) and update their state.

What if we had an "orphan" runqueue which everyone pulled from?  Then we
could grab the lock, move them all to the fake rq, then let stuff happen
normally.

Maybe that's crap, but at least we could move the migration out of the
hotplug callback somehow.

> > Otherwise, we risk doing a great deal of work and gaining nothing
> > (cleanups aside, of course).
> 
> I don't really think its possible to spend too much time cleaning up
> hotplug at this point :-)

There is that, yes :)

Cheers,
Rusty.
-- 
  How could I marry someone with more hair than me?  http://baldalex.org

/* measure_latency.c */
#define _GNU_SOURCE
#include <sched.h>
#include <sys/time.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <err.h>
#include <string.h>
#include "time_diff.h"

/* "Copyright 2007 <kathy.staples@au.ibm.com> IBM Corp" */
#define streq(a,b) (strcmp((a),(b)) == 0)

static uint64_t timeval_to_usecs(struct timeval convert)
{	/* this function works out the number of microseconds  */
	return (convert.tv_sec * (uint64_t)1000000 + convert.tv_usec);
}

static void *grab_file(const char *filename, unsigned long *size)
{
	unsigned int max = 16384;
	int ret, fd;
	void *buffer = malloc(max);
	if (!buffer)
		return NULL;

	if (streq(filename, "-"))
		fd = dup(STDIN_FILENO);
	else
		fd = open(filename, O_RDONLY, 0);

	if (fd < 0)
		return NULL;

	*size = 0;
	while ((ret = read(fd, buffer + *size, max - *size)) > 0) {
		*size += ret;
		if (*size == max)
			buffer = realloc(buffer, max *= 2);
	}
	if (ret < 0) {
		free(buffer);
		buffer = NULL;
	}
	close(fd);
	return buffer;
}

extern long init_module(void *, unsigned long, const char *);

/* If module is NULL, merely go through the motions. */
static void do_modprobe(int cpu, int pollfd, int secs, const char *module)
{
	struct sched_param sparam = { .sched_priority = 99 };
	cpu_set_t this_cpu;
	fd_set readfds;
	int error;
	struct timeval timeout = { .tv_sec = 5 };
	void *file;
	unsigned long len;

	if (module) {
		file = grab_file(module, &len);
		if (!file)
			err(1, "Loading file %s", module);
	}

	CPU_ZERO(&this_cpu);
	CPU_SET(cpu, &this_cpu);
	if (sched_setaffinity(getpid(), sizeof(cpu_set_t), &this_cpu) != 0)
		err(1, "Could not move modprobe to cpu %i", cpu);
	if (sched_setscheduler(getpid(), SCHED_FIFO, &sparam) != 0)
		err(1, "Could not set FIFO scheduler for modprobe");

	/* Wait for go signal. */
	FD_ZERO(&readfds);
	FD_SET(pollfd, &readfds);

	/* We can timeout. */
	if (select(pollfd + 1, &readfds, NULL, NULL, &timeout) != 1)
		exit(1);

	/* Sleep until halfway through. */
	usleep(secs * 500000);

	if (module) {
		error = init_module(file, len, "");
		if (error)
			err(1, "init_module '%s'", module);
	}
	printf("Modprobe done on cpu %i\n", cpu);
	exit(0);
}

static void measure_latency(int cpu, int secs, int writefd, int pollfd)
{
	struct timeval start_time, now, elapsed_time, previous_time, diff;
	uint64_t least, max_diff, no_of_diffs;
	cpu_set_t this_cpu;
	fd_set readfds;
	struct timeval timeout = { .tv_sec = 5 };
	char buf[1024];
	struct sched_param sparam = { .sched_priority = 50 };

	least = UINT64_MAX;
	max_diff = 0;
	no_of_diffs = 0;

	CPU_ZERO(&this_cpu);
	CPU_SET(cpu, &this_cpu);
	if (sched_setaffinity(getpid(), sizeof(cpu_set_t), &this_cpu) != 0)
		err(1, "Could not move to cpu %i", cpu);
	if (sched_setscheduler(getpid(), SCHED_FIFO, &sparam) != 0)
		err(1, "Could not set FIFO scheduler");

	/* Note that we're ready. */
	write(writefd, "", 1);

	/* Wait for go signal. */
	FD_ZERO(&readfds);
	FD_SET(pollfd, &readfds);

	/* We can timeout. */
	if (select(pollfd + 1, &readfds, NULL, NULL, &timeout) != 1)
		exit(1);

	gettimeofday(&start_time, NULL);
	previous_time = start_time;
	do {
		gettimeofday(&now, NULL); 
		/* call conv_timeval func; apply to now and previous time; calc diff */
			
		time_diff(&previous_time, &now, &diff);

		if (timeval_to_usecs(diff) > max_diff)
			max_diff = timeval_to_usecs(diff);

		if (timeval_to_usecs(diff) < least)   /* This should always return 0 */
			least = timeval_to_usecs(diff);
					
		/* Work out time to elapse since the starting time */
		time_diff(&start_time, &now, &elapsed_time);
			
		/* reset previous_time to now */
		previous_time = now; 

		no_of_diffs++;
	} while (elapsed_time.tv_sec < secs);

	sprintf(buf, "CPU %u: %llu diffs, min/avg/max = %llu/%llu/%llu\n",
		cpu, no_of_diffs,
		least,
		timeval_to_usecs(elapsed_time) / no_of_diffs,
		max_diff);

	write(STDOUT_FILENO, buf, strlen(buf));
	exit(0);
}

int main(int argc, char *argv[])
{
	int i, secs, status, tochildren[2], fromchild[2], arg;
	const char *module;

	if (argc < 3) {
		printf("Usage: %s [--modprobe=<module>] <seconds> <cpunum>...\n", argv[0]);
		exit(1);
	}

	arg = 1;
	if (strncmp(argv[arg], "--modprobe=", strlen("--modprobe=")) == 0) {
		module = argv[arg] + 11;
		arg++;
	} else
		module = NULL;

	if (pipe(tochildren) != 0 || pipe(fromchild) != 0)
		err(1, "Creating pipes");

	secs = atoi(argv[arg++]);

	switch (fork()) {
	case -1:
		err(1, "fork failed");
	case 0:
		do_modprobe(atoi(argv[arg]), tochildren[0], secs, module);
	}

	for (i = arg+1; i < argc; i++) {
		char c;
		switch (fork()) {
		case -1:
			err(1, "fork failed");
		case 0:
			measure_latency(atoi(argv[i]), secs, 
					fromchild[1], tochildren[0]);
		}
		if (read(fromchild[0], &c, 1) != 1)
			err(1, "Read from child failed");
	}

	/* Tell them to go. */
	write(tochildren[1], "", 1);

	/* Wait for the children. */
	status = 0;
	for (i = arg; i < argc; i++) {
		if (status == 0) {
			wait(&status);
			if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
				status = 1;
		} else
			wait(NULL);
	}

	return status;
}


	

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-20 23:00     ` Rusty Russell
@ 2012-03-21  9:01       ` Peter Zijlstra
  2012-03-22  4:25         ` Rusty Russell
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2012-03-21  9:01 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Srivatsa S. Bhat, Paul E. McKenney, Arjan van de Ven,
	Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Wed, 2012-03-21 at 09:30 +1030, Rusty Russell wrote:
> > > (2) Do something more efficient with userspace threads than migrating
> > >     them one at a time.
> > 
> > Sadly that can't really be done. We need to pick up every task
> > (userspace, but also running kernel threads) and update their state.
> 
> What if we had an "orphan" runqueue which everyone pulled from?  Then we
> could grab the lock, move them all to the fake rq, then let stuff happen
> normally.

Well, we could simply let them sit where they are and fudge load-balance
to consider it a source but not a destination until its empty, but it
might be somewhat tricky to make it fast enough to not introduce
noticable latencies. Also, you really don't want everyone to pull,
that's a serialization/scalability problem.

Also, since we really only move the currently runnable tasks it
shouldn't be too many in the first place. Is it really that expensive?

> Maybe that's crap, but at least we could move the migration out of the
> hotplug callback somehow. 

Thing is, if its really too much for some people, they can orchestrate
it such that its not. Just move everybody in a cpuset, clear the to be
offlined cpu from the cpuset's mask -- this will migrate everybody away.
Then hotplug will find an empty runqueue and its fast, no?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-21  9:01       ` Peter Zijlstra
@ 2012-03-22  4:25         ` Rusty Russell
  2012-03-22 22:49           ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Rusty Russell @ 2012-03-22  4:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Paul E. McKenney, Arjan van de Ven,
	Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-03-21 at 09:30 +1030, Rusty Russell wrote:
> > > > (2) Do something more efficient with userspace threads than migrating
> > > >     them one at a time.
> > > 
> > > Sadly that can't really be done. We need to pick up every task
> > > (userspace, but also running kernel threads) and update their state.
> > 
> > What if we had an "orphan" runqueue which everyone pulled from?  Then we
> > could grab the lock, move them all to the fake rq, then let stuff happen
> > normally.
> 
> Well, we could simply let them sit where they are and fudge load-balance
> to consider it a source but not a destination until its empty, but it
> might be somewhat tricky to make it fast enough to not introduce
> noticable latencies. Also, you really don't want everyone to pull,
> that's a serialization/scalability problem.
> 
> Also, since we really only move the currently runnable tasks it
> shouldn't be too many in the first place. Is it really that expensive?

Good question, requires measurement to answer.

> > Maybe that's crap, but at least we could move the migration out of the
> > hotplug callback somehow. 
> 
> Thing is, if its really too much for some people, they can orchestrate
> it such that its not. Just move everybody in a cpuset, clear the to be
> offlined cpu from the cpuset's mask -- this will migrate everybody away.
> Then hotplug will find an empty runqueue and its fast, no?

I like this solution better.

Thanks,
Rusty.
-- 
  How could I marry someone with more hair than me?  http://baldalex.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-22  4:25         ` Rusty Russell
@ 2012-03-22 22:49           ` Paul E. McKenney
  2012-03-23 23:27             ` Rusty Russell
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-03-22 22:49 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Arjan van de Ven,
	Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Thu, Mar 22, 2012 at 02:55:04PM +1030, Rusty Russell wrote:
> On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Wed, 2012-03-21 at 09:30 +1030, Rusty Russell wrote:
> > > > > (2) Do something more efficient with userspace threads than migrating
> > > > >     them one at a time.
> > > > 
> > > > Sadly that can't really be done. We need to pick up every task
> > > > (userspace, but also running kernel threads) and update their state.
> > > 
> > > What if we had an "orphan" runqueue which everyone pulled from?  Then we
> > > could grab the lock, move them all to the fake rq, then let stuff happen
> > > normally.
> > 
> > Well, we could simply let them sit where they are and fudge load-balance
> > to consider it a source but not a destination until its empty, but it
> > might be somewhat tricky to make it fast enough to not introduce
> > noticable latencies. Also, you really don't want everyone to pull,
> > that's a serialization/scalability problem.
> > 
> > Also, since we really only move the currently runnable tasks it
> > shouldn't be too many in the first place. Is it really that expensive?
> 
> Good question, requires measurement to answer.
> 
> > > Maybe that's crap, but at least we could move the migration out of the
> > > hotplug callback somehow. 
> > 
> > Thing is, if its really too much for some people, they can orchestrate
> > it such that its not. Just move everybody in a cpuset, clear the to be
> > offlined cpu from the cpuset's mask -- this will migrate everybody away.
> > Then hotplug will find an empty runqueue and its fast, no?
> 
> I like this solution better.

As long as we have some way to handle kthreads that are algorithmically
tied to a given CPU.  There are coding conventions to handle this, for
example, do everything with preemption disabled and just after each
preempt_disable() verify that you are in fact running on the correct
CPU, but it is easy to imagine improvements.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-22 22:49           ` Paul E. McKenney
@ 2012-03-23 23:27             ` Rusty Russell
  2012-03-24  0:23               ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Rusty Russell @ 2012-03-23 23:27 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Arjan van de Ven,
	Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Thu, 22 Mar 2012 15:49:20 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> On Thu, Mar 22, 2012 at 02:55:04PM +1030, Rusty Russell wrote:
> > On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > Thing is, if its really too much for some people, they can orchestrate
> > > it such that its not. Just move everybody in a cpuset, clear the to be
> > > offlined cpu from the cpuset's mask -- this will migrate everybody away.
> > > Then hotplug will find an empty runqueue and its fast, no?
> > 
> > I like this solution better.
> 
> As long as we have some way to handle kthreads that are algorithmically
> tied to a given CPU.  There are coding conventions to handle this, for
> example, do everything with preemption disabled and just after each
> preempt_disable() verify that you are in fact running on the correct
> CPU, but it is easy to imagine improvements.

I don't think we should move per-cpu kthreads at all.  Let's stop trying
to save a few bytes of memory, and just leave them frozen.  They'll run
again if/when the CPU returns.

Cheers,
Rusty.
-- 
  How could I marry someone with more hair than me?  http://baldalex.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-23 23:27             ` Rusty Russell
@ 2012-03-24  0:23               ` Paul E. McKenney
  2012-03-26  0:41                 ` Rusty Russell
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2012-03-24  0:23 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Arjan van de Ven,
	Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Sat, Mar 24, 2012 at 09:57:32AM +1030, Rusty Russell wrote:
> On Thu, 22 Mar 2012 15:49:20 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > On Thu, Mar 22, 2012 at 02:55:04PM +1030, Rusty Russell wrote:
> > > On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > Thing is, if its really too much for some people, they can orchestrate
> > > > it such that its not. Just move everybody in a cpuset, clear the to be
> > > > offlined cpu from the cpuset's mask -- this will migrate everybody away.
> > > > Then hotplug will find an empty runqueue and its fast, no?
> > > 
> > > I like this solution better.
> > 
> > As long as we have some way to handle kthreads that are algorithmically
> > tied to a given CPU.  There are coding conventions to handle this, for
> > example, do everything with preemption disabled and just after each
> > preempt_disable() verify that you are in fact running on the correct
> > CPU, but it is easy to imagine improvements.
> 
> I don't think we should move per-cpu kthreads at all.  Let's stop trying
> to save a few bytes of memory, and just leave them frozen.  They'll run
> again if/when the CPU returns.

OK, that would work for me.  So, how do I go about freezing RCU's
per-CPU kthreads?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-24  0:23               ` Paul E. McKenney
@ 2012-03-26  0:41                 ` Rusty Russell
  2012-03-26  8:02                   ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Rusty Russell @ 2012-03-26  0:41 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Arjan van de Ven,
	Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri,
	akpm@linux-foundation.org, Paul Gortmaker, Milton Miller,
	mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel,
	Linux PM mailing list

On Fri, 23 Mar 2012 17:23:47 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> On Sat, Mar 24, 2012 at 09:57:32AM +1030, Rusty Russell wrote:
> > On Thu, 22 Mar 2012 15:49:20 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > On Thu, Mar 22, 2012 at 02:55:04PM +1030, Rusty Russell wrote:
> > > > On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > > Thing is, if its really too much for some people, they can orchestrate
> > > > > it such that its not. Just move everybody in a cpuset, clear the to be
> > > > > offlined cpu from the cpuset's mask -- this will migrate everybody away.
> > > > > Then hotplug will find an empty runqueue and its fast, no?
> > > > 
> > > > I like this solution better.
> > > 
> > > As long as we have some way to handle kthreads that are algorithmically
> > > tied to a given CPU.  There are coding conventions to handle this, for
> > > example, do everything with preemption disabled and just after each
> > > preempt_disable() verify that you are in fact running on the correct
> > > CPU, but it is easy to imagine improvements.
> > 
> > I don't think we should move per-cpu kthreads at all.  Let's stop trying
> > to save a few bytes of memory, and just leave them frozen.  They'll run
> > again if/when the CPU returns.
> 
> OK, that would work for me.  So, how do I go about freezing RCU's
> per-CPU kthreads?

Good question.

Obviously, having callbacks hanging around until the CPU comes back is
not viable, nor is blocking preempt during the callbacks.  Calling
get_online_cpus() is too heavy.

I can think of three approaches:

1) Put the being-processed rcu calls into a per-cpu var, and pull them
   off that list with preempt disabled.  This lets us cleanup after the
   thread gets frozen as its CPU goes ofline, but doesn't solve the case
   of going offline during a callback.

2) Sync with the thread somehow during a notifier callback.  This is the
   same kind of logic as shutting the thread down, so it's not really
   attractive from a simplicity POV.

3) Create to a per-cpu rwsem to stop a specific CPU from going down, and
   just grab that while we're processing rcu callbacks.

If this pattern of kthread is common, then #3 (or some equiv lightwieght
way of stopping a specific CPU from going offline) is looking
attractive.

Cheers,
Rusty.
-- 
  How could I marry someone with more hair than me?  http://baldalex.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-26  0:41                 ` Rusty Russell
@ 2012-03-26  8:02                   ` Peter Zijlstra
  2012-03-26 13:09                     ` Steven Rostedt
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2012-03-26  8:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: paulmck, Srivatsa S. Bhat, Arjan van de Ven, Steven Rostedt,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Mon, 2012-03-26 at 11:11 +1030, Rusty Russell wrote:
> Obviously, having callbacks hanging around until the CPU comes back is
> not viable, nor is blocking preempt during the callbacks.  Calling
> get_online_cpus() is too heavy. 

-rt has patches (albeit somewhat ugly) to make get_online_cpus() a br
style rw-lock. That is, it makes the read side a per-cpu lock/counter so
that its much faster and only touches cpu-local state in the common
case.

It would be good to clean those up and make get_online_cpus() fast and
usable.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-26  8:02                   ` Peter Zijlstra
@ 2012-03-26 13:09                     ` Steven Rostedt
  2012-03-26 13:38                       ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2012-03-26 13:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Mon, 2012-03-26 at 10:02 +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 11:11 +1030, Rusty Russell wrote:
> > Obviously, having callbacks hanging around until the CPU comes back is
> > not viable, nor is blocking preempt during the callbacks.  Calling
> > get_online_cpus() is too heavy. 
> 
> -rt has patches (albeit somewhat ugly) to make get_online_cpus() a br
> style rw-lock. 

Unfortunately, -rt still has broken cpu hotplug.

-- Steve



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-26 13:09                     ` Steven Rostedt
@ 2012-03-26 13:38                       ` Peter Zijlstra
  2012-03-26 15:22                         ` Steven Rostedt
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2012-03-26 13:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Mon, 2012-03-26 at 09:09 -0400, Steven Rostedt wrote:
> On Mon, 2012-03-26 at 10:02 +0200, Peter Zijlstra wrote:
> > On Mon, 2012-03-26 at 11:11 +1030, Rusty Russell wrote:
> > > Obviously, having callbacks hanging around until the CPU comes back is
> > > not viable, nor is blocking preempt during the callbacks.  Calling
> > > get_online_cpus() is too heavy. 
> > 
> > -rt has patches (albeit somewhat ugly) to make get_online_cpus() a br
> > style rw-lock. 
> 
> Unfortunately, -rt still has broken cpu hotplug.

This is still the issue where we take the hotplug lock for write, but
then rely on other threads to complete -- say CPU_DOWN_PREPARE notifiers
doing flush_workqueue(), and the worker kthread needing to acquire the
hotplug lock for read, right?

Yes, that is bothersome, but doesn't hinder the work required to get
get_online_cpus() usably fast, right?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-26 13:38                       ` Peter Zijlstra
@ 2012-03-26 15:22                         ` Steven Rostedt
  2012-03-26 16:13                           ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2012-03-26 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Mon, 2012-03-26 at 15:38 +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 09:09 -0400, Steven Rostedt wrote:
> > On Mon, 2012-03-26 at 10:02 +0200, Peter Zijlstra wrote:
> > > On Mon, 2012-03-26 at 11:11 +1030, Rusty Russell wrote:
> > > > Obviously, having callbacks hanging around until the CPU comes back is
> > > > not viable, nor is blocking preempt during the callbacks.  Calling
> > > > get_online_cpus() is too heavy. 
> > > 
> > > -rt has patches (albeit somewhat ugly) to make get_online_cpus() a br
> > > style rw-lock. 
> > 
> > Unfortunately, -rt still has broken cpu hotplug.
> 
> This is still the issue where we take the hotplug lock for write, but
> then rely on other threads to complete -- say CPU_DOWN_PREPARE notifiers
> doing flush_workqueue(), and the worker kthread needing to acquire the
> hotplug lock for read, right?

Right, I did get it working, but required making the current "ugly" code
even uglier, and Thomas nack'd it for that reason. He wanted a clean up
on hotplug instead.

> 
> Yes, that is bothersome, but doesn't hinder the work required to get
> get_online_cpus() usably fast, right?

Yep. What we do in -rt currently (without my updates) is that we create
a thread pinned to the CPU going down. This thread counts the "readers"
that is on the CPU. That is, every user that wanted to do a
"get_online_cpus()", and as -rt allows preemption while doing so, there
may be more than one of these for a given CPU.

Once the count goes to zero, new tasks that enter this path will block
on the hotplug lock. Then the CPU task we created is done and the cpu
offline continues.

But then we call the cpu offline notifiers. Some of the notifiers wakeup
threads and tell them to exit using kthread_stop(). But if the thread it
waits for does something that will enter the above path, it will block
on the hotplug lock too, and cause a deadlock. Unfortunately, in -rt we
have the sleeping spinlocks go into this path, thus any thread that is
going down that grabs a sleeping spinlock will hit this deadlock.

There's also the case where the thread that is going offline, or even
the notifier itself, may be grabbing a lock, that happens to be held by
another thread that is on the offlining CPU but is blocked on the
hotplug lock. This also causes a deadlock.

The workaround I added was to do several things:

1) instead of blocking on the hotplug lock, try to migrate itself
instead. If it succeeds, then we don't need to worry about this thread.
But if the thread is pinned to the CPU, then we need to worry about it.
I first had it block only in this case, but that wasn't good enough, so
I let them just continue.

2) had the CPU thread that was created do a multi-phase. The first phase
it still waited for the cpu ref counter to go to zero, but instead of
having tasks block, tasks would instead try to migrate and if I could
not then just continue.

3) after all the notifiers are finished, notify the created CPU thread
to sync tasks. Now that the notifiers are done, we can make any
remaining task block. That is, the old method is done, where the CPU
thread waits for the ref counter to go to zero, and new tasks will block
on the hotplug lock. Because this happens after the notifiers, we do not
need to worry about the previous deadlocks.

The above changes mostly worked. Where it broke was paths in bringing up
the CPU that may call code with sleeping spin locks where preemption is
disabled. But this is actually unrelated to the cpu hotplug itself and
more related to the normal problems caused by sleeping spinlocks added
by -rt.

The above changes also required reverting the -rt changes done to
workqueue.

Now what are the issues we have:

1) We need to get tasks off the CPU going down. For most tasks this is
not an issue. But for CPU specific kernel threads, this can be an issue.
To get tasks off of the CPU is required before the notifiers are called.
This is to keep them from creating work on the CPU, because after the
notifiers, there should be no more work added to the CPU.

2) Some tasks are going to go down and exit. We can audit all the
notifier callbacks for CPU offlining, and see if we can just make them
dormant instead of killing them. As Rusty said, it may not be that
important to save the memory of these tasks.

3) Some tasks do not go offline, instead they just get moved to another
CPU. This is the case of ksoftirqd. As it is killed after the CPU is
down (POST_DEAD) (at least in -rt it is).

All that is needed now is, at the beginning of taking the CPU down is to
push off tasks from the CPU that may migrate. Then call the notifiers,
and then block the rest and take the CPU down. This seems to work fine.
It was just the implementation I proposed was a bit too ugly for
Thomas's taste.

-- Steve

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-26 15:22                         ` Steven Rostedt
@ 2012-03-26 16:13                           ` Peter Zijlstra
  2012-03-26 17:05                             ` Steven Rostedt
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2012-03-26 16:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Mon, 2012-03-26 at 11:22 -0400, Steven Rostedt wrote:

> The workaround I added was to do several things:
> 
> 1) instead of blocking on the hotplug lock, try to migrate itself
> instead. If it succeeds, then we don't need to worry about this thread.
> But if the thread is pinned to the CPU, then we need to worry about it.
> I first had it block only in this case, but that wasn't good enough, so
> I let them just continue.
> 
> 2) had the CPU thread that was created do a multi-phase. The first phase
> it still waited for the cpu ref counter to go to zero, but instead of
> having tasks block, tasks would instead try to migrate and if I could
> not then just continue.
> 
> 3) after all the notifiers are finished, notify the created CPU thread
> to sync tasks. Now that the notifiers are done, we can make any
> remaining task block. That is, the old method is done, where the CPU
> thread waits for the ref counter to go to zero, and new tasks will block
> on the hotplug lock. Because this happens after the notifiers, we do not
> need to worry about the previous deadlocks.

So how about we add another variant of kthread_freezable_should_stop(),
maybe call it kthread_bound_should_stop() that checks if the cpu its
bound to goes awol, if so, park it.

Then after CPU_DOWN_PREPARE, wait for all such threads (as registered
per kthread_bind()) to pass through kthread_bound_should_stop() and get
frozen.

This should restore PF_THREAD_BOUND to mean its actually bound to this
cpu, since if the cpu goes down, the task won't actually run at all.
Which means you can again use PF_THREAD_BOUND to by-pass the whole
get_online_cpus()/pin_curr_cpu() muck.

Any subsystem that can still accrue state after this (eg, softirq/rcu
and possible kworker) need to register a CPU_DYING or CPU_DEAD notifier
to either complete the state or take it away and give it to someone
else.

> Now what are the issues we have:
> 
> 1) We need to get tasks off the CPU going down. For most tasks this is
> not an issue. But for CPU specific kernel threads, this can be an issue.
> To get tasks off of the CPU is required before the notifiers are called.
> This is to keep them from creating work on the CPU, because after the
> notifiers, there should be no more work added to the CPU.

This is hard for things like ksoftirq, because for as long as interrupts
are enabled we can trigger softirqs. And since we need to deal with
that, we might as well deal with it for all and not bother.

See the CPU_DYING/DEAD notifier as described above that can deal with
this.

> 
> 2) Some tasks are going to go down and exit. We can audit all the
> notifier callbacks for CPU offlining, and see if we can just make them
> dormant instead of killing them. As Rusty said, it may not be that
> important to save the memory of these tasks.

Right, this shouldn't be a difficult task, but isn't required for -rt
afaict, its just good practise.

> 3) Some tasks do not go offline, instead they just get moved to another
> CPU. This is the case of ksoftirqd. As it is killed after the CPU is
> down (POST_DEAD) (at least in -rt it is).

No, we should really stop allowing tasks that were kthread_bind() to run
anywhere else. Breaking the strict affinity and letting them run
someplace else to complete their work is what gets is in a whole heap of
trouble.

> All that is needed now is, at the beginning of taking the CPU down is to
> push off tasks from the CPU that may migrate. Then call the notifiers,
> and then block the rest and take the CPU down. This seems to work fine.
> It was just the implementation I proposed was a bit too ugly for
> Thomas's taste.

I really don't see the point in that.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-26 16:13                           ` Peter Zijlstra
@ 2012-03-26 17:05                             ` Steven Rostedt
  2012-03-26 17:59                               ` Peter Zijlstra
  2012-03-27  1:32                               ` Rusty Russell
  0 siblings, 2 replies; 38+ messages in thread
From: Steven Rostedt @ 2012-03-26 17:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Mon, 2012-03-26 at 18:13 +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 11:22 -0400, Steven Rostedt wrote:

> So how about we add another variant of kthread_freezable_should_stop(),
> maybe call it kthread_bound_should_stop() that checks if the cpu its
> bound to goes awol, if so, park it.

Do you mean to have this function automate the "park". When it is
called, if the cpu is going down it should simply schedule off and not
return until the CPU comes back on line?

Actually, why not just keep "kthread_should_stop()" and instead create a
"kthread_park()", and call that instead of kthread_stop(). Then when the
task calls kthread_should_stop(), that can park the thread then.

> 
> Then after CPU_DOWN_PREPARE, wait for all such threads (as registered
> per kthread_bind()) to pass through kthread_bound_should_stop() and get
> frozen.

We could have the notifiers call kthread_park().

> 
> This should restore PF_THREAD_BOUND to mean its actually bound to this
> cpu, since if the cpu goes down, the task won't actually run at all.
> Which means you can again use PF_THREAD_BOUND to by-pass the whole
> get_online_cpus()/pin_curr_cpu() muck.
> 
> Any subsystem that can still accrue state after this (eg, softirq/rcu
> and possible kworker) need to register a CPU_DYING or CPU_DEAD notifier
> to either complete the state or take it away and give it to someone
> else.

I'm afraid that this part sounds easier than done.

> 
> > Now what are the issues we have:
> > 
> > 1) We need to get tasks off the CPU going down. For most tasks this is
> > not an issue. But for CPU specific kernel threads, this can be an issue.
> > To get tasks off of the CPU is required before the notifiers are called.
> > This is to keep them from creating work on the CPU, because after the
> > notifiers, there should be no more work added to the CPU.
> 
> This is hard for things like ksoftirq, because for as long as interrupts
> are enabled we can trigger softirqs. And since we need to deal with
> that, we might as well deal with it for all and not bother.

Heh, at least for -rt we don't need to worry about that. As interrupts
are threads and are moved to other CPUS. Although I'm not sure that's
true about the timer softirq.

> 
> See the CPU_DYING/DEAD notifier as described above that can deal with
> this.
> 
> > 
> > 2) Some tasks are going to go down and exit. We can audit all the
> > notifier callbacks for CPU offlining, and see if we can just make them
> > dormant instead of killing them. As Rusty said, it may not be that
> > important to save the memory of these tasks.
> 
> Right, this shouldn't be a difficult task, but isn't required for -rt
> afaict, its just good practise.

If we get a consensus to do this, then sure.
 
> 
> > 3) Some tasks do not go offline, instead they just get moved to another
> > CPU. This is the case of ksoftirqd. As it is killed after the CPU is
> > down (POST_DEAD) (at least in -rt it is).
> 
> No, we should really stop allowing tasks that were kthread_bind() to run
> anywhere else. Breaking the strict affinity and letting them run
> someplace else to complete their work is what gets is in a whole heap of
> trouble.

Agreed, but to fix this is not a easy problem.

> 
> > All that is needed now is, at the beginning of taking the CPU down is to
> > push off tasks from the CPU that may migrate. Then call the notifiers,
> > and then block the rest and take the CPU down. This seems to work fine.
> > It was just the implementation I proposed was a bit too ugly for
> > Thomas's taste.
> 
> I really don't see the point in that.

It was the easiest fix for the current state of the kernel.

-- Steve



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-26 17:05                             ` Steven Rostedt
@ 2012-03-26 17:59                               ` Peter Zijlstra
  2012-03-27  1:32                               ` Rusty Russell
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2012-03-26 17:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Mon, 2012-03-26 at 13:05 -0400, Steven Rostedt wrote:
> On Mon, 2012-03-26 at 18:13 +0200, Peter Zijlstra wrote:
> > On Mon, 2012-03-26 at 11:22 -0400, Steven Rostedt wrote:
> 
> > So how about we add another variant of kthread_freezable_should_stop(),
> > maybe call it kthread_bound_should_stop() that checks if the cpu its
> > bound to goes awol, if so, park it.
> 
> Do you mean to have this function automate the "park". When it is
> called, if the cpu is going down it should simply schedule off and not
> return until the CPU comes back on line?

Yep..

> Actually, why not just keep "kthread_should_stop()" and instead create a
> "kthread_park()", and call that instead of kthread_stop(). Then when the
> task calls kthread_should_stop(), that can park the thread then.

That would add an if ((current->flags & PF_THREAD_BOUND) &&
kthread_should_park(cpu))) conditional to every kthread_stop()
invocation. So as per the example of kthread_freezable_should_stop() I
opted for another function.

Note that kernel/workqueue.c should be fixed to use kthread_stop() or
whatever variant we implement, as it currently uses a home brewn
solution to stop threads.

> > Then after CPU_DOWN_PREPARE, wait for all such threads (as registered
> > per kthread_bind()) to pass through kthread_bound_should_stop() and get
> > frozen.
> 
> We could have the notifiers call kthread_park().

You mean to avoid having to track them through kthread_bind() ?

The advantage of tracking them is that its harder to 'forget' about one.

> > This should restore PF_THREAD_BOUND to mean its actually bound to this
> > cpu, since if the cpu goes down, the task won't actually run at all.
> > Which means you can again use PF_THREAD_BOUND to by-pass the whole
> > get_online_cpus()/pin_curr_cpu() muck.
> > 
> > Any subsystem that can still accrue state after this (eg, softirq/rcu
> > and possible kworker) need to register a CPU_DYING or CPU_DEAD notifier
> > to either complete the state or take it away and give it to someone
> > else.
> 
> I'm afraid that this part sounds easier than done.

Got anything particularly difficult in mind? 

Workqueues can give the gcwq to unbound threads -- it doesn't guarantee
the per-cpu-ness of work items anyway. 

Softirqs can be ran from CPU_DYING since interrupts will never be
enabled again at that point.

RCU would have to make sure the cpu doesn't complete a grace period and
fixup from CPU_DEAD, so have it complete any outstanding grace periods,
move it to extended idle and steal the callback list.

I'm not sure there's anything really hard there.

> > > Now what are the issues we have:
> > > 
> > > 1) We need to get tasks off the CPU going down. For most tasks this is
> > > not an issue. But for CPU specific kernel threads, this can be an issue.
> > > To get tasks off of the CPU is required before the notifiers are called.
> > > This is to keep them from creating work on the CPU, because after the
> > > notifiers, there should be no more work added to the CPU.
> > 
> > This is hard for things like ksoftirq, because for as long as interrupts
> > are enabled we can trigger softirqs. And since we need to deal with
> > that, we might as well deal with it for all and not bother.
> 
> Heh, at least for -rt we don't need to worry about that. As interrupts
> are threads and are moved to other CPUS. Although I'm not sure that's
> true about the timer softirq.

Its a problem for rt, since as long as interrupts are enabled (and we
can schedule) interrupts can come in and wake their respective threads,
this can happen during the CPU_DOWN_PREPARE notifier just fine.

For both -rt and mainline we can schedule right up until we call
stop-machine, mainline (!threadirq) will continue servicing interrupts
another few instructions until the stop_machine bits disable interrupts
on all cpus. The difference is really not that big.

> > > 3) Some tasks do not go offline, instead they just get moved to another
> > > CPU. This is the case of ksoftirqd. As it is killed after the CPU is
> > > down (POST_DEAD) (at least in -rt it is).
> > 
> > No, we should really stop allowing tasks that were kthread_bind() to run
> > anywhere else. Breaking the strict affinity and letting them run
> > someplace else to complete their work is what gets is in a whole heap of
> > trouble.
> 
> Agreed, but to fix this is not a easy problem.

I'm not sure its that hard, just work.

If we get the above stuff done, we should be able to put BUG_ON(p->flags
& PF_THREAD_BOUND) in select_fallback_rq().

Also, I think you should opt for the solution that has the
cleanest/strongest semantics so you can add more debug infrastructure
around it to enforce it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-26 17:05                             ` Steven Rostedt
  2012-03-26 17:59                               ` Peter Zijlstra
@ 2012-03-27  1:32                               ` Rusty Russell
  2012-03-27  3:05                                 ` Steven Rostedt
  1 sibling, 1 reply; 38+ messages in thread
From: Rusty Russell @ 2012-03-27  1:32 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki,
	Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker,
	Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro,
	linux-kernel, Linux PM mailing list

On Mon, 26 Mar 2012 13:05:41 -0400, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2012-03-26 at 18:13 +0200, Peter Zijlstra wrote:
> > On Mon, 2012-03-26 at 11:22 -0400, Steven Rostedt wrote:
> 
> > So how about we add another variant of kthread_freezable_should_stop(),
> > maybe call it kthread_bound_should_stop() that checks if the cpu its
> > bound to goes awol, if so, park it.
> 
> Do you mean to have this function automate the "park". When it is
> called, if the cpu is going down it should simply schedule off and not
> return until the CPU comes back on line?
> 
> Actually, why not just keep "kthread_should_stop()" and instead create a
> "kthread_park()", and call that instead of kthread_stop(). Then when the
> task calls kthread_should_stop(), that can park the thread then.

Do we ever have to do this?  Why not say we never move kthreads, and if
anyone really needs this, they have to do it themselves.

We *will* need a way for the kthreads to say "don't take this cpu away
now" in a v. lightweight way, but I'm happy with the -rt solution that
Steven mentioned.

Basically, it's time to revisit cpu hotplug entirely.  Let's not get
too caught up in what we've got now.

Cheers,
Rusty.
PS.  And I'm sure you'll finish the work while I'm on honeymoon :)
-- 
  How could I marry someone with more hair than me?  http://baldalex.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CPU Hotplug rework
  2012-03-27  1:32                               ` Rusty Russell
@ 2012-03-27  3:05                                 ` Steven Rostedt
  0 siblings, 0 replies; 38+ messages in thread
From: Steven Rostedt @ 2012-03-27  3:05 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Peter Zijlstra, paulmck, Srivatsa S. Bhat, Arjan van de Ven,
	Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org,
	Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo,
	KOSAKI Motohiro, linux-kernel, Linux PM mailing list

On Tue, 2012-03-27 at 12:02 +1030, Rusty Russell wrote:

> PS.  And I'm sure you'll finish the work while I'm on honeymoon :)

I'll be getting ready for Linux Collaboration. But have fun on your
honeymoon, and congrats!

-- Steve




^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-04-11 12:29 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-19 14:44 CPU Hotplug rework Srivatsa S. Bhat
2012-03-19 14:48 ` Srivatsa S. Bhat
2012-03-20 11:28   ` Peter Zijlstra
2012-04-05 17:39   ` Paul E. McKenney
2012-04-05 17:55     ` Paul E. McKenney
2012-04-05 23:06       ` Paul E. McKenney
2012-04-06 20:15         ` Srivatsa S. Bhat
2012-04-09 16:46           ` Paul E. McKenney
2012-04-10  7:56             ` Nikunj A Dadhania
2012-04-06 19:52     ` Srivatsa S. Bhat
2012-04-09 17:13       ` Paul E. McKenney
2012-04-10 13:41         ` Srivatsa S. Bhat
2012-04-10 15:46           ` Paul E. McKenney
2012-04-10 17:26             ` Srivatsa S. Bhat
2012-04-11  0:09       ` Steven Rostedt
2012-04-11  0:28         ` Paul E. McKenney
2012-04-11  0:37           ` Steven Rostedt
2012-04-11  1:00             ` Paul E. McKenney
2012-04-11  6:02               ` Srivatsa S. Bhat
2012-04-11 12:28                 ` Paul E. McKenney
2012-03-19 23:42 ` Rusty Russell
2012-03-20 10:42   ` Peter Zijlstra
2012-03-20 23:00     ` Rusty Russell
2012-03-21  9:01       ` Peter Zijlstra
2012-03-22  4:25         ` Rusty Russell
2012-03-22 22:49           ` Paul E. McKenney
2012-03-23 23:27             ` Rusty Russell
2012-03-24  0:23               ` Paul E. McKenney
2012-03-26  0:41                 ` Rusty Russell
2012-03-26  8:02                   ` Peter Zijlstra
2012-03-26 13:09                     ` Steven Rostedt
2012-03-26 13:38                       ` Peter Zijlstra
2012-03-26 15:22                         ` Steven Rostedt
2012-03-26 16:13                           ` Peter Zijlstra
2012-03-26 17:05                             ` Steven Rostedt
2012-03-26 17:59                               ` Peter Zijlstra
2012-03-27  1:32                               ` Rusty Russell
2012-03-27  3:05                                 ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).